uchrom.core¶

ChromData¶

class uchrom.core.ChromData(coords: ndarray, spots: DataFrame, *, cells: DataFrame | None = None, cellm: Dict[str, ndarray] | None = None, tracks: DataFrame | None = None, traces: DataFrame | None = None, layers: Dict[str, ndarray] | None = None, results: dict | None = None, uns: dict | None = None, linked_adata=None, validate: bool = True)[source]¶

Bases: object

Chromatin Data — the core container for U-Chrom.

See the module docstring of uchrom.core.cdata for the full purpose, hierarchy, FOF-CT mapping, and on-disk format contract. Summary:

The central abstraction is a “structure table” — genomic bins mapped to 3D coordinates. Each row of spots (with the corresponding row of coords) is one Spot.
Spots are grouped hierarchically as Cell → Trace → Spot. A Trace is an ordered chromatin-fibre polymer; a Cell contains one or more traces.
All analysis in U-Chrom consumes or produces ChromData. Reconstruction modules (uchrom.recon) emit it, structure callers (uchrom.strc) decorate cd.results[...] with TADs / loops / compartments, and the browser / plotters render it.

Parameters:

coords (ndarray, shape (n_spots, 3)) – 3-D coordinates (x, y, z) per spot.
spots (DataFrame, shape (n_spots, ≥4)) – Per-spot metadata. Required columns: chrom (str, will be categorified), start (int, 0-based BED-style), end (int, non-inclusive), trace_id (int or str, will be categorified). Optional: cell_id (int or str), spot_id, FOF-CT sub_cell_roi_id / extra_cell_roi_id, and any experiment-specific annotation column (carried through verbatim).
cells (DataFrame, optional) – Per-cell metadata indexed by cell_id.
cellm (dict[str, ndarray], optional) – Per-cell multi-dimensional annotations (embeddings, UMAP, …). Each array’s first axis length = n_cells.
tracks (DataFrame, optional) – Epigenomic signals (ATAC, ChIP-seq, …) row-aligned to spots. Length must equal n_spots.
traces (DataFrame, optional) – Per-trace metadata indexed by trace_id.
layers (dict[str, ndarray], optional) – Alternative coordinate sets, each with shape (n_spots, 3) — e.g. raw / drift-corrected / aligned.
results (dict, optional) – Analysis outputs. Conventional keys: 'loops' → DataFrame (chrom1, start1, end1, …), 'tads' → DataFrame (chrom, start, end, …), 'compartments' → ndarray or DataFrame per bin.
uns (dict, optional) – Unstructured metadata preserved on disk. Conventional keys: 'genome_assembly', 'xyz_unit', 'fofct_header'. Auto-discovery context may use 'dataset_references' for source papers/repositories and 'user_annotations' for user-provided priors, constraints, or hypothesis seeds.
validate (bool) – If True (default), validate internal consistency on construction (coords shape, spots required columns, tracks / layers alignment).

n_spots, n_traces, n_cells, chroms

Type:: derived accessors

Key methods

-----------

from_dataframe, from_fofct, read, write, to_dataframe,

get_cell, get_trace, get_chrom, compute_distances

On-disk format — ``.h5cd``

--------------------------

Versioned HDF5. See :mod:`uchrom.core.cdata` module docstring for

the full layout and the :meth:`read` / :meth:`write` round-trip

contract.

Notes

Subsetting (cd[mask], get_chrom etc.) always returns a new ChromData; the source is not mutated.
Global pairwise distance matrices are intentionally not stored — they are biologically meaningful per-trace, not across cells, and would be O(n²) memory. Compute on demand via compute_distances(trace_id=...)().
String columns (chrom, trace_id, cell_id) are auto-converted to pd.Categorical for ~10× memory savings.

Add a source reference to uns['dataset_references'].

Parameters are intentionally metadata-oriented rather than tied to one publication database. role should describe how the reference relates to the dataset, for example 'primary_dataset_paper', 'data_repository', 'supplementary_table', or 'related_biology_prior'.

add_user_annotation(*, annotation_id: str | None = None, scope: str, text: str, target: str | None = None, tags: List[str] | None = None, confidence: str = 'user_asserted', **extra: Any) → dict[source]¶

Add a user annotation to uns['user_annotations'].

Use annotations for cell-type notes, marker priors, analysis constraints, hypothesis seeds, negative constraints, field semantics, or quality warnings. They are surfaced in the discovery schema and agent context but still require notebook validation before becoming evidence.

property auto_discovery_schema: dict¶: Alias for discovery_schema.

build_discovery_schema(*, store: bool = True, **kwargs) → dict[source]¶

Build the auto-discovery schema, optionally storing it in uns.

The stored representation is an HDF5-friendly JSON payload under uns['auto_discovery_schema'], so it round-trips with .h5cd.

property chroms: List[str]¶

compute_distances(trace_id=None) → ndarray[source]¶

Compute pairwise Euclidean distance matrix.

Parameters:: trace_id (optional) – If given, compute only for spots in that trace. If None, compute for all spots (use with caution on large data).
Return type:: np.ndarray, shape (n, n)

copy() → ChromData[source]¶

property dataset_references: List[dict]¶

Dataset-level source references used as auto-discovery priors.

References are stored in uns['dataset_references'] and round-trip with .h5cd files. They are intended for primary dataset papers, data repositories, supplementary tables, method papers, and related biological priors.

describe_for_agent(*, max_items: int = 40) → str[source]¶: Return a compact prompt-ready description of available data.

property discovery_schema: dict¶

Agent-readable auto-discovery schema for this ChromData.

If a schema is stored in uns['auto_discovery_schema'] it is parsed and returned. Otherwise a fresh in-memory schema is built without mutating uns.

classmethod from_dataframe(df: DataFrame, *, cell_id=None, **kwargs) → ChromData[source]¶

Create from a reconstruction output DataFrame.

Expects columns: chrom, start, end, x, y, z. Each chromosome becomes one trace.

Parameters:

df (DataFrame with columns chrom, start, end, x, y, z.)
cell_id (hashable, optional) – If given, tag every spot with this cell identifier (e.g. derived from the output filename for single-cell reconstruction). The DataFrame’s own cell_id column, if any, takes precedence.
**kwargs – Forwarded to the ChromData constructor.

classmethod from_fofct(core_path: str | Path, **kwargs) → ChromData[source]¶

Read from FOF-CT core table file.

Parameters:

core_path (path) – Path to the FOF-CT core table (CSV/TSV/TXT).
**kwargs – Additional keyword arguments passed to ChromData constructor (e.g. cells, tracks, uns).

classmethod from_pyhim_trace(ecsv_path: str | Path, barcode_dict: dict | DataFrame | None = None, **kwargs) → ChromData[source]¶

Read a PyHiM chromatin-trace ECSV table into a ChromData.

PyHiM (Devos et al. 2024) emits one ECSV file per trace-building run. Schema (from chromatin_trace_table.py upstream):

Spot_ID, Trace_ID, x, y, z, Chrom, Chrom_Start, Chrom_End, ROI #, Mask_id, Barcode #, label

meta['comments'] carries xyz_unit=... and genome_assembly=....

Parameters:

ecsv_path (path) – Path to the ECSV file written by PyHiM.
barcode_dict (dict[int, (chrom, start, end)] or DataFrame, optional) – Required when Chrom/Chrom_Start/Chrom_End are empty in the ECSV (PyHiM does not always populate them). As a DataFrame, expects columns barcode, chrom, start, end. If Chrom is populated, barcode_dict is ignored.
**kwargs – Additional keyword arguments passed to the ChromData constructor (cells, tracks, uns, …).

Notes

Mask_id becomes cell_id (PyHiM convention).
ECSV header comments are captured in cd.uns['pyhim']['ecsv_comments'] and any xyz_unit / genome_assembly entries are also promoted to cd.uns directly (matching from_fofct()).

classmethod from_seqfish_multiomics(spot_glob, **kwargs) → ChromData[source]¶

Load Takei 2025 DNA seqFISH+ cerebellum data.

Thin shim around uchrom.io.seqfish_multiomics.read_seqfish_multiomics(). See that function for the full parameter list.

classmethod from_seqfish_multiomics_linked(spot_glob, **kwargs)[source]¶

Load linked Takei 2025 DNA tracing + RNA AnnData artifacts.

Thin shim around uchrom.io.seqfish_multiomics.load_seqfish_multiomics_linked(). Returns a ChromData with RNA expression available at cd.linked_adata and can write paired .h5cd / .h5ad files.

classmethod from_takei2025_cerebellum(**kwargs)[source]¶

Load linked Takei 2025 cerebellum data.

Thin shim around uchrom.io.seqfish_multiomics.load_takei2025_cerebellum(). Returns a ChromData with RNA expression available at cd.linked_adata.

get_cell(cell_id) → ChromData[source]¶

get_chrom(chrom: str) → ChromData[source]¶

get_trace(trace_id) → ChromData[source]¶

link_anndata(adata, *, cell_id_col: str | None = None, copy_obs: bool = True, copy_obsm: bool = True) → int[source]¶

Import cell-level metadata from an AnnData into this ChromData.

Matches cells by cell_id: each unique value in spots['cell_id'] is looked up in adata.obs (by index, or by the column cell_id_col if given). Matched cells get their adata.obs columns merged into self.cells and their adata.obsm arrays copied into self.cellm.

If self.cells already exists, its row order is preserved and AnnData rows are aligned onto that cell axis. This is important for multi-omics loaders such as Takei 2025, where chromatin tracing coordinates live in coords/spots, RNA/IF signals live in spot-level tracks, and mRNA clustering/UMAP already live in cells / cellm.

Parameters:

adata (anndata.AnnData) – The single-cell dataset to link (e.g. scRNA-seq).
cell_id_col (str, optional) – Column in adata.obs that holds cell identifiers matching spots['cell_id']. If None, adata.obs.index is used as the key.
copy_obs (bool) – If True (default), copy adata.obs columns into self.cells.
copy_obsm (bool) – If True (default), copy adata.obsm arrays into self.cellm.

Returns:

Number of cells matched.

Return type:

int

Raises:

KeyError – If spots has no cell_id column.

property linked_adata¶: Linked AnnData object, loaded lazily from uns metadata if possible.

load_linked_anndata(path: str | Path | None = None)[source]¶: Load, cache, and return the linked AnnData object.

property n_cells: int¶

property n_spots: int¶

property n_traces: int¶

classmethod read(path: str | Path) → ChromData[source]¶

Read from HDF5 (.h5cd) file.

Dispatches to a version-specific reader based on f.attrs['uchrom_format_version']. Files written before versioning was introduced are read with a warning using the v1.0 reader (the on-disk layout has been stable from the start).

Forward compatibility:

Same MAJOR, higher MINOR → read with a warning; unknown fields are ignored silently by the lower-level helpers.
Different MAJOR → raise ValueError with guidance.

to_anndata()[source]¶

Export cell-level data as an AnnData object.

Creates an AnnData where each observation is a cell, obs is self.cells, and obsm is self.cellm. The X matrix is left empty (zeros) because ChromData has no cell-by-feature expression matrix. Spot-level RNA-FISH / IF / epigenomic signals, such as Takei 2025 tracks, remain in self.tracks and are not flattened into AnnData.X.

Return type:: anndata.AnnData
Raises:: ImportError – If anndata is not installed.

to_dataframe() → DataFrame[source]¶: Export as a flat DataFrame with coords + spots columns.

update_discovery_schema(schema: dict | None = None, **kwargs) → dict[source]¶: Store a supplied or newly built auto-discovery schema in uns.

property user_annotations: List[dict]¶

User-provided discovery context and analysis constraints.

Annotations are stored in uns['user_annotations'] and are treated as user-supplied priors or constraints by discovery agents, not as validated data evidence.

validate_discovery_schema(schema: dict | None = None, *, raise_on_error: bool = False) → List[str][source]¶

Validate a discovery schema against this ChromData.

Returns a list of issues. If raise_on_error=True, raises ValueError when any issue is found.

write(path: str | Path) → None[source]¶: Write to HDF5 (.h5cd) file.