uchrom.core

ChromData

class uchrom.core.ChromData(coords: ndarray, spots: DataFrame, *, cells: DataFrame | None = None, cellm: Dict[str, ndarray] | None = None, tracks: DataFrame | None = None, traces: DataFrame | None = None, layers: Dict[str, ndarray] | None = None, results: dict | None = None, uns: dict | None = None, linked_adata=None, validate: bool = True)[source]

Bases: object

Chromatin Data — the core container for U-Chrom.

See the module docstring of uchrom.core.cdata for the full purpose, hierarchy, FOF-CT mapping, and on-disk format contract. Summary:

  • The central abstraction is a “structure table” — genomic bins mapped to 3D coordinates. Each row of spots (with the corresponding row of coords) is one Spot.

  • Spots are grouped hierarchically as Cell → Trace → Spot. A Trace is an ordered chromatin-fibre polymer; a Cell contains one or more traces.

  • All analysis in U-Chrom consumes or produces ChromData. Reconstruction modules (uchrom.recon) emit it, structure callers (uchrom.strc) decorate cd.results[...] with TADs / loops / compartments, and the browser / plotters render it.

Parameters:
  • coords (ndarray, shape (n_spots, 3)) – 3-D coordinates (x, y, z) per spot.

  • spots (DataFrame, shape (n_spots, ≥4)) – Per-spot metadata. Required columns: chrom (str, will be categorified), start (int, 0-based BED-style), end (int, non-inclusive), trace_id (int or str, will be categorified). Optional: cell_id (int or str), spot_id, FOF-CT sub_cell_roi_id / extra_cell_roi_id, and any experiment-specific annotation column (carried through verbatim).

  • cells (DataFrame, optional) – Per-cell metadata indexed by cell_id.

  • cellm (dict[str, ndarray], optional) – Per-cell multi-dimensional annotations (embeddings, UMAP, …). Each array’s first axis length = n_cells.

  • tracks (DataFrame, optional) – Epigenomic signals (ATAC, ChIP-seq, …) row-aligned to spots. Length must equal n_spots.

  • traces (DataFrame, optional) – Per-trace metadata indexed by trace_id.

  • layers (dict[str, ndarray], optional) – Alternative coordinate sets, each with shape (n_spots, 3) — e.g. raw / drift-corrected / aligned.

  • results (dict, optional) – Analysis outputs. Conventional keys: 'loops' → DataFrame (chrom1, start1, end1, …), 'tads' → DataFrame (chrom, start, end, …), 'compartments' → ndarray or DataFrame per bin.

  • uns (dict, optional) – Unstructured metadata preserved on disk. Conventional keys: 'genome_assembly', 'xyz_unit', 'fofct_header'.

  • validate (bool) – If True (default), validate internal consistency on construction (coords shape, spots required columns, tracks / layers alignment).

n_spots, n_traces, n_cells, chroms
Type:

derived accessors

Key methods
-----------
from_dataframe, from_fofct, read, write, to_dataframe,
get_cell, get_trace, get_chrom, compute_distances
On-disk format ``.h5cd``
--------------------------
Versioned HDF5.  See :mod:`uchrom.core.cdata` module docstring for
the full layout and the :meth:`read` / :meth:`write` round-trip
contract.

Notes

  • Subsetting (cd[mask], get_chrom etc.) always returns a new ChromData; the source is not mutated.

  • Global pairwise distance matrices are intentionally not stored — they are biologically meaningful per-trace, not across cells, and would be O(n²) memory. Compute on demand via compute_distances(trace_id=...)().

  • String columns (chrom, trace_id, cell_id) are auto-converted to pd.Categorical for ~10× memory savings.

property auto_discovery_schema: dict

Alias for discovery_schema.

build_discovery_schema(*, store: bool = True, **kwargs) dict[source]

Build the auto-discovery schema, optionally storing it in uns.

The stored representation is an HDF5-friendly JSON payload under uns['auto_discovery_schema'], so it round-trips with .h5cd.

property chroms: List[str]
compute_distances(trace_id=None) ndarray[source]

Compute pairwise Euclidean distance matrix.

Parameters:

trace_id (optional) – If given, compute only for spots in that trace. If None, compute for all spots (use with caution on large data).

Return type:

np.ndarray, shape (n, n)

copy() ChromData[source]
describe_for_agent(*, max_items: int = 40) str[source]

Return a compact prompt-ready description of available data.

property discovery_schema: dict

Agent-readable auto-discovery schema for this ChromData.

If a schema is stored in uns['auto_discovery_schema'] it is parsed and returned. Otherwise a fresh in-memory schema is built without mutating uns.

classmethod from_dataframe(df: DataFrame, *, cell_id=None, **kwargs) ChromData[source]

Create from a reconstruction output DataFrame.

Expects columns: chrom, start, end, x, y, z. Each chromosome becomes one trace.

Parameters:
  • df (DataFrame with columns chrom, start, end, x, y, z.)

  • cell_id (hashable, optional) – If given, tag every spot with this cell identifier (e.g. derived from the output filename for single-cell reconstruction). The DataFrame’s own cell_id column, if any, takes precedence.

  • **kwargs – Forwarded to the ChromData constructor.

classmethod from_fofct(core_path: str | Path, **kwargs) ChromData[source]

Read from FOF-CT core table file.

Parameters:
  • core_path (path) – Path to the FOF-CT core table (CSV/TSV/TXT).

  • **kwargs – Additional keyword arguments passed to ChromData constructor (e.g. cells, tracks, uns).

classmethod from_pyhim_trace(ecsv_path: str | Path, barcode_dict: dict | DataFrame | None = None, **kwargs) ChromData[source]

Read a PyHiM chromatin-trace ECSV table into a ChromData.

PyHiM (Devos et al. 2024) emits one ECSV file per trace-building run. Schema (from chromatin_trace_table.py upstream):

Spot_ID, Trace_ID, x, y, z, Chrom, Chrom_Start, Chrom_End, ROI #, Mask_id, Barcode #, label

meta['comments'] carries xyz_unit=... and genome_assembly=....

Parameters:
  • ecsv_path (path) – Path to the ECSV file written by PyHiM.

  • barcode_dict (dict[int, (chrom, start, end)] or DataFrame, optional) – Required when Chrom/Chrom_Start/Chrom_End are empty in the ECSV (PyHiM does not always populate them). As a DataFrame, expects columns barcode, chrom, start, end. If Chrom is populated, barcode_dict is ignored.

  • **kwargs – Additional keyword arguments passed to the ChromData constructor (cells, tracks, uns, …).

Notes

  • Mask_id becomes cell_id (PyHiM convention).

  • ECSV header comments are captured in cd.uns['pyhim']['ecsv_comments'] and any xyz_unit / genome_assembly entries are also promoted to cd.uns directly (matching from_fofct()).

classmethod from_seqfish_multiomics(spot_glob, **kwargs) ChromData[source]

Load Takei 2025 DNA seqFISH+ cerebellum data.

Thin shim around uchrom.io.seqfish_multiomics.read_seqfish_multiomics(). See that function for the full parameter list.

classmethod from_seqfish_multiomics_linked(spot_glob, **kwargs)[source]

Load linked Takei 2025 DNA tracing + RNA AnnData artifacts.

Thin shim around uchrom.io.seqfish_multiomics.load_seqfish_multiomics_linked(). Returns a ChromData with RNA expression available at cd.linked_adata and can write paired .h5cd / .h5ad files.

classmethod from_takei2025_cerebellum(**kwargs)[source]

Load linked Takei 2025 cerebellum data.

Thin shim around uchrom.io.seqfish_multiomics.load_takei2025_cerebellum(). Returns a ChromData with RNA expression available at cd.linked_adata.

get_cell(cell_id) ChromData[source]
get_chrom(chrom: str) ChromData[source]
get_trace(trace_id) ChromData[source]

Import cell-level metadata from an AnnData into this ChromData.

Matches cells by cell_id: each unique value in spots['cell_id'] is looked up in adata.obs (by index, or by the column cell_id_col if given). Matched cells get their adata.obs columns merged into self.cells and their adata.obsm arrays copied into self.cellm.

If self.cells already exists, its row order is preserved and AnnData rows are aligned onto that cell axis. This is important for multi-omics loaders such as Takei 2025, where chromatin tracing coordinates live in coords/spots, RNA/IF signals live in spot-level tracks, and mRNA clustering/UMAP already live in cells / cellm.

Parameters:
  • adata (anndata.AnnData) – The single-cell dataset to link (e.g. scRNA-seq).

  • cell_id_col (str, optional) – Column in adata.obs that holds cell identifiers matching spots['cell_id']. If None, adata.obs.index is used as the key.

  • copy_obs (bool) – If True (default), copy adata.obs columns into self.cells.

  • copy_obsm (bool) – If True (default), copy adata.obsm arrays into self.cellm.

Returns:

Number of cells matched.

Return type:

int

Raises:

KeyError – If spots has no cell_id column.

property linked_adata

Linked AnnData object, loaded lazily from uns metadata if possible.

load_linked_anndata(path: str | Path | None = None)[source]

Load, cache, and return the linked AnnData object.

property n_cells: int
property n_spots: int
property n_traces: int
classmethod read(path: str | Path) ChromData[source]

Read from HDF5 (.h5cd) file.

Dispatches to a version-specific reader based on f.attrs['uchrom_format_version']. Files written before versioning was introduced are read with a warning using the v1.0 reader (the on-disk layout has been stable from the start).

Forward compatibility:

  • Same MAJOR, higher MINOR → read with a warning; unknown fields are ignored silently by the lower-level helpers.

  • Different MAJOR → raise ValueError with guidance.

to_anndata()[source]

Export cell-level data as an AnnData object.

Creates an AnnData where each observation is a cell, obs is self.cells, and obsm is self.cellm. The X matrix is left empty (zeros) because ChromData has no cell-by-feature expression matrix. Spot-level RNA-FISH / IF / epigenomic signals, such as Takei 2025 tracks, remain in self.tracks and are not flattened into AnnData.X.

Return type:

anndata.AnnData

Raises:

ImportError – If anndata is not installed.

to_dataframe() DataFrame[source]

Export as a flat DataFrame with coords + spots columns.

update_discovery_schema(schema: dict | None = None, **kwargs) dict[source]

Store a supplied or newly built auto-discovery schema in uns.

validate_discovery_schema(schema: dict | None = None, *, raise_on_error: bool = False) List[str][source]

Validate a discovery schema against this ChromData.

Returns a list of issues. If raise_on_error=True, raises ValueError when any issue is found.

write(path: str | Path) None[source]

Write to HDF5 (.h5cd) file.