ChromData¶

uchrom.ChromData is the central container. It is conceptually similar to AnnData but organises the data around the chromatin-tracing hierarchy Cell → Trace → Spot.

Construction¶

import numpy as np
import pandas as pd
from uchrom import ChromData

coords = np.asarray([[1.0, 2.0, 3.0], [1.5, 2.3, 2.9]])
spots = pd.DataFrame({
    "chrom":    ["chr1", "chr1"],
    "start":    [0, 100_000],
    "end":      [100_000, 200_000],
    "trace_id": [0, 0],
})
cd = ChromData(coords, spots, uns={"genome_assembly": "GRCh38"})

Required spot columns¶

Column	Type	Required	FOF-CT field
`chrom`	str (category)	yes	`Chrom`
`start`	int	yes	`Chrom_Start`
`end`	int	yes	`Chrom_End`
`trace_id`	int/str (category)	yes	`Trace_ID`
`cell_id`	int/str (category)	no	`Cell_ID`
`spot_id`	int/str	no	`Spot_ID`

String columns are auto-converted to pd.Categorical for ~10× memory savings on large datasets.

Attributes¶

Attribute	Shape	Purpose
`coords`	`(n_spots, 3)`	x, y, z for each spot
`spots`	`(n_spots, ≥4)`	per-spot metadata
`cells`	`(n_cells, ?)`	per-cell metadata
`cellm`	dict	multi-dim cell annotations (embeddings, UMAP)
`tracks`	`(n_spots, n_tracks)`	epigenomic signals aligned to spots
`traces`	`(n_traces, ?)`	per-trace metadata
`layers`	dict	alternative coordinate sets
`results`	dict	analysis outputs (loops, TADs, …)
`uns`	dict	unstructured metadata

Properties: cd.n_spots, cd.n_traces, cd.n_cells, cd.chroms.

Subsetting¶

Subset operations return new ChromData instances — they never mutate the original.

cd.get_chrom("chr1")           # all spots on chr1
cd.get_trace(5)                # only trace 5
cd.get_cell("cell_0")          # only cell "cell_0"
cd[cd.spots["chrom"] == "chr1"]  # boolean mask
cd[:100]                       # first 100 spots

All children (coords, spots, tracks, layers, cellm) are filtered consistently; traces and cells that become empty are dropped.

On-demand distance matrix¶

Global (n_spots, n_spots) matrices are not stored — they are meaningless across cells and scale poorly. Compute per-trace on demand:

D = cd.compute_distances(trace_id=0)  # (n_spots_in_trace, n_spots_in_trace)

I/O¶

Input	Method
Reconstruction CSV (chrom, start, end, x, y, z)	`ChromData.from_dataframe`
4DN FOF-CT core table	`ChromData.from_fofct`
`.h5cd` (HDF5)	`ChromData.read`

cd.write("data.h5cd")
cd2 = ChromData.read("data.h5cd")

The .h5cd format has a version stamp and gracefully handles legacy files. See Concepts — On-disk format for details.

Why not AnnData?¶

AnnData observations are flat (one row = one cell). Chromatin tracing has an extra hierarchy (Cell → Trace → Spot), and the natural “sample axis” differs per operation — a distance matrix is per-trace, cell-type embeddings are per-cell, TAD calls are per-locus. ChromData exposes that hierarchy directly while reusing the AnnData-like obsm/uns conventions where they apply (cellm, uns).