ChromData — the core data container¶
uchrom.ChromData is the central data structure for U-Chrom, analogous to
AnnData for chromatin 3D structure data.
It holds genomic bins mapped to 3D coordinates, grouped in a
Cell → Trace → Spot hierarchy, and persists as .h5cd (HDF5) files.
This notebook covers:
Creating a ChromData from scratch
The three “core tables” (core / cell / chromatin) and auxiliary storage
Subsetting by chromosome, trace, or cell
Round-tripping to
.h5cdwith format versioningImporting from reconstruction CSV and FOF-CT
import numpy as np
import pandas as pd
from uchrom import ChromData
Minimal construction¶
ChromData needs at least coords (n_spots, 3) and a spots DataFrame with chrom / start / end / trace_id.
rng = np.random.default_rng(0)
coords = rng.normal(size=(60, 3))
spots = pd.DataFrame({
'chrom': ['chr1']*30 + ['chr2']*30,
'start': list(range(0, 30_000_000, 1_000_000)) * 2,
'end': list(range(1_000_000, 31_000_000, 1_000_000)) * 2,
'trace_id': [0]*30 + [1]*30,
})
cd = ChromData(coords, spots, uns={'genome_assembly': 'GRCh38'})
cd
Attributes¶
print('n_spots :', cd.n_spots)
print('n_traces:', cd.n_traces)
print('chroms :', cd.chroms)
cd.spots.head()
spots['chrom'] and spots['trace_id'] are auto-converted to pd.Categorical for ~10× memory savings on large datasets.
print(cd.spots['chrom'].dtype)
print(cd.spots['trace_id'].dtype)
Subsetting¶
All subset operations return a new ChromData — never mutate the original.
chr1 = cd.get_chrom('chr1')
print(chr1)
trace0 = cd.get_trace(0)
print(trace0)
# Boolean indexing also works
first_ten = cd[np.arange(10)]
print(first_ten)
On-demand distance matrix¶
Global pairwise distance matrices are not stored (they’d be O(n²) and biologically meaningless across cells). Compute per-trace on demand:
d = cd.compute_distances(trace_id=0)
print('distance matrix shape:', d.shape)
print('diagonal is zero :', np.allclose(np.diag(d), 0))
Persistence — .h5cd¶
ChromData serialises to HDF5 with a format version attribute so future releases can migrate old files.
import tempfile, os
tmp = tempfile.mkdtemp()
path = os.path.join(tmp, 'demo.h5cd')
cd.write(path)
cd2 = ChromData.read(path)
print(cd2)
print('coords match:', np.allclose(cd.coords, cd2.coords))
Importing from existing formats¶
Source |
Method |
|---|---|
Reconstruction CSV (chrom, start, end, x, y, z) |
|
4DN FOF-CT core table |
|
|
|
# Example: from a reconstruction output CSV (here we build a fake one)
recon = pd.DataFrame({
'chrom': ['chr1']*10,
'start': range(0, 10_000_000, 1_000_000),
'end': range(1_000_000, 11_000_000, 1_000_000),
'x': rng.normal(size=10),
'y': rng.normal(size=10),
'z': rng.normal(size=10),
})
cd_recon = ChromData.from_dataframe(recon, cell_id='demo_cell')
cd_recon
Where to go next¶
Importing FOF-CT imaging data and inspecting trace structure.
3D reconstruction from Hi-C
.pairsfiles.Loop calling on FOF-CT data.