ChromData — the core data container

uchrom.ChromData is the central data structure for U-Chrom, analogous to AnnData for chromatin 3D structure data. It holds genomic bins mapped to 3D coordinates, grouped in a Cell → Trace → Spot hierarchy, and persists as .h5cd (HDF5) files.

This notebook covers:

  • Creating a ChromData from scratch

  • The three “core tables” (core / cell / chromatin) and auxiliary storage

  • Subsetting by chromosome, trace, or cell

  • Round-tripping to .h5cd with format versioning

  • Importing from reconstruction CSV and FOF-CT

import numpy as np
import pandas as pd
from uchrom import ChromData

Minimal construction

ChromData needs at least coords (n_spots, 3) and a spots DataFrame with chrom / start / end / trace_id.

rng = np.random.default_rng(0)
coords = rng.normal(size=(60, 3))
spots = pd.DataFrame({
    'chrom':    ['chr1']*30 + ['chr2']*30,
    'start':    list(range(0, 30_000_000, 1_000_000)) * 2,
    'end':      list(range(1_000_000, 31_000_000, 1_000_000)) * 2,
    'trace_id': [0]*30 + [1]*30,
})

cd = ChromData(coords, spots, uns={'genome_assembly': 'GRCh38'})
cd

Attributes

print('n_spots :', cd.n_spots)
print('n_traces:', cd.n_traces)
print('chroms  :', cd.chroms)
cd.spots.head()

spots['chrom'] and spots['trace_id'] are auto-converted to pd.Categorical for ~10× memory savings on large datasets.

print(cd.spots['chrom'].dtype)
print(cd.spots['trace_id'].dtype)

Subsetting

All subset operations return a new ChromData — never mutate the original.

chr1 = cd.get_chrom('chr1')
print(chr1)
trace0 = cd.get_trace(0)
print(trace0)

# Boolean indexing also works
first_ten = cd[np.arange(10)]
print(first_ten)

On-demand distance matrix

Global pairwise distance matrices are not stored (they’d be O(n²) and biologically meaningless across cells). Compute per-trace on demand:

d = cd.compute_distances(trace_id=0)
print('distance matrix shape:', d.shape)
print('diagonal is zero      :', np.allclose(np.diag(d), 0))

Persistence — .h5cd

ChromData serialises to HDF5 with a format version attribute so future releases can migrate old files.

import tempfile, os
tmp = tempfile.mkdtemp()
path = os.path.join(tmp, 'demo.h5cd')

cd.write(path)
cd2 = ChromData.read(path)
print(cd2)
print('coords match:', np.allclose(cd.coords, cd2.coords))

Importing from existing formats

Source

Method

Reconstruction CSV (chrom, start, end, x, y, z)

ChromData.from_dataframe(df)

4DN FOF-CT core table

ChromData.from_fofct(path)

.h5cd

ChromData.read(path)

# Example: from a reconstruction output CSV (here we build a fake one)
recon = pd.DataFrame({
    'chrom': ['chr1']*10,
    'start': range(0, 10_000_000, 1_000_000),
    'end':   range(1_000_000, 11_000_000, 1_000_000),
    'x': rng.normal(size=10),
    'y': rng.normal(size=10),
    'z': rng.normal(size=10),
})
cd_recon = ChromData.from_dataframe(recon, cell_id='demo_cell')
cd_recon

Where to go next

  • Importing FOF-CT imaging data and inspecting trace structure.

  • 3D reconstruction from Hi-C .pairs files.

  • Loop calling on FOF-CT data.