Concepts¶
The structure table¶
The central abstraction in U-Chrom is the structure table: a flat mapping from genomic bins to 3D coordinates.
chrom |
start |
end |
x |
y |
z |
|---|---|---|---|---|---|
chr1 |
0 |
100_000 |
0.34 |
1.20 |
-0.55 |
chr1 |
100_000 |
200_000 |
0.38 |
1.15 |
-0.52 |
… |
… |
… |
… |
… |
… |
This is what both reconstruction output (Hi-C → 3D) and imaging output (chromatin tracing → 3D) produce. Every analysis module in U-Chrom consumes this format.
Hierarchy: Cell → Trace → Spot¶
A single row in the structure table is a Spot — one 3D observation of one genomic bin. Spots are organised hierarchically:
Spot — one observation of one bin in one trace.
Trace — an ordered polymer of spots along a chromatin fibre. One allele’s copy of the locus is one trace. In chromatin tracing, a diploid cell can give two traces per chromosome (maternal + paternal).
Cell — a physical cell containing one or more traces.
For reconstructed data this hierarchy is thin: each chromosome is typically a single trace, and the whole output is one cell.
For imaging data it matters: a single FOF-CT file may have hundreds of traces on the same chromosome, each coming from a different cell (and without explicit cell segmentation, individual traces do not necessarily belong to different cells — they just represent separate allele observations).
ChromData — the container¶
uchrom.ChromData holds the structure table plus hierarchical
metadata and analysis results:
ChromData
├── coords (n_spots, 3) x, y, z coordinates
├── spots DataFrame (n_spots) chrom, start, end, trace_id, [cell_id]
├── cells DataFrame (n_cells) cell-level metadata
├── cellm dict[str, ndarray] per-cell embeddings etc.
├── tracks DataFrame (n_spots) per-bin epigenomic signals (ATAC, ChIP)
├── traces DataFrame (n_traces) trace-level metadata
├── layers dict[str, (n_spots, 3)] alternative coordinate sets
├── results dict analysis outputs (loops, tads, ...)
└── uns dict genome_assembly, xyz_unit, ...
Key design choices are documented in User guide — ChromData.
Data flow¶
┌──────────────────────────────────┐
│ Hi-C / Dip-C (.pairs, .cool) │
└──────────────────────────────────┘
│
▼ uchrom.recon.{sc,bulk}
┌──────────────────────────────────┐
│ 3D coordinates (.h5cd) │──┐
└──────────────────────────────────┘ │
│
┌──────────────────────────────────┐ │
│ Imaging (FOF-CT .csv) │──┼──▶ ChromData
└──────────────────────────────────┘ │
│ │
▼ uchrom.im (WIP) │
┌──────────────────────────────────┐ │
│ 3D coordinates (.h5cd) │──┘
└──────────────────────────────────┘
│
▼
┌──────────────────────────────────┐
│ Downstream analysis │
│ - uchrom.strc.loop │
│ - uchrom.strc.tad │
│ - uchrom.strc.comp (WIP) │
│ - uchrom.fea │
│ - uchrom.emb (WIP) │
│ - uchrom.pl │
│ - uchrom.browser │
└──────────────────────────────────┘
On-disk format: .h5cd¶
Serialisation is HDF5 with a small root-group schema:
data.h5cd (HDF5)
├── @uchrom_format_version = "1.0"
├── @uchrom_version = "0.2.0"
├── coords (n_spots, 3) float64
├── spots/ group of per-column datasets
├── cells/ group
├── tracks/ group
├── traces/ group
├── cellm/ group of per-key ndarrays
├── layers/ group of (n_spots, 3) ndarrays
├── results/ group (DataFrames as sub-groups, arrays as datasets)
└── uns/ nested group (scalars as attrs, dicts as sub-groups)
Format version follows MAJOR.MINOR:
same MAJOR, higher MINOR → reader warns and proceeds (unknown fields are ignored);
different MAJOR → reader raises, with a clear upgrade path;
missing attribute → reader warns and assumes legacy 1.0 layout.
See uchrom/core/spec.md
for the complete on-disk spec.