uchrom.recon.fish¶
Joint Hi-C + FISH 3-D chromosome reconstruction.
Joint Hi-C + FISH 3D chromosome reconstruction.
First backend: uchrom.recon.fish.gem_fish — independent PyTorch
reimplementation of Abbas et al. 2019 (Nature Communications,
doi:10.1038/s41467-019-10005-6), GEM-FISH.
Unlike the Hi-C-only reconstructors in uchrom.recon.sc and
uchrom.recon.bulk, this module consumes both a Hi-C matrix and
a FISH-derived pairwise-distance / Rg measurement (e.g. from a
ChromData loaded via
from_fofct()) and solves a joint
optimisation over the 3-D coordinates.
- class uchrom.recon.fish.GEMFISHParams(lambda_E_tad: float = 0.05, lambda_F: float = 0.01, lambda_E_intra: float = 0.05, lambda_R: float = 0.01, stage1_iter: int = 1000, stage1_lr: float = 0.05, stage1_ensemble: int = 20, stage1_optimiser: str = 'adam', stage2_iter: int = 500, stage2_lr: float = 0.05, stage2_ensemble: int = 10, hic_alpha: float = 0.25, hic_fallback_c: float = 1.0, hic_fallback_beta: float = 0.5, tad_di_window: int = 50, tad_di_smoothing: float = 0.1, tad_di_min_size_frac: float = 0.05, tad_max_size_frac: float = 0.05, device: str = 'auto', verbose: bool = True)[source]¶
Bases:
objectRuntime parameters for
reconstruct_gem_fish().
- uchrom.recon.fish.reconstruct_gem_fish(hic_path: str, chrom: str, resolution: int | None = None, fish_cd: Any = None, tads: DataFrame | None = None, params: GEMFISHParams | None = None, return_intermediate: bool = False) Any[source]¶
End-to-end GEM-FISH reconstruction for a single chromosome.
- Parameters:
hic_path (str) –
.coolor.mcoolfile.chrom (str) – Chromosome name as stored in the cooler.
resolution (int, optional) – Required for
.mcool; ignored for.cool.fish_cd (ChromData, optional) – FISH data, e.g. from
ChromData.from_fofct(...). Supplies the C_3 (TAD-centre distances) and C_4 (per-TAD Rg²) targets. IfNone, both FISH terms are dropped and the result degenerates to a Hi-C-only reconstruction.tads (DataFrame, optional) – Override the TAD partition. Must have columns
startandend(bp) and be sorted. If omitted, the Dixon DI caller (uchrom.strc.tad.get_domains()) is run on the Hi-C matrix.params (GEMFISHParams)
return_intermediate (bool) – If True, return a dict with every intermediate artifact (Stage-1 coords, per-TAD coords, loss histories).
- Returns:
Reconstructed 3-D model with bin-resolution coordinates. Each Hi-C bin becomes a spot;
trace_id = 0for the whole chain (single consensus model). Per-stage info is stored incd.uns['gem_fish']. Ifreturn_intermediate=True, returns(cd, dict)with the full artefact bundle.- Return type:
GEM-FISH — three-stage joint optimiser¶
Independent PyTorch reimplementation of Abbas et al. 2019 (Nature Communications 10:2049, doi:10.1038/s41467-019-10005-6).
- class uchrom.recon.fish.GEMFISHParams(lambda_E_tad: float = 0.05, lambda_F: float = 0.01, lambda_E_intra: float = 0.05, lambda_R: float = 0.01, stage1_iter: int = 1000, stage1_lr: float = 0.05, stage1_ensemble: int = 20, stage1_optimiser: str = 'adam', stage2_iter: int = 500, stage2_lr: float = 0.05, stage2_ensemble: int = 10, hic_alpha: float = 0.25, hic_fallback_c: float = 1.0, hic_fallback_beta: float = 0.5, tad_di_window: int = 50, tad_di_smoothing: float = 0.1, tad_di_min_size_frac: float = 0.05, tad_max_size_frac: float = 0.05, device: str = 'auto', verbose: bool = True)[source]¶
Bases:
objectRuntime parameters for
reconstruct_gem_fish().
- uchrom.recon.fish.reconstruct_gem_fish(hic_path: str, chrom: str, resolution: int | None = None, fish_cd: Any = None, tads: DataFrame | None = None, params: GEMFISHParams | None = None, return_intermediate: bool = False) Any[source]¶
End-to-end GEM-FISH reconstruction for a single chromosome.
- Parameters:
hic_path (str) –
.coolor.mcoolfile.chrom (str) – Chromosome name as stored in the cooler.
resolution (int, optional) – Required for
.mcool; ignored for.cool.fish_cd (ChromData, optional) – FISH data, e.g. from
ChromData.from_fofct(...). Supplies the C_3 (TAD-centre distances) and C_4 (per-TAD Rg²) targets. IfNone, both FISH terms are dropped and the result degenerates to a Hi-C-only reconstruction.tads (DataFrame, optional) – Override the TAD partition. Must have columns
startandend(bp) and be sorted. If omitted, the Dixon DI caller (uchrom.strc.tad.get_domains()) is run on the Hi-C matrix.params (GEMFISHParams)
return_intermediate (bool) – If True, return a dict with every intermediate artifact (Stage-1 coords, per-TAD coords, loss histories).
- Returns:
Reconstructed 3-D model with bin-resolution coordinates. Each Hi-C bin becomes a spot;
trace_id = 0for the whole chain (single consensus model). Per-stage info is stored incd.uns['gem_fish']. Ifreturn_intermediate=True, returns(cd, dict)with the full artefact bundle.- Return type:
Internal components¶
Loss primitives¶
Loss primitives for GEM-FISH.
Independent re-implementation of the loss terms from Abbas et al. 2019
(Nat. Commun., doi:10.1038/s41467-019-10005-6). Each primitive takes
a coordinate tensor of shape (*batch, n_bins, 3) (PyTorch) and
returns a scalar loss per batch item. A leading batch dimension is
supported for ensemble training — stacking K independent models on the
first axis lets them share autograd and optimiser state at zero extra
code cost.
Summary of terms¶
C_1— KL divergence between row-normalised Hi-C interaction frequencies and row-normalised inverse reconstructed distances. This is the paper’s Hi-C structural consistency term (Eqn. 3).C_2— polymer conformation energy (bond + angle + excluded volume). The paper cites prior GEM / uchrom.recon.sc.gem for the functional form; implemented here as a sum of three standard harmonic / hinge terms.C_3— sum of squared errors between reconstructed pairwise distances and FISH-measured distances on TAD centres (Eqn. 6).C_4— L1 penalty on the deviation of squared radius-of-gyration from a FISH-derived target (Eqn. 7).
All lengths are in the coordinate system of coords (typically nm),
with the Hi-C-derived distances scaled to the same units.
- uchrom.recon.fish._losses.fish_distance_loss(coords: torch.Tensor, fish_distances: torch.Tensor, mask: torch.Tensor | None = None) torch.Tensor[source]¶
Σ_{i,j ∈ M} (||s_i - s_j|| - F_ij)².- Parameters:
coords (Tensor (..., N, 3))
fish_distances (Tensor (N, N)) – FISH-measured average pairwise distances. Symmetric; NaN or non-finite entries are ignored.
mask (Tensor (N, N), optional) – Extra Boolean mask — set to
Falsewhere no FISH measurement is available. If omitted, finite entries offish_distancesdefine the mask automatically.
- uchrom.recon.fish._losses.hic_kl_loss(coords: torch.Tensor, contacts: torch.Tensor, mask: torch.Tensor | None = None) torch.Tensor[source]¶
Row-wise KL divergence between normalised Hi-C and normalised inverse reconstructed distances.
P_ij = f_ij / Σ_{k≠i} f_ikandQ_ij = (1 + d_ij)^{-1} / Σ_{k≠i} (1 + d_ik)^{-1}, whered_ij = ||s_i - s_j||is the reconstructed 3-D distance.Returns
Σ_i Σ_{j≠i} P_ij log(P_ij / Q_ij)summed overi.- Parameters:
coords (Tensor (..., N, 3))
contacts (Tensor (N, N)) – Raw Hi-C interaction frequencies. Symmetric; diagonal ignored.
mask (Tensor (N, N), optional) – Boolean mask of
(i, j)pairs to include (for e.g. dropping unmappable bins). Default: all off-diagonal pairs.
- uchrom.recon.fish._losses.polymer_energy(coords: torch.Tensor, bond_length: float = 1.0, bond_weight: float = 1.0, angle_weight: float = 0.1, exclude_weight: float = 1.0, exclude_radius: float = 1.0) torch.Tensor[source]¶
Bond-stretching + angle-bending + excluded-volume energy.
Bond:
Σ_i (||s_{i+1} - s_i|| - b)²Angle:
Σ_i θ²whereθis the angle between successive bond vectors.Exclude:
Σ_{i<j} max(0, r - ||s_i - s_j||)²(hinge).
Matches the functional form used in the existing
uchrom.recon.sc.gemTaichi kernels.
Hi-C preprocessing¶
Hi-C helpers for GEM-FISH.
load_contact_matrix()pulls a chromosome-restricted, square contact matrix from a.cool/.mcoolfile at the requested resolution, returning(matrix, bin_df)withbin_dfcarrying(chrom, start, end)for each row / column.aggregate_over_tads()collapses a bin-resolution matrix into a TAD-resolution matrix given a list of(start, end)TAD windows.contact_to_distance()applies Abbas et al. 2019 Eqn. 10 to convert inter-TAD contact counts to initial distance estimates used by the Stage-3 assembly step.
- uchrom.recon.fish._hic.aggregate_over_tads(matrix: ndarray, bin_df: DataFrame, tads: DataFrame) Tuple[ndarray, List[Tuple[int, int]]][source]¶
Collapse a bin-resolution matrix into a TAD-resolution matrix.
- Parameters:
matrix (ndarray (n_bins, n_bins)) – Row / column ordering matches
bin_df.bin_df (DataFrame with
chrom, start, end.)tads (DataFrame with
start, end(bp). Assumed sorted by start) – and on the same chromosome as the matrix.
- Returns:
tad_matrix (ndarray (n_tads, n_tads)) – Sum of bin-level contacts over each
(TAD_i, TAD_j)window.tad_bin_indices (list of (start_idx, end_idx) into
bin_df) – Bin-index range covered by each TAD (end exclusive).
- uchrom.recon.fish._hic.contact_to_distance(contacts: ndarray, genomic_distances: ndarray | None = None, alpha: float = 0.25, fallback_c: float = 1.0, fallback_beta: float = 0.5) ndarray[source]¶
Abbas 2019 Eqn. 10 — convert contact counts to initial distances.
d_ij = f_ij^{-α}whenf_ij > 0. For zero-contact pairs the paper falls back to a genomic-distance-based estimated_ij = c · g_ij^{β}whereg_ijis the genomic separation in bp. We use the two fallback parameters in the same sense.- Parameters:
contacts (ndarray (N, N))
genomic_distances (ndarray (N, N), optional) – Absolute genomic separation of each TAD pair in bp. Must be supplied if any zero contacts need the fallback estimate.
alpha (float) – Power-law exponent; 0.25 from Wang 2016 (empirical for IMR90).
fallback_c (float) – Parameters of the power-law fallback for zero-contact pairs.
fallback_beta (float) – Parameters of the power-law fallback for zero-contact pairs.
- Returns:
d – Symmetric, non-negative. Diagonal is 0.
- Return type:
ndarray (N, N)
- uchrom.recon.fish._hic.load_contact_matrix(path: str, chrom: str, resolution: int | None = None, normalize: bool = True) Tuple[ndarray, DataFrame][source]¶
Load a single-chromosome contact matrix from
.cool/.mcool.- Parameters:
path (str) – Path to a
.cool(single resolution) or.mcoolfile. For.mcool, passresolutionto pick the resolution group; the function prepends::resolutions/{resolution}as needed.chrom (str) – Chromosome name as stored in the cooler.
resolution (int, optional) – Required for
.mcoolfiles; ignored for.coolfiles that already fix the resolution.normalize (bool) – Apply cooler’s weight column (KR / ICE as stored) when available. Falls back to raw counts if no weights are present.
- Returns:
matrix (ndarray (n_bins, n_bins), float64)
bin_df (DataFrame with
chrom, start, endper row.)
Stage-1 TAD-level optimisation¶
Stage 1 of GEM-FISH — TAD-level coarse 3-D model.
Minimises the objective of Abbas et al. 2019 Eqn. (2):
C_g = C_1 + lambda_E * C_2 + lambda_F * C_3
C_1: row-wise KL between Hi-C and reconstructed inverse distances (_losses.hic_kl_loss()).C_2: polymer energy (bond + angle + excluded volume) (_losses.polymer_energy()).C_3: squared error against FISH-measured TAD-centre distances (_losses.fish_distance_loss()).
Gradient descent is run on an ensemble of n_ensemble independent
initialisations; the best single model (lowest final loss) is returned
by default but the full ensemble is also available via return_all.
Loss weights in the paper (λ_E = 5e12, λ_F = 1e-8) assume raw (not
row-normalised) Hi-C counts as input; our PyTorch C_1 is scale-free
after row-normalisation, so we expose smaller defaults that balance the
three terms for unit-normalised data — users should tune them for their
data.
- class uchrom.recon.fish._tad_level.Stage1Params(lambda_E: float = 0.05, lambda_F: float = 0.01, bond_length: float = 1.0, bond_weight: float = 1.0, angle_weight: float = 0.1, exclude_weight: float = 1.0, exclude_radius: float = 0.9, n_iter: int = 1000, lr: float = 0.05, optimiser: str = 'adam', n_ensemble: int = 20, init_scale: float = 3.0, init_from_mds: bool = True, device: str = 'auto', verbose: bool = False)[source]¶
Bases:
objectOptimisation parameters for Stage-1 (TAD-level) reconstruction.
- init_from_mds: bool = True¶
If True, seed the Gaussian-chain optimisation with a classical MDS embedding of the FISH distance matrix (filled in with Hi-C- derived distances where FISH is absent). Starting from an MDS solution that already respects the pairwise-distance structure gives the gradient descent a much better basin than N(0, σ²) — crucial on chromosomes with partial FISH coverage where the un-anchored centres otherwise drift into extended configurations. Each ensemble replica gets a small random perturbation on top of the MDS seed so the ensemble still has diversity.
- uchrom.recon.fish._tad_level.reconstruct_tad_level(contacts: ndarray, fish_distances: ndarray | None, params: Stage1Params | None = None, initial_coords: ndarray | None = None, return_all: bool = False) Tuple[ndarray, dict][source]¶
Stage-1 TAD-level optimisation.
- Parameters:
contacts (ndarray (N, N)) – Inter-TAD Hi-C contact matrix (sums over TAD windows). Raw or normalised — row-normalisation happens inside
C_1.fish_distances (ndarray (N, N), optional) – FISH-measured pairwise distances between TAD centres.
NaNor non-positive entries are ignored. IfNone, the FISH term is dropped.params (Stage1Params)
initial_coords (ndarray (N, 3) or (n_ensemble, N, 3), optional) – Override the Gaussian initialisation.
return_all (bool) – If True, return the full ensemble
(n_ensemble, N, 3); otherwise return the best single model(N, 3).
- Returns:
coords (ndarray)
info (dict with keys
final_loss(per-ensemble),) –best_ensemble_idx,loss_history(shape(n_iter, n_ensemble)), and the split C_1 / C_2 / C_3 at the end.
Stage-2 intra-TAD optimisation¶
Stage 2 of GEM-FISH — intra-TAD fine 3-D model.
For each TAD t, minimises Abbas et al. 2019 Eqn. (9):
C_t = C_1 + lambda_E * C_2 + lambda_R * C_4
C_1: row-wise KL on the intra-TAD Hi-C sub-matrix.C_2: polymer energy (bond + angle + excluded volume).C_4: L1 penalty on the deviation of the reconstructed squared radius of gyration from a FISH-derived target.
Each TAD is solved independently so the work is trivially parallel over TADs. We run an ensemble per TAD just like Stage 1 and keep the best single fibre per TAD.
- class uchrom.recon.fish._intra_tad.Stage2Params(lambda_E: float = 0.05, lambda_R: float = 0.01, bond_length: float = 1.0, bond_weight: float = 1.0, angle_weight: float = 0.1, exclude_weight: float = 1.0, exclude_radius: float = 0.9, n_iter: int = 500, lr: float = 0.05, optimiser: str = 'adam', n_ensemble: int = 10, init_scale: float = 2.0, device: str = 'auto', verbose: bool = False, max_bins_per_tad: int = 2000)[source]¶
Bases:
objectOptimisation parameters for Stage-2 (intra-TAD) reconstruction.
- uchrom.recon.fish._intra_tad.reconstruct_intra_tad(contacts: ndarray, tad_bin_indices: List[Tuple[int, int]], target_rg_sq_per_tad: List[float] | None = None, params: Stage2Params | None = None) Tuple[List[ndarray], List[dict]][source]¶
Run Stage-2 on every TAD.
- Parameters:
contacts (ndarray (n_bins, n_bins)) – Full bin-resolution Hi-C matrix (the same one used to build the TAD partition).
tad_bin_indices (list of (start_idx, end_idx)) – Bin-index windows (end exclusive), as produced by
uchrom.recon.fish._hic.aggregate_over_tads().target_rg_sq_per_tad (list of float, optional) – FISH-derived \(\hat{R}_g^2\) per TAD (same length as
tad_bin_indices). UseNoneorNaNfor TADs without a measurement.params (Stage2Params)
- Returns:
coords_per_tad (list of ndarray (n_bins_in_tad, 3))
infos (list of dict per TAD (final loss, C_1, C_2, C_4))
Stage-3 assembly¶
Stage 3 of GEM-FISH — assemble per-TAD intra models under the TAD-level scaffold.
For each TAD i with intra-TAD coordinates y^{(i)} and a target
TAD-centre position s_i from Stage 1:
Translate each TAD so its centroid (mean of its bin coords) coincides with
s_i.Rotate / reflect each TAD (except the first) to minimise the distance between its start point and the previous TAD’s end point, penalised against the expected inter-TAD gap
d_{i-1,i}(computed from Hi-C contact counts via Eqn. 10).
Step 2 is solved independently per TAD: with the rigid-body transform
constrained to keep the centroid fixed at s_i, the optimal
rotation/reflection that minimises ||y^{(i)}_{start} − (y^{(i-1)}_{end}
+ d̂_{i-1,i})||² is a closed-form Kabsch problem on a single vector
pair, but since we also have the freedom to reflect and the objective
is a single squared distance, the simplest robust thing is to rotate
the TAD around its centroid so that the vector centroid → start
points toward the desired target. That’s a two-atom Kabsch, done via
SVD of the outer product.
- class uchrom.recon.fish._assembly.AssemblyParams(allow_reflection: bool = True, endpoint_k: int = 1, match_scales: bool = True, centre_gap_factor: float = 2.0, iterative: bool = True, inner_max_iter: int = 5000, outer_max_iter: int = 5, alpha: float = 0.1, reflect_threshold: float = 0.2, convergence_abs: float = 0.1)[source]¶
Bases:
objectParameters for the Stage-3 assembly step.
- allow_reflection: bool = True¶
If True, accept reflections (negative-determinant rotations) when they reduce the boundary-gap objective.
- centre_gap_factor: float = 2.0¶
When
match_scales=True, the TAD centres are rescaled so thatmean_inter_TAD_gap = centre_gap_factor * mean_intra_TAD_Rg.2.0corresponds to roughly-tangent TAD packing;1.5gives slight overlap between neighbours;3.0gives visible spacing.
- convergence_abs: float = 0.1¶
stop inner loop when
Σ|d_prior - df| < convergence_abs.- Type:
Convergence
- endpoint_k: int = 1¶
Number of bins from each TAD end used as ‘anchor’ points when computing the Kabsch alignment.
1uses just the first / last bin;3uses the first/last 3 bins (robust to noise).
- iterative: bool = True¶
If True, after the single-shot Kabsch rotation run the upstream GEM-FISH iterative gradient descent (Abbas 2019 Eqn. 13): for each adjacent-TAD pair, rotate TAD i+1 around its centre so the gap between
y_end[i]andy_start[i+1]converges towardd_prior[i, i+1]. Repeats with reflection retry for TADs whose boundary gap has >reflect_thresholdrelative error.
- match_scales: bool = True¶
If True, uniformly rescale the Stage-1 TAD-centre layout so the mean inter-TAD gap matches the intra-TAD Rg scale (roughly tangent packing of TADs). Without this, Stage 1 and Stage 2 end up on incompatible scales: Stage-1 centres drift apart under minimal polymer constraint while each Stage-2 intra-TAD cloud optimises to unit bond length, giving tight blobs with huge bridges between them. The intra-TAD geometry (which Stage 2 optimised against real Hi-C) is preserved; only the global layout is rescaled.
- uchrom.recon.fish._assembly.assemble(tad_centres: ndarray, intra_tad_coords: List[ndarray], inter_tad_distances: ndarray | None = None, params: AssemblyParams | None = None) ndarray[source]¶
Stitch intra-TAD conformations together into a single chain.
- Parameters:
tad_centres (ndarray (n_tads, 3)) – Stage-1 TAD-centre coordinates.
intra_tad_coords (list of ndarray (n_bins_i, 3)) – Stage-2 intra-TAD coordinate clouds (one per TAD).
inter_tad_distances (ndarray (n_tads, n_tads), optional) – Pairwise expected distances used for Stage-3 alignment. Only the immediate-neighbour entries
d[i, i+1]are consulted in this simple implementation. If omitted, we fall back to using the Stage-1 centre distances directly.params (AssemblyParams)
- Returns:
coords – Concatenated, rigid-body-aligned per-bin coordinates. Order is TAD-0 bins first, then TAD-1, …
- Return type:
ndarray (sum n_bins_i, 3)
- uchrom.recon.fish._assembly.gap_distance_from_contacts(tad_contacts: ndarray, tad_genomic_distances: ndarray, alpha: float = 0.25, fallback_c: float = 1.0, fallback_beta: float = 0.5) ndarray[source]¶
Wrapper around
uchrom.recon.fish._hic.contact_to_distance()producing the neighbour-gap matrix used byassemble().