uchrom.recon.fish¶

Joint Hi-C + FISH 3-D chromosome reconstruction.

Joint Hi-C + FISH 3D chromosome reconstruction.

First backend: uchrom.recon.fish.gem_fish — independent PyTorch reimplementation of Abbas et al. 2019 (Nature Communications, doi:10.1038/s41467-019-10005-6), GEM-FISH.

Unlike the Hi-C-only reconstructors in uchrom.recon.sc and uchrom.recon.bulk, this module consumes both a Hi-C matrix and a FISH-derived pairwise-distance / Rg measurement (e.g. from a ChromData loaded via from_fofct()) and solves a joint optimisation over the 3-D coordinates.

class uchrom.recon.fish.GEMFISHParams(lambda_E_tad: float = 0.05, lambda_F: float = 0.01, lambda_E_intra: float = 0.05, lambda_R: float = 0.01, stage1_iter: int = 1000, stage1_lr: float = 0.05, stage1_ensemble: int = 20, stage1_optimiser: str = 'adam', stage2_iter: int = 500, stage2_lr: float = 0.05, stage2_ensemble: int = 10, hic_alpha: float = 0.25, hic_fallback_c: float = 1.0, hic_fallback_beta: float = 0.5, tad_di_window: int = 50, tad_di_smoothing: float = 0.1, tad_di_min_size_frac: float = 0.05, tad_max_size_frac: float = 0.05, device: str = 'auto', verbose: bool = True)[source]¶

Bases: object

Runtime parameters for reconstruct_gem_fish().

device: str = 'auto'¶

hic_alpha: float = 0.25¶

hic_fallback_beta: float = 0.5¶

hic_fallback_c: float = 1.0¶

lambda_E_intra: float = 0.05¶

lambda_E_tad: float = 0.05¶

lambda_F: float = 0.01¶

lambda_R: float = 0.01¶

stage1_ensemble: int = 20¶

stage1_iter: int = 1000¶

stage1_lr: float = 0.05¶

stage1_optimiser: str = 'adam'¶

stage2_ensemble: int = 10¶

stage2_iter: int = 500¶

stage2_lr: float = 0.05¶

tad_di_min_size_frac: float = 0.05¶

tad_di_smoothing: float = 0.1¶

tad_di_window: int = 50¶

tad_max_size_frac: float = 0.05¶: Cap any called TAD at tad_max_size_frac * n_bins bins. Oversized DI windows (often in heterochromatic regions like the chr21 p-arm where DI signal is weak) are split into equal chunks. Set to 1.0 to disable capping.

verbose: bool = True¶

uchrom.recon.fish.reconstruct_gem_fish(hic_path: str, chrom: str, resolution: int | None = None, fish_cd: Any = None, tads: DataFrame | None = None, params: GEMFISHParams | None = None, return_intermediate: bool = False) → Any[source]¶

End-to-end GEM-FISH reconstruction for a single chromosome.

Parameters:

hic_path (str) – .cool or .mcool file.
chrom (str) – Chromosome name as stored in the cooler.
resolution (int, optional) – Required for .mcool; ignored for .cool.
fish_cd (ChromData, optional) – FISH data, e.g. from ChromData.from_fofct(...). Supplies the C_3 (TAD-centre distances) and C_4 (per-TAD Rg²) targets. If None, both FISH terms are dropped and the result degenerates to a Hi-C-only reconstruction.
tads (DataFrame, optional) – Override the TAD partition. Must have columns start and end (bp) and be sorted. If omitted, the Dixon DI caller (uchrom.strc.tad.get_domains()) is run on the Hi-C matrix.
params (GEMFISHParams)
return_intermediate (bool) – If True, return a dict with every intermediate artifact (Stage-1 coords, per-TAD coords, loss histories).

Returns:

Reconstructed 3-D model with bin-resolution coordinates. Each Hi-C bin becomes a spot; trace_id = 0 for the whole chain (single consensus model). Per-stage info is stored in cd.uns['gem_fish']. If return_intermediate=True, returns (cd, dict) with the full artefact bundle.

Return type:

ChromData

GEM-FISH — three-stage joint optimiser¶

Independent PyTorch reimplementation of Abbas et al. 2019 (Nature Communications 10:2049, doi:10.1038/s41467-019-10005-6).

Bases: object

Runtime parameters for reconstruct_gem_fish().

device: str = 'auto'¶

hic_alpha: float = 0.25¶

hic_fallback_beta: float = 0.5¶

hic_fallback_c: float = 1.0¶

lambda_E_intra: float = 0.05¶

lambda_E_tad: float = 0.05¶

lambda_F: float = 0.01¶

lambda_R: float = 0.01¶

stage1_ensemble: int = 20¶

stage1_iter: int = 1000¶

stage1_lr: float = 0.05¶

stage1_optimiser: str = 'adam'¶

stage2_ensemble: int = 10¶

stage2_iter: int = 500¶

stage2_lr: float = 0.05¶

tad_di_min_size_frac: float = 0.05¶

tad_di_smoothing: float = 0.1¶

tad_di_window: int = 50¶

tad_max_size_frac: float = 0.05¶: Cap any called TAD at tad_max_size_frac * n_bins bins. Oversized DI windows (often in heterochromatic regions like the chr21 p-arm where DI signal is weak) are split into equal chunks. Set to 1.0 to disable capping.

verbose: bool = True¶

End-to-end GEM-FISH reconstruction for a single chromosome.

Parameters:

hic_path (str) – .cool or .mcool file.
chrom (str) – Chromosome name as stored in the cooler.
resolution (int, optional) – Required for .mcool; ignored for .cool.
fish_cd (ChromData, optional) – FISH data, e.g. from ChromData.from_fofct(...). Supplies the C_3 (TAD-centre distances) and C_4 (per-TAD Rg²) targets. If None, both FISH terms are dropped and the result degenerates to a Hi-C-only reconstruction.
tads (DataFrame, optional) – Override the TAD partition. Must have columns start and end (bp) and be sorted. If omitted, the Dixon DI caller (uchrom.strc.tad.get_domains()) is run on the Hi-C matrix.
params (GEMFISHParams)
return_intermediate (bool) – If True, return a dict with every intermediate artifact (Stage-1 coords, per-TAD coords, loss histories).

Returns:

Return type:

ChromData

Internal components¶

Loss primitives¶

Loss primitives for GEM-FISH.

Independent re-implementation of the loss terms from Abbas et al. 2019 (Nat. Commun., doi:10.1038/s41467-019-10005-6). Each primitive takes a coordinate tensor of shape (*batch, n_bins, 3) (PyTorch) and returns a scalar loss per batch item. A leading batch dimension is supported for ensemble training — stacking K independent models on the first axis lets them share autograd and optimiser state at zero extra code cost.

Summary of terms¶

C_1 — KL divergence between row-normalised Hi-C interaction frequencies and row-normalised inverse reconstructed distances. This is the paper’s Hi-C structural consistency term (Eqn. 3).
C_2 — polymer conformation energy (bond + angle + excluded volume). The paper cites prior GEM / uchrom.recon.sc.gem for the functional form; implemented here as a sum of three standard harmonic / hinge terms.
C_3 — sum of squared errors between reconstructed pairwise distances and FISH-measured distances on TAD centres (Eqn. 6).
C_4 — L1 penalty on the deviation of squared radius-of-gyration from a FISH-derived target (Eqn. 7).

All lengths are in the coordinate system of coords (typically nm), with the Hi-C-derived distances scaled to the same units.

uchrom.recon.fish._losses.fish_distance_loss(coords: torch.Tensor, fish_distances: torch.Tensor, mask: torch.Tensor | None = None) → torch.Tensor[source]¶

Σ_{i,j ∈ M} (||s_i - s_j|| - F_ij)².

Parameters:

coords (Tensor (..., N, 3))
fish_distances (Tensor (N, N)) – FISH-measured average pairwise distances. Symmetric; NaN or non-finite entries are ignored.
mask (Tensor (N, N), optional) – Extra Boolean mask — set to False where no FISH measurement is available. If omitted, finite entries of fish_distances define the mask automatically.

uchrom.recon.fish._losses.hic_kl_loss(coords: torch.Tensor, contacts: torch.Tensor, mask: torch.Tensor | None = None) → torch.Tensor[source]¶

Row-wise KL divergence between normalised Hi-C and normalised inverse reconstructed distances.

P_ij = f_ij / Σ_{k≠i} f_ik and Q_ij = (1 + d_ij)^{-1} / Σ_{k≠i} (1 + d_ik)^{-1}, where d_ij = ||s_i - s_j|| is the reconstructed 3-D distance.

Returns Σ_i Σ_{j≠i} P_ij log(P_ij / Q_ij) summed over i.

Parameters:

coords (Tensor (..., N, 3))
contacts (Tensor (N, N)) – Raw Hi-C interaction frequencies. Symmetric; diagonal ignored.
mask (Tensor (N, N), optional) – Boolean mask of (i, j) pairs to include (for e.g. dropping unmappable bins). Default: all off-diagonal pairs.

uchrom.recon.fish._losses.polymer_energy(coords: torch.Tensor, bond_length: float = 1.0, bond_weight: float = 1.0, angle_weight: float = 0.1, exclude_weight: float = 1.0, exclude_radius: float = 1.0) → torch.Tensor[source]¶

Bond-stretching + angle-bending + excluded-volume energy.

Bond: Σ_i (||s_{i+1} - s_i|| - b)²
Angle: Σ_i θ² where θ is the angle between successive bond vectors.
Exclude: Σ_{i<j} max(0, r - ||s_i - s_j||)² (hinge).

Matches the functional form used in the existing uchrom.recon.sc.gem Taichi kernels.

uchrom.recon.fish._losses.rg_loss(coords: torch.Tensor, target_rg_sq: torch.Tensor) → torch.Tensor[source]¶

|R_g² - R̂_g²| — L1 penalty on the squared-Rg deviation.

Parameters:

coords (Tensor (..., N, 3))
target_rg_sq (Tensor (...) or scalar) – FISH-derived R̂_g². Broadcast over the batch dimension.

uchrom.recon.fish._losses.rg_squared(coords: torch.Tensor) → torch.Tensor[source]¶: Squared radius of gyration (1/N) Σ ||s_i - s̄||².

Hi-C preprocessing¶

Hi-C helpers for GEM-FISH.

load_contact_matrix() pulls a chromosome-restricted, square contact matrix from a .cool / .mcool file at the requested resolution, returning (matrix, bin_df) with bin_df carrying (chrom, start, end) for each row / column.
aggregate_over_tads() collapses a bin-resolution matrix into a TAD-resolution matrix given a list of (start, end) TAD windows.
contact_to_distance() applies Abbas et al. 2019 Eqn. 10 to convert inter-TAD contact counts to initial distance estimates used by the Stage-3 assembly step.

uchrom.recon.fish._hic.aggregate_over_tads(matrix: ndarray, bin_df: DataFrame, tads: DataFrame) → Tuple[ndarray, List[Tuple[int, int]]][source]¶

Collapse a bin-resolution matrix into a TAD-resolution matrix.

Parameters:

matrix (ndarray (n_bins, n_bins)) – Row / column ordering matches bin_df.
bin_df (DataFrame with chrom, start, end.)
tads (DataFrame with start, end (bp). Assumed sorted by start) – and on the same chromosome as the matrix.

Returns:

tad_matrix (ndarray (n_tads, n_tads)) – Sum of bin-level contacts over each (TAD_i, TAD_j) window.
tad_bin_indices (list of (start_idx, end_idx) into bin_df) – Bin-index range covered by each TAD (end exclusive).

uchrom.recon.fish._hic.contact_to_distance(contacts: ndarray, genomic_distances: ndarray | None = None, alpha: float = 0.25, fallback_c: float = 1.0, fallback_beta: float = 0.5) → ndarray[source]¶

Abbas 2019 Eqn. 10 — convert contact counts to initial distances.

d_ij = f_ij^{-α} when f_ij > 0. For zero-contact pairs the paper falls back to a genomic-distance-based estimate d_ij = c · g_ij^{β} where g_ij is the genomic separation in bp. We use the two fallback parameters in the same sense.

Parameters:

contacts (ndarray (N, N))
genomic_distances (ndarray (N, N), optional) – Absolute genomic separation of each TAD pair in bp. Must be supplied if any zero contacts need the fallback estimate.
alpha (float) – Power-law exponent; 0.25 from Wang 2016 (empirical for IMR90).
fallback_c (float) – Parameters of the power-law fallback for zero-contact pairs.
fallback_beta (float) – Parameters of the power-law fallback for zero-contact pairs.

Returns:

d – Symmetric, non-negative. Diagonal is 0.

Return type:

ndarray (N, N)

uchrom.recon.fish._hic.load_contact_matrix(path: str, chrom: str, resolution: int | None = None, normalize: bool = True) → Tuple[ndarray, DataFrame][source]¶

Load a single-chromosome contact matrix from .cool / .mcool.

Parameters:

path (str) – Path to a .cool (single resolution) or .mcool file. For .mcool, pass resolution to pick the resolution group; the function prepends ::resolutions/{resolution} as needed.
chrom (str) – Chromosome name as stored in the cooler.
resolution (int, optional) – Required for .mcool files; ignored for .cool files that already fix the resolution.
normalize (bool) – Apply cooler’s weight column (KR / ICE as stored) when available. Falls back to raw counts if no weights are present.

Returns:

matrix (ndarray (n_bins, n_bins), float64)
bin_df (DataFrame with chrom, start, end per row.)

Stage-1 TAD-level optimisation¶

Stage 1 of GEM-FISH — TAD-level coarse 3-D model.

Minimises the objective of Abbas et al. 2019 Eqn. (2):

C_g = C_1 + lambda_E * C_2 + lambda_F * C_3

C_1: row-wise KL between Hi-C and reconstructed inverse distances (_losses.hic_kl_loss()).
C_2: polymer energy (bond + angle + excluded volume) (_losses.polymer_energy()).
C_3: squared error against FISH-measured TAD-centre distances (_losses.fish_distance_loss()).

Gradient descent is run on an ensemble of n_ensemble independent initialisations; the best single model (lowest final loss) is returned by default but the full ensemble is also available via return_all.

Loss weights in the paper (λ_E = 5e12, λ_F = 1e-8) assume raw (not row-normalised) Hi-C counts as input; our PyTorch C_1 is scale-free after row-normalisation, so we expose smaller defaults that balance the three terms for unit-normalised data — users should tune them for their data.

class uchrom.recon.fish._tad_level.Stage1Params(lambda_E: float = 0.05, lambda_F: float = 0.01, bond_length: float = 1.0, bond_weight: float = 1.0, angle_weight: float = 0.1, exclude_weight: float = 1.0, exclude_radius: float = 0.9, n_iter: int = 1000, lr: float = 0.05, optimiser: str = 'adam', n_ensemble: int = 20, init_scale: float = 3.0, init_from_mds: bool = True, device: str = 'auto', verbose: bool = False)[source]¶

Bases: object

Optimisation parameters for Stage-1 (TAD-level) reconstruction.

angle_weight: float = 0.1¶

bond_length: float = 1.0¶

bond_weight: float = 1.0¶

device: str = 'auto'¶

exclude_radius: float = 0.9¶

exclude_weight: float = 1.0¶

init_from_mds: bool = True¶: If True, seed the Gaussian-chain optimisation with a classical MDS embedding of the FISH distance matrix (filled in with Hi-C- derived distances where FISH is absent). Starting from an MDS solution that already respects the pairwise-distance structure gives the gradient descent a much better basin than N(0, σ²) — crucial on chromosomes with partial FISH coverage where the un-anchored centres otherwise drift into extended configurations. Each ensemble replica gets a small random perturbation on top of the MDS seed so the ensemble still has diversity.

init_scale: float = 3.0¶

lambda_E: float = 0.05¶

lambda_F: float = 0.01¶

lr: float = 0.05¶

n_ensemble: int = 20¶

n_iter: int = 1000¶

optimiser: str = 'adam'¶

verbose: bool = False¶

uchrom.recon.fish._tad_level.reconstruct_tad_level(contacts: ndarray, fish_distances: ndarray | None, params: Stage1Params | None = None, initial_coords: ndarray | None = None, return_all: bool = False) → Tuple[ndarray, dict][source]¶

Stage-1 TAD-level optimisation.

Parameters:

contacts (ndarray (N, N)) – Inter-TAD Hi-C contact matrix (sums over TAD windows). Raw or normalised — row-normalisation happens inside C_1.
fish_distances (ndarray (N, N), optional) – FISH-measured pairwise distances between TAD centres. NaN or non-positive entries are ignored. If None, the FISH term is dropped.
params (Stage1Params)
initial_coords (ndarray (N, 3) or (n_ensemble, N, 3), optional) – Override the Gaussian initialisation.
return_all (bool) – If True, return the full ensemble (n_ensemble, N, 3); otherwise return the best single model (N, 3).

Returns:

coords (ndarray)
info (dict with keys final_loss (per-ensemble),) – best_ensemble_idx, loss_history (shape (n_iter, n_ensemble)), and the split C_1 / C_2 / C_3 at the end.

Stage-2 intra-TAD optimisation¶

Stage 2 of GEM-FISH — intra-TAD fine 3-D model.

For each TAD t, minimises Abbas et al. 2019 Eqn. (9):

C_t = C_1 + lambda_E * C_2 + lambda_R * C_4

C_1: row-wise KL on the intra-TAD Hi-C sub-matrix.
C_2: polymer energy (bond + angle + excluded volume).
C_4: L1 penalty on the deviation of the reconstructed squared radius of gyration from a FISH-derived target.

Each TAD is solved independently so the work is trivially parallel over TADs. We run an ensemble per TAD just like Stage 1 and keep the best single fibre per TAD.

class uchrom.recon.fish._intra_tad.Stage2Params(lambda_E: float = 0.05, lambda_R: float = 0.01, bond_length: float = 1.0, bond_weight: float = 1.0, angle_weight: float = 0.1, exclude_weight: float = 1.0, exclude_radius: float = 0.9, n_iter: int = 500, lr: float = 0.05, optimiser: str = 'adam', n_ensemble: int = 10, init_scale: float = 2.0, device: str = 'auto', verbose: bool = False, max_bins_per_tad: int = 2000)[source]¶

Bases: object

Optimisation parameters for Stage-2 (intra-TAD) reconstruction.

angle_weight: float = 0.1¶

bond_length: float = 1.0¶

bond_weight: float = 1.0¶

device: str = 'auto'¶

exclude_radius: float = 0.9¶

exclude_weight: float = 1.0¶

init_scale: float = 2.0¶

lambda_E: float = 0.05¶

lambda_R: float = 0.01¶

lr: float = 0.05¶

max_bins_per_tad: int = 2000¶

n_ensemble: int = 10¶

n_iter: int = 500¶

optimiser: str = 'adam'¶

verbose: bool = False¶

uchrom.recon.fish._intra_tad.reconstruct_intra_tad(contacts: ndarray, tad_bin_indices: List[Tuple[int, int]], target_rg_sq_per_tad: List[float] | None = None, params: Stage2Params | None = None) → Tuple[List[ndarray], List[dict]][source]¶

Run Stage-2 on every TAD.

Parameters:

contacts (ndarray (n_bins, n_bins)) – Full bin-resolution Hi-C matrix (the same one used to build the TAD partition).
tad_bin_indices (list of (start_idx, end_idx)) – Bin-index windows (end exclusive), as produced by uchrom.recon.fish._hic.aggregate_over_tads().
target_rg_sq_per_tad (list of float, optional) – FISH-derived \(\hat{R}_g^2\) per TAD (same length as tad_bin_indices). Use None or NaN for TADs without a measurement.
params (Stage2Params)

Returns:

coords_per_tad (list of ndarray (n_bins_in_tad, 3))
infos (list of dict per TAD (final loss, C_1, C_2, C_4))

Stage-3 assembly¶

Stage 3 of GEM-FISH — assemble per-TAD intra models under the TAD-level scaffold.

For each TAD i with intra-TAD coordinates y^{(i)} and a target TAD-centre position s_i from Stage 1:

Translate each TAD so its centroid (mean of its bin coords) coincides with s_i.
Rotate / reflect each TAD (except the first) to minimise the distance between its start point and the previous TAD’s end point, penalised against the expected inter-TAD gap d_{i-1,i} (computed from Hi-C contact counts via Eqn. 10).

Step 2 is solved independently per TAD: with the rigid-body transform constrained to keep the centroid fixed at s_i, the optimal rotation/reflection that minimises ||y^{(i)}_{start} − (y^{(i-1)}_{end} + d̂_{i-1,i})||² is a closed-form Kabsch problem on a single vector pair, but since we also have the freedom to reflect and the objective is a single squared distance, the simplest robust thing is to rotate the TAD around its centroid so that the vector centroid → start points toward the desired target. That’s a two-atom Kabsch, done via SVD of the outer product.

class uchrom.recon.fish._assembly.AssemblyParams(allow_reflection: bool = True, endpoint_k: int = 1, match_scales: bool = True, centre_gap_factor: float = 2.0, iterative: bool = True, inner_max_iter: int = 5000, outer_max_iter: int = 5, alpha: float = 0.1, reflect_threshold: float = 0.2, convergence_abs: float = 0.1)[source]¶

Bases: object

Parameters for the Stage-3 assembly step.

allow_reflection: bool = True¶: If True, accept reflections (negative-determinant rotations) when they reduce the boundary-gap objective.

alpha: float = 0.1¶: Gradient-descent learning rate for the iterative assembly.

centre_gap_factor: float = 2.0¶: When match_scales=True, the TAD centres are rescaled so that mean_inter_TAD_gap = centre_gap_factor * mean_intra_TAD_Rg. 2.0 corresponds to roughly-tangent TAD packing; 1.5 gives slight overlap between neighbours; 3.0 gives visible spacing.

convergence_abs: float = 0.1¶

stop inner loop when Σ|d_prior - df| < convergence_abs.

Type:: Convergence

endpoint_k: int = 1¶: Number of bins from each TAD end used as ‘anchor’ points when computing the Kabsch alignment. 1 uses just the first / last bin; 3 uses the first/last 3 bins (robust to noise).

inner_max_iter: int = 5000¶: Max gradient-descent iterations per inner assembly pass.

iterative: bool = True¶: If True, after the single-shot Kabsch rotation run the upstream GEM-FISH iterative gradient descent (Abbas 2019 Eqn. 13): for each adjacent-TAD pair, rotate TAD i+1 around its centre so the gap between y_end[i] and y_start[i+1] converges toward d_prior[i, i+1]. Repeats with reflection retry for TADs whose boundary gap has > reflect_threshold relative error.

match_scales: bool = True¶: If True, uniformly rescale the Stage-1 TAD-centre layout so the mean inter-TAD gap matches the intra-TAD Rg scale (roughly tangent packing of TADs). Without this, Stage 1 and Stage 2 end up on incompatible scales: Stage-1 centres drift apart under minimal polymer constraint while each Stage-2 intra-TAD cloud optimises to unit bond length, giving tight blobs with huge bridges between them. The intra-TAD geometry (which Stage 2 optimised against real Hi-C) is preserved; only the global layout is rescaled.

outer_max_iter: int = 5¶: Max outer passes (each can reflect bad TADs and re-run inner).

reflect_threshold: float = 0.2¶: TADs whose adjacent-boundary relative error exceeds this threshold are reflected through their centre before the next outer pass (matches the upstream reflection-retry strategy).

uchrom.recon.fish._assembly.assemble(tad_centres: ndarray, intra_tad_coords: List[ndarray], inter_tad_distances: ndarray | None = None, params: AssemblyParams | None = None) → ndarray[source]¶

Stitch intra-TAD conformations together into a single chain.

Parameters:

tad_centres (ndarray (n_tads, 3)) – Stage-1 TAD-centre coordinates.
intra_tad_coords (list of ndarray (n_bins_i, 3)) – Stage-2 intra-TAD coordinate clouds (one per TAD).
inter_tad_distances (ndarray (n_tads, n_tads), optional) – Pairwise expected distances used for Stage-3 alignment. Only the immediate-neighbour entries d[i, i+1] are consulted in this simple implementation. If omitted, we fall back to using the Stage-1 centre distances directly.
params (AssemblyParams)

Returns:

coords – Concatenated, rigid-body-aligned per-bin coordinates. Order is TAD-0 bins first, then TAD-1, …

Return type:

ndarray (sum n_bins_i, 3)

uchrom.recon.fish._assembly.gap_distance_from_contacts(tad_contacts: ndarray, tad_genomic_distances: ndarray, alpha: float = 0.25, fallback_c: float = 1.0, fallback_beta: float = 0.5) → ndarray[source]¶: Wrapper around uchrom.recon.fish._hic.contact_to_distance() producing the neighbour-gap matrix used by assemble().