uchrom.fea

Geometric / statistical features over chromatin traces.

uchrom.fea.axis_variance_cube(cd, chrom: str, device: str = 'auto') dict[source]

Compute per-axis pairwise variance + sample-count cubes.

Returns a dict with var, count, mean (all (3, B, B)) plus bin_ids, n_traces, chrom, and — for downstream filter_normalize() — the full (3, T, B, B) pairwise diff tensor on the GPU device under key "diff".

uchrom.fea.axis_weight(cd, chrom: str | None = None, device: str = 'auto') ndarray[source]

Compute per-axis weights w 1 / median(trace_variance).

For each axis we centre every trace at its own mean (bins with NaN excluded) and take the median across traces of each trace’s variance. The inverse of that median is the axis weight, normalised to sum 1. Consistent with ArcFISH’s axis_weight routine.

uchrom.fea.contact_frequency(df: DataFrame, threshold: float, chrom=None)[source]

Fraction of traces where a pair of bins are within threshold.

NaN distances (missing spots) are excluded from both numerator and denominator — each bin pair’s frequency is over the set of traces that have both endpoints detected.

Parameters:
  • df (DataFrame with spots + coords.)

  • threshold (distance threshold in the same units as x/y/z.)

  • chrom (optional chromosome filter.)

Returns:

  • frequency (ndarray (n_bins, n_bins) in [0, 1], NaN where no) – trace had both endpoints detected.

  • bin_ids (list of (start, end))

  • n_traces (int)

uchrom.fea.filter_normalize(cube: dict, k_sigma: float = 4.0, frac: float = 0.1) dict[source]

ArcFISH-style per-trace LOWESS filter + normalise.

Operates on the full (3, n_traces, n_bins, n_bins) pairwise-diff tensor kept on the GPU (under cube['diff']). Two passes:

  1. Per-pair raw_var = nanmedian(trace_diff²) → LOWESS over log(genomic_distance)strata_std. Individual trace observations where |diff - median(diff)| > k_sigma × strata_std are NaN’d in-place in the 4D tensor.

  2. After filtering, per-pair filtered_var = nanmean((diff - mean)²) and per-pair count = n_valid recomputed. LOWESS again over log(d1d) → expected; normalised variance = filtered / expected.

Output (numpy, on CPU): var, count (refreshed after filter), norm_var, expected, raw_var, genomic_distance. The original 4D tensor under "diff" is consumed (may be modified).

uchrom.fea.mean_distance_matrix(df: DataFrame, chrom=None, reduce: str = 'median')[source]

Population-level mean/median pairwise distance matrix.

For each pair of genomic bins (i, j), the distance is computed per-trace and then reduced across traces with np.nanmedian (the Bintu 2018 convention) or np.nanmean.

Parameters:
  • df (DataFrame with spots + coords.)

  • chrom (optional chromosome filter.)

  • reduce ('median' (default) or 'mean'.)

Returns:

  • matrix (ndarray (n_bins, n_bins))

  • bin_ids (list of (start, end))

  • n_traces (int)

uchrom.fea.radius_of_gyration(df: DataFrame, chrom=None) Series[source]

Per-trace radius of gyration.

Rg = sqrt(mean over spots of ||r - centroid||²). Traces with fewer than 2 spots contribute NaN.

Distance-based aggregates

Distance-based aggregate statistics over a population of traces.

Input convention: a flat DataFrame with columns chrom, start, end, x, y, z, trace_id (what ChromData.to_dataframe() produces, or what the browser’s ChromatinLayer.df stores).

The core helper _bin_coord_cube() pivots the flat table into a (n_traces, n_bins, 3) array with NaN for missing spots, which lets every aggregate statistic be computed as a straightforward NaN-aware reduction.

uchrom.fea.distance.contact_frequency(df: DataFrame, threshold: float, chrom=None)[source]

Fraction of traces where a pair of bins are within threshold.

NaN distances (missing spots) are excluded from both numerator and denominator — each bin pair’s frequency is over the set of traces that have both endpoints detected.

Parameters:
  • df (DataFrame with spots + coords.)

  • threshold (distance threshold in the same units as x/y/z.)

  • chrom (optional chromosome filter.)

Returns:

  • frequency (ndarray (n_bins, n_bins) in [0, 1], NaN where no) – trace had both endpoints detected.

  • bin_ids (list of (start, end))

  • n_traces (int)

uchrom.fea.distance.mean_distance_matrix(df: DataFrame, chrom=None, reduce: str = 'median')[source]

Population-level mean/median pairwise distance matrix.

For each pair of genomic bins (i, j), the distance is computed per-trace and then reduced across traces with np.nanmedian (the Bintu 2018 convention) or np.nanmean.

Parameters:
  • df (DataFrame with spots + coords.)

  • chrom (optional chromosome filter.)

  • reduce ('median' (default) or 'mean'.)

Returns:

  • matrix (ndarray (n_bins, n_bins))

  • bin_ids (list of (start, end))

  • n_traces (int)

uchrom.fea.distance.radius_of_gyration(df: DataFrame, chrom=None) Series[source]

Per-trace radius of gyration.

Rg = sqrt(mean over spots of ||r - centroid||²). Traces with fewer than 2 spots contribute NaN.

Axis-wise preprocessing

ArcFISH-style axis-wise preprocessing for chromatin tracing data.

References

Yu H. et al. Accurate and robust 3D genome feature discovery from multiplexed DNA FISH, bioRxiv 2025.11.26.690837v1.

Independent implementation in uchrom — not derived from the GPL-3.0 ArcFISH source.

Pipeline (per chromosome)

  1. axis_variance_cube Builds (3, n_bins, n_bins) per-axis pairwise variance + count cubes from ChromData spots. Each trace contributes a rank-1 outer difference for each axis; aggregation is NaN-aware.

  2. filter_normalize Two-pass LOWESS stratification on log(1D genomic distance):

    • first pass: flag entries whose per-pair squared deviation is more than k_sigma × stratified std as outliers and NaN them;

    • second pass: refit LOWESS on the cleaned variances to give each entry a genome-distance-matched expectation, then normalise.

  3. axis_weight Returns a 3-vector of weights (sum 1) inversely proportional to the per-axis trace-variance median — the exact weighting used by the ACAT combination step in the loop / tad / comp callers.

All tensor-heavy computation runs on a user-selected torch device ('auto' | 'cpu' | 'cuda' | 'mps'). LOWESS stays on CPU via statsmodels because it’s a non-vectorised kernel smoother whose input size is O(n_bins²) (typically ≤ 10 k).

uchrom.fea.arc.axis_variance_cube(cd, chrom: str, device: str = 'auto') dict[source]

Compute per-axis pairwise variance + sample-count cubes.

Returns a dict with var, count, mean (all (3, B, B)) plus bin_ids, n_traces, chrom, and — for downstream filter_normalize() — the full (3, T, B, B) pairwise diff tensor on the GPU device under key "diff".

uchrom.fea.arc.axis_weight(cd, chrom: str | None = None, device: str = 'auto') ndarray[source]

Compute per-axis weights w 1 / median(trace_variance).

For each axis we centre every trace at its own mean (bins with NaN excluded) and take the median across traces of each trace’s variance. The inverse of that median is the axis weight, normalised to sum 1. Consistent with ArcFISH’s axis_weight routine.

uchrom.fea.arc.filter_normalize(cube: dict, k_sigma: float = 4.0, frac: float = 0.1) dict[source]

ArcFISH-style per-trace LOWESS filter + normalise.

Operates on the full (3, n_traces, n_bins, n_bins) pairwise-diff tensor kept on the GPU (under cube['diff']). Two passes:

  1. Per-pair raw_var = nanmedian(trace_diff²) → LOWESS over log(genomic_distance)strata_std. Individual trace observations where |diff - median(diff)| > k_sigma × strata_std are NaN’d in-place in the 4D tensor.

  2. After filtering, per-pair filtered_var = nanmean((diff - mean)²) and per-pair count = n_valid recomputed. LOWESS again over log(d1d) → expected; normalised variance = filtered / expected.

Output (numpy, on CPU): var, count (refreshed after filter), norm_var, expected, raw_var, genomic_distance. The original 4D tensor under "diff" is consumed (may be modified).