FastHigashi cell embedding¶
U-Chrom wraps FastHigashi (Zhang et al. 2022, Cell Systems) so
that single-cell Hi-C data can be embedded into a per-cell low-rank
representation that’s stored inside ChromData (cd.cellm['higashi'])
alongside everything else.
This notebook walks the full pipeline:
Convert per-cell sci-Hi-C contact matrices into Higashi v2 input format (
filelist.txt,label_info.pickle,config.JSON,chromsizes.tsv) usinguchrom.io.write_higashi_inputs/ helpers.Run FastHigashi end-to-end via
uchrom.emb.higashi.run— the wrapper applies an Apple-Siliconpin_memoryshim so it works on CPU-only Macs.Score the resulting embedding against the known cell-type labels using ARI / NMI.
Dataset: Kim et al. 2020 sci-Hi-C (Noble lab), H1Esc + HFF mix at
500 kb on hg19 — 1 931 cells with ground-truth labels. See
example-data/README.md for source.
Performance on a Mac mini M2 (16 GB, CPU only):
n cells |
rank |
wall (s) |
ARI |
|---|---|---|---|
40 |
16 |
8 |
0.07 |
150 |
64 |
52 |
0.42 |
300 |
64 |
136 |
0.55 |
ARI grows with cell count — Tucker decomposition needs population diversity to separate cell types. This notebook defaults to 150 cells (~1 min on a Mac mini).
0. Install FastHigashi¶
FastHigashi is an optional dependency — pip install -e .[emb] (or
pip install fast-higashi from a checkout of the upstream repo) before
running this notebook. The wrapper imports fasthigashi lazily and
raises a clear ImportError pointing at the right extra if it’s
missing.
# Sanity check the dep is available; the import is otherwise lazy.
import importlib.util
if importlib.util.find_spec("fasthigashi") is None:
raise RuntimeError(
"Install FastHigashi first: `pip install -e .[emb]` "
"(or `pip install fast-higashi`)."
)
1. Locate the demo data¶
We use Kim et al. 2020 sci-Hi-C (H1Esc + HFF mix, 500 kb, hg19), as
hosted by the Noble lab at
noble.gs.washington.edu/proj/schic-topic-model. The helper below
auto-downloads the tarball (~128 MB) and labels file (~92 KB) into
example-data/ on first run.
Format note: each cell’s .matrix file is a sparse triplet
bin1<TAB>bin2<TAB>count<TAB>weight<TAB>chrom1<TAB>chrom2, where
bin1/bin2 are global bin indices across the whole genome. We
discover each chrom’s offset by scanning the data, then convert to
Higashi v2’s (chrom1, pos1, chrom2, pos2, count) rows.
import tarfile, urllib.request
from pathlib import Path
import numpy as np
import pandas as pd
RESOLUTION = 500_000
KIM2020_BASE = (
"https://noble.gs.washington.edu/proj/schic-topic-model/data/"
)
HG19_CHROMSIZES_500K = {
"chr1": 249250621, "chr2": 243199373, "chr3": 198022430, "chr4": 191154276,
"chr5": 180915260, "chr6": 171115067, "chr7": 159138663, "chr8": 146364022,
"chr9": 141213431, "chr10": 135534747, "chr11": 135006516, "chr12": 133851895,
"chr13": 115169878, "chr14": 107349540, "chr15": 102531392, "chr16": 90354753,
"chr17": 81195210, "chr18": 78077248, "chr19": 59128983, "chr20": 63025520,
"chr21": 48129895, "chr22": 51304566, "chrX": 155270560,
}
def _repo_root() -> Path:
for p in [Path.cwd(), *Path.cwd().parents]:
if (p / "pyproject.toml").exists():
return p
return Path.cwd()
def find_kim2020():
root = _repo_root()
data = root / "example-data"
data.mkdir(parents=True, exist_ok=True)
tar_path = data / "H1Esc-HFF.R1.tar.gz"
matrices_dir = data / "H1Esc-HFF.R1"
labels_path = data / "H1Esc-HFF.R1.labeled"
if not labels_path.exists():
url = KIM2020_BASE + "matrix_labels/H1Esc-HFF.R1.labeled"
print(f"[data] fetching cell-type labels → {labels_path}")
urllib.request.urlretrieve(url, labels_path)
if not matrices_dir.exists():
if not tar_path.exists():
url = KIM2020_BASE + "matrix_files/H1Esc-HFF.R1.tar.gz"
print(f"[data] fetching sci-Hi-C tarball (~128 MB) → {tar_path}")
urllib.request.urlretrieve(url, tar_path)
print(f"[data] extracting {tar_path.name} → {matrices_dir}")
with tarfile.open(tar_path, "r:gz") as tf:
tf.extractall(data)
return matrices_dir, labels_path
matrices_dir, labels_path = find_kim2020()
print(f"matrices: {matrices_dir} ({sum(1 for _ in matrices_dir.glob('*.matrix'))} cells)")
2. Pick a balanced subset¶
Take 75 H1Esc + 75 HFF cells. Larger subsets give better ARI but cost more wall-clock time — see the table at the top of the notebook.
N_PER_TYPE = 75 # 150 cells total — ~1 min on Mac mini CPU
label_df = pd.read_csv(
labels_path, sep="\t", header=None, names=["matrix", "cell_type"]
)
h1 = label_df[label_df.cell_type == "H1Esc"].head(N_PER_TYPE)
hff = label_df[label_df.cell_type == "HFF"].head(N_PER_TYPE)
sub = pd.concat([h1, hff]).reset_index(drop=True)
sub["cell_name"] = sub["matrix"].str.replace(".matrix", "", regex=False)
print(f"subset: {len(sub)} cells")
sub["cell_type"].value_counts()
3. Discover bin offsets, convert to Higashi v2¶
The matrix files use global bin indices, but Higashi v2 wants
(chrom, pos) per contact. We compute each chrom’s offset by taking
the minimum global bin observed for that chrom across a sample of
cells, then translate to local genomic coordinates.
def discover_bin_offsets(matrices_dir, sample=200):
files = sorted(matrices_dir.glob("*.matrix"))[:sample]
per_chrom_min = {}
for f in files:
df = pd.read_csv(
f, sep="\t", header=None,
names=["bin1", "bin2", "count", "weight", "chrom1", "chrom2"],
)
for col_bin, col_chrom in [("bin1", "chrom1"), ("bin2", "chrom2")]:
for chrom, mn in df.groupby(col_chrom)[col_bin].min().items():
if chrom not in per_chrom_min or mn < per_chrom_min[chrom]:
per_chrom_min[chrom] = int(mn)
return per_chrom_min
def matrix_to_higashi(matrix_path, bin_offsets):
df = pd.read_csv(
matrix_path, sep="\t", header=None,
names=["bin1", "bin2", "count", "weight", "chrom1_full", "chrom2_full"],
)
df["chrom1"] = df["chrom1_full"].str.replace("human_", "", regex=False)
df["chrom2"] = df["chrom2_full"].str.replace("human_", "", regex=False)
df["pos1"] = (df["bin1"] - df["chrom1_full"].map(bin_offsets)) * RESOLUTION
df["pos2"] = (df["bin2"] - df["chrom2_full"].map(bin_offsets)) * RESOLUTION
out = df[["chrom1", "pos1", "chrom2", "pos2", "count"]].dropna()
out = out[(out["pos1"] >= 0) & (out["pos2"] >= 0)]
return out.astype({"pos1": int, "pos2": int, "count": int})
bin_offsets = discover_bin_offsets(matrices_dir)
print("first 5 chrom offsets:", sorted(bin_offsets.items(), key=lambda x: x[1])[:5])
4. Materialise the Higashi v2 input directory¶
uchrom.io.contacts exposes three helpers used here:
write_label_info(cells_df, path)— dump per-cell metadata asdict[col] -> ndarrayto a pickle.make_higashi_config(...)— produce theconfig.JSONdict with all fields FastHigashi expects (resolution, chrom_list, header_included, contact_header, resolution_fh, …)._materialise_chromsizes(...)— writechromsizes.tsvfrom a{chrom: size}dict (FastHigashi needsgenome_reference_path).
write_higashi_inputs orchestrates these for the common .pairs/.cool
case. Here we open the sparse .matrix files directly, so we call the
helpers ourselves.
import json
import tempfile
from uchrom.io.contacts import (
write_label_info, make_higashi_config, _materialise_chromsizes,
)
td = Path(tempfile.mkdtemp(prefix="higashi_tut_"))
higashi_dir = td / "higashi"
data_dir = higashi_dir / "data"
data_dir.mkdir(parents=True)
chrom_list = list(HG19_CHROMSIZES_500K.keys())
filelist_lines = []
for _, row in sub.iterrows():
df = matrix_to_higashi(matrices_dir / row.matrix, bin_offsets)
df = df[df["chrom1"].isin(chrom_list) & df["chrom2"].isin(chrom_list)]
out = data_dir / f"{row.cell_name}.txt"
df.to_csv(out, sep="\t", index=False, header=False)
filelist_lines.append(str(out.resolve()))
(higashi_dir / "filelist.txt").write_text("\n".join(filelist_lines) + "\n")
write_label_info(sub[["cell_name", "cell_type"]],
higashi_dir / "label_info.pickle")
chromsizes_path = _materialise_chromsizes(HG19_CHROMSIZES_500K, higashi_dir)
cfg = make_higashi_config(
data_dir=str(higashi_dir),
genome_reference="hg19",
resolution=RESOLUTION,
chrom_list=chrom_list,
overrides={"genome_reference_path": str(chromsizes_path)},
)
(higashi_dir / "config.JSON").write_text(json.dumps(cfg, indent=2))
print("Higashi input directory:", higashi_dir)
print("files:", sorted(p.name for p in higashi_dir.iterdir()))
5. Build a stub ChromData¶
The embedding goes into cd.cellm['higashi'], so we need a
ChromData with a cells frame whose cell_name matches the names
in label_info.pickle. Spots/coords here are placeholders — in a real
analysis you’d already have a populated ChromData from
reconstruction or tracing.
from uchrom.core import ChromData
cd = ChromData(
coords=np.zeros((len(sub), 3)),
spots=pd.DataFrame({
"chrom": ["chr1"] * len(sub),
"start": np.arange(len(sub)) * RESOLUTION,
"end": np.arange(len(sub)) * RESOLUTION + RESOLUTION,
"trace_id": [f"t{i}" for i in range(len(sub))],
"cell_id": list(sub.cell_name),
}),
cells=sub[["cell_name", "cell_type"]].reset_index(drop=True),
)
cd
6. Run FastHigashi¶
uchrom.emb.higashi.run drives the four-stage FastHigashi pipeline
(fast_process_data → prep_dataset → run_model → fetch_cell_embedding)
and writes the L2-normalised cell embedding into
cd.cellm['higashi'], aligned to cd.cells row order. Provenance
(rank, config path, output dir) is recorded in cd.uns['higashi'].
The wrapper_kwargs enable the recommended preprocessing for noisy
scHi-C: graph convolution, random-walk-with-restart, and column
normalization.
import time
from uchrom.emb.higashi import run as higashi_run
RANK = 64
t0 = time.time()
cd = higashi_run(
cd,
contacts_dir=higashi_dir,
rank=RANK,
cache_dir=td / "cache",
result_dir=td / "result",
wrapper_kwargs={"do_conv": True, "do_rwr": True, "do_col": True},
)
print(f"wall clock: {time.time()-t0:.1f}s")
print(f"embedding shape: {cd.cellm['higashi'].shape}")
print(f"provenance: {dict(cd.uns['higashi'])}")
7. Score the embedding against ground truth¶
Cluster the embedding with k-means (k = 2 = number of known cell types) and compare to the labels using:
ARI (Adjusted Rand Index): random = 0, perfect = 1.
NMI (Normalised Mutual Information): random ≈ 0, perfect = 1.
For 150 cells at rank 64 with conv+rwr+col preprocessing we expect ARI ≈ 0.4. Bumping to 300 cells gets ARI ≈ 0.55. GPU + full PFC dataset in the original paper reaches ARI > 0.7.
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score, normalized_mutual_info_score
emb = cd.cellm["higashi"]
true_labels = pd.Categorical(cd.cells["cell_type"]).codes
pred = KMeans(n_clusters=2, n_init=10, random_state=0).fit_predict(emb)
ari = adjusted_rand_score(true_labels, pred)
nmi = normalized_mutual_info_score(true_labels, pred)
print(f"ARI = {ari:.3f} (random ≈ 0, perfect = 1)")
print(f"NMI = {nmi:.3f}")
8. Visualise with UMAP¶
A quick 2-D UMAP of the embedding shows whether H1Esc and HFF separate
visually. (UMAP is optional — pip install umap-learn.)
try:
from umap import UMAP
import matplotlib.pyplot as plt
coords = UMAP(n_components=2, n_neighbors=15, random_state=0).fit_transform(emb)
fig, ax = plt.subplots(figsize=(6, 5))
for ct, color in [("H1Esc", "tab:red"), ("HFF", "tab:blue")]:
mask = cd.cells["cell_type"].values == ct
ax.scatter(coords[mask, 0], coords[mask, 1], s=14, alpha=0.7,
label=ct, color=color)
ax.set_xlabel("UMAP 1"); ax.set_ylabel("UMAP 2")
ax.set_title(f"FastHigashi cell embedding (ARI={ari:.2f})")
ax.legend()
plt.tight_layout()
except ImportError:
print("install umap-learn for the UMAP plot: pip install umap-learn")
Next steps¶
Use the embedding for cell-type / trajectory analysis — store it in
cd.cellm[...]and persist withcd.write('with_higashi.h5cd').Imputed contact maps are written under
cd.uns['higashi']['result_dir']— not loaded intoChromData(per the “nospotp” design — O(n²) per cell is prohibitive).Larger / more cell types: rerun with
N_PER_TYPE=150(300 cells, ~2 min on Mac mini, ARI ≈ 0.55) or move to a GPU box for the full 1 931-cell dataset.