Example data¶
Catalog of the datasets used by the tutorials (under tutorials/) and
benchmarks (under benchmarks/). Small files (< 10 MB) are checked
into the repo; larger files are listed here with their source URL and
auto-downloaded by the tutorials on first run into this directory.
Contents at a glance¶
File / target location |
Size |
Shipped in repo |
Auto-download |
Used by |
|---|---|---|---|---|
|
4.7 MB |
yes |
— |
|
|
5.4 MB |
yes |
— |
(available for experimentation; bulkier variant of |
|
280 KB |
yes |
— |
|
|
~750 KB |
yes |
— |
|
Takei 2025 full Zenodo dump |
~8 GB |
no |
manual |
|
|
22 MB |
no |
4DN public S3 |
|
|
144 MB |
no |
Zenodo 3735329 |
|
|
2 MB |
no |
GitHub raw |
|
|
~35 MB each |
no |
from the zip |
|
|
128 MB |
no |
UW Noble lab |
|
|
92 KB |
no |
UW Noble lab |
|
|
24 MB |
no |
generated |
output of |
|
~6 MB |
no |
generated |
synthetic offline fallback for tutorials 4/5/6/7 |
|
~40 GB |
no |
manual (too large) |
|
*.csv, *.h5cd, and the DNAseqFISH+ zip/folder are ignored by git;
they’re either generated or downloaded on demand.
Small, in-repo datasets¶
cell1.pairs — single-cell Hi-C read pairs (Stevens 2017)¶
Format: plain-text
.pairs(no header) with 7 tab-separated columnsread_id, chrom1, pos1, chrom2, pos2, strand1, strand2.Content: 105,700 paired-end reads from one mESC G1 cell.
Source: Stevens et al. 2017, Nature 544:59–64, “3D structures of individual mammalian genomes studied by single-cell Hi-C”. GEO accession GSE80006. Distributed with the Nuc Dynamics software (github.com/tjs23/nuc_dynamics).
Used by:
tutorials/reconstruction.ipynb(Nuc Dynamics worked example); mentioned in the rootREADME.mdCLI demo.
cell2.pairs.gz — single-cell Hi-C (v1.0 pairs format, gzipped)¶
Format: gzipped pairs v1.0 (with
##header), 503,720 reads.Source: same Stevens 2017 dataset, a different cell.
Used by: not wired into any tutorial yet — kept for users who want to experiment with a larger, header-carrying pairs file.
takei2025_cerebellum_fixture/ — Takei 2025 cerebellum DNA seqFISH+ slice (~750 KB)¶
Format: three CSVs that together exercise the
read_seqfish_multiomicsloader end-to-end —dna_spots.csv— 1 070 rows fromcerebellum_rep1_pos0.csv(rep 1, FOV 0, 6 cells, chr19 only), all 76 source columns preserved verbatim (59 z-scores + 3 DBSCAN allele variants + μm coordinates +dot_int / n_rad_score / n_per_dist(um)).locus_annotation.csv— the 838 chr19 rows fromLC1-100k-09022022-mm10-25kb-meta.csvtrimmed toname/chrom/start/end.clustering.csv— the 6 matching rows from100k-002-001-cerebellum_mRNA_cluster_nuc_vol_filtered.csv.derive_fixture.py— the slicing script for reproducibility. Not executed in CI.verify.py— five-layer correctness check (HDF5 layout +ChromDataconsistency + value-level row reconciliation against the source CSVs + semantic checks +.h5cdround-trip). Runpython example-data/takei2025_cerebellum_fixture/verify.pyto re-validate; pass--skip-value-reconplus--spot-glob/--locus/ --clusteringpaths to validate the full Zenodo data instead.
Cell-type coverage: the 6 cells span leiden clusters {0, 2, 3, 4, 6, 7} → cell types Granule, Bergmann, Other, MLI1, Purkinje, MLI2+PLI (one cell per type).
Source: Takei et al. 2025, Nature “Spatial multi-omics reveals cell-type-specific nuclear compartments” (doi:10.1038/s41586-025-08838-x). Raw data: Zenodo record 7693825. Locus annotation + clustering CSVs: CaiGroup/dna-seqfish-plus-multi-omics GitHub repo.
Used by:
seqfish_multiomics_cerebellum.ipynb(loader walk-through.h5cdround-trip).
Real-data run: the full distribution (rep 1 + rep 2 tarballs, ~8 GB compressed; tens of GB uncompressed; tens of millions of spots) is not auto-fetched. Manual download:
wget https://zenodo.org/records/7693825/files/cerebellum_rep1.tar.gz wget https://zenodo.org/records/7693825/files/cerebellum_rep2.tar.gz git clone https://github.com/CaiGroup/dna-seqfish-plus-multi-omics.git
Then point the loader at the extracted CSV(s):
cd = ChromData.from_seqfish_multiomics( spot_glob='cerebellum_rep*/cerebellum_rep*_pos*.csv', locus_annotation='dna-seqfish-plus-multi-omics/data/annotation/' 'LC1-100k-09022022-mm10-25kb-meta.csv', cell_clustering='dna-seqfish-plus-multi-omics/data/cerebellum/' 'clustering/100k-002-001-cerebellum_mRNA_cluster_nuc_vol_filtered.csv', )
IMR90_chr21_30kb.cool — Rao 2014 IMR90 chr21 at 30 kb (0.3 MB)¶
Format: single-resolution cooler. 1 557 bins × 30 kb covering chr21 (hg38).
Content: IMR90 in situ Hi-C pair counts from Rao et al. 2014, aggregated from the native 5 kb to 30 kb to match the Bintu 2018 chromatin-tracing resolution.
Source: 4DN accession 4DNFI4QQPDMR (the full 810 MB IMR90
.mcool; we extract chr21 at 30 kb into this small.coolfor redistribution).How it was derived:
from cooler import Cooler; import cooler, numpy as np, pandas as pd c5 = Cooler('4DNFI4QQPDMR.mcool::/resolutions/5000') m = c5.matrix(balance=False, as_pixels=False).fetch('chr21') # aggregate 5 kb × 6 → 30 kb by summation f = 6; n5 = m.shape[0]; n30 = (n5 + f - 1) // f agg = np.zeros((n30, n30)) for i in range(n30): for j in range(n30): agg[i,j] = m[i*f:(i+1)*f, j*f:(j+1)*f].sum() bins = pd.DataFrame({ 'chrom': ['chr21']*n30, 'start': np.arange(n30)*30_000, 'end': np.minimum((np.arange(n30)+1)*30_000, c5.chromsizes['chr21']), }) iu = np.triu_indices(n30, k=0) pixels = pd.DataFrame({'bin1_id': iu[0], 'bin2_id': iu[1], 'count': agg[iu].astype(np.int64)}) pixels = pixels[pixels['count'] > 0] cooler.create_cooler('IMR90_chr21_30kb.cool', bins, pixels, assembly='hg38') # Balance with ICE so reconstruct_gem_fish can pull balanced counts: # python -m cooler balance IMR90_chr21_30kb.cool
Used by:
gem_fish_reconstruction.ipynb— paired with the Bintu 2018 FISH CSV for the paper-faithful Part 2 run.
Large datasets — auto-downloaded on first run¶
The tutorials locate these files via a find_*() helper that:
Returns any cached copy under
example-data/or~/Downloads/….Otherwise downloads into
example-data/and caches for the next run.Falls back to a small synthetic dataset if the network is unreachable (only applies to the 4DN CSV path; the Zenodo zip has no synthetic fallback).
Every tutorial’s first data-loading cell prints the location it ends up using.
IMR90_chr21_pyhim.ecsv — Bintu 2018 IMR90 chr21 in PyHiM ECSV format (5.7 MB)¶
Format: PyHiM chromatin trace table (Astropy ECSV). Columns:
Spot_ID, Trace_ID, x, y, z, Chrom, Chrom_Start, Chrom_End, ROI #, Mask_id, Barcode #, label.meta['comments']carriesxyz_unit=micron,genome_assembly=hg38.Content: hg38 chr21:18.6–20.6 Mb, 1,277 traces × 66 loci × 30 kb spacing. IMR90 fibroblasts.
Source: Bintu et al. 2018, Science 362:eaau1783, “Super-resolution chromatin tracing reveals domains and cooperative interactions in single cells”. Original CSV at mendeley.com/datasets/3jkp7zhwbr/1.
How it was derived: The original Bintu CSV has columns
Chromosome index, Segment index, Z, X, Y(nm). The conversion scriptbintu_to_pyhim_ecsv.pymaps these to PyHiM schema:Chromosome index→Trace_ID(1..1277)Segment index→Barcode #(1..66)X, Y, Z(nm) →x, y, z(microns)Chrom = chr21,Chrom_Start/Endderived from segment index × 30 kbMask_id = Chromosome index(each trace = one “cell”)ROI # = 0,label = "None"
Used by:
import_pyhim_ecsv.ipynb— demonstratesChromData.from_pyhim_trace()on real chromatin tracing data.Generated on demand: If missing, the tutorial runs
bintu_to_pyhim_ecsv.pyto convertIMR90_chr21-18-20Mb.csv(which is auto-downloaded if needed).
4DNFIHF3JCBY.csv — Takei 2021 mESC FOF-CT chromatin tracing (22 MB)¶
Format: 4DN FISH Omics Format — Chromatin Tracing (FOF-CT) core table. Headers (
##…) describe the experiment; the data table has columns:Spot_ID, Trace_ID, X, Y, Z, Chrom, Chrom_Start, Chrom_End, Cell_ID, ….Content: mm10, 20 chromosomes × 60 bins × 25 kb, ~400 traces per chromosome across 201 E14 mESC cells.
Source: Takei et al. 2021, Nature 590:344–350, “Integrated spatial genomics reveals global architecture of single nuclei”. 4DN portal: data.4dnucleome.org/4DNFIHF3JCBY.
Download URL (public S3, no credentials needed):
https://4dn-open-data-public.s3.amazonaws.com/fourfront-webprod/wfoutput/e699334e-fb34-4a0e-8ef6-670b2099831a/4DNFIHF3JCBY.csv
Used by (all use the
find_fofct()helper):loop_calling.ipynb,tad_calling.ipynb,compartment.ipynb,fishnet_domains.ipynb.
IMR90_chr21-18-20Mb.csv — Bintu 2018 IMR90 chr21:18.6–20.6 Mb chromatin tracing (2 MB)¶
Format: CSV with header line then columns
Chromosome index, Segment index, Z, X, Y. One row per detected segment per imaged chromosome. Coordinates in nanometres. Segment spacing is 30 kb.Content: 1 278 imaged chromosomes × 66 segments in IMR90 cells, covering chr21:18,627,714–20,577,518 (hg38).
Source: Bintu et al. 2018, Science 362, eaau1783, “Super- resolution chromatin tracing reveals domains and cooperative interactions in single cells”. The paper is a higher-resolution follow-up to Wang et al. 2016 (Science 353:598) which Abbas et al. 2019 (GEM-FISH) originally used; Bintu 2018 data is directly accessible from the authors’ GitHub repository.
Download URL:
https://raw.githubusercontent.com/BogdanBintu/ChromatinImaging/master/Data/IMR90_chr21-18-20Mb.csvUsed by:
gem_fish_reconstruction.ipynb(part 2 — real FISH data). The tutorial synthesises a matching Hi-C from the FISH population-mean so it runs offline; to reproduce the paper’s exact setup, substitute the Rao 2014 IMR90.mcoolfrom 4DN at 30 kb resolution.
GSE63525_GM12878_insitu_primary+replicate_combined_30.hic — GM12878 in-situ combined Hi-C (~40 GB)¶
Format: Juicer
.hic— a multi-resolution contact matrix file (read withhicstraw).Content: GM12878 lymphoblastoid cells, in-situ Hi-C, primary + replicate merged and MAPQ ≥ 30 filtered. Contains every standard Juicer resolution from 1 kb to 2.5 Mb, all chromosomes.
Source: Rao et al. 2014, Cell 159:1665–1680, “A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping”.
Download: not auto-fetched (40 GB is too large to ship through
download_data.py). Grab it manually from one of:GEO GSE63525 (look for
GSE63525_GM12878_insitu_primary+replicate_combined_30.hic)
Used by:
benchmark.ipynb(MDS reconstruction benchmark). The tutorial checksexample-data/first and falls back to theUCHROM_GM12878_HICenv var, so you can keep the file on an external drive:export UCHROM_GM12878_HIC=/path/to/…_combined_30.hic.
DNAseqFISH+.zip — Takei 2021 raw seqFISH+ spots (144 MB)¶
Format: zip containing 8 CSVs — 4 replicates at 1-Mb resolution and 4 at 25-kb resolution. Columns:
fov, channel, cellID, regionID (hyb1-60), x, y, z, dot_intensity, chr{N}_intensity × 20, chromID, labelID.Content: same experiment as the FOF-CT above, but before trace assignment — every row is a detected fluorescent spot with a decoded chromosome ID but ambiguous fiber assignment (median 6 candidate spots per
(cell, chromID, region)).labelID ≥ 0marks the upstream pipeline’s trace choice (useful as ground truth when benchmarking aligners).Source: Takei et al. 2021, Zenodo record 3735329, doi:10.5281/zenodo.3735329.
Download URL:
https://zenodo.org/records/3735329/files/DNAseqFISH%2B.zip?download=1
Used by:
jie_aligner.ipynb(spot-to-fiber tracing). The tutorial opens the CSV directly from the zip without extracting.Coordinate conventions (applied by the tutorial):
x, yin pixels × 103 nm/pixel;zin pixels × 250 nm/pixel.
H1Esc-HFF.R1.tar.gz + H1Esc-HFF.R1.labeled — Kim 2020 sci-Hi-C (128 MB + 92 KB)¶
Format: tarball of per-cell
.matrixfiles; each is a sparse tripletbin1<TAB>bin2<TAB>count<TAB>weight<TAB>chrom1<TAB>chrom2, withbin1/bin2as global bin indices across the whole hg19 genome at 500 kb (offsets are not encoded — derive them by min-bin per chrom across the cells, or compute from canonical hg19 chromsizes). Chromosome strings are prefixedhuman_(e.g.human_chr14). The companion*.labeledis a 2-column TSVmatrix_filename<TAB>cell_typewith values in{H1Esc, HFF}.Content: 1 931 cells (750 H1Esc + 1 181 HFF), pooled from a combinatorial-indexing sci-Hi-C library at 500 kb. Used as a benchmark in Kim et al.’s topic-model paper.
Source: Kim et al. 2020, Nature Communications 11:6386, “Capturing cell type-specific chromatin compartment patterns by applying topic modeling to single-cell Hi-C data” — accompanying website at noble.gs.washington.edu/proj/schic-topic-model.
Download URLs:
https://noble.gs.washington.edu/proj/schic-topic-model/data/matrix_files/H1Esc-HFF.R1.tar.gz https://noble.gs.washington.edu/proj/schic-topic-model/data/matrix_labels/H1Esc-HFF.R1.labeled
Used by:
higashi_embedding.ipynb(FastHigashi cell embedding + ARI vs ground-truth labels). The tutorial picks a balanced subset (e.g. 150 H1Esc + 150 HFF), converts each.matrixto Higashi v2 contact-pair format, runs FastHigashi at rank 64 withdo_conv/do_rwr/do_col=True, and reports ARI / NMI between the k-means clustering ofcd.cellm['higashi']and the cell-type labels. Reproduces ARI ≈ 0.55 on a Mac mini M2 in ~2 min on CPU at 300 cells.
Generated outputs¶
with_loops.h5cd— created at the end ofloop_calling.ipynbas a round-trip demonstration. Safe to delete; the tutorial will regenerate it on next run.fofct_core.csv— a small synthetic FOF-CT that tutorials 4–7 generate if the real Takei 2021 CSV can’t be downloaded. Has a hand-crafted 3-TAD + 1-loop structure purely to keep the notebook runnable offline.
Pre-staging all data (optional)¶
Tutorials download what they need on first run, so you normally don’t have to do anything. To pre-stage all large files (useful for offline / CI environments), run:
python example-data/download_data.py
This is idempotent — files already present are skipped.
Adding a new dataset¶
If under ~10 MB and redistributable → commit to
example-data/and add a row to the “Small, in-repo datasets” table above.Otherwise → add a
download_*helper todownload_data.py, a row to “Large datasets”, and afind_*()function in whichever tutorial needs it (follow the existingfind_fofct()pattern).Always cite the paper + accession, and note the licence / terms of use where they matter.