Example data

Catalog of the datasets used by the tutorials (under tutorials/) and benchmarks (under benchmarks/). Small files (< 10 MB) are checked into the repo; larger files are listed here with their source URL and auto-downloaded by the tutorials on first run into this directory.

Contents at a glance

File / target location

Size

Shipped in repo

Auto-download

Used by

cell1.pairs

4.7 MB

yes

reconstruction.ipynb, root README CLI example

cell2.pairs.gz

5.4 MB

yes

(available for experimentation; bulkier variant of cell1.pairs)

IMR90_chr21_30kb.cool

280 KB

yes

gem_fish_reconstruction tutorial (Hi-C side)

takei2025_cerebellum_fixture/

~750 KB

yes

seqfish_multiomics_cerebellum tutorial (Takei 2025 loader fixture)

Takei 2025 full Zenodo dump

~8 GB

no

manual

seqfish_multiomics_cerebellum tutorial (real-data run)

4DNFIHF3JCBY.csv

22 MB

no

4DN public S3

loop_calling, tad_calling, compartment, fishnet_domains tutorials

DNAseqFISH+.zip

144 MB

no

Zenodo 3735329

jie_aligner tutorial

IMR90_chr21-18-20Mb.csv

2 MB

no

GitHub raw

gem_fish_reconstruction tutorial

DNAseqFISH+/*.csv (extracted)

~35 MB each

no

from the zip

jie_aligner tutorial

H1Esc-HFF.R1.tar.gz

128 MB

no

UW Noble lab

higashi_embedding tutorial (sci-Hi-C H1Esc+HFF mix)

H1Esc-HFF.R1.labeled

92 KB

no

UW Noble lab

higashi_embedding tutorial (cell-type labels)

with_loops.h5cd

24 MB

no

generated

output of loop_calling.ipynb (intermediate cache, safe to delete)

fofct_core.csv

~6 MB

no

generated

synthetic offline fallback for tutorials 4/5/6/7

GSE63525_GM12878_insitu_primary+replicate_combined_30.hic

~40 GB

no

manual (too large)

benchmark.ipynb

*.csv, *.h5cd, and the DNAseqFISH+ zip/folder are ignored by git; they’re either generated or downloaded on demand.

Small, in-repo datasets

cell1.pairs — single-cell Hi-C read pairs (Stevens 2017)

  • Format: plain-text .pairs (no header) with 7 tab-separated columns read_id, chrom1, pos1, chrom2, pos2, strand1, strand2.

  • Content: 105,700 paired-end reads from one mESC G1 cell.

  • Source: Stevens et al. 2017, Nature 544:59–64, “3D structures of individual mammalian genomes studied by single-cell Hi-C”. GEO accession GSE80006. Distributed with the Nuc Dynamics software (github.com/tjs23/nuc_dynamics).

  • Used by: tutorials/reconstruction.ipynb (Nuc Dynamics worked example); mentioned in the root README.md CLI demo.

cell2.pairs.gz — single-cell Hi-C (v1.0 pairs format, gzipped)

  • Format: gzipped pairs v1.0 (with ## header), 503,720 reads.

  • Source: same Stevens 2017 dataset, a different cell.

  • Used by: not wired into any tutorial yet — kept for users who want to experiment with a larger, header-carrying pairs file.

takei2025_cerebellum_fixture/ — Takei 2025 cerebellum DNA seqFISH+ slice (~750 KB)

  • Format: three CSVs that together exercise the read_seqfish_multiomics loader end-to-end —

    • dna_spots.csv — 1 070 rows from cerebellum_rep1_pos0.csv (rep 1, FOV 0, 6 cells, chr19 only), all 76 source columns preserved verbatim (59 z-scores + 3 DBSCAN allele variants + μm coordinates + dot_int / n_rad_score / n_per_dist(um)).

    • locus_annotation.csv — the 838 chr19 rows from LC1-100k-09022022-mm10-25kb-meta.csv trimmed to name/chrom/start/end.

    • clustering.csv — the 6 matching rows from 100k-002-001-cerebellum_mRNA_cluster_nuc_vol_filtered.csv.

    • derive_fixture.py — the slicing script for reproducibility. Not executed in CI.

    • verify.py — five-layer correctness check (HDF5 layout + ChromData consistency + value-level row reconciliation against the source CSVs + semantic checks + .h5cd round-trip). Run python example-data/takei2025_cerebellum_fixture/verify.py to re-validate; pass --skip-value-recon plus --spot-glob/--locus/ --clustering paths to validate the full Zenodo data instead.

  • Cell-type coverage: the 6 cells span leiden clusters {0, 2, 3, 4, 6, 7} → cell types Granule, Bergmann, Other, MLI1, Purkinje, MLI2+PLI (one cell per type).

  • Source: Takei et al. 2025, Nature “Spatial multi-omics reveals cell-type-specific nuclear compartments” (doi:10.1038/s41586-025-08838-x). Raw data: Zenodo record 7693825. Locus annotation + clustering CSVs: CaiGroup/dna-seqfish-plus-multi-omics GitHub repo.

  • Used by: seqfish_multiomics_cerebellum.ipynb (loader walk-through

    • .h5cd round-trip).

  • Real-data run: the full distribution (rep 1 + rep 2 tarballs, ~8 GB compressed; tens of GB uncompressed; tens of millions of spots) is not auto-fetched. Manual download:

    wget https://zenodo.org/records/7693825/files/cerebellum_rep1.tar.gz
    wget https://zenodo.org/records/7693825/files/cerebellum_rep2.tar.gz
    git clone https://github.com/CaiGroup/dna-seqfish-plus-multi-omics.git
    

    Then point the loader at the extracted CSV(s):

    cd = ChromData.from_seqfish_multiomics(
        spot_glob='cerebellum_rep*/cerebellum_rep*_pos*.csv',
        locus_annotation='dna-seqfish-plus-multi-omics/data/annotation/'
                         'LC1-100k-09022022-mm10-25kb-meta.csv',
        cell_clustering='dna-seqfish-plus-multi-omics/data/cerebellum/'
                        'clustering/100k-002-001-cerebellum_mRNA_cluster_nuc_vol_filtered.csv',
    )
    

IMR90_chr21_30kb.cool — Rao 2014 IMR90 chr21 at 30 kb (0.3 MB)

  • Format: single-resolution cooler. 1 557 bins × 30 kb covering chr21 (hg38).

  • Content: IMR90 in situ Hi-C pair counts from Rao et al. 2014, aggregated from the native 5 kb to 30 kb to match the Bintu 2018 chromatin-tracing resolution.

  • Source: 4DN accession 4DNFI4QQPDMR (the full 810 MB IMR90 .mcool; we extract chr21 at 30 kb into this small .cool for redistribution).

  • How it was derived:

    from cooler import Cooler; import cooler, numpy as np, pandas as pd
    c5 = Cooler('4DNFI4QQPDMR.mcool::/resolutions/5000')
    m = c5.matrix(balance=False, as_pixels=False).fetch('chr21')
    # aggregate 5 kb × 6 → 30 kb by summation
    f = 6; n5 = m.shape[0]; n30 = (n5 + f - 1) // f
    agg = np.zeros((n30, n30))
    for i in range(n30):
        for j in range(n30):
            agg[i,j] = m[i*f:(i+1)*f, j*f:(j+1)*f].sum()
    bins = pd.DataFrame({
        'chrom': ['chr21']*n30,
        'start': np.arange(n30)*30_000,
        'end':   np.minimum((np.arange(n30)+1)*30_000, c5.chromsizes['chr21']),
    })
    iu = np.triu_indices(n30, k=0)
    pixels = pd.DataFrame({'bin1_id': iu[0], 'bin2_id': iu[1],
                            'count': agg[iu].astype(np.int64)})
    pixels = pixels[pixels['count'] > 0]
    cooler.create_cooler('IMR90_chr21_30kb.cool', bins, pixels, assembly='hg38')
    # Balance with ICE so reconstruct_gem_fish can pull balanced counts:
    # python -m cooler balance IMR90_chr21_30kb.cool
    
  • Used by: gem_fish_reconstruction.ipynb — paired with the Bintu 2018 FISH CSV for the paper-faithful Part 2 run.

Large datasets — auto-downloaded on first run

The tutorials locate these files via a find_*() helper that:

  1. Returns any cached copy under example-data/ or ~/Downloads/….

  2. Otherwise downloads into example-data/ and caches for the next run.

  3. Falls back to a small synthetic dataset if the network is unreachable (only applies to the 4DN CSV path; the Zenodo zip has no synthetic fallback).

Every tutorial’s first data-loading cell prints the location it ends up using.

IMR90_chr21_pyhim.ecsv — Bintu 2018 IMR90 chr21 in PyHiM ECSV format (5.7 MB)

  • Format: PyHiM chromatin trace table (Astropy ECSV). Columns: Spot_ID, Trace_ID, x, y, z, Chrom, Chrom_Start, Chrom_End, ROI #, Mask_id, Barcode #, label. meta['comments'] carries xyz_unit=micron, genome_assembly=hg38.

  • Content: hg38 chr21:18.6–20.6 Mb, 1,277 traces × 66 loci × 30 kb spacing. IMR90 fibroblasts.

  • Source: Bintu et al. 2018, Science 362:eaau1783, “Super-resolution chromatin tracing reveals domains and cooperative interactions in single cells”. Original CSV at mendeley.com/datasets/3jkp7zhwbr/1.

  • How it was derived: The original Bintu CSV has columns Chromosome index, Segment index, Z, X, Y (nm). The conversion script bintu_to_pyhim_ecsv.py maps these to PyHiM schema:

    • Chromosome indexTrace_ID (1..1277)

    • Segment indexBarcode # (1..66)

    • X, Y, Z (nm) → x, y, z (microns)

    • Chrom = chr21, Chrom_Start/End derived from segment index × 30 kb

    • Mask_id = Chromosome index (each trace = one “cell”)

    • ROI # = 0, label = "None"

  • Used by: import_pyhim_ecsv.ipynb — demonstrates ChromData.from_pyhim_trace() on real chromatin tracing data.

  • Generated on demand: If missing, the tutorial runs bintu_to_pyhim_ecsv.py to convert IMR90_chr21-18-20Mb.csv (which is auto-downloaded if needed).

4DNFIHF3JCBY.csv — Takei 2021 mESC FOF-CT chromatin tracing (22 MB)

  • Format: 4DN FISH Omics Format — Chromatin Tracing (FOF-CT) core table. Headers (##…) describe the experiment; the data table has columns: Spot_ID, Trace_ID, X, Y, Z, Chrom, Chrom_Start, Chrom_End, Cell_ID, .

  • Content: mm10, 20 chromosomes × 60 bins × 25 kb, ~400 traces per chromosome across 201 E14 mESC cells.

  • Source: Takei et al. 2021, Nature 590:344–350, “Integrated spatial genomics reveals global architecture of single nuclei”. 4DN portal: data.4dnucleome.org/4DNFIHF3JCBY.

  • Download URL (public S3, no credentials needed):

    https://4dn-open-data-public.s3.amazonaws.com/fourfront-webprod/wfoutput/e699334e-fb34-4a0e-8ef6-670b2099831a/4DNFIHF3JCBY.csv
    
  • Used by (all use the find_fofct() helper): loop_calling.ipynb, tad_calling.ipynb, compartment.ipynb, fishnet_domains.ipynb.

IMR90_chr21-18-20Mb.csv — Bintu 2018 IMR90 chr21:18.6–20.6 Mb chromatin tracing (2 MB)

  • Format: CSV with header line then columns Chromosome index, Segment index, Z, X, Y. One row per detected segment per imaged chromosome. Coordinates in nanometres. Segment spacing is 30 kb.

  • Content: 1 278 imaged chromosomes × 66 segments in IMR90 cells, covering chr21:18,627,714–20,577,518 (hg38).

  • Source: Bintu et al. 2018, Science 362, eaau1783, “Super- resolution chromatin tracing reveals domains and cooperative interactions in single cells”. The paper is a higher-resolution follow-up to Wang et al. 2016 (Science 353:598) which Abbas et al. 2019 (GEM-FISH) originally used; Bintu 2018 data is directly accessible from the authors’ GitHub repository.

  • Download URL: https://raw.githubusercontent.com/BogdanBintu/ChromatinImaging/master/Data/IMR90_chr21-18-20Mb.csv

  • Used by: gem_fish_reconstruction.ipynb (part 2 — real FISH data). The tutorial synthesises a matching Hi-C from the FISH population-mean so it runs offline; to reproduce the paper’s exact setup, substitute the Rao 2014 IMR90 .mcool from 4DN at 30 kb resolution.

GSE63525_GM12878_insitu_primary+replicate_combined_30.hic — GM12878 in-situ combined Hi-C (~40 GB)

  • Format: Juicer .hic — a multi-resolution contact matrix file (read with hicstraw).

  • Content: GM12878 lymphoblastoid cells, in-situ Hi-C, primary + replicate merged and MAPQ ≥ 30 filtered. Contains every standard Juicer resolution from 1 kb to 2.5 Mb, all chromosomes.

  • Source: Rao et al. 2014, Cell 159:1665–1680, “A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping”.

  • Download: not auto-fetched (40 GB is too large to ship through download_data.py). Grab it manually from one of:

  • Used by: benchmark.ipynb (MDS reconstruction benchmark). The tutorial checks example-data/ first and falls back to the UCHROM_GM12878_HIC env var, so you can keep the file on an external drive: export UCHROM_GM12878_HIC=/path/to/…_combined_30.hic.

DNAseqFISH+.zip — Takei 2021 raw seqFISH+ spots (144 MB)

  • Format: zip containing 8 CSVs — 4 replicates at 1-Mb resolution and 4 at 25-kb resolution. Columns: fov, channel, cellID, regionID (hyb1-60), x, y, z, dot_intensity, chr{N}_intensity × 20, chromID, labelID.

  • Content: same experiment as the FOF-CT above, but before trace assignment — every row is a detected fluorescent spot with a decoded chromosome ID but ambiguous fiber assignment (median 6 candidate spots per (cell, chromID, region)). labelID 0 marks the upstream pipeline’s trace choice (useful as ground truth when benchmarking aligners).

  • Source: Takei et al. 2021, Zenodo record 3735329, doi:10.5281/zenodo.3735329.

  • Download URL:

    https://zenodo.org/records/3735329/files/DNAseqFISH%2B.zip?download=1
    
  • Used by: jie_aligner.ipynb (spot-to-fiber tracing). The tutorial opens the CSV directly from the zip without extracting.

  • Coordinate conventions (applied by the tutorial): x, y in pixels × 103 nm/pixel; z in pixels × 250 nm/pixel.

H1Esc-HFF.R1.tar.gz + H1Esc-HFF.R1.labeled — Kim 2020 sci-Hi-C (128 MB + 92 KB)

  • Format: tarball of per-cell .matrix files; each is a sparse triplet bin1<TAB>bin2<TAB>count<TAB>weight<TAB>chrom1<TAB>chrom2, with bin1/bin2 as global bin indices across the whole hg19 genome at 500 kb (offsets are not encoded — derive them by min-bin per chrom across the cells, or compute from canonical hg19 chromsizes). Chromosome strings are prefixed human_ (e.g. human_chr14). The companion *.labeled is a 2-column TSV matrix_filename<TAB>cell_type with values in {H1Esc, HFF}.

  • Content: 1 931 cells (750 H1Esc + 1 181 HFF), pooled from a combinatorial-indexing sci-Hi-C library at 500 kb. Used as a benchmark in Kim et al.’s topic-model paper.

  • Source: Kim et al. 2020, Nature Communications 11:6386, “Capturing cell type-specific chromatin compartment patterns by applying topic modeling to single-cell Hi-C data” — accompanying website at noble.gs.washington.edu/proj/schic-topic-model.

  • Download URLs:

    https://noble.gs.washington.edu/proj/schic-topic-model/data/matrix_files/H1Esc-HFF.R1.tar.gz
    https://noble.gs.washington.edu/proj/schic-topic-model/data/matrix_labels/H1Esc-HFF.R1.labeled
    
  • Used by: higashi_embedding.ipynb (FastHigashi cell embedding + ARI vs ground-truth labels). The tutorial picks a balanced subset (e.g. 150 H1Esc + 150 HFF), converts each .matrix to Higashi v2 contact-pair format, runs FastHigashi at rank 64 with do_conv/do_rwr/do_col=True, and reports ARI / NMI between the k-means clustering of cd.cellm['higashi'] and the cell-type labels. Reproduces ARI ≈ 0.55 on a Mac mini M2 in ~2 min on CPU at 300 cells.

Generated outputs

  • with_loops.h5cd — created at the end of loop_calling.ipynb as a round-trip demonstration. Safe to delete; the tutorial will regenerate it on next run.

  • fofct_core.csv — a small synthetic FOF-CT that tutorials 4–7 generate if the real Takei 2021 CSV can’t be downloaded. Has a hand-crafted 3-TAD + 1-loop structure purely to keep the notebook runnable offline.

Pre-staging all data (optional)

Tutorials download what they need on first run, so you normally don’t have to do anything. To pre-stage all large files (useful for offline / CI environments), run:

python example-data/download_data.py

This is idempotent — files already present are skipped.

Adding a new dataset

  1. If under ~10 MB and redistributable → commit to example-data/ and add a row to the “Small, in-repo datasets” table above.

  2. Otherwise → add a download_* helper to download_data.py, a row to “Large datasets”, and a find_*() function in whichever tutorial needs it (follow the existing find_fofct() pattern).

  3. Always cite the paper + accession, and note the licence / terms of use where they matter.