Iterative Auto-discovery Design

This page records the proposed next architecture for U-Chrom auto-discovery. The goal is to move from one-shot idea generation toward a graph-guided discovery loop that uses dataset metadata, user annotations, external literature, prior notebook results, and explicit evidence classification.

Motivation

The current auto-discovery workflow already supports:

  • h5cd-backed discovery schemas

  • agent-generated structured ideas

  • schema review before execution

  • notebook-first exploration with free-form code

  • runner re-execution and verification

  • evidence classification separate from notebook verification

The next step is to make the system iterative. New ideas should not be independent samples from a prompt. They should be linked to what is already known, what was already tested, what was supported, what was contradicted, and which regions of the dataset remain underexplored.

The central design principle is:

schema tells what is measurable
references tell what is known
annotations tell what the user cares about
graph tells what has already been tried
evidence tells what held up
frontiers tell what to ask next
notebooks verify each claim

High-level Architecture

The proposed system is organized around an IdeaGraph plus an orchestrator. Agents do not call each other directly. Instead, each agent reads a graph slice and writes structured artifacts that are ingested back into the graph.

cdata / h5cd
  -> discovery schema
  -> dataset references
  -> user annotations
  -> graph pointer / run summaries

Browser agent
  -> external sources, papers, dataset pages, claims, citations

Idea agent
  -> graph-guided computable hypotheses

Notebook agent
  -> notebook-first quantitative validation

Runner
  -> deterministic re-execution and evidence classification

IdeaGraph
  -> shared memory, provenance, coverage, and next frontiers

This keeps responsibilities clear:

  • ChromData is the authority for data and data-level context.

  • IdeaGraph is the authority for discovery history and scientific relationships.

  • Browser agents collect knowledge.

  • Idea agents propose hypotheses.

  • Notebook agents test hypotheses.

  • The orchestrator schedules work based on graph state and budget.

Relationship to ChromData

ChromData should remain the entry point for the dataset. It should not store large notebooks, browser logs, or full graph artifacts. Instead, it should store lightweight context and pointers.

Recommended h5cd metadata keys:

cdata.uns["dataset_references"] = [...]
cdata.uns["user_annotations"] = [...]
cdata.uns["auto_discovery_schema"] = {...}
cdata.uns["auto_discovery_graph"] = {...}
cdata.uns["auto_discovery_runs"] = [...]

dataset_references records source material associated with the dataset. These may include primary papers, data repositories, supplementary tables, method papers, related biology priors, or user-provided references.

Example:

[
  {
    "reference_id": "takei2025_primary",
    "role": "primary_dataset_paper",
    "title": "Primary dataset paper title",
    "doi": "10.xxxx/xxxxx",
    "pmid": null,
    "url": "https://example.org/paper",
    "year": 2025,
    "notes": "Primary dataset paper for linked chromatin tracing and RNA data."
  },
  {
    "reference_id": "zenodo_7693825",
    "role": "data_repository",
    "url": "https://zenodo.org/records/7693825",
    "notes": "Raw data and annotation tables."
  }
]

user_annotations records user-provided knowledge, constraints, or seeds. These should be treated as high-priority context but not as validated evidence.

Example:

[
  {
    "annotation_id": "purkinje_marker_note",
    "scope": "cell_type",
    "target": "Purkinje",
    "text": "Pcp2 should be treated as a Purkinje marker in this dataset.",
    "tags": ["marker", "cell_type_prior"],
    "confidence": "user_asserted"
  },
  {
    "annotation_id": "rna_geometry_constraint",
    "scope": "analysis_constraint",
    "target": "linked_adata",
    "text": "RNA and chromatin tracing are linked at cell_id level; do not assume RNA spot geometry.",
    "tags": ["constraint", "multiomics_alignment"]
  }
]

The discovery schema should include these as first-class fields:

schema["references"] = cdata.uns.get("dataset_references", [])
schema["user_annotations"] = cdata.uns.get("user_annotations", [])
schema["knowledge_seed_context"] = ...

This lets the first iteration start from dataset-specific knowledge rather than from schema fields alone.

IdeaGraph

The graph provides shared scientific memory. It should support both machine-readable scheduling and human-readable audit trails.

Suggested node types:

Dataset
SchemaSnapshot
Reference
LiteratureSource
LiteratureClaim
UserAnnotation
CellType
Gene
IFMarker
Modality
ParameterFamily
Frontier
Idea
NotebookRun
EvidenceResult

Suggested edge types:

Idea -> uses_cell_type -> CellType
Idea -> uses_gene -> Gene
Idea -> uses_marker -> IFMarker
Idea -> uses_modality -> Modality
Idea -> has_parameter_family -> ParameterFamily
Idea -> motivated_by -> LiteratureClaim
Idea -> motivated_by -> UserAnnotation
Idea -> derived_from -> Idea
Idea -> refines -> Idea
Idea -> generalizes -> Idea
Idea -> specializes -> Idea
Idea -> alternative_definition_of -> Idea
Idea -> negative_control_for -> Idea
Idea -> tested_by -> NotebookRun
NotebookRun -> produced -> EvidenceResult
EvidenceResult -> supports -> Idea
EvidenceResult -> contradicts -> Idea
EvidenceResult -> not_supports -> Idea
LiteratureClaim -> sourced_from -> Reference
LiteratureClaim -> mentions_gene -> Gene
LiteratureClaim -> mentions_cell_type -> CellType
LiteratureClaim -> mentions_marker -> IFMarker

Each idea should be graph-linked. A generated idea should be able to answer:

  • Which prior idea or result motivated it?

  • Which literature claim or user annotation motivated it?

  • Which fields, genes, markers, and cell types does it use?

  • Which previous ideas are near duplicates?

  • What would count as support, contradiction, or non-support?

External Knowledge Layer

External knowledge should be handled by a dedicated browser or literature agent, not by the notebook agent during analysis.

The browser agent is a knowledge ingestion worker:

input:
  schema references
  user annotations
  graph gaps
  frontier query plan

output:
  papers.jsonl
  claims.jsonl
  sources.bib
  retrieval_log.jsonl
  browser_records.jsonl

The browser agent should extract source-backed claims, not free-floating summaries.

Example claim:

{
  "claim_id": "claim_pcp2_purkinje_marker",
  "source": {
    "reference_id": "some_paper",
    "doi": "10.xxxx/xxxxx",
    "pmid": "12345678",
    "title": "Paper title",
    "year": 2024,
    "url": "https://example.org/paper"
  },
  "claim_text": "Pcp2 is used as a Purkinje-cell marker.",
  "entities": {
    "genes": ["Pcp2"],
    "cell_types": ["Purkinje"],
    "markers": [],
    "modalities": ["RNA"]
  },
  "relationship": "supports_marker_identity",
  "direction": "positive",
  "evidence_type": "paper_text",
  "confidence": 0.8,
  "quoted_evidence": "Short source-backed quote or excerpt."
}

External knowledge is a prior. It can motivate ideas, but it does not replace data validation. Notebook results remain the evidence for whether an idea is supported in the loaded h5cd dataset.

Suggested operation modes:

closed_world
  Use only h5cd schema and previous discovery results.

knowledge_guided
  Use cached literature claims and user annotations, but do not browse.

browser_refresh
  Refresh external knowledge first, then generate new ideas.

Agent Scheduling

The orchestrator should run discovery in iterations. Each iteration updates the graph before planning the next step.

1. Sync graph from cdata and discovery schema.
2. Optionally refresh external knowledge with browser agents.
3. Plan dynamic frontiers from graph state.
4. Dispatch idea agents over selected frontiers.
5. Run schema review and graph novelty review.
6. Select a diverse idea portfolio.
7. Dispatch notebook agents over selected ideas.
8. Re-execute notebooks with the U-Chrom runner.
9. Classify evidence and update the graph.
10. Write graph summaries and h5cd pointers.

Pseudo-code:

while budget.remaining():
    graph.sync_from_cdata(cdata)

    if planner.needs_knowledge_refresh(graph):
        claims = browser_pool.run(graph.knowledge_queries())
        graph.ingest_literature_claims(claims)

    frontiers = planner.plan_frontiers(graph)
    candidates = idea_pool.run(frontiers)

    reviewed = [
        review_idea_against_schema(idea, schema)
        + review_idea_against_graph(idea, graph)
        for idea in candidates
    ]
    selected = portfolio_select(reviewed, graph, budget)

    notebooks = notebook_pool.run(selected)
    results = runner.verify(notebooks)
    graph.ingest_results(results)

    cdata.uns["auto_discovery_graph"] = graph.lightweight_summary()

Dynamic Frontiers

The current framework uses fixed idea direction buckets. This is useful for a first pass, but it is too rigid for iterative discovery. The next design should replace static buckets with graph-derived frontier cards.

Frontier examples:

Follow up a supported rDNA inter-chromosomal hub result.
Explain a contradicted H3K27ac radial-enrichment result.
Refine a borderline Aldoc lamina-associated signal.
Fill an underexplored Bergmann RNA-linked region.
Add negative controls for active chromatin assortativity.
Generate literature-guided Pcp2/Purkinje spatial coupling ideas.

Example frontier:

{
  "frontier_id": "contradicted_h3k27ac_radial_followup",
  "goal": "Explain why H3K27ac radial enrichment was contradicted.",
  "strategy": "alternative_definition",
  "preferred_facets": {
    "markers": ["H3K27ac"],
    "cell_types": ["Purkinje", "Granule"],
    "parameter_families": ["radial_position", "local_distance"]
  },
  "avoid_idea_ids": ["previous_near_duplicate_id"],
  "required_novelty": 0.65,
  "required_relation": "alternative_definition_of"
}

Frontiers can be generated from:

  • supported findings that need replication, refinement, or controls

  • contradicted findings that need alternative definitions or confounder tests

  • borderline findings that need more power or less noisy parameters

  • not-supported findings that should not be repeated exactly

  • coverage gaps in cell types, markers, genes, modalities, or parameter families

  • literature claims that have not yet been tested in this h5cd

  • user annotations that encode hypotheses or analysis constraints

Diversity and Novelty

Static prompt instructions are not enough to avoid repetition. The graph should provide a structured novelty review.

Each idea should get an idea_signature:

{
  "modalities": ["chromatin_tracing", "rna_expression"],
  "cell_types": ["Purkinje"],
  "genes": ["Pcp2"],
  "markers": ["H3K27ac"],
  "parameter_family": "local_distance",
  "null_model": "cell_label_permutation",
  "expected_direction": "negative",
  "parent_idea_id": "previous_idea",
  "source_claim_ids": ["claim_pcp2_purkinje_marker"]
}

Graph review should compute:

schema_feasibility
history_novelty
coverage_gain
literature_relevance
control_value
redundancy_penalty

The final set of ideas should be selected as a portfolio, not accepted in the order returned by agents.

Example scoring:

final_score =
  0.30 * novelty
+ 0.25 * evidence_relevance
+ 0.20 * coverage_gain
+ 0.15 * feasibility
+ 0.10 * control_value
- redundancy_penalty

Example portfolio constraints:

No more than K ideas per marker.
No more than K ideas per cell type.
No more than K ideas per parameter family.
At least N RNA-linked ideas.
At least N literature-guided ideas.
At least N negative-control or robustness ideas.
At least N follow-ups to supported or contradicted findings.

Evidence-driven Follow-up

New ideas should be derived from previous evidence states.

Suggested follow-up policies:

Supported
  Generate replication, refinement, mechanism, cross-cell-type, and negative
  control ideas.

Contradicted
  Generate inverted-direction ideas, alternative operational definitions, and
  confounder checks.

Borderline
  Improve power, reduce noise, change aggregation, or test an adjacent marker.

Not supported
  Avoid exact repeats. Generate orthogonal formulations or explanatory null
  checks.

Verification failed
  Repair data access, simplify the parameter, or add missing data checks before
  biological interpretation.

This turns each run into a source of new questions.

Notebook Agent Contract

The notebook agent should receive one accepted idea plus a focused graph slice. It should not browse the web and should not decide the next global direction.

Required notebook outputs:

idea brief
data/schema checks
inspection code cell
main analysis code cell
explicit hypothesis test
result_table
analysis_summary
statistical matplotlib figure
verification output
final interpretation

The runner then re-executes the notebook and writes an EvidenceResult node:

{
  "idea_id": "...",
  "notebook_status": "pass",
  "hypothesis_status": "Supported",
  "test_method": "...",
  "p_value": 0.01,
  "effect_size": 0.42,
  "observed_statistic": 0.42,
  "figure_path": "...",
  "result_table_path": "..."
}

The graph then connects this result back to the idea with supports, contradicts, not_supports, or inconclusive edges.

Storage Layout

The h5cd should stay lightweight. Full graph and agent artifacts should live in the run directory.

Recommended layout:

runs/iterative_001/
  graph/
    idea_graph.json
    idea_graph.graphml
    graph_summary.md
  knowledge/
    papers.jsonl
    claims.jsonl
    sources.bib
    retrieval_log.jsonl
    browser_records.jsonl
  frontiers.jsonl
  ideas.jsonl
  reviews.jsonl
  graph_reviews.jsonl
  selected_ideas.jsonl
  results.jsonl
  agent_records.jsonl
  report.md
  notebooks/

The h5cd pointer can remain small:

{
  "schema_hash": "...",
  "latest_graph_path": "runs/iterative_001/graph/idea_graph.json",
  "n_ideas": 80,
  "n_verified": 62,
  "coverage_summary": {
    "cell_types": {},
    "markers": {},
    "genes": {},
    "parameter_families": {}
  }
}

Proposed Implementation Phases

Phase 1: Schema context

Add dataset references and user annotations to ChromData and to the discovery schema.

Deliverables:

  • cdata.add_reference(...)

  • cdata.add_user_annotation(...)

  • schema["references"]

  • schema["user_annotations"]

  • validation and round-trip tests

Phase 2: IdeaGraph MVP

Build a graph from existing run artifacts.

Deliverables:

  • uchrom.auto_discovery.graph

  • graph import from ideas.jsonl, reviews.jsonl, results.jsonl

  • evidence nodes from classify_hypothesis_evidence

  • graph export as JSON and GraphML

  • graph summary Markdown

Phase 3: Novelty and coverage review

Add structured idea signatures and graph-aware review.

Deliverables:

  • idea_signature(idea)

  • review_idea_against_graph(idea, graph)

  • nearest-neighbor duplicate detection

  • coverage gain scores

  • portfolio selector

Phase 4: Dynamic frontier planner

Replace fixed direction buckets with graph-derived frontiers.

Deliverables:

  • plan_discovery_frontiers(schema, graph, ...)

  • frontier JSON schema

  • Pantheon idea-agent prompts that consume frontiers

  • CLI flags for iterative mode and history runs

Phase 5: External knowledge layer

Add browser/literature ingestion as a separate step.

Deliverables:

  • literature-refresh CLI

  • source and claim schemas

  • browser agent records

  • claim-to-graph ingestion

  • citation/provenance checks

Phase 6: Iterative orchestrator

Run complete graph-guided discovery loops.

Deliverables:

  • iterative-run or run --iterative

  • budgeted multi-iteration scheduling

  • graph snapshots per iteration

  • cdata graph pointer updates

  • documentation and Takei example update

Open Questions

  • Should the graph use networkx internally, a lightweight custom JSON graph, or both?

  • How much of the graph summary should be written back into h5cd versus kept only in run directories?

  • What should be the default portfolio constraints for small datasets?

  • How should literature claim confidence be represented when sources disagree?

  • Should browser refresh be opt-in only to keep runs reproducible by default?

  • How should multiple users’ annotations be namespaced and versioned?

Summary

The proposed system turns auto-discovery into an evidence-guided loop:

Browser agents enrich the graph.
Idea agents propose graph-linked hypotheses.
Notebook agents validate graph-linked hypotheses.
The runner classifies evidence.
The graph plans the next round.
ChromData anchors everything to real data.

This provides a path from automated notebook generation toward a persistent, auditable, literature-aware discovery engine for spatial multi-omics and chromatin tracing data.