Skip to content

Multi-parent snapshot lineage #203

@larstalian

Description

@larstalian

Context

Snapshot.lineage today is a single-parent ancestry chain — each
snapshot has a parent_id pointing to the snapshot it derived from
(via evolve()). It's a nice clean structure for "child snapshot is
the parent's curriculum step."

It breaks down once training pulls harder:

  • Multiple children from one parent: a curriculum runner that
    forks one parent into N variants ("patch sql_injection vs add ssrf
    vs harden accounts") wants all three to have the same parent_id —
    fine. But it also wants to remember they were generated as a
    batch from the same parent state. Today's chain doesn't capture
    that batching.
  • Multiple parents per child: counterfactual training compares
    rollouts across two related-but-distinct snapshots; a derived
    snapshot might want to reference both as "I'm derived from these."
    Single-parent chain can't represent it.
  • Lineage as the training data store: the long-term frame from
    the meeting notes is "snapshots are the training data; the lineage
    graph IS the curriculum." A pool / DAG fits that frame; a chain
    doesn't.

What this issue tracks

Promote Snapshot.lineage from a single-parent chain to a
multi-parent reference graph. Persistent (snapshots store
relationships); queryable (you can ask "what derived from X" and
"what is Y's full provenance").

Why it matters

The training-loop integration (#198) and the curriculum-driven demo
(#200) will both push on the chain shape. Promoting it before they
land is cheaper than refactoring after.

Where to start

  • src/openrange/core/snapshot.pyLineageNode dataclass and how
    it's serialized in Snapshot.as_dict / from_mapping.
  • src/openrange/core/store.pySnapshotStore. If we want
    efficient "what derived from X" queries, the store needs to index
    by parent (it doesn't today).

Design questions

Open a design doc PR; these are open:

  • Multi-parent shape: explicit list of parents per snapshot, or
    a sibling references mechanism (snapshot ↔ snapshot relations
    outside ancestry)?
  • Curriculum metadata: when a snapshot is generated as part of a
    batch, do we record the batch identifier? A "curriculum cohort"
    concept?
  • Backwards compatibility: existing snapshots have single-parent
    chains. The new model needs to accept those without a migration.
  • Storage: today it's JSON on disk. Multi-parent + queries
    could push for SQLite / similar.

Acceptance

  • Design doc PR landed.
  • Snapshot.lineage supports multiple parents.
  • SnapshotStore exposes "children of X" and "ancestors of Y"
    queries.
  • Existing single-parent snapshots load correctly under the new
    model.

Notes

Defer until the training-loop work pushes back on the chain shape.
This issue is filed for visibility, not as a near-term priority.

Metadata

Metadata

Assignees

No one assigned

    Labels

    coreCore library / runtime / admissiondesign-neededNeeds a design pass before coderoadmapTracked on the public roadmap

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions