Skip to content

Distribute EpisodeService across hosts #204

@larstalian

Description

@larstalian

Context

EpisodeService holds running episodes in a local
dict[str, _RunningEpisode]. One process, one machine, one in-memory
registry. That's fine for the current use cases (eval scripts, single-
user demos).

Distributed training fleets need a different model:

  • The trainer fans out N rollouts in parallel across M workers
  • Each worker runs its episodes locally but the dispatcher needs to
    see global state (which episodes are live? which ones errored?)
  • Snapshots and episode artifacts live on shared storage, not
    per-worker disks
  • Health, restart, and stragglers need real handling

What this issue tracks

Generalize EpisodeService so it can run across multiple hosts.
Concretely:

  • Pluggable backend for the in-memory registry (Redis / etcd /
    similar) — the dict becomes one impl, distributed becomes another
  • Episode IDs that are globally unique (today they're per-process
    uuids — could collide across hosts in theory, ok in practice)
  • Heartbeat / liveness so the orchestrator can detect a dead worker
  • Shared storage abstraction for run artifacts (today's local
    filesystem becomes one impl)

Why it matters

This is the difference between OpenRange-as-tool (single-machine
eval, what we have) and OpenRange-as-platform (training fleets,
benchmark farms, multi-tenant). The architecture allows it; nothing
ships today that does it.

Where to start

  • src/openrange/core/episode.py — the EpisodeService class.
    _RunningEpisode is the per-episode state today.
  • src/openrange/core/store.py — the SnapshotStore is the
    filesystem-shaped artifact store.
  • src/openrange/runtime.pyOpenRangeRun wraps the service +
    artifact store + dashboard. Distributed mode probably introduces
    a "controller" role on top of it.

Suggested phasing

Open a design doc PR first — distributed-systems shape decisions are
expensive to change later:

  1. Define what "controller" / "worker" mean in the OpenRange model.
  2. Pick the registry backend (Redis is the obvious low-effort
    choice).
  3. Decide the shared-storage abstraction (S3 / NFS / pluggable).
  4. Implement behind a flag.

Acceptance

  • Design doc PR landed.
  • A multi-host run works: 1 controller + N workers, each running
    episodes, all visible in one dashboard / report.
  • Single-host mode still works unchanged (no perf regression for
    the common case).

Notes

Out of scope for this issue: actual training-loop orchestration on
top of distributed episodes — that's the trainer's concern (#198).
This issue is just the service primitive.

Pick this up when someone actually needs it. Filed for visibility.

Metadata

Metadata

Assignees

No one assigned

    Labels

    coreCore library / runtime / admissiondesign-neededNeeds a design pass before coderoadmapTracked on the public roadmap

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions