Context
EpisodeService holds running episodes in a local
dict[str, _RunningEpisode]. One process, one machine, one in-memory
registry. That's fine for the current use cases (eval scripts, single-
user demos).
Distributed training fleets need a different model:
- The trainer fans out N rollouts in parallel across M workers
- Each worker runs its episodes locally but the dispatcher needs to
see global state (which episodes are live? which ones errored?)
- Snapshots and episode artifacts live on shared storage, not
per-worker disks
- Health, restart, and stragglers need real handling
What this issue tracks
Generalize EpisodeService so it can run across multiple hosts.
Concretely:
- Pluggable backend for the in-memory registry (Redis / etcd /
similar) — the dict becomes one impl, distributed becomes another
- Episode IDs that are globally unique (today they're per-process
uuids — could collide across hosts in theory, ok in practice)
- Heartbeat / liveness so the orchestrator can detect a dead worker
- Shared storage abstraction for run artifacts (today's local
filesystem becomes one impl)
Why it matters
This is the difference between OpenRange-as-tool (single-machine
eval, what we have) and OpenRange-as-platform (training fleets,
benchmark farms, multi-tenant). The architecture allows it; nothing
ships today that does it.
Where to start
src/openrange/core/episode.py — the EpisodeService class.
_RunningEpisode is the per-episode state today.
src/openrange/core/store.py — the SnapshotStore is the
filesystem-shaped artifact store.
src/openrange/runtime.py — OpenRangeRun wraps the service +
artifact store + dashboard. Distributed mode probably introduces
a "controller" role on top of it.
Suggested phasing
Open a design doc PR first — distributed-systems shape decisions are
expensive to change later:
- Define what "controller" / "worker" mean in the OpenRange model.
- Pick the registry backend (Redis is the obvious low-effort
choice).
- Decide the shared-storage abstraction (S3 / NFS / pluggable).
- Implement behind a flag.
Acceptance
- Design doc PR landed.
- A multi-host run works: 1 controller + N workers, each running
episodes, all visible in one dashboard / report.
- Single-host mode still works unchanged (no perf regression for
the common case).
Notes
Out of scope for this issue: actual training-loop orchestration on
top of distributed episodes — that's the trainer's concern (#198).
This issue is just the service primitive.
Pick this up when someone actually needs it. Filed for visibility.
Context
EpisodeServiceholds running episodes in a localdict[str, _RunningEpisode]. One process, one machine, one in-memoryregistry. That's fine for the current use cases (eval scripts, single-
user demos).
Distributed training fleets need a different model:
see global state (which episodes are live? which ones errored?)
per-worker disks
What this issue tracks
Generalize
EpisodeServiceso it can run across multiple hosts.Concretely:
similar) — the dict becomes one impl, distributed becomes another
uuids — could collide across hosts in theory, ok in practice)
filesystem becomes one impl)
Why it matters
This is the difference between OpenRange-as-tool (single-machine
eval, what we have) and OpenRange-as-platform (training fleets,
benchmark farms, multi-tenant). The architecture allows it; nothing
ships today that does it.
Where to start
src/openrange/core/episode.py— theEpisodeServiceclass._RunningEpisodeis the per-episode state today.src/openrange/core/store.py— theSnapshotStoreis thefilesystem-shaped artifact store.
src/openrange/runtime.py—OpenRangeRunwraps the service +artifact store + dashboard. Distributed mode probably introduces
a "controller" role on top of it.
Suggested phasing
Open a design doc PR first — distributed-systems shape decisions are
expensive to change later:
choice).
Acceptance
episodes, all visible in one dashboard / report.
the common case).
Notes
Out of scope for this issue: actual training-loop orchestration on
top of distributed episodes — that's the trainer's concern (#198).
This issue is just the service primitive.
Pick this up when someone actually needs it. Filed for visibility.