Distribute EpisodeService across hosts

## Context

`EpisodeService` holds running episodes in a local
`dict[str, _RunningEpisode]`. One process, one machine, one in-memory
registry. That's fine for the current use cases (eval scripts, single-
user demos).

Distributed training fleets need a different model:

- The trainer fans out N rollouts in parallel across M workers
- Each worker runs its episodes locally but the dispatcher needs to
  see global state (which episodes are live? which ones errored?)
- Snapshots and episode artifacts live on shared storage, not
  per-worker disks
- Health, restart, and stragglers need real handling

## What this issue tracks

Generalize `EpisodeService` so it can run across multiple hosts.
Concretely:

- Pluggable backend for the in-memory registry (Redis / etcd /
  similar) — the dict becomes one impl, distributed becomes another
- Episode IDs that are globally unique (today they're per-process
  uuids — could collide across hosts in theory, ok in practice)
- Heartbeat / liveness so the orchestrator can detect a dead worker
- Shared storage abstraction for run artifacts (today's local
  filesystem becomes one impl)

## Why it matters

This is the difference between OpenRange-as-tool (single-machine
eval, what we have) and OpenRange-as-platform (training fleets,
benchmark farms, multi-tenant). The architecture allows it; nothing
ships today that does it.

## Where to start

- `src/openrange/core/episode.py` — the `EpisodeService` class.
  `_RunningEpisode` is the per-episode state today.
- `src/openrange/core/store.py` — the `SnapshotStore` is the
  filesystem-shaped artifact store.
- `src/openrange/runtime.py` — `OpenRangeRun` wraps the service +
  artifact store + dashboard. Distributed mode probably introduces
  a "controller" role on top of it.

## Suggested phasing

Open a design doc PR first — distributed-systems shape decisions are
expensive to change later:

1. Define what "controller" / "worker" mean in the OpenRange model.
2. Pick the registry backend (Redis is the obvious low-effort
   choice).
3. Decide the shared-storage abstraction (S3 / NFS / pluggable).
4. Implement behind a flag.

## Acceptance

- Design doc PR landed.
- A multi-host run works: 1 controller + N workers, each running
  episodes, all visible in one dashboard / report.
- Single-host mode still works unchanged (no perf regression for
  the common case).

## Notes

Out of scope for this issue: actual training-loop orchestration on
top of distributed episodes — that's the trainer's concern (#198).
This issue is just the service primitive.

Pick this up when someone actually needs it. Filed for visibility.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distribute EpisodeService across hosts #204

Context

What this issue tracks

Why it matters

Where to start

Suggested phasing

Acceptance

Notes

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Distribute EpisodeService across hosts #204

Description

Context

What this issue tracks

Why it matters

Where to start

Suggested phasing

Acceptance

Notes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions