Skip to content

Session-scoped Dask scheduler#105

Open
EiffL wants to merge 1 commit into
mainfrom
session-scoped-dask-scheduler
Open

Session-scoped Dask scheduler#105
EiffL wants to merge 1 commit into
mainfrom
session-scoped-dask-scheduler

Conversation

@EiffL
Copy link
Copy Markdown
Member

@EiffL EiffL commented Apr 30, 2026

Summary

The Dask scheduler now outlives a single lc run: spawned on first use, reused across subsequent runs in the same Claude session, torn down on SessionEnd or after 30 minutes idle.

  • One scheduler per execution context. The project on a laptop; the SLURM allocation inside salloc (so srun-spawned workers are reused across every run in the allocation rather than respawned each time).
  • NERSC-correct. State lives under the existing scratch layout (<scratch>/.lightcone/dask-scheduler/<key>/), so flock semantics inherit from the rest of the lock layer.
  • Crash-safe via one primitive. The daemon holds flock on owner.lock for its lifetime; the kernel releases it on clean exit, crash, or SIGKILL. Liveness is probed by trying to take the lock — no PID files, no heartbeats. Idle-timeout (Dask's built-in Scheduler.idle_timeout) is the safety net for the rare case where the SessionEnd hook never fires.

What changes

  • New src/lightcone/engine/dask_daemon.pyensure_scheduler, stop, and a python -m daemon entrypoint.
  • src/lightcone/engine/dask_cluster.py shrinks from 200 lines of three branches to a connect-only context manager. The LocalCluster and SLURM-srun branches moved into the daemon, where their lifetime is now session-long.
  • The DASK_SCHEDULER_ADDRESS escape hatch is preserved for CI / external schedulers.
  • New lc dask stop command sends SIGTERM to the running scheduler. The new SessionEnd hook (claude/lightcone/scripts/session-end.sh) calls it for prompt cleanup. Both are best-effort — silent on every "nothing to stop" path.
  • RunDirs.dask_local removed: the daemon owns spill placement (per-key, not per-run) under its own scheduler dir.

Test plan

  • uv run pytest — all 295 tests pass, including the new tests/test_dask_daemon.py (key/dirs layout, reuse-vs-spawn, stale-state cleanup, spawn timeout, stop semantics).
  • uv run ruff check src/ tests/ — clean.
  • uv run mypy src/ — no new errors (only pre-existing third-party stub gaps).
  • End-to-end smoke: spawn → reuse → submit task → stop, verifying scheduler.json and meta.json are cleaned up post-shutdown.
  • SIGKILL recovery: kill -9 the daemon → next ensure_scheduler probes the dead address, takes spawn lock, cleans stale files, spawns fresh daemon at a new address.
  • Manual: run two lc run invocations in one Claude session and confirm the second uses the same scheduler (no second startup latency, no second LocalCluster log line).
  • Manual on Perlmutter (next salloc): confirm the SLURM-keyed scheduler is shared across runs in one allocation, and that SLURM cgroup cleanup tears it down at allocation end.

🤖 Generated with Claude Code

The scheduler now outlives a single ``lc run``: spawned on first use,
reused across subsequent runs in the same Claude session, torn down on
SessionEnd or after 30 minutes idle.

One scheduler per execution context — the project on a laptop, the
SLURM allocation inside ``salloc`` (so ``srun``-spawned workers are
reused across every run in the allocation, not respawned each time).
Storage lives under the existing scratch layout
(``<scratch>/.lightcone/dask-scheduler/<key>/``) so NERSC's Lustre
flock semantics are inherited rather than re-derived.

Crash safety hangs on one primitive: the daemon holds ``flock`` on
``owner.lock`` for its lifetime, and the kernel releases it when the
process dies (clean exit, crash, SIGKILL). Liveness is therefore
probed by trying to take the lock — never by PID files or heartbeats.
Idle-timeout is the safety net for the residual case where SessionEnd
never fires (Claude killed without a clean exit).

``cluster_for_run`` shrinks from 200 lines of three branches to a
connect-only context manager. The LocalCluster and SLURM-srun branches
move into the daemon, where their lifetime is now session-long. The
``DASK_SCHEDULER_ADDRESS`` escape hatch is preserved for CI/external
schedulers.

The new ``lc dask stop`` command sends SIGTERM to the running
scheduler; the SessionEnd hook calls it for prompt cleanup. Both are
best-effort — silence on every "nothing to stop" path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

❌ Eval Results

Metric Value
Score 0.25
Build complete
Cost $1.14
Turns 40
Duration 434s
lightcone-cli 0.3.5.dev2+ge63cdfb04 (e63cdfb0)
Results Download

Graders

✅ spec_valid (1.00)
❌ all_materialized (0.00)

Full output
22:48:56 lightcone.eval.build Building lightcone-cli wheel from /home/runner/work/lightcone-cli/lightcone-cli ...
22:49:01 lightcone.eval.build Built lightcone_cli-0.3.5.dev2+ge63cdfb04-py3-none-any.whl (commit e63cdfb0)
22:49:05 lightcone.eval.sandbox Created sandbox 0c69d9a7-5081-45a0-b645-468dbda02432 for trial build-snae-0
22:49:05 httpx HTTP Request: POST https://proxy.app.daytona.io/toolbox/0c69d9a7-5081-45a0-b645-468dbda02432/files/bulk-upload "HTTP/1.1 200 OK"
22:49:06 lightcone.eval.sandbox Installed wheels: ['lightcone_cli-0.3.5.dev2+ge63cdfb04-py3-none-any.whl']
22:49:07 httpx HTTP Request: POST https://proxy.app.daytona.io/toolbox/0c69d9a7-5081-45a0-b645-468dbda02432/files/bulk-upload "HTTP/1.1 200 OK"
22:49:07 httpx HTTP Request: POST https://proxy.app.daytona.io/toolbox/0c69d9a7-5081-45a0-b645-468dbda02432/files/bulk-upload "HTTP/1.1 200 OK"
22:49:07 httpx HTTP Request: POST https://proxy.app.daytona.io/toolbox/0c69d9a7-5081-45a0-b645-468dbda02432/files/bulk-upload "HTTP/1.1 200 OK"
22:49:08 httpx HTTP Request: POST https://proxy.app.daytona.io/toolbox/0c69d9a7-5081-45a0-b645-468dbda02432/files/bulk-upload "HTTP/1.1 200 OK"
22:56:23 lightcone.eval.sandbox Deleted sandbox for trial build-snae-0
  snae trial 0: score=0.25 complete

lightcone-cli: 0.3.5.dev2+ge63cdfb04 (HEAD e63cdfb0)
ASTRA: 0.2.6

  Eval Results: Scores  
┏━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Task ┃     Score     ┃
┡━━━━━━╇━━━━━━━━━━━━━━━┩
│ snae │ 0.25 +/- 0.00 │
│      │ pass@k: 100%  │
└──────┴───────────────┘

   Eval Results: Cost &   
         Duration         
┏━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃ Task ┃ Cost / Duration ┃
┡━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ snae │      $1.14      │
│      │      434s       │
└──────┴─────────────────┘

Total: 1 trials, $1.14, 434s

Results saved to: eval-results/build-e63cdfb0/results.json

@EiffL
Copy link
Copy Markdown
Member Author

EiffL commented May 1, 2026

@claude is this the simplest way to provide the functionality of a persistent Dask cluster when executing locally or within an 'salloc' job?

Also, we do not want to introduce a command line level mention of Dask, the cli should remain agnostic to the execution backend.

@claude
Copy link
Copy Markdown

claude Bot commented May 1, 2026

Claude finished @EiffL's task in 1m 56s —— View job


Review: Session-scoped Dask scheduler

  • Read changed files to understand the implementation
  • Assess design complexity vs simpler alternatives
  • Review CLI surface for Dask-specific exposure

Q1: Is this the simplest way to get a persistent cluster?

For the SLURM case — roughly yes. Managing worker lifetime across srun tasks (spawning workers, retiring them gracefully on shutdown, and sharing one scheduler across all lc run invocations in the same salloc) requires orchestration that can't be delegated entirely to a stock CLI. The flock-as-liveness primitive is well-chosen: kernel-released on crash/kill, no PID-file race, one syscall to probe.

For the local case — no. Dask ships dask scheduler --scheduler-file <path> --idle-timeout "30 minutes" as a first-class CLI. Running that as a detached subprocess replaces _serve(), _start_local_cluster(), _block_until_done(), and _shutdown() with three lines. The full daemon is only justified by the SLURM branch, yet the two branches share a single 462-line entrypoint. If the SLURM path is real and needed now, keep it; but if local is the 90% case, the daemon is carrying a lot of dead weight.

Specific complexity concerns:

  • _block_until_done polling cluster.status (dask_daemon.py:395-399) depends on Dask's internal EnumStatus repr. A str(status).rsplit(".", maxsplit=1)[-1] != "running" check will silently miss new states or break across Dask versions. A more robust probe: subscribe to cluster.scheduler.coroutines or just let SIGTERM be the only exit path and trust idle-timeout as the second line of defence.

  • stop() uses PID from meta.json (dask_daemon.py:168-184) while the stated crash-safety argument is entirely based on owner.lock. These two mechanisms are inconsistent. If stop() tried to acquire owner.lock non-blocking, it would find the PID of the lock holder via /proc/<pid>/fd or simply wouldn't need the PID at all — it could SIGTERM by lock holder. As-is, a short window exists where meta.json records a PID that has been recycled by the OS after a crash.

  • meta.json exists solely so stop() can find the PID. If stop() were redesigned to not need it (see above), the file and the write-before-cluster-start ordering dance disappear.


Q2: CLI is not backend-agnostic — this is a real design problem

The lc dask group at commands.py:824 violates the stated principle. Concrete places the abstraction leaks:

Location Leakage
commands.py:824 @main.group() named dask — a public CLI command named after the backend
commands.py:379 lc run docstring: "Always dispatches through a Dask cluster"
commands.py:345 _abort_on_perlmutter_login references DASK_SCHEDULER_ADDRESS in its logic but this env var is already the escape hatch — the function comment could avoid naming the backend
dask_cluster.py:98,106 cluster_for_run verbose prints say "→ Using existing Dask scheduler" / "→ Reusing session scheduler" — the second is fine, the first names Dask

The root problem: lc dask stop is only ever called by the session-end.sh hook (line 16). It is not a user-facing command — it is a hook target. Exposing it as a public subgroup makes Dask a first-class noun in the user's CLI.

Suggested fix: Remove the lc dask group entirely. Have the SessionEnd hook call the daemon module directly:

# session-end.sh — no public CLI command needed
python -m lightcone.engine.dask_daemon --project "$cwd" --stop

Or, if you want a CLI entry for discoverability without naming Dask, make it backend-agnostic:

lc session stop   # "stop the background execution scheduler"

Either way, the lc dask group should not appear in lc --help output. Fix this →


Summary

The daemon architecture is defensible for SLURM but over-specified for the local case. The flock-liveness design is correct. The main actionable issues:

  1. lc dask must go — or be renamed to something that doesn't name the backend. The SessionEnd hook is the only caller; it doesn't need a public command.
  2. lc run docstring should not mention Dask.
  3. _block_until_done status check is fragile against Dask internals — consider a simpler approach.
  4. stop() + meta.json PID approach is inconsistent with the owner.lock crash-safety story — worth aligning.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant