Session-scoped Dask scheduler#105
Conversation
The scheduler now outlives a single ``lc run``: spawned on first use, reused across subsequent runs in the same Claude session, torn down on SessionEnd or after 30 minutes idle. One scheduler per execution context — the project on a laptop, the SLURM allocation inside ``salloc`` (so ``srun``-spawned workers are reused across every run in the allocation, not respawned each time). Storage lives under the existing scratch layout (``<scratch>/.lightcone/dask-scheduler/<key>/``) so NERSC's Lustre flock semantics are inherited rather than re-derived. Crash safety hangs on one primitive: the daemon holds ``flock`` on ``owner.lock`` for its lifetime, and the kernel releases it when the process dies (clean exit, crash, SIGKILL). Liveness is therefore probed by trying to take the lock — never by PID files or heartbeats. Idle-timeout is the safety net for the residual case where SessionEnd never fires (Claude killed without a clean exit). ``cluster_for_run`` shrinks from 200 lines of three branches to a connect-only context manager. The LocalCluster and SLURM-srun branches move into the daemon, where their lifetime is now session-long. The ``DASK_SCHEDULER_ADDRESS`` escape hatch is preserved for CI/external schedulers. The new ``lc dask stop`` command sends SIGTERM to the running scheduler; the SessionEnd hook calls it for prompt cleanup. Both are best-effort — silence on every "nothing to stop" path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
❌ Eval Results
Graders✅ spec_valid (1.00) Full output |
|
@claude is this the simplest way to provide the functionality of a persistent Dask cluster when executing locally or within an 'salloc' job? Also, we do not want to introduce a command line level mention of Dask, the cli should remain agnostic to the execution backend. |
|
Claude finished @EiffL's task in 1m 56s —— View job Review: Session-scoped Dask scheduler
Q1: Is this the simplest way to get a persistent cluster?For the SLURM case — roughly yes. Managing worker lifetime across For the local case — no. Dask ships Specific complexity concerns:
Q2: CLI is not backend-agnostic — this is a real design problemThe
The root problem: Suggested fix: Remove the # session-end.sh — no public CLI command needed
python -m lightcone.engine.dask_daemon --project "$cwd" --stopOr, if you want a CLI entry for discoverability without naming Dask, make it backend-agnostic: Either way, the SummaryThe daemon architecture is defensible for SLURM but over-specified for the local case. The
|
Summary
The Dask scheduler now outlives a single
lc run: spawned on first use, reused across subsequent runs in the same Claude session, torn down on SessionEnd or after 30 minutes idle.salloc(sosrun-spawned workers are reused across every run in the allocation rather than respawned each time).<scratch>/.lightcone/dask-scheduler/<key>/), so flock semantics inherit from the rest of the lock layer.flockonowner.lockfor its lifetime; the kernel releases it on clean exit, crash, or SIGKILL. Liveness is probed by trying to take the lock — no PID files, no heartbeats. Idle-timeout (Dask's built-inScheduler.idle_timeout) is the safety net for the rare case where the SessionEnd hook never fires.What changes
src/lightcone/engine/dask_daemon.py—ensure_scheduler,stop, and apython -mdaemon entrypoint.src/lightcone/engine/dask_cluster.pyshrinks from 200 lines of three branches to a connect-only context manager. The LocalCluster and SLURM-srun branches moved into the daemon, where their lifetime is now session-long.DASK_SCHEDULER_ADDRESSescape hatch is preserved for CI / external schedulers.lc dask stopcommand sends SIGTERM to the running scheduler. The newSessionEndhook (claude/lightcone/scripts/session-end.sh) calls it for prompt cleanup. Both are best-effort — silent on every "nothing to stop" path.RunDirs.dask_localremoved: the daemon owns spill placement (per-key, not per-run) under its own scheduler dir.Test plan
uv run pytest— all 295 tests pass, including the newtests/test_dask_daemon.py(key/dirs layout, reuse-vs-spawn, stale-state cleanup, spawn timeout,stopsemantics).uv run ruff check src/ tests/— clean.uv run mypy src/— no new errors (only pre-existing third-party stub gaps).scheduler.jsonandmeta.jsonare cleaned up post-shutdown.ensure_schedulerprobes the dead address, takes spawn lock, cleans stale files, spawns fresh daemon at a new address.lc runinvocations in one Claude session and confirm the second uses the same scheduler (no second startup latency, no second LocalCluster log line).salloc): confirm the SLURM-keyed scheduler is shared across runs in one allocation, and that SLURM cgroup cleanup tears it down at allocation end.🤖 Generated with Claude Code