Session-scoped Dask scheduler by EiffL · Pull Request #105 · LightconeResearch/lightcone-cli

EiffL · 2026-04-30T22:48:28Z

Summary

The Dask scheduler now outlives a single lc run: spawned on first use, reused across subsequent runs in the same Claude session, torn down on SessionEnd or after 30 minutes idle.

One scheduler per execution context. The project on a laptop; the SLURM allocation inside salloc (so srun-spawned workers are reused across every run in the allocation rather than respawned each time).
NERSC-correct. State lives under the existing scratch layout (<scratch>/.lightcone/dask-scheduler/<key>/), so flock semantics inherit from the rest of the lock layer.
Crash-safe via one primitive. The daemon holds flock on owner.lock for its lifetime; the kernel releases it on clean exit, crash, or SIGKILL. Liveness is probed by trying to take the lock — no PID files, no heartbeats. Idle-timeout (Dask's built-in Scheduler.idle_timeout) is the safety net for the rare case where the SessionEnd hook never fires.

What changes

New src/lightcone/engine/dask_daemon.py — ensure_scheduler, stop, and a python -m daemon entrypoint.
src/lightcone/engine/dask_cluster.py shrinks from 200 lines of three branches to a connect-only context manager. The LocalCluster and SLURM-srun branches moved into the daemon, where their lifetime is now session-long.
The DASK_SCHEDULER_ADDRESS escape hatch is preserved for CI / external schedulers.
New lc dask stop command sends SIGTERM to the running scheduler. The new SessionEnd hook (claude/lightcone/scripts/session-end.sh) calls it for prompt cleanup. Both are best-effort — silent on every "nothing to stop" path.
RunDirs.dask_local removed: the daemon owns spill placement (per-key, not per-run) under its own scheduler dir.

Test plan

uv run pytest — all 295 tests pass, including the new tests/test_dask_daemon.py (key/dirs layout, reuse-vs-spawn, stale-state cleanup, spawn timeout, stop semantics).
uv run ruff check src/ tests/ — clean.
uv run mypy src/ — no new errors (only pre-existing third-party stub gaps).
End-to-end smoke: spawn → reuse → submit task → stop, verifying scheduler.json and meta.json are cleaned up post-shutdown.
SIGKILL recovery: kill -9 the daemon → next ensure_scheduler probes the dead address, takes spawn lock, cleans stale files, spawns fresh daemon at a new address.
Manual: run two lc run invocations in one Claude session and confirm the second uses the same scheduler (no second startup latency, no second LocalCluster log line).
Manual on Perlmutter (next salloc): confirm the SLURM-keyed scheduler is shared across runs in one allocation, and that SLURM cgroup cleanup tears it down at allocation end.

🤖 Generated with Claude Code

The scheduler now outlives a single ``lc run``: spawned on first use, reused across subsequent runs in the same Claude session, torn down on SessionEnd or after 30 minutes idle. One scheduler per execution context — the project on a laptop, the SLURM allocation inside ``salloc`` (so ``srun``-spawned workers are reused across every run in the allocation, not respawned each time). Storage lives under the existing scratch layout (``<scratch>/.lightcone/dask-scheduler/<key>/``) so NERSC's Lustre flock semantics are inherited rather than re-derived. Crash safety hangs on one primitive: the daemon holds ``flock`` on ``owner.lock`` for its lifetime, and the kernel releases it when the process dies (clean exit, crash, SIGKILL). Liveness is therefore probed by trying to take the lock — never by PID files or heartbeats. Idle-timeout is the safety net for the residual case where SessionEnd never fires (Claude killed without a clean exit). ``cluster_for_run`` shrinks from 200 lines of three branches to a connect-only context manager. The LocalCluster and SLURM-srun branches move into the daemon, where their lifetime is now session-long. The ``DASK_SCHEDULER_ADDRESS`` escape hatch is preserved for CI/external schedulers. The new ``lc dask stop`` command sends SIGTERM to the running scheduler; the SessionEnd hook calls it for prompt cleanup. Both are best-effort — silence on every "nothing to stop" path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-30T22:56:26Z

❌ Eval Results

Metric	Value
Score	0.25
Build complete	✅
Cost	$1.14
Turns	40
Duration	434s
lightcone-cli	`0.3.5.dev2+ge63cdfb04` (`e63cdfb0`)
Results	Download

Graders

✅ spec_valid (1.00)
❌ all_materialized (0.00)

Full output

22:48:56 lightcone.eval.build Building lightcone-cli wheel from /home/runner/work/lightcone-cli/lightcone-cli ...
22:49:01 lightcone.eval.build Built lightcone_cli-0.3.5.dev2+ge63cdfb04-py3-none-any.whl (commit e63cdfb0)
22:49:05 lightcone.eval.sandbox Created sandbox 0c69d9a7-5081-45a0-b645-468dbda02432 for trial build-snae-0
22:49:05 httpx HTTP Request: POST https://proxy.app.daytona.io/toolbox/0c69d9a7-5081-45a0-b645-468dbda02432/files/bulk-upload "HTTP/1.1 200 OK"
22:49:06 lightcone.eval.sandbox Installed wheels: ['lightcone_cli-0.3.5.dev2+ge63cdfb04-py3-none-any.whl']
22:49:07 httpx HTTP Request: POST https://proxy.app.daytona.io/toolbox/0c69d9a7-5081-45a0-b645-468dbda02432/files/bulk-upload "HTTP/1.1 200 OK"
22:49:07 httpx HTTP Request: POST https://proxy.app.daytona.io/toolbox/0c69d9a7-5081-45a0-b645-468dbda02432/files/bulk-upload "HTTP/1.1 200 OK"
22:49:07 httpx HTTP Request: POST https://proxy.app.daytona.io/toolbox/0c69d9a7-5081-45a0-b645-468dbda02432/files/bulk-upload "HTTP/1.1 200 OK"
22:49:08 httpx HTTP Request: POST https://proxy.app.daytona.io/toolbox/0c69d9a7-5081-45a0-b645-468dbda02432/files/bulk-upload "HTTP/1.1 200 OK"
22:56:23 lightcone.eval.sandbox Deleted sandbox for trial build-snae-0
  snae trial 0: score=0.25 complete

lightcone-cli: 0.3.5.dev2+ge63cdfb04 (HEAD e63cdfb0)
ASTRA: 0.2.6

  Eval Results: Scores  
┏━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Task ┃     Score     ┃
┡━━━━━━╇━━━━━━━━━━━━━━━┩
│ snae │ 0.25 +/- 0.00 │
│      │ pass@k: 100%  │
└──────┴───────────────┘

   Eval Results: Cost &   
         Duration         
┏━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃ Task ┃ Cost / Duration ┃
┡━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ snae │      $1.14      │
│      │      434s       │
└──────┴─────────────────┘

Total: 1 trials, $1.14, 434s

Results saved to: eval-results/build-e63cdfb0/results.json

EiffL · 2026-05-01T06:48:45Z

@claude is this the simplest way to provide the functionality of a persistent Dask cluster when executing locally or within an 'salloc' job?

Also, we do not want to introduce a command line level mention of Dask, the cli should remain agnostic to the execution backend.

claude · 2026-05-01T06:48:58Z

Claude finished @EiffL's task in 1m 56s —— View job

Review: Session-scoped Dask scheduler

Read changed files to understand the implementation
Assess design complexity vs simpler alternatives
Review CLI surface for Dask-specific exposure

Q1: Is this the simplest way to get a persistent cluster?

For the SLURM case — roughly yes. Managing worker lifetime across srun tasks (spawning workers, retiring them gracefully on shutdown, and sharing one scheduler across all lc run invocations in the same salloc) requires orchestration that can't be delegated entirely to a stock CLI. The flock-as-liveness primitive is well-chosen: kernel-released on crash/kill, no PID-file race, one syscall to probe.

For the local case — no. Dask ships dask scheduler --scheduler-file <path> --idle-timeout "30 minutes" as a first-class CLI. Running that as a detached subprocess replaces _serve(), _start_local_cluster(), _block_until_done(), and _shutdown() with three lines. The full daemon is only justified by the SLURM branch, yet the two branches share a single 462-line entrypoint. If the SLURM path is real and needed now, keep it; but if local is the 90% case, the daemon is carrying a lot of dead weight.

Specific complexity concerns:

_block_until_done polling cluster.status (dask_daemon.py:395-399) depends on Dask's internal EnumStatus repr. A str(status).rsplit(".", maxsplit=1)[-1] != "running" check will silently miss new states or break across Dask versions. A more robust probe: subscribe to cluster.scheduler.coroutines or just let SIGTERM be the only exit path and trust idle-timeout as the second line of defence.
stop() uses PID from meta.json (dask_daemon.py:168-184) while the stated crash-safety argument is entirely based on owner.lock. These two mechanisms are inconsistent. If stop() tried to acquire owner.lock non-blocking, it would find the PID of the lock holder via /proc/<pid>/fd or simply wouldn't need the PID at all — it could SIGTERM by lock holder. As-is, a short window exists where meta.json records a PID that has been recycled by the OS after a crash.
meta.json exists solely so stop() can find the PID. If stop() were redesigned to not need it (see above), the file and the write-before-cluster-start ordering dance disappear.

Q2: CLI is not backend-agnostic — this is a real design problem

The lc dask group at commands.py:824 violates the stated principle. Concrete places the abstraction leaks:

Location	Leakage
`commands.py:824`	`@main.group()` named `dask` — a public CLI command named after the backend
`commands.py:379`	`lc run` docstring: "Always dispatches through a Dask cluster"
`commands.py:345`	`_abort_on_perlmutter_login` references `DASK_SCHEDULER_ADDRESS` in its logic but this env var is already the escape hatch — the function comment could avoid naming the backend
`dask_cluster.py:98,106`	`cluster_for_run` verbose prints say `"→ Using existing Dask scheduler"` / `"→ Reusing session scheduler"` — the second is fine, the first names Dask

The root problem: lc dask stop is only ever called by the session-end.sh hook (line 16). It is not a user-facing command — it is a hook target. Exposing it as a public subgroup makes Dask a first-class noun in the user's CLI.

Suggested fix: Remove the lc dask group entirely. Have the SessionEnd hook call the daemon module directly:

# session-end.sh — no public CLI command needed
python -m lightcone.engine.dask_daemon --project "$cwd" --stop

Or, if you want a CLI entry for discoverability without naming Dask, make it backend-agnostic:

lc session stop   # "stop the background execution scheduler"

Either way, the lc dask group should not appear in lc --help output. Fix this →

Summary

The daemon architecture is defensible for SLURM but over-specified for the local case. The flock-liveness design is correct. The main actionable issues:

lc dask must go — or be renamed to something that doesn't name the backend. The SessionEnd hook is the only caller; it doesn't need a public command.
lc run docstring should not mention Dask.
_block_until_done status check is fragile against Dask internals — consider a simpler approach.
stop() + meta.json PID approach is inconsistent with the owner.lock crash-safety story — worth aligning.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Session-scoped Dask scheduler#105

Session-scoped Dask scheduler#105
EiffL wants to merge 1 commit into
mainfrom
session-scoped-dask-scheduler

EiffL commented Apr 30, 2026

Uh oh!

github-actions Bot commented Apr 30, 2026

Uh oh!

EiffL commented May 1, 2026

Uh oh!

claude Bot commented May 1, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

EiffL commented Apr 30, 2026

Summary

What changes

Test plan

Uh oh!

github-actions Bot commented Apr 30, 2026

❌ Eval Results

Graders

Uh oh!

EiffL commented May 1, 2026

Uh oh!

claude Bot commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review: Session-scoped Dask scheduler

Q1: Is this the simplest way to get a persistent cluster?

Q2: CLI is not backend-agnostic — this is a real design problem

Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

claude Bot commented May 1, 2026 •

edited

Loading