Skip to content

Change asyncio.gather to TaskGroup in the Harbor generator to correctly propagate cancellation signals.#1160

Draft
alexgshaw wants to merge 6 commits intoNovaSky-AI:mainfrom
alexgshaw:main
Draft

Change asyncio.gather to TaskGroup in the Harbor generator to correctly propagate cancellation signals.#1160
alexgshaw wants to merge 6 commits intoNovaSky-AI:mainfrom
alexgshaw:main

Conversation

@alexgshaw
Copy link
Contributor

@alexgshaw alexgshaw commented Feb 16, 2026

See #1194

Fix Daytona sandbox leak on Ctrl+C and unhandled exceptions

Motivation

When a user presses Ctrl+C during Harbor RL training, Daytona sandboxes are not
cleaned up. They remain running indefinitely and must be manually deleted (or up
to a user-specified timeout like 30 mins). The same issue occurs when an unhandled
exception kills the training loop. In a run with 50+ concurrent sandboxes, all of them leak.

This PR fixes the full chain of issues — from the asyncio task structure in the
generator, through the Ray worker's signal handling, to the driver's cleanup
logic. Each fix addresses a specific failure mode discovered through E2E testing.

What's changed

1. harbor_generator.py — TaskGroup instead of asyncio.gather

Symptom: When one trajectory raises an exception, asyncio.gather propagates
the error but leaves all sibling coroutines running as orphans. Nobody awaits
them, so their finally blocks (sandbox teardown) may never execute.

Fix: Replace tqdm.asyncio.tqdm.gather with asyncio.TaskGroup. When one
task fails or the group is cancelled, TaskGroup cancels all siblings and waits
for their cleanup (finally blocks) before propagating the exception.

2. main_base.py — SIGTERM → KeyboardInterrupt in the worker

Symptom: When the driver calls ray.cancel(ref, force=False), Ray raises
KeyboardInterrupt in the worker. But when Ray sends SIGTERM to terminate
the worker (e.g. on force-cancel or cluster shutdown), the default handler
raises SystemExit. asyncio.run() does not handle SystemExit — it
exits immediately without cancelling tasks or running finally blocks.
Sandboxes leak.

Fix: Install a SIGTERM handler in BasePPOExp.run() that raises
KeyboardInterrupt instead. This lets asyncio.run() do its proper shutdown:
cancel the main task, propagate CancelledError to subtasks, and run all
finally blocks.

3. main_harbor.py — Driver waits for worker cleanup

Three issues were discovered in the driver (the process that calls ray.get()):

3a. Ray masks SIGINT during ray.get()

Symptom: ray.get() blocks in C code where SIGINT is set to SIG_IGN.
Pressing Ctrl+C has no effect — the driver hangs forever.

Fix: Replace ray.get(ref) with a polling loop: time.sleep(1) (which is
interruptible by signals) followed by a non-blocking ray.wait([ref], timeout=0).
Signal handlers are reinstalled after every ray.wait() call since Ray may
re-mask them.

3b. Driver exits before worker finishes cleanup

Symptom: The old code did ray.cancel(ref, force=False); raise. The driver
exits immediately after calling cancel. The worker starts cleanup (deleting
sandboxes) but the driver is already gone. At scale (50+ sandboxes), the worker
needs time to delete them all. Sandboxes leak.

Fix: After ray.cancel(), poll ray.wait() for up to 120 seconds, waiting
for the worker to finish cleanup. A second Ctrl+C force-kills the worker for
users who want to exit immediately.

3c. uv sends SIGTERM when it dies

Symptom: When Ctrl+C sends SIGINT to the process group, uv (the parent
process) also receives SIGINT, dies, and sends SIGTERM to the Python child.
The driver receives: SIGINT → first KeyboardInterrupt → enters cleanup handler
→ SIGTERM from uv → second KeyboardInterrupt → force-kills worker before
cleanup can start.

Fix: Temporarily set SIGINT and SIGTERM to SIG_IGN upon entering the
cleanup handler, call ray.cancel(), wait 0.5s for uv's SIGTERM to be
absorbed, then restore handlers so a deliberate second Ctrl+C still works.

Testing

Both tests run with 2× H100 GPUs, Qwen3-1.7B, 8 prompts × 8 samples = 64
trajectories, 50 concurrent Daytona sandboxes.

Test Before After
Ctrl+C during generation 27–50 sandboxes leaked 0 leaked
Unhandled exception during generation 4 leaked 0 leaked

Companion PR

Harbor fix (asyncio.shield in _create_sandbox): laude-institute/harbor#819


Open with Devin

gemini-code-assist[bot]

This comment was marked as resolved.

Copy link
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 4 additional findings.

Open in Devin Review

@alexgshaw alexgshaw marked this pull request as draft February 16, 2026 21:50
@CharlieFRuan
Copy link
Collaborator

Thanks @alexgshaw !

I think my default linter was configured slightly differently from yours. Lmk if you want me to get rid of those linter changes, or if you're fine as is.

No worry, I can edit the PR directly. Let me know when it's ready!

@alexgshaw alexgshaw marked this pull request as ready for review February 17, 2026 01:43
@alexgshaw
Copy link
Contributor Author

Okay, I might add some more changes later, but this should be ready. Pretty much all of my logic changes are in the generate method.

@alexgshaw
Copy link
Contributor Author

(By more changes later I mean additional PRs, this one is ready to merge!)

gemini-code-assist[bot]

This comment was marked as resolved.

CharlieFRuan added a commit to CharlieFRuan/SkyRL that referenced this pull request Feb 19, 2026
…n propagation

Fixes laude-institute/harbor#656 — when users
interrupt RL with Ctrl+C during sandbox rollout, sandboxes now properly
terminate instead of requiring manual cleanup.

asyncio.TaskGroup propagates cancellation to all child tasks, ensuring
Harbor trials (Daytona/Modal sandboxes) are cleaned up on KeyboardInterrupt.

Based on NovaSky-AI#1160 with lint reverted.

Co-Authored-By: Alex Shaw <alexgshaw64@gmail.com>
@CharlieFRuan
Copy link
Collaborator

/gemini review

devin-ai-integration[bot]

This comment was marked as resolved.

gemini-code-assist[bot]

This comment was marked as resolved.

…n propagation

Fixes laude-institute/harbor#656 — when users
interrupt RL with Ctrl+C during sandbox rollout, sandboxes now properly
terminate instead of requiring manual cleanup.

asyncio.TaskGroup propagates cancellation to all child tasks, ensuring
Harbor trials (Daytona/Modal sandboxes) are cleaned up on KeyboardInterrupt.

Also wraps the TaskGroup in try/finally to ensure the progress bar is
always closed, even if the TaskGroup raises an ExceptionGroup.

Ports the change to examples/train_integrations/harbor/ for the
skyrl migration (NovaSky-AI#1145).

Co-Authored-By: Alex Shaw <alexgshaw64@gmail.com>
CharlieFRuan and others added 2 commits February 19, 2026 23:24
When the driver process receives KeyboardInterrupt (Ctrl+C), `ray.get()` raises
the exception but does NOT cancel the remote worker task. The worker keeps
running until Ray eventually kills it (often with SIGKILL), skipping all async
cleanup — including Harbor sandbox teardown.

Fix: catch KeyboardInterrupt at the `ray.get()` call site and explicitly invoke
`ray.cancel(ref, force=False)`, which delivers KeyboardInterrupt to the worker.
asyncio.run() has built-in handling for KeyboardInterrupt: it cancels the main
task and runs all finally blocks, allowing Trial.run() to delete sandboxes.

Additionally, add a SIGTERM→KeyboardInterrupt handler in BasePPOExp.run() as
defense-in-depth for when Ray terminates the worker with SIGTERM directly.
Without this, SIGTERM raises SystemExit which asyncio.run() does not handle
gracefully (cleanup callbacks are skipped).

Tested E2E:
  - KeyboardInterrupt (Ctrl+C): 0/50 sandboxes leaked (was 27/50 before fix)
  - Unhandled exception:        0/64 sandboxes leaked (was 4/64 before fix,
    remaining leaks fixed by harbor asyncio.shield() change in harbor#656)

Fixes NovaSky-AI#1160
See also: laude-institute/harbor#656

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…trl+C

Ray's C code sets SIGINT to SIG_IGN during initialization, which prevents
Python from ever seeing KeyboardInterrupt.  The previous fix (ray.cancel +
raise) failed at production scale because the driver exited before the
worker could finish cleanup.

Key changes:
- Reinstall SIGINT/SIGTERM handlers AFTER initialize_ray() to override
  Ray's SIG_IGN
- Replace blocking ray.get() with time.sleep() + non-blocking ray.wait()
  polling loop — time.sleep() is interruptible by signals whereas ray.get()
  blocks in C code
- After first Ctrl+C: temporarily ignore signals (uv sends SIGTERM when
  it dies), call ray.cancel(force=False), then wait up to 2 minutes for
  the worker to clean up sandboxes
- Second Ctrl+C force-kills the worker

Tested: 50 Daytona sandboxes created → 0 leaked after SIGINT.
Also tested: exception during generation → 0 leaked.

See harbor#656 / SkyRL#1160.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@CharlieFRuan
Copy link
Collaborator

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the asynchronous task handling in the Harbor generator to use asyncio.TaskGroup instead of asyncio.gather. This is a great improvement for ensuring correct cancellation signal propagation. The changes also include robust signal handling in the main entrypoint to gracefully manage ray worker lifecycle and cleanup, which is crucial for stability.

My review identifies a critical bug in the SIGTERM handling logic that could lead to a crash, and also points out opportunities for code deduplication and style improvements in the signal handling implementation. Overall, the changes are in the right direction for making the training process more robust.

Comment on lines 83 to 94
signal.signal(signal.SIGINT, signal.default_int_handler)
signal.signal(signal.SIGTERM, lambda s, f: signal.default_int_handler(s, f))

try:
# Poll with time.sleep() (interruptible by signals) + non-blocking
# ray.wait(). We cannot use ray.get() or ray.wait(timeout=N>0)
# because those block in C code and may re-mask SIGINT.
while True:
time.sleep(1)
# Reinstall handlers in case ray.wait() overrode them
signal.signal(signal.SIGINT, signal.default_int_handler)
signal.signal(signal.SIGTERM, lambda s, f: signal.default_int_handler(s, f))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The signal handling logic has a few issues:

  1. Critical Bug: The SIGTERM handler lambda s, f: signal.default_int_handler(s, f) is incorrect. signal.default_int_handler() takes no arguments, so passing s and f will cause a TypeError when the handler is invoked. The process will crash instead of gracefully handling SIGTERM.
  2. Code Duplication: The signal handler installation logic is repeated multiple times (here, and later in the except block on lines 118-119 and 125). This makes the code harder to read and maintain.

I suggest refactoring this into a helper function and correcting the SIGTERM handler. This will fix the bug and improve code quality. You can then use this helper function in all places where signal handlers are set or reset, which will also fix an inconsistency on line 125 where only the SIGINT handler is re-installed.

Suggested change
signal.signal(signal.SIGINT, signal.default_int_handler)
signal.signal(signal.SIGTERM, lambda s, f: signal.default_int_handler(s, f))
try:
# Poll with time.sleep() (interruptible by signals) + non-blocking
# ray.wait(). We cannot use ray.get() or ray.wait(timeout=N>0)
# because those block in C code and may re-mask SIGINT.
while True:
time.sleep(1)
# Reinstall handlers in case ray.wait() overrode them
signal.signal(signal.SIGINT, signal.default_int_handler)
signal.signal(signal.SIGTERM, lambda s, f: signal.default_int_handler(s, f))
def _reinstall_signal_handlers():
"""Restores default signal handlers for SIGINT and SIGTERM."""
def _sigterm_handler(signum, frame):
raise KeyboardInterrupt
signal.signal(signal.SIGINT, signal.default_int_handler)
signal.signal(signal.SIGTERM, _sigterm_handler)
_reinstall_signal_handlers()
try:
# Poll with time.sleep() (interruptible by signals) + non-blocking
# ray.wait(). We cannot use ray.get() or ray.wait(timeout=N>0)
# because those block in C code and may re-mask SIGINT.
while True:
time.sleep(1)
# Reinstall handlers in case ray.wait() overrode them
_reinstall_signal_handlers()

Comment on lines 69 to 70
import signal
import time
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

According to PEP 8, imports should generally be at the top of the file. Please move import signal and import time to the top-level of the module for consistency with other imports.

1. Refactor duplicated signal installation into _install_interrupt_handlers()
   helper function (Gemini suggestion).

2. Fix SIGTERM not reinstalled in cleanup wait loop — previously only
   SIGINT was reinstalled after ray.wait(), leaving SIGTERM ineffective
   during the 120s cleanup wait (Devin review, valid).

3. Move `import signal` and `import time` to module top level (PEP 8).

4. Wrap _worker() in try/except Exception so that a single failing
   trajectory (e.g. tokenizer error) does not cancel the entire
   TaskGroup batch.  CancelledError still propagates for cleanup.
   (Devin review, valid concern about TaskGroup semantics.)

Note: Gemini's claim that signal.default_int_handler takes no arguments
is incorrect — it takes (signalnum, frame).  The lambda is correct.

Tested: Ctrl+C (50 sandboxes → 0 leaked), exception (0 leaked).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
devin-ai-integration[bot]

This comment was marked as resolved.

CharlieFRuan and others added 2 commits February 20, 2026 20:08
Per user preference, one unexpected error canceling the entire TaskGroup
batch is fine. The try/except in harbor_agent_loop's retry loop already
handles expected failures.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@CharlieFRuan
Copy link
Collaborator

/gemini review

gemini-code-assist[bot]

This comment was marked as resolved.

Copy link
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

View 8 additional findings in Devin Review.

Open in Devin Review

Comment on lines 100 to 104
#
# See harbor#656 / SkyRL#1160.
signal.signal(signal.SIGTERM, signal.SIG_IGN)
signal.signal(signal.SIGINT, signal.SIG_IGN)
trainer = self._setup_trainer()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 SIGINT masked too early makes the entire setup phase uninterruptible by Ctrl+C

In HarborExp.run(), signal.signal(signal.SIGINT, signal.SIG_IGN) is set before self._setup_trainer() is called. The SIGINT handler is only restored later inside the event loop by loop.add_signal_handler(signal.SIGINT, _cancel_main) within _run_with_graceful_shutdown. This means that during the entire _setup_trainer() phase — which includes building inference engines, creating the trainer, and calling trainer.build_models() — Ctrl+C is completely ignored.

Root cause and impact

The intent (documented in the comment at line 96-98) is to prevent asyncio.run() from installing its own _on_sigint handler. However, placing the SIG_IGN call at line 103 — before _setup_trainer() at line 104 — has the unintended side effect of also masking SIGINT during setup.

_setup_trainer() (skyrl/train/entrypoints/main_base.py:370-414) performs heavyweight operations: initializing trackers, starting inference engines (get_inference_client()), building models (trainer.build_models()), etc. This can take several minutes with large models. During this entire period, the user has no way to interrupt the process with Ctrl+C.

The fix is to move the SIG_IGN calls to just before asyncio.run(), after _setup_trainer() returns:

trainer = self._setup_trainer()
signal.signal(signal.SIGTERM, signal.SIG_IGN)
signal.signal(signal.SIGINT, signal.SIG_IGN)

The same issue exists in the duplicate file at skyrl-train/examples/harbor/entrypoints/main_harbor.py:100-104.

Suggested change
#
# See harbor#656 / SkyRL#1160.
signal.signal(signal.SIGTERM, signal.SIG_IGN)
signal.signal(signal.SIGINT, signal.SIG_IGN)
trainer = self._setup_trainer()
trainer = self._setup_trainer()
signal.signal(signal.SIGTERM, signal.SIG_IGN)
signal.signal(signal.SIGINT, signal.SIG_IGN)
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants