Skip to content

Rewrite sglang integration test with tensor dumps and harden Ray worker env#75

Merged
yubofredwang merged 3 commits intomainfrom
feat/sglang-test-rewrite-and-ray-env
Apr 13, 2026
Merged

Rewrite sglang integration test with tensor dumps and harden Ray worker env#75
yubofredwang merged 3 commits intomainfrom
feat/sglang-test-rewrite-and-ray-env

Conversation

@yubofredwang
Copy link
Copy Markdown
Collaborator

Summary

  • tests/test_sglang_engine_integration.py — Rewrite the sglang integration test with structured test cases (short seqs, long seqs, text prompts) that dump all tensors to .pt files for cross-engine comparison (e.g. vs vllm). Auto-resolves aux layer IDs from model config instead of hardcoding.
  • torchspec/ray/train_group.py — Propagate TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC (default 1800s) and TORCHINDUCTOR_FX_GRAPH_CACHE (enabled) to Ray training workers to prevent NCCL watchdog kills during long training steps and speed up inductor compilations.
  • torchspec/utils/env.py — Update the env key allowlist: drop legacy NCCL_IB_* keys, add the two new torch/inductor env vars.

Test plan

  • Run python tests/test_sglang_engine_integration.py --model Qwen/Qwen3-8B --tp 4 with mooncake services running; verify tensor dumps appear in ./tensor_dumps/
  • Verify Ray training workers inherit the new env vars (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC, TORCHINDUCTOR_FX_GRAPH_CACHE) by inspecting worker logs
  • Confirm no regressions in existing Eagle3 training pipeline

…er env

- Rewrite test_sglang_engine_integration.py with structured tests (short seqs,
  long seqs, text prompts) that dump all tensors to .pt files for cross-engine
  comparison against vllm. Auto-resolve aux layer IDs from model config.
- Propagate TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC and TORCHINDUCTOR_FX_GRAPH_CACHE
  to Ray training workers to avoid NCCL watchdog kills and enable inductor
  FX graph caching.
- Update torchspec env key allowlist: drop legacy NCCL_IB_* keys, add the two
  new torch/inductor env vars.
Copilot AI review requested due to automatic review settings April 13, 2026 22:27
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 313e3c0c78

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates TorchSpec’s sglang integration workflow to support cross-engine tensor dump comparisons and hardens Ray worker environment propagation for long-running training steps.

Changes:

  • Rewrites tests/test_sglang_engine_integration.py into structured scenarios that fetch Mooncake tensors and dump them to .pt files, with aux layer IDs auto-derived from model config.
  • Propagates TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC (default 1800s) and TORCHINDUCTOR_FX_GRAPH_CACHE (default enabled) into Ray training worker runtime_env.
  • Updates the Ray env allowlist to include the two new Torch/TorchInductor env vars and drop legacy NCCL IB keys.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
torchspec/utils/env.py Adjusts the env var allowlist forwarded to Ray actors (adds Torch/NCCL + Inductor knobs).
torchspec/ray/train_group.py Ensures Ray training workers receive safer defaults for NCCL heartbeat + Inductor FX graph cache.
tests/test_sglang_engine_integration.py Reworks the sglang integration script to run multiple scenarios and dump Mooncake tensors for cross-engine comparison.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +31 to 33
if _REPO_ROOT not in sys.path:
sys.path.insert(0, _REPO_ROOT)

Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This module mutates global interpreter state at import time (inserting repo root into sys.path). Since this file lives under tests/ and matches pytest’s default collection pattern (test_*.py), importing it during test discovery can unexpectedly change import resolution for the entire test session. Consider moving the sys.path manipulation into main() / the main block (or renaming/relocating the script so it isn’t collected by pytest).

Suggested change
if _REPO_ROOT not in sys.path:
sys.path.insert(0, _REPO_ROOT)
def _ensure_repo_root_on_sys_path() -> None:
if _REPO_ROOT not in sys.path:
sys.path.insert(0, _REPO_ROOT)
if __name__ == "__main__":
_ensure_repo_root_on_sys_path()

Copilot uses AI. Check for mistakes.
Comment on lines +15 to +17
# Start mooncake master first:
# mooncake_master --port 50051 &
# etcd --listen-client-urls http://0.0.0.0:8090 --advertise-client-urls http://localhost:8090 &
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The usage header shows mooncake_master/etcd ports 50051/8090, but the script defaults MOONCAKE_MASTER_PORT/MOONCAKE_METADATA_PORT to 51135/8763 below. This mismatch can confuse users following the instructions. Update the usage text to match the defaults, or explicitly state that the ports are controlled via MOONCAKE_* env vars (and that the shown command is just an example).

Suggested change
# Start mooncake master first:
# mooncake_master --port 50051 &
# etcd --listen-client-urls http://0.0.0.0:8090 --advertise-client-urls http://localhost:8090 &
# Start mooncake services first. The ports are controlled by
# MOONCAKE_MASTER_PORT / MOONCAKE_METADATA_PORT; the commands below use
# this script's defaults:
# mooncake_master --port 51135 &
# etcd --listen-client-urls http://0.0.0.0:8763 --advertise-client-urls http://localhost:8763 &

Copilot uses AI. Check for mistakes.
spec_training_response_length=[5, 10, 8],
sampling_params={"max_new_tokens": 32},
spec_training_data_id=data_ids,
sampling_params={"max_new_tokens": 1},
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script sets sampling_params={"max_new_tokens": 1}, but TorchSpec’s SglEngine spec_training path uses max_new_tokens=0 (prefill-only). Generating extra tokens can change what the engine writes to Mooncake and can make the seq_len/shape assertions in this test incorrect or flaky. Align this to max_new_tokens=0 (or derive seq_len from meta_info["prompt_tokens"] consistently when validating shapes).

Suggested change
sampling_params={"max_new_tokens": 1},
sampling_params={"max_new_tokens": 0},

Copilot uses AI. Check for mistakes.
Comment on lines +208 to 213
results = engine.generate(
prompt=text_prompts,
spec_training_data_id=data_ids,
sampling_params={"max_new_tokens": 1},
return_hidden_states=True,
)
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above: using sampling_params={"max_new_tokens": 1} diverges from TorchSpec’s prefill-only spec_training behavior (max_new_tokens=0) and may change Mooncake tensor shapes/content relative to the asserted prompt token length. Consider setting this to 0 to keep the dump comparable to training-time behavior.

Copilot uses AI. Check for mistakes.
…gured value

MOONCAKE_MASTER_SERVER was always set to {LOCAL_IP}:51135 regardless of
the resolved MOONCAKE_MASTER_PORT value.  Build it from the actual port
so overriding MOONCAKE_MASTER_PORT (e.g. 50051) propagates correctly.
Avoids doubling memory usage when the eval cache checkpoint is large,
since mmap lazily pages in tensors instead of reading the full file.
@yubofredwang yubofredwang merged commit 6b2d682 into main Apr 13, 2026
1 check passed
@yubofredwang yubofredwang deleted the feat/sglang-test-rewrite-and-ray-env branch April 13, 2026 22:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants