Rewrite sglang integration test with tensor dumps and harden Ray worker env by yubofredwang · Pull Request #75 · torchspec-project/TorchSpec

yubofredwang · 2026-04-13T22:27:02Z

Summary

tests/test_sglang_engine_integration.py — Rewrite the sglang integration test with structured test cases (short seqs, long seqs, text prompts) that dump all tensors to .pt files for cross-engine comparison (e.g. vs vllm). Auto-resolves aux layer IDs from model config instead of hardcoding.
torchspec/ray/train_group.py — Propagate TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC (default 1800s) and TORCHINDUCTOR_FX_GRAPH_CACHE (enabled) to Ray training workers to prevent NCCL watchdog kills during long training steps and speed up inductor compilations.
torchspec/utils/env.py — Update the env key allowlist: drop legacy NCCL_IB_* keys, add the two new torch/inductor env vars.

Test plan

Run python tests/test_sglang_engine_integration.py --model Qwen/Qwen3-8B --tp 4 with mooncake services running; verify tensor dumps appear in ./tensor_dumps/
Verify Ray training workers inherit the new env vars (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC, TORCHINDUCTOR_FX_GRAPH_CACHE) by inspecting worker logs
Confirm no regressions in existing Eagle3 training pipeline

…er env - Rewrite test_sglang_engine_integration.py with structured tests (short seqs, long seqs, text prompts) that dump all tensors to .pt files for cross-engine comparison against vllm. Auto-resolve aux layer IDs from model config. - Propagate TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC and TORCHINDUCTOR_FX_GRAPH_CACHE to Ray training workers to avoid NCCL watchdog kills and enable inductor FX graph caching. - Update torchspec env key allowlist: drop legacy NCCL_IB_* keys, add the two new torch/inductor env vars.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 313e3c0c78

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

tests/test_sglang_engine_integration.py

Copilot

Pull request overview

This PR updates TorchSpec’s sglang integration workflow to support cross-engine tensor dump comparisons and hardens Ray worker environment propagation for long-running training steps.

Changes:

Rewrites tests/test_sglang_engine_integration.py into structured scenarios that fetch Mooncake tensors and dump them to .pt files, with aux layer IDs auto-derived from model config.
Propagates TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC (default 1800s) and TORCHINDUCTOR_FX_GRAPH_CACHE (default enabled) into Ray training worker runtime_env.
Updates the Ray env allowlist to include the two new Torch/TorchInductor env vars and drop legacy NCCL IB keys.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File	Description
torchspec/utils/env.py	Adjusts the env var allowlist forwarded to Ray actors (adds Torch/NCCL + Inductor knobs).
torchspec/ray/train_group.py	Ensures Ray training workers receive safer defaults for NCCL heartbeat + Inductor FX graph cache.
tests/test_sglang_engine_integration.py	Reworks the sglang integration script to run multiple scenarios and dump Mooncake tensors for cross-engine comparison.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-13T22:31:13Z

tests/test_sglang_engine_integration.py

+if _REPO_ROOT not in sys.path:
+    sys.path.insert(0, _REPO_ROOT)



This module mutates global interpreter state at import time (inserting repo root into sys.path). Since this file lives under tests/ and matches pytest’s default collection pattern (test_*.py), importing it during test discovery can unexpectedly change import resolution for the entire test session. Consider moving the sys.path manipulation into main() / the main block (or renaming/relocating the script so it isn’t collected by pytest).

Suggested change

if _REPO_ROOT not in sys.path:

sys.path.insert(0, _REPO_ROOT)

def _ensure_repo_root_on_sys_path() -> None:

if _REPO_ROOT not in sys.path:

sys.path.insert(0, _REPO_ROOT)

if __name__ == "__main__":

_ensure_repo_root_on_sys_path()

Copilot · 2026-04-13T22:31:14Z

tests/test_sglang_engine_integration.py

+  # Start mooncake master first:
+  #   mooncake_master --port 50051 &
+  #   etcd --listen-client-urls http://0.0.0.0:8090 --advertise-client-urls http://localhost:8090 &


The usage header shows mooncake_master/etcd ports 50051/8090, but the script defaults MOONCAKE_MASTER_PORT/MOONCAKE_METADATA_PORT to 51135/8763 below. This mismatch can confuse users following the instructions. Update the usage text to match the defaults, or explicitly state that the ports are controlled via MOONCAKE_* env vars (and that the shown command is just an example).

Suggested change

# Start mooncake master first:

# mooncake_master --port 50051 &

# etcd --listen-client-urls http://0.0.0.0:8090 --advertise-client-urls http://localhost:8090 &

# Start mooncake services first. The ports are controlled by

# MOONCAKE_MASTER_PORT / MOONCAKE_METADATA_PORT; the commands below use

# this script's defaults:

# mooncake_master --port 51135 &

# etcd --listen-client-urls http://0.0.0.0:8763 --advertise-client-urls http://localhost:8763 &

Copilot · 2026-04-13T22:31:14Z

tests/test_sglang_engine_integration.py

-        spec_training_response_length=[5, 10, 8],
-        sampling_params={"max_new_tokens": 32},
+        spec_training_data_id=data_ids,
+        sampling_params={"max_new_tokens": 1},


This script sets sampling_params={"max_new_tokens": 1}, but TorchSpec’s SglEngine spec_training path uses max_new_tokens=0 (prefill-only). Generating extra tokens can change what the engine writes to Mooncake and can make the seq_len/shape assertions in this test incorrect or flaky. Align this to max_new_tokens=0 (or derive seq_len from meta_info["prompt_tokens"] consistently when validating shapes).

Suggested change

sampling_params={"max_new_tokens": 1},

sampling_params={"max_new_tokens": 0},

Copilot · 2026-04-13T22:31:14Z

tests/test_sglang_engine_integration.py

+    results = engine.generate(
+        prompt=text_prompts,
+        spec_training_data_id=data_ids,
+        sampling_params={"max_new_tokens": 1},
        return_hidden_states=True,
    )


Same as above: using sampling_params={"max_new_tokens": 1} diverges from TorchSpec’s prefill-only spec_training behavior (max_new_tokens=0) and may change Mooncake tensor shapes/content relative to the asserted prompt token length. Consider setting this to 0 to keep the dump comparable to training-time behavior.

…gured value MOONCAKE_MASTER_SERVER was always set to {LOCAL_IP}:51135 regardless of the resolved MOONCAKE_MASTER_PORT value. Build it from the actual port so overriding MOONCAKE_MASTER_PORT (e.g. 50051) propagates correctly.

Avoids doubling memory usage when the eval cache checkpoint is large, since mmap lazily pages in tensors instead of reading the full file.

Copilot AI review requested due to automatic review settings April 13, 2026 22:27

Copilot started reviewing on behalf of yubofredwang April 13, 2026 22:27 View session

chatgpt-codex-connector bot reviewed Apr 13, 2026

View reviewed changes

tests/test_sglang_engine_integration.py Outdated Show resolved Hide resolved

Copilot AI reviewed Apr 13, 2026

View reviewed changes

yubofredwang added 2 commits April 13, 2026 22:34

Use mmap=True when loading eval cache to reduce peak memory

0a6f579

Avoids doubling memory usage when the eval cache checkpoint is large, since mmap lazily pages in tensors instead of reading the full file.

yubofredwang merged commit 6b2d682 into main Apr 13, 2026
1 check passed

yubofredwang deleted the feat/sglang-test-rewrite-and-ray-env branch April 13, 2026 22:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrite sglang integration test with tensor dumps and harden Ray worker env#75

Rewrite sglang integration test with tensor dumps and harden Ray worker env#75
yubofredwang merged 3 commits intomainfrom
feat/sglang-test-rewrite-and-ray-env

yubofredwang commented Apr 13, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 13, 2026

Uh oh!

Copilot AI Apr 13, 2026

Uh oh!

Copilot AI Apr 13, 2026

Uh oh!

Copilot AI Apr 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		if _REPO_ROOT not in sys.path:
		sys.path.insert(0, _REPO_ROOT)

-  # Start mooncake master first:
-  #   mooncake_master --port 50051 &
-  #   etcd --listen-client-urls http://0.0.0.0:8090 --advertise-client-urls http://localhost:8090 &
+  # Start mooncake services first. The ports are controlled by
+  # MOONCAKE_MASTER_PORT / MOONCAKE_METADATA_PORT; the commands below use
+  # this script's defaults:
+  #   mooncake_master --port 51135 &
+  #   etcd --listen-client-urls http://0.0.0.0:8763 --advertise-client-urls http://localhost:8763 &

	sampling_params={"max_new_tokens": 1},
	sampling_params={"max_new_tokens": 0},

Conversation

yubofredwang commented Apr 13, 2026

Summary

Test plan

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants