Conversation
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
📝 WalkthroughWalkthroughUpdates the Gym submodule pointer, adds Docker build support for virtual environment prefetching via new ENV/ARG variables and a Python script that performs dry-run initialization, introduces helper functions to retrieve UV cache and venv directories, and integrates these into the GRPO NeMo Gym configuration flow. Changes
Sequence Diagram(s)sequenceDiagram
actor User
participant prefetch as prefetch_venvs.py
participant Ray
participant NemoGymEnv as NeMo Gym<br/>Environment
User->>prefetch: Execute with config paths
prefetch->>prefetch: Parse CLI arguments
loop For each config file
prefetch->>prefetch: Load YAML config
prefetch->>Ray: ray.init()
prefetch->>NemoGymEnv: Create with dry_run=True<br/>(UV paths injected)
NemoGymEnv->>NemoGymEnv: Initialize environment
prefetch->>NemoGymEnv: Call health_check()
NemoGymEnv-->>prefetch: Health check result
prefetch->>NemoGymEnv: Force kill actor
prefetch->>prefetch: Log success/failure
Ray->>Ray: Shutdown
end
prefetch->>prefetch: Aggregate results
prefetch-->>User: Print summary & exit
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (3)
examples/nemo_gym/prefetch_venvs.py (2)
92-94: Broadexcept Exceptionviolates the coding guidelines
ExceptioncatchesKeyboardInterruptandSystemExitsubclasses aren't excluded, and the try body spans config loading, Ray actor creation, and health-check wait — very different failure modes. Per coding guidelines, the except clause should be limited to the smallest set of errors expected.♻️ Proposed narrowing
- except Exception as e: + except (KeyError, FileNotFoundError, ray.exceptions.RayError, Exception) as e:Or, at a minimum, re-raise
KeyboardInterrupt:except Exception as e: print(f"Error processing {config_path}: {e}") + if isinstance(e, KeyboardInterrupt): + raise failed.append((config_path, str(e)))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/nemo_gym/prefetch_venvs.py` around lines 92 - 94, The current broad "except Exception as e" around the config load / Ray actor creation / health-check is too wide; replace it with targeted except blocks for the specific failures you expect (e.g., FileNotFoundError, json.JSONDecodeError or config parsing errors, ray.exceptions.RayError for actor creation, TimeoutError for health-check waits) and append failures to failed only in those handlers; ensure you do not swallow KeyboardInterrupt or SystemExit by either adding an explicit "except (KeyboardInterrupt, SystemExit): raise" or by not including them in the caught exception tuples; update the except references around config_path and failed so they only run in the specific handlers.
83-87: "DONT MERGE UNTIL FIXED" comment must be resolved before mergingThe developer explicitly flagged this as a merge blocker. If
ray.killis an acceptable long-term solution for this prefetch context (venvs are already written to disk before the kill), the commented-outshutdown()call and the warning comment should simply be removed so the intent is clear.Do you want me to open a new issue to track the underlying
NemoGym.shutdown()hang, so that the workaround here can be committed cleanly?🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/nemo_gym/prefetch_venvs.py` around lines 83 - 87, Decide whether ray.kill(nemo_gym) is an acceptable permanent solution; if it is, remove the "TODO: Hangs... (DONT MERGE...)" comment and the commented-out ray.get(nemo_gym.shutdown.remote()) call so the intent is clear and keep the ray.kill(nemo_gym) line (ensure venvs are flushed to disk before killing); if ray.kill is only a temporary workaround, replace ray.kill(nemo_gym) with an actual fix for NemoGym.shutdown() so you call ray.get(nemo_gym.shutdown.remote()) from code (or add a retry/timeout wrapper around NemoGym.shutdown) and open an issue to track the underlying shutdown hang for future follow-up.examples/nemo_gym/run_grpo_nemo_gym.py (1)
212-218: Duplicated UV path injection — consider extracting a shared helperThe same 6-line pattern (get uv_cache_dir / uv_venv_dir + setdefault) also appears in
examples/nemo_gym/prefetch_venvs.pylines 63–68. Extracting it intonemo_rl/environments/nemo_gym.pywould make both call sites a single line and keep the helpers co-located withget_nemo_gym_uv_cache_dir/get_nemo_gym_venv_dir.♻️ Proposed refactor — add helper to nemo_rl/environments/nemo_gym.py
In
nemo_rl/environments/nemo_gym.py:+def enrich_nemo_gym_dict_with_uv_paths(nemo_gym_dict: Dict[str, Any]) -> None: + """Inject UV cache and venv directories into a NeMo Gym config dict when available. + + Args: + nemo_gym_dict: Dict to enrich in-place with uv_cache_dir and uv_venv_dir + when the corresponding environment variables are set. + """ + uv_cache_dir = get_nemo_gym_uv_cache_dir() + if uv_cache_dir is not None: + nemo_gym_dict.setdefault("uv_cache_dir", uv_cache_dir) + uv_venv_dir = get_nemo_gym_venv_dir() + if uv_venv_dir is not None: + nemo_gym_dict.setdefault("uv_venv_dir", uv_venv_dir)Then in both callers:
- uv_cache_dir = get_nemo_gym_uv_cache_dir() - if uv_cache_dir is not None: - nemo_gym_dict.setdefault("uv_cache_dir", uv_cache_dir) - uv_venv_dir = get_nemo_gym_venv_dir() - if uv_venv_dir is not None: - nemo_gym_dict.setdefault("uv_venv_dir", uv_venv_dir) + enrich_nemo_gym_dict_with_uv_paths(nemo_gym_dict)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/nemo_gym/run_grpo_nemo_gym.py` around lines 212 - 218, Duplicate logic that fetches UV paths (get_nemo_gym_uv_cache_dir / get_nemo_gym_venv_dir) and calls nemo_gym_dict.setdefault is repeated; extract this into a single helper function (e.g., inject_nemo_gym_uv_paths(nemo_gym_dict)) placed in nemo_rl/environments/nemo_gym.py that calls get_nemo_gym_uv_cache_dir and get_nemo_gym_venv_dir and sets defaults on the passed nemo_gym_dict, then replace the 6-line blocks in both callers with a single call to inject_nemo_gym_uv_paths(nemo_gym_dict) to keep the helpers co-located and remove duplication.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docker/Dockerfile`:
- Around line 155-162: The hermetic stage runs a prefetch step referencing
examples/nemo_gym/prefetch_venvs.py and NEMO_GYM_PREFETCH_CONFIGS but that
script and nemo_rl subpackages aren't copied into hermetic, causing the build to
fail; remove the prefetch RUN block from the hermetic stage and relocate it into
the release stage (after the full source copy) and re-declare ARG
NEMO_GYM_PREFETCH_CONFIGS there so the prefetch invocation (UV_LINK_MODE=symlink
uv run python examples/nemo_gym/prefetch_venvs.py $NEMO_GYM_PREFETCH_CONFIGS)
executes only when the full source (including nemo_rl and examples) is present.
In `@examples/nemo_gym/prefetch_venvs.py`:
- Line 1: The file's copyright header currently reads "# Copyright (c) 2025,
NVIDIA CORPORATION. All rights reserved."—update that header to use the current
year by replacing "2025" with "2026" so the top-of-file copyright line reflects
2026.
- Around line 79-80: The current blocking call
ray.get(nemo_gym.health_check.remote()) can hang indefinitely during Docker
builds; modify the call to include a generous timeout (e.g., timeout=300) and
wrap it in a try/except to catch ray.exceptions.GetTimeoutError (and other Ray
exceptions) so you can log/raise a clear error or exit cleanly instead of
stalling the build; locate the call to nemo_gym.health_check.remote() in
prefetch_venvs.py and update that invocation and its error handling accordingly.
In `@nemo_rl/environments/nemo_gym.py`:
- Line 38: The subprocess call uses a relative executable name and no timeout;
change the call site where subprocess.check_output(["uv", "cache", "dir"]) is
invoked so it resolves an absolute executable path (use shutil.which to find
"uv" and raise/log if missing) and replace check_output with subprocess.run (or
check_output with timeout) supplying a reasonable timeout and check=True,
capturing stdout/stderr; also handle and log FileNotFoundError and
subprocess.TimeoutExpired so the caller doesn't block or get spoofed by PATH.
---
Nitpick comments:
In `@examples/nemo_gym/prefetch_venvs.py`:
- Around line 92-94: The current broad "except Exception as e" around the config
load / Ray actor creation / health-check is too wide; replace it with targeted
except blocks for the specific failures you expect (e.g., FileNotFoundError,
json.JSONDecodeError or config parsing errors, ray.exceptions.RayError for actor
creation, TimeoutError for health-check waits) and append failures to failed
only in those handlers; ensure you do not swallow KeyboardInterrupt or
SystemExit by either adding an explicit "except (KeyboardInterrupt, SystemExit):
raise" or by not including them in the caught exception tuples; update the
except references around config_path and failed so they only run in the specific
handlers.
- Around line 83-87: Decide whether ray.kill(nemo_gym) is an acceptable
permanent solution; if it is, remove the "TODO: Hangs... (DONT MERGE...)"
comment and the commented-out ray.get(nemo_gym.shutdown.remote()) call so the
intent is clear and keep the ray.kill(nemo_gym) line (ensure venvs are flushed
to disk before killing); if ray.kill is only a temporary workaround, replace
ray.kill(nemo_gym) with an actual fix for NemoGym.shutdown() so you call
ray.get(nemo_gym.shutdown.remote()) from code (or add a retry/timeout wrapper
around NemoGym.shutdown) and open an issue to track the underlying shutdown hang
for future follow-up.
In `@examples/nemo_gym/run_grpo_nemo_gym.py`:
- Around line 212-218: Duplicate logic that fetches UV paths
(get_nemo_gym_uv_cache_dir / get_nemo_gym_venv_dir) and calls
nemo_gym_dict.setdefault is repeated; extract this into a single helper function
(e.g., inject_nemo_gym_uv_paths(nemo_gym_dict)) placed in
nemo_rl/environments/nemo_gym.py that calls get_nemo_gym_uv_cache_dir and
get_nemo_gym_venv_dir and sets defaults on the passed nemo_gym_dict, then
replace the 6-line blocks in both callers with a single call to
inject_nemo_gym_uv_paths(nemo_gym_dict) to keep the helpers co-located and
remove duplication.
| # Prefetch NeMo Gym internal venvs (for gym servers like code_gen, math, etc.) | ||
| if [[ -n "${NEMO_GYM_PREFETCH_CONFIGS:-}" ]]; then | ||
| UV_LINK_MODE=symlink uv run python examples/nemo_gym/prefetch_venvs.py $NEMO_GYM_PREFETCH_CONFIGS | ||
| fi | ||
|
|
||
| # Remove /tmp/ray because the previous script starts up a local ray cluster which creates a session | ||
| # that we can just clean up. | ||
| rm -rf /tmp/ray |
There was a problem hiding this comment.
Prefetch script and nemo_rl source are both absent in the hermetic stage — build breaks by default
Two problems combine to break the Docker build:
-
Missing file:
examples/nemo_gym/prefetch_venvs.pyis never copied in thehermeticstage (the COPY commands at lines 120–126 only bring inpyproject.toml,uv.lock, twonemo_rlentry files, andtools/). Python will exit with "No such file or directory", failing theRUNlayer. -
Incomplete
nemo_rlsource: Even if the file were present, the script importsnemo_rl.environments.nemo_gym,nemo_rl.distributed.virtual_cluster, etc. None of these sub-packages are present inhermetic(onlynemo_rl/__init__.pyandnemo_rl/package_info.pyare copied), so the imports would raiseModuleNotFoundError.
Because once a build argument is declared in a stage it is automatically inherited by child stages, NEMO_GYM_PREFETCH_CONFIGS has its non-empty default value when the if condition is evaluated in hermetic, so this code path executes on every default build.
Recommended fix: Move the prefetch block to the release stage (after line 194, where the full source tree is copied), and re-declare the ARG there — matching the pattern already used for nemo_rl/utils/prefetch_venvs.py.
🐛 Proposed fix — move prefetch to `release` stage
Remove lines 155–162 from the hermetic RUN block, then in the release stage add:
# Re-declare build args for this stage
+ARG NEMO_GYM_PREFETCH_CONFIGS
ARG SKIP_VLLM_BUILD
...
# (after the full COPY at line 194 and after uv sync installs nemo_rl)
+# Prefetch NeMo Gym internal venvs (for gym servers like code_gen, math, etc.)
+RUN if [[ -n "${NEMO_GYM_PREFETCH_CONFIGS:-}" ]]; then \
+ UV_LINK_MODE=symlink uv run python examples/nemo_gym/prefetch_venvs.py \
+ $NEMO_GYM_PREFETCH_CONFIGS; \
+ rm -rf /tmp/ray; \
+ fi🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docker/Dockerfile` around lines 155 - 162, The hermetic stage runs a prefetch
step referencing examples/nemo_gym/prefetch_venvs.py and
NEMO_GYM_PREFETCH_CONFIGS but that script and nemo_rl subpackages aren't copied
into hermetic, causing the build to fail; remove the prefetch RUN block from the
hermetic stage and relocate it into the release stage (after the full source
copy) and re-declare ARG NEMO_GYM_PREFETCH_CONFIGS there so the prefetch
invocation (UV_LINK_MODE=symlink uv run python
examples/nemo_gym/prefetch_venvs.py $NEMO_GYM_PREFETCH_CONFIGS) executes only
when the full source (including nemo_rl and examples) is present.
| @@ -0,0 +1,134 @@ | |||
| # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. | |||
There was a problem hiding this comment.
Copyright year should be 2026 for this new file
As per coding guidelines, the copyright header for new files should carry the current year (2026).
✏️ Proposed fix
-# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. | |
| # Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/nemo_gym/prefetch_venvs.py` at line 1, The file's copyright header
currently reads "# Copyright (c) 2025, NVIDIA CORPORATION. All rights
reserved."—update that header to use the current year by replacing "2025" with
"2026" so the top-of-file copyright line reflects 2026.
| print("Waiting for NeMo Gym to finish initialization...") | ||
| ray.get(nemo_gym.health_check.remote()) |
There was a problem hiding this comment.
ray.get() without a timeout can cause an infinite hang in Docker builds
If NemoGym.__init__ hangs during dry_run initialization (e.g., network issue, external process deadlock), the Docker build layer stalls with no way to recover.
🛡️ Proposed fix — add a generous build-time timeout
- ray.get(nemo_gym.health_check.remote())
+ ray.get(nemo_gym.health_check.remote(), timeout=1800) # 30-minute cap for Docker builds🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/nemo_gym/prefetch_venvs.py` around lines 79 - 80, The current
blocking call ray.get(nemo_gym.health_check.remote()) can hang indefinitely
during Docker builds; modify the call to include a generous timeout (e.g.,
timeout=300) and wrap it in a try/except to catch ray.exceptions.GetTimeoutError
(and other Ray exceptions) so you can log/raise a clear error or exit cleanly
instead of stalling the build; locate the call to nemo_gym.health_check.remote()
in prefetch_venvs.py and update that invocation and its error handling
accordingly.
| """ | ||
| if not os.environ.get("NRL_CONTAINER"): | ||
| return None | ||
| return subprocess.check_output(["uv", "cache", "dir"]).decode().strip() |
There was a problem hiding this comment.
Partial executable path for uv subprocess (S607)
subprocess.check_output(["uv", ...]) resolves uv via PATH, making the call susceptible to PATH-based spoofing. Additionally there is no timeout, so if uv hangs the caller blocks indefinitely.
🛡️ Proposed fix
+import shutil
...
- return subprocess.check_output(["uv", "cache", "dir"]).decode().strip()
+ uv_bin = shutil.which("uv") or "uv"
+ return subprocess.check_output([uv_bin, "cache", "dir"], timeout=30).decode().strip()🧰 Tools
🪛 Ruff (0.15.1)
[error] 38-38: Starting a process with a partial executable path
(S607)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nemo_rl/environments/nemo_gym.py` at line 38, The subprocess call uses a
relative executable name and no timeout; change the call site where
subprocess.check_output(["uv", "cache", "dir"]) is invoked so it resolves an
absolute executable path (use shutil.which to find "uv" and raise/log if
missing) and replace check_output with subprocess.run (or check_output with
timeout) supplying a reasonable timeout and check=True, capturing stdout/stderr;
also handle and log FileNotFoundError and subprocess.TimeoutExpired so the
caller doesn't block or get spoofed by PATH.
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
What does this PR do ?
Add a one line overview of what this PR aims to accomplish.
Adds support to cache the venvs of gym inside the container and give the option to configure different venvs if needed.
Issues
List issues that this PR closes (syntax):
Usage
# Add a code snippet demonstrating how to use thisBefore your PR is "Ready for review"
Pre checks:
Additional Information
Summary by CodeRabbit
New Features
Chores