Conversation
📝 WalkthroughWalkthroughIntroduces GDPO multi-reward training support for math problem solving, including GSM8K dataset integration, dedicated data processors, Ray actor-based math verification environments for multi-reward computation, and configuration files enabling end-to-end GDPO workflows with asynchronous training options. Changes
Sequence Diagram(s)sequenceDiagram
actor User as User / CLI
participant Script as run_gdpo_gsm8k.py
participant Config as Config Loader
participant DataSetup as setup_data
participant Proc as DataProcessor<br/>(math_gdpo)
participant Env as MathMultiReward<br/>Environment
participant Verify as HFMultiRewardVerify<br/>Worker
participant Train as grpo_train<br/>(with GDPO)
User->>Script: Execute with config
Script->>Config: Load + merge overrides
Config-->>Script: Hydra config
Script->>DataSetup: setup_data(tokenizer, config)
DataSetup->>Proc: Create processor
Proc-->>DataSetup: math_gdpo_processor
DataSetup->>Env: Initialize environment
Env->>Verify: Spawn workers
Verify-->>Env: Ready
DataSetup-->>Script: datasets, envs, processors
Script->>Train: grpo_train(config, ...)
loop Training iterations
Train->>Proc: Process batch
Proc-->>Train: tokenized messages + task
Train->>Env: step(message_log_batch)
Env->>Verify: verify(predictions)
Verify-->>Env: reward1, reward2, ...
Env-->>Train: EnvironmentReturn(multi_rewards)
Train->>Train: _get_reward_component_keys()
Train->>Train: per-component baseline & advantage
Train->>Train: sum advantages<br/>(GDPO)
Train->>Train: normalize combined advantages
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 1 | ❌ 3❌ Failed checks (2 warnings, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Tip Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 20
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (4)
nemo_rl/experience/rollouts.py (3)
287-309:⚠️ Potential issue | 🟡 MinorRemove debug prints/commented-out code in hot path.
Lines 287 and 322 are debug prints, and Lines 305–309 are commented-out debug code; these will spam logs and violate the commented-out code rule. Please remove or gate behind a debug logger.
🧹 Suggested cleanup
- print(f"indices {indices}") @@ - # print(f"task_rewards {task_rewards.shape}") - # print(f"task_rewards[i] {len(task_rewards[i])}") @@ - # print(f"terminateds {terminateds}") @@ - print(f"rewards stack {rewards.shape}")As per coding guidelines, "For commented-out code, include a comment describing its usage and why it is commented out; otherwise remove as debug comment before merging."
Also applies to: 322-322
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_rl/experience/rollouts.py` around lines 287 - 309, Remove the stray debug prints and commented-out debug snippets in the hot path: delete the print(f"indices {indices}) and the later print/print-debug comments near the task_rewards/terminateds block, or replace them with proper conditional debug logging (e.g., use logger.debug(...) and guard with logger.isEnabledFor(logging.DEBUG)) so the logic in functions handling indices, all_indices_order, all_rewards, and all_env_observations is not spammed by prints; ensure no commented-out debug lines remain unless accompanied by a short comment explaining why they are kept.
369-455:⚠️ Potential issue | 🟠 MajorHard-coded reward count breaks variable/single-reward envs.
Lines 369–455 assume exactly 3 reward components and a 2D reward tensor. Single-reward outputs (1D) or different component counts will fail. Initialize
multi_rewardsdynamically fromenv_output.rewardsand guard 1D rewards.✅ Suggested fix (dynamic reward shape)
- number_of_rewards = 3 - multi_rewards = torch.zeros(batch_size, number_of_rewards, dtype=torch.float32) + multi_rewards: torch.Tensor | None = None @@ - multi_rewards[active_indices] += env_output.rewards - total_rewards[active_indices] += env_output.rewards.sum(dim=1) + env_rewards = env_output.rewards + if env_rewards.ndim == 1: + env_rewards = env_rewards.unsqueeze(1) + if multi_rewards is None: + multi_rewards = torch.zeros( + batch_size, env_rewards.shape[1], dtype=torch.float32 + ) + multi_rewards[active_indices] += env_rewards + total_rewards[active_indices] += env_rewards.sum(dim=1) @@ - num_reward_components = multi_rewards.shape[1] - for i in range(num_reward_components): - current_batch[f"reward{i + 1}"] = multi_rewards[:, i].clone() + if multi_rewards is not None: + for i in range(multi_rewards.shape[1]): + current_batch[f"reward{i + 1}"] = multi_rewards[:, i].clone()Also applies to: 533-536
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_rl/experience/rollouts.py` around lines 369 - 455, The code assumes three reward components by hardcoding number_of_rewards=3 and indexing multi_rewards, which breaks when calculate_rewards returns a 1D single-reward tensor or a different number of components; change initialization of multi_rewards to derive its shape from the first env_output.rewards (e.g., after the first calculate_rewards call) and reshape/expand 1D rewards to 2D before adding; update the accumulation lines (multi_rewards[active_indices] += env_output.rewards and total_rewards[active_indices] += env_output.rewards.sum(dim=1)) to handle both 1D and 2D cases by detecting rewards.dim() and using .unsqueeze(-1) or .sum(dim=1) accordingly so active_indices, multi_rewards, total_rewards and calculate_rewards/env_output.rewards remain consistent (also apply the same change to the later occurrence around the code noted at lines 533-536).
315-323:⚠️ Potential issue | 🟠 MajorRemove
torch.stackand handle variable reward shapes explicitly.Line 321 uses
torch.stackwhich requires all reward tensors to have identical shapes. However, theEnvironmentReturninterface (interfaces.py:48) explicitly permits variable reward shapes with the comment "## This could be of different shape". If different environment types return rewards with different shapes, this will raiseRuntimeErrorat runtime.Recommendations:
- Pad all rewards to a common shape before stacking, or
- Collect rewards grouped by shape/type, or
- Use
torch.catwith flattening if shape variation is expectedAdditionally, remove the debug
print()statement at line 322 and the commented-out code at line 320 (which lacks a justification comment per coding guidelines).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_rl/experience/rollouts.py` around lines 315 - 323, The code in rollouts.py uses torch.stack on all_rewards (variable name all_rewards, sorted_indices) which fails when reward tensors have different shapes per the EnvironmentReturn interface; replace the torch.stack call by one of: pad each reward to a common shape before stacking, group rewards by shape and keep separate tensors, or keep rewards as a list (no stacking) and handle downstream consumers accordingly; remove the debug print("rewards stack ...") and delete the commented-out rewards line to satisfy guidelines; ensure references to all_env_observations and env_observations remain correctly ordered using sorted_indices and update any downstream code expecting a single tensor to accept the new padded/grouped/list structure.nemo_rl/algorithms/grpo.py (1)
2272-2284:⚠️ Potential issue | 🟡 MinorAsync GRPO silently ignores GDPO; add validation or implementation.
Async training (lines 2272-2284) does not check the
use_gdpoflag when computing advantages, unlike the sync path (lines 1254-1340) which implements per-component baselines. If bothuse_gdpo=trueandasync_grpo.enabled=trueare set, GDPO will not be applied without error or warning. Either implement GDPO in async mode, add config validation to prevent this combination, or document it as intentionally unsupported.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_rl/algorithms/grpo.py` around lines 2272 - 2284, Async GRPO path silently ignores the use_gdpo flag when computing advantages; add an explicit configuration validation before the async advantage computation (the block using prompt_only_ids, rewards, calculate_baseline_and_std_per_prompt and producing advantages) to detect the unsupported combination master_config["grpo"]["use_gdpo"] == True and async_grpo.enabled == True and fail fast: raise a clear error or log+exit instructing the user to disable use_gdpo or async_grpo, so GDPO is not silently skipped; alternatively, if you prefer to support it, implement the per-component baseline logic used in the sync path for GDPO into the async branch where baseline/std are computed.
🧹 Nitpick comments (8)
nemo_rl/experience/rollouts.py (1)
902-920: Remove leftover debug print.Line 919 prints
"testing new code"unconditionally; please drop it or guard behind a debug logger.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_rl/experience/rollouts.py` around lines 902 - 920, Remove the leftover unconditional debug print: delete the print("testing new code") statement (located immediately after constructing final_batch_dict in rollouts.py) or replace it with a proper logger.debug call if you need runtime tracing; ensure you reference the same spot where final_batch_dict is built so message_log/total_reward/idx/truncated handling remains unchanged.examples/configs/grpo_math_1B.yaml (1)
272-276: Confirm WandB logging is intended here.Line 275 enables WandB; ensure credentials and data governance expectations are documented for this example config.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/configs/grpo_math_1B.yaml` around lines 272 - 276, The example enables WandB via the logger key wandb_enabled: true; confirm whether WandB telemetry is intended for this example and either disable it by setting logger.wandb_enabled to false or add/update documentation/comments describing required WandB credentials, how to provide them, and any data governance/privacy expectations for this config (update the example config comment and relevant README or examples docs to reference wandb_enabled and credential setup).nemo_rl/algorithms/grpo.py (1)
1250-1257: Avoid implicit defaults foruse_gdpoin code.Line 1254 uses
.get(..., False), which hides missing config keys. Prefer requiring the key and setting defaults in YAML.💡 Suggested fix
- use_gdpo = master_config["grpo"].get("use_gdpo", False) + use_gdpo = master_config["grpo"]["use_gdpo"]As per coding guidelines, "YAML is the single source of truth for configuration defaults; do not set non-None defaults in code for configuration values."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_rl/algorithms/grpo.py` around lines 1250 - 1257, The code currently uses master_config["grpo"].get("use_gdpo", False) which silently falls back to False when the config key is missing; change this to require the key from YAML and fail fast: replace the .get call with an explicit access (e.g., master_config["grpo"]["use_gdpo"]) or add a clear validation step on master_config["grpo"] that raises a descriptive error if "use_gdpo" is missing, so that functions using use_gdpo (the block around rewards = repeated_batch["total_reward"] and the subsequent use_multi_reward_advantages logic that consults _get_reward_component_keys and repeated_batch) never rely on an implicit code default.examples/run_gdpo_gsm8k.py (2)
181-192: Unused variablecluster.
clusteris unpacked fromsetup()return but never used. Prefix with_to indicate intentional discard.( policy, policy_generation, - cluster, + _cluster, dataloader, val_dataloader, loss_fn, logger, checkpointer, grpo_state, master_config, ) = setup(config, tokenizer, dataset, val_dataset)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/run_gdpo_gsm8k.py` around lines 181 - 192, The variable cluster returned from setup(...) is unused; update the unpacking to mark it as intentionally discarded by renaming cluster to _cluster (i.e., change "cluster" to "_cluster" in the tuple on the left side of the assignment where setup(config, tokenizer, dataset, val_dataset) is called) so linters/readers know it’s intentionally unused.
70-74: UseX | Noneinstead ofOptional[X]for Python 3.12+.Per coding guidelines, code should conform to Python 3.12+. Replace
Optional[AllTaskProcessedDataset]withAllTaskProcessedDataset | None.As per coding guidelines: "Conform code to Python 3.12+"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/run_gdpo_gsm8k.py` around lines 70 - 74, The return type uses Optional[AllTaskProcessedDataset]; update the annotation to use modern union syntax AllTaskProcessedDataset | None (e.g., change ") -> tuple[AllTaskProcessedDataset, Optional[AllTaskProcessedDataset], dict[str, EnvironmentInterface], dict[str, EnvironmentInterface]," to ") -> tuple[AllTaskProcessedDataset, AllTaskProcessedDataset | None, dict[str, EnvironmentInterface], dict[str, EnvironmentInterface],"). Also remove or stop using typing.Optional imports if they become unused and update any other occurrences of Optional[AllTaskProcessedDataset] in this module to AllTaskProcessedDataset | None.nemo_rl/data/datasets/response_datasets/__init__.py (1)
87-91:gsm8kDatasetnot added to__all__.Other dataset classes exposed here are listed in
__all__(line 159–172).gsm8kDataset(or its renamed PascalCase equivalent) should be added for API consistency.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_rl/data/datasets/response_datasets/__init__.py` around lines 87 - 91, The export list (__all__) is missing the gsm8kDataset symbol, causing it not to be publicly exposed; update the __all__ list to include the dataset's exported name (either "gsm8kDataset" or its PascalCase variant if the class is named that, e.g., "Gsm8kDataset") so the symbol returned by the factory (gsm8kDataset in the factory code) is consistently exported; locate the __all__ definition and add the exact identifier used where the class is defined/constructed to that list.nemo_rl/distributed/ray_actor_environment_registry.py (1)
36-36: Minor formatting: extra space before:in dict entry.Other entries in the registry use
"key": valuewithout a space before the colon. This is inconsistent and violates PEP 8 / Google style.- "nemo_rl.environments.math_environment.MathMultiRewardEnvironment" : PY_EXECUTABLES.SYSTEM, + "nemo_rl.environments.math_environment.MathMultiRewardEnvironment": PY_EXECUTABLES.SYSTEM,🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_rl/distributed/ray_actor_environment_registry.py` at line 36, Fix the minor formatting inconsistency in the environment registry entry for "nemo_rl.environments.math_environment.MathMultiRewardEnvironment": remove the extra space before the colon so the mapping matches the other entries (i.e., use "nemo_rl.environments.math_environment.MathMultiRewardEnvironment": PY_EXECUTABLES.SYSTEM). Locate the dictionary where this key is defined (the registry containing PY_EXECUTABLES) and make the spacing change to comply with PEP8/Google style.nemo_rl/environments/math_environment.py (1)
266-291: Inner reward functions are re-created on everyverifycall.
extract_xml_answer,correctness_reward_func,int_reward_func, andformat_reward_funcare defined as closures insideverify. Since they don't capture any mutable state from the enclosing scope (beyondextract_xml_answerwhich is itself stateless), they could be class methods or module-level functions to avoid the repeated allocation overhead on every batch verification.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_rl/environments/math_environment.py` around lines 266 - 291, The inner reward helpers extract_xml_answer, correctness_reward_func, int_reward_func, and format_reward_func should be moved out of the verify function and made either module-level functions or methods on the surrounding class to avoid reallocating them per call; refactor by cutting these definitions from verify, pasting them as top-level functions or class methods (keeping their signatures the same), ensure re is imported at module level, and update verify to call these functions (e.g., self.format_reward_func or format_reward_func) instead of the now-removed inner definitions.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/run_gdpo_gsm8k.py`:
- Line 1: Update the copyright header in examples/run_gdpo_gsm8k.py from 2025 to
2026; locate the top-of-file copyright comment (the existing line starting with
"# Copyright (c) 2025, NVIDIA CORPORATION.") and change the year to 2026 so the
header matches current-year policy for new Python files.
In `@nemo_rl/algorithms/grpo.py`:
- Around line 140-143: Add documentation for the new TypedDict key use_gdpo in
nemo_rl/algorithms/grpo.py: explain it's a boolean that toggles GDPO
multi-reward advantage computation (per-component baselines, sum advantages,
then normalize), list valid values (true/false) and recommended default (false).
Update any exemplar GRPO configs under examples/configs/*.yaml to include
use_gdpo: false explicitly so defaults are visible in examples. Ensure the
TypedDict entry (use_gdpo) has a short inline comment or docstring describing
purpose, type, and default and that all example YAMLs and README/config docs
reflect the new key.
In `@nemo_rl/data/datasets/response_datasets/__init__.py`:
- Line 18: Rename the class symbol gsm8kDataset in gsm8k.py to a PascalCase name
(e.g., Gsm8kDataset or GSM8KDataset) and update all references/imports
accordingly; specifically change the import in response_datasets/__init__.py to
import the new class name from nemo_rl.data.datasets.response_datasets.gsm8k and
adjust any usage sites. Also reorder the import lines in __init__.py so the
gsm8k import comes after deepscaler (maintain alphabetical order consistent with
DAPOMath17KDataset and DeepScalerDataset). Ensure the class definition name in
gsm8k.py and the exported import name in __init__.py match exactly.
In `@nemo_rl/data/datasets/response_datasets/gsm8k.py`:
- Around line 43-46: extract_hash_answer currently returns None when "####" is
missing, which then becomes the assistant message content and can break
downstream processing; update extract_hash_answer to raise a clear exception
(e.g., ValueError with context) when "####" is not found, and/or modify the code
that constructs the assistant message (the call site that sets content =
extract_hash_answer(...)) to skip or filter out records where
extract_hash_answer would be None and log a warning; ensure callers handle the
exception or check for a missing delimiter so no assistant message is created
with None content.
- Line 104: Rename the class gsm8kDataset to GSM8KDataset (use PascalCase) in
the gsm8k.py file where class gsm8kDataset(RawDataset) is declared, and update
every reference to that class (imports, instantiations, registrations, type
hints, and any exports) to the new GSM8KDataset identifier so symbols like
gsm8kDataset, any factory/registry entries, and tests are updated accordingly to
avoid broken references.
- Around line 80-81: The inline comment above the GSM8K load is incorrect;
update the comment that precedes the load_dataset call (the line that assigns
val_ds using load_dataset('openai/gsm8k', 'main')["test"]) to accurately state
that it loads the GSM8K test split from the openai/gsm8k dataset instead of
referencing hendrydong/aime24. Keep it short and precise (e.g., "Load GSM8K test
split for validation") and ensure it sits immediately before the val_ds
assignment.
- Around line 23-41: The constant is defined twice—remove the lowercase alias
and keep only SYSTEM_PROMPT; update the SYSTEM_PROMPT string to correct typos:
change "fianl" to "final" and change "after the reasoning steps is done" to
"after the reasoning steps are done"; ensure any references to the removed
symbol system_prompt are updated to use SYSTEM_PROMPT.
- Around line 104-114: The constructor docstring for class gsm8kDataset
incorrectly references "DAPO Math 17K"; update the docstring in
gsm8kDataset.__init__ to accurately describe the GSM8K dataset (e.g.,
"Initialize the GSM8K dataset with train split") so it matches the class name
and the call to prepare_gsm8k_dataset; keep the seed argument description as-is.
In `@nemo_rl/data/processors.py`:
- Around line 213-219: Update the function/class docstring that starts with
"Process a datum dictionary..." to a Google-style docstring (with Args, Returns,
and optional Raises sections) describing parameters (datum_dict: dict), returned
DatumSpec, and any exceptions; remove the commented-out debug line
"print(f'user_message {user_message}')" unless you add a one-line justification
above it explaining when/why it's kept (otherwise delete it). Ensure the
docstring mentions the role of datum_dict, user_message and the produced
DatumSpec and keep references to datum_dict, user_message, and problem to help
locate the code.
- Around line 206-223: math_gdpo_data_processor currently ignores task_data_spec
and hardcodes messages from datum_dict["messages"], leaves a commented debug
print, and mis-annotates the variable returned by apply_chat_template; update
the function to honor configured prompts by checking task_data_spec (use its
system_prompt_file and prompt_file when present) and generate the system/user
content via apply_chat_template(tokenize=False) instead of reading
datum_dict["messages"] directly (replace references to user_message and
extra_env_info accordingly), remove the extraneous commented-out print (or add a
brief why-comment if it must remain), and change the type annotation for the
result of apply_chat_template (the variable named message) from list[str] to str
to match its actual return type; alternatively, if config-driven prompts are not
needed, remove the task_data_spec parameter and related config entries to avoid
confusion.
- Around line 238-250: The variable message is annotated as list[str] but
tokenizer.apply_chat_template(..., tokenize=False) returns a string here; update
the annotation and usage so message is typed as str (remove the type: ignore and
change list[str] to str) and ensure subsequent calls use it as a string (the
tokenizer(...) call and assignments to user_message["token_ids"] and
user_message["content"] remain unchanged but now operate on a str). This keeps
apply_chat_template, message, user_message["token_ids"], and
user_message["content"] consistent and removes the masked type error.
In `@nemo_rl/environments/interfaces.py`:
- Around line 44-48: Remove the inline comment on the public field rewards (the
"## This could be of different shape" text) and instead update the class-level
docstring for the interface in this file to document that rewards (Tensor) may
have different shapes depending on the environment; keep the field signature
rewards: Tensor unchanged and ensure the docstring sits with the class or
Protocol that defines observations, metadata, next_stop_strings, rewards, and
terminateds so external users see the note in the public API documentation.
In `@nemo_rl/environments/math_environment.py`:
- Around line 330-342: The code is incorrectly treating extracted_answer as a
tuple (len==2 unpack) though extract_xml_answer(response) returns a str when
math_verify_impl == "hf_math_verify"; remove the dead branch that handles
return_extracted_answer (the block that asserts len==2, unpacks
extracted_answer, and appends extracted_answers) or replace it with a simple
guard returning/raising since step() already raises NotImplementedError for
return_extracted_answer; locate the logic around extract_xml_answer(response),
the return_extracted_answer flag, and the HFVerifyWorker-style unpacking to
remove or short-circuit that tuple-based handling to avoid the crash.
- Around line 293-294: The hard-coded magic number 3 for number_of_rewards in
HFMultiRewardVerifyWorker.verify and MathMultiRewardEnvironment.step should be
made configurable: add a number_of_rewards field to MathEnvConfig (with a
sensible default), expose it on the MathMultiRewardEnvironment instance (or as a
class attribute), and pass it into HFMultiRewardVerifyWorker when the worker is
constructed/initialized; then replace the literal 3 usages in
HFMultiRewardVerifyWorker.verify and MathMultiRewardEnvironment.step with the
config-backed value (e.g., self.cfg.number_of_rewards or env.number_of_rewards)
so both places derive the same single source of truth.
- Line 642: Fix the typo in the inline comment on the loop that iterates "for
result in results[0]" by changing "corretness" to "correctness" so the comment
reads that index 0 always stores the correctness reward; update the comment
adjacent to the results variable and the loop to reflect the corrected spelling.
- Around line 549-551: Remove the leftover commented-out print debug artifacts
inside the MathMultiRewardEnvironment class (the commented prints around the
constructor/initialization and other methods at the noted locations); either
delete these commented print lines or replace them with proper logging (e.g.,
self.logger.debug(...)) if runtime diagnostic info is needed, ensuring you
reference the MathMultiRewardEnvironment class and the related init/step/helper
methods where those prints appear so you update all occurrences consistently.
- Around line 233-234: Update the file header copyright year from 2025 to 2026
to reflect the new class addition; locate the top-of-file copyright comment in
the same file where the new HFMultiRewardVerifyWorker class is defined and
change the year to 2026 so the header matches the new 2026 content.
- Around line 296-361: Remove the commented-out debug print blocks and the
active print calls in the loop of math_environment.py: delete the commented
prints inside the hf_math_verify branch (around the checks of response and
ground_truth) and remove the two print(f"results ...") lines and the stale
trailing comment before the final return in the block that builds results and
extracted_answers; ensure the logic around correctness_reward_func,
int_reward_func, format_reward_func, extract_xml_answer, and the
return_extracted_answer handling remains unchanged and tests still pass.
- Around line 670-672: batch["is_end"] is 1D while batch["rewards"] is 2D,
causing broadcast/indexing errors; fix by unsqueezing is_end before any
broadcast: replace uses of batch["is_end"] with batch["is_end"].unsqueeze(-1)
when multiplying into batch["rewards"] (so batch["rewards"] = batch["rewards"] *
batch["is_end"].unsqueeze(-1)). For places where a 1D per-prompt pass vector is
required (indexing generation_lengths or calling
calculate_pass_rate_per_prompt), collapse the 2D rewards into a 1D boolean/byte
vector (e.g., batch["rewards"].eq(1).any(dim=1) or .max(dim=1).values as
appropriate) and use that 1D tensor for indexing and for
calculate_pass_rate_per_prompt.
In `@run.sh`:
- Around line 1-2: Add a POSIX shebang and a short copyright/header comment at
the top of run.sh, and either remove or replace the unexplained commented-out
command "# uv run python examples/run_grpo_math.py" with a brief comment
explaining why it is preserved (e.g., deprecated example or alternative
invocation); ensure the remaining actual command "uv run python
examples/run_gdpo_gsm8k.py" remains runnable and that the header follows the
project's copyright and style conventions.
---
Outside diff comments:
In `@nemo_rl/algorithms/grpo.py`:
- Around line 2272-2284: Async GRPO path silently ignores the use_gdpo flag when
computing advantages; add an explicit configuration validation before the async
advantage computation (the block using prompt_only_ids, rewards,
calculate_baseline_and_std_per_prompt and producing advantages) to detect the
unsupported combination master_config["grpo"]["use_gdpo"] == True and
async_grpo.enabled == True and fail fast: raise a clear error or log+exit
instructing the user to disable use_gdpo or async_grpo, so GDPO is not silently
skipped; alternatively, if you prefer to support it, implement the per-component
baseline logic used in the sync path for GDPO into the async branch where
baseline/std are computed.
In `@nemo_rl/experience/rollouts.py`:
- Around line 287-309: Remove the stray debug prints and commented-out debug
snippets in the hot path: delete the print(f"indices {indices}) and the later
print/print-debug comments near the task_rewards/terminateds block, or replace
them with proper conditional debug logging (e.g., use logger.debug(...) and
guard with logger.isEnabledFor(logging.DEBUG)) so the logic in functions
handling indices, all_indices_order, all_rewards, and all_env_observations is
not spammed by prints; ensure no commented-out debug lines remain unless
accompanied by a short comment explaining why they are kept.
- Around line 369-455: The code assumes three reward components by hardcoding
number_of_rewards=3 and indexing multi_rewards, which breaks when
calculate_rewards returns a 1D single-reward tensor or a different number of
components; change initialization of multi_rewards to derive its shape from the
first env_output.rewards (e.g., after the first calculate_rewards call) and
reshape/expand 1D rewards to 2D before adding; update the accumulation lines
(multi_rewards[active_indices] += env_output.rewards and
total_rewards[active_indices] += env_output.rewards.sum(dim=1)) to handle both
1D and 2D cases by detecting rewards.dim() and using .unsqueeze(-1) or
.sum(dim=1) accordingly so active_indices, multi_rewards, total_rewards and
calculate_rewards/env_output.rewards remain consistent (also apply the same
change to the later occurrence around the code noted at lines 533-536).
- Around line 315-323: The code in rollouts.py uses torch.stack on all_rewards
(variable name all_rewards, sorted_indices) which fails when reward tensors have
different shapes per the EnvironmentReturn interface; replace the torch.stack
call by one of: pad each reward to a common shape before stacking, group rewards
by shape and keep separate tensors, or keep rewards as a list (no stacking) and
handle downstream consumers accordingly; remove the debug print("rewards stack
...") and delete the commented-out rewards line to satisfy guidelines; ensure
references to all_env_observations and env_observations remain correctly ordered
using sorted_indices and update any downstream code expecting a single tensor to
accept the new padded/grouped/list structure.
---
Nitpick comments:
In `@examples/configs/grpo_math_1B.yaml`:
- Around line 272-276: The example enables WandB via the logger key
wandb_enabled: true; confirm whether WandB telemetry is intended for this
example and either disable it by setting logger.wandb_enabled to false or
add/update documentation/comments describing required WandB credentials, how to
provide them, and any data governance/privacy expectations for this config
(update the example config comment and relevant README or examples docs to
reference wandb_enabled and credential setup).
In `@examples/run_gdpo_gsm8k.py`:
- Around line 181-192: The variable cluster returned from setup(...) is unused;
update the unpacking to mark it as intentionally discarded by renaming cluster
to _cluster (i.e., change "cluster" to "_cluster" in the tuple on the left side
of the assignment where setup(config, tokenizer, dataset, val_dataset) is
called) so linters/readers know it’s intentionally unused.
- Around line 70-74: The return type uses Optional[AllTaskProcessedDataset];
update the annotation to use modern union syntax AllTaskProcessedDataset | None
(e.g., change ") -> tuple[AllTaskProcessedDataset,
Optional[AllTaskProcessedDataset], dict[str, EnvironmentInterface], dict[str,
EnvironmentInterface]," to ") -> tuple[AllTaskProcessedDataset,
AllTaskProcessedDataset | None, dict[str, EnvironmentInterface], dict[str,
EnvironmentInterface],"). Also remove or stop using typing.Optional imports if
they become unused and update any other occurrences of
Optional[AllTaskProcessedDataset] in this module to AllTaskProcessedDataset |
None.
In `@nemo_rl/algorithms/grpo.py`:
- Around line 1250-1257: The code currently uses
master_config["grpo"].get("use_gdpo", False) which silently falls back to False
when the config key is missing; change this to require the key from YAML and
fail fast: replace the .get call with an explicit access (e.g.,
master_config["grpo"]["use_gdpo"]) or add a clear validation step on
master_config["grpo"] that raises a descriptive error if "use_gdpo" is missing,
so that functions using use_gdpo (the block around rewards =
repeated_batch["total_reward"] and the subsequent use_multi_reward_advantages
logic that consults _get_reward_component_keys and repeated_batch) never rely on
an implicit code default.
In `@nemo_rl/data/datasets/response_datasets/__init__.py`:
- Around line 87-91: The export list (__all__) is missing the gsm8kDataset
symbol, causing it not to be publicly exposed; update the __all__ list to
include the dataset's exported name (either "gsm8kDataset" or its PascalCase
variant if the class is named that, e.g., "Gsm8kDataset") so the symbol returned
by the factory (gsm8kDataset in the factory code) is consistently exported;
locate the __all__ definition and add the exact identifier used where the class
is defined/constructed to that list.
In `@nemo_rl/distributed/ray_actor_environment_registry.py`:
- Line 36: Fix the minor formatting inconsistency in the environment registry
entry for "nemo_rl.environments.math_environment.MathMultiRewardEnvironment":
remove the extra space before the colon so the mapping matches the other entries
(i.e., use "nemo_rl.environments.math_environment.MathMultiRewardEnvironment":
PY_EXECUTABLES.SYSTEM). Locate the dictionary where this key is defined (the
registry containing PY_EXECUTABLES) and make the spacing change to comply with
PEP8/Google style.
In `@nemo_rl/environments/math_environment.py`:
- Around line 266-291: The inner reward helpers extract_xml_answer,
correctness_reward_func, int_reward_func, and format_reward_func should be moved
out of the verify function and made either module-level functions or methods on
the surrounding class to avoid reallocating them per call; refactor by cutting
these definitions from verify, pasting them as top-level functions or class
methods (keeping their signatures the same), ensure re is imported at module
level, and update verify to call these functions (e.g., self.format_reward_func
or format_reward_func) instead of the now-removed inner definitions.
In `@nemo_rl/experience/rollouts.py`:
- Around line 902-920: Remove the leftover unconditional debug print: delete the
print("testing new code") statement (located immediately after constructing
final_batch_dict in rollouts.py) or replace it with a proper logger.debug call
if you need runtime tracing; ensure you reference the same spot where
final_batch_dict is built so message_log/total_reward/idx/truncated handling
remains unchanged.
| @@ -0,0 +1,260 @@ | |||
| # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. | |||
There was a problem hiding this comment.
Copyright year should be 2026 for new files.
The copyright header says 2025, but this file is being created in 2026.
As per coding guidelines: "Add the NVIDIA copyright header (with current year) to all Python files and shell scripts, excluding tests"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/run_gdpo_gsm8k.py` at line 1, Update the copyright header in
examples/run_gdpo_gsm8k.py from 2025 to 2026; locate the top-of-file copyright
comment (the existing line starting with "# Copyright (c) 2025, NVIDIA
CORPORATION.") and change the year to 2026 so the header matches current-year
policy for new Python files.
| # Use GDPO multi-reward advantage (per-component baselines, sum advantages, then normalize). | ||
| # When True, reward components are discovered from the batch (keys reward1, reward2, ...). | ||
| # Requires the environment to expose per-reward keys; see nemo_rl/environments/math_environment.py. | ||
| use_gdpo: NotRequired[bool] |
There was a problem hiding this comment.
Document use_gdpo and add explicit default in exemplar YAMLs.
You added a new GRPO config key but the default/usage isn’t reflected in exemplar YAMLs (e.g., existing GRPO configs). Please document the key’s purpose/valid values/default and add an explicit default in example configs.
As per coding guidelines, "When adding a new config key to a TypedDict subclass, document the key's purpose, valid values/types, recommended default, and reflect the default in exemplar YAMLs under examples/configs/*.yaml."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nemo_rl/algorithms/grpo.py` around lines 140 - 143, Add documentation for the
new TypedDict key use_gdpo in nemo_rl/algorithms/grpo.py: explain it's a boolean
that toggles GDPO multi-reward advantage computation (per-component baselines,
sum advantages, then normalize), list valid values (true/false) and recommended
default (false). Update any exemplar GRPO configs under examples/configs/*.yaml
to include use_gdpo: false explicitly so defaults are visible in examples.
Ensure the TypedDict entry (use_gdpo) has a short inline comment or docstring
describing purpose, type, and default and that all example YAMLs and
README/config docs reflect the new key.
|
|
||
| from nemo_rl.data.datasets.response_datasets.clevr import CLEVRCoGenTDataset | ||
| from nemo_rl.data.datasets.response_datasets.dapo_math import DAPOMath17KDataset | ||
| from nemo_rl.data.datasets.response_datasets.gsm8k import gsm8kDataset |
There was a problem hiding this comment.
Class name gsm8kDataset violates PascalCase naming convention for classes.
Per coding guidelines, class names must use PascalCase. gsm8kDataset should be renamed to Gsm8kDataset or GSM8KDataset (consistent with other dataset class names like DAPOMath17KDataset, DeepScalerDataset). This also affects the definition in gsm8k.py.
Also, the import is out of alphabetical order — gsm8k is placed between dapo_math and deepscaler, but should come after deepscaler.
As per coding guidelines: "Use PascalCase for class names"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nemo_rl/data/datasets/response_datasets/__init__.py` at line 18, Rename the
class symbol gsm8kDataset in gsm8k.py to a PascalCase name (e.g., Gsm8kDataset
or GSM8KDataset) and update all references/imports accordingly; specifically
change the import in response_datasets/__init__.py to import the new class name
from nemo_rl.data.datasets.response_datasets.gsm8k and adjust any usage sites.
Also reorder the import lines in __init__.py so the gsm8k import comes after
deepscaler (maintain alphabetical order consistent with DAPOMath17KDataset and
DeepScalerDataset). Ensure the class definition name in gsm8k.py and the
exported import name in __init__.py match exactly.
| SYSTEM_PROMPT = system_prompt = """ | ||
| You are a helpful AI assistant. | ||
|
|
||
| For every request, you should carefully think through the math problem step by step, then provide the fianl answer in integer format. | ||
|
|
||
| Steps for Each Request: | ||
| 1. Think: Provide detailed, step-by-step reasoning, calculations, or derivations. | ||
| 2. Produce Final Answer: After step-by-step reasoning, output the final answer in integer format. | ||
|
|
||
| Output Format: | ||
| <think>Your thoughts and reasoning</think> | ||
| <answer>Final answer in integer format</answer> | ||
|
|
||
| Important Notes: | ||
| 1. You must include your reasoning steps inside <think>. | ||
| 2. You must always output the Final Answer within <answer> after the reasoning steps is done. | ||
| 3. You should consistently work through the solution step by step before giving the final answer. | ||
| 4. The final answer can only be an integer. | ||
| """ |
There was a problem hiding this comment.
Dual-assignment global, typos in the system prompt, and naming.
- Line 23:
SYSTEM_PROMPT = system_prompt = """creates two names for the same value. Remove the lowercase alias per the guideline to use upper snake_case for constants. - Line 26: "fianl" → "final".
- Line 38: "after the reasoning steps is done" → "after the reasoning steps are done".
Proposed fix
-SYSTEM_PROMPT = system_prompt = """
+SYSTEM_PROMPT = """
You are a helpful AI assistant.
-For every request, you should carefully think through the math problem step by step, then provide the fianl answer in integer format.
+For every request, you should carefully think through the math problem step by step, then provide the final answer in integer format.
Steps for Each Request:
1. Think: Provide detailed, step-by-step reasoning, calculations, or derivations.
...
-2. You must always output the Final Answer within <answer> after the reasoning steps is done.
+2. You must always output the Final Answer within <answer> after the reasoning steps are done.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| SYSTEM_PROMPT = system_prompt = """ | |
| You are a helpful AI assistant. | |
| For every request, you should carefully think through the math problem step by step, then provide the fianl answer in integer format. | |
| Steps for Each Request: | |
| 1. Think: Provide detailed, step-by-step reasoning, calculations, or derivations. | |
| 2. Produce Final Answer: After step-by-step reasoning, output the final answer in integer format. | |
| Output Format: | |
| <think>Your thoughts and reasoning</think> | |
| <answer>Final answer in integer format</answer> | |
| Important Notes: | |
| 1. You must include your reasoning steps inside <think>. | |
| 2. You must always output the Final Answer within <answer> after the reasoning steps is done. | |
| 3. You should consistently work through the solution step by step before giving the final answer. | |
| 4. The final answer can only be an integer. | |
| """ | |
| SYSTEM_PROMPT = """ | |
| You are a helpful AI assistant. | |
| For every request, you should carefully think through the math problem step by step, then provide the final answer in integer format. | |
| Steps for Each Request: | |
| 1. Think: Provide detailed, step-by-step reasoning, calculations, or derivations. | |
| 2. Produce Final Answer: After step-by-step reasoning, output the final answer in integer format. | |
| Output Format: | |
| <think>Your thoughts and reasoning</think> | |
| <answer>Final answer in integer format</answer> | |
| Important Notes: | |
| 1. You must include your reasoning steps inside <think>. | |
| 2. You must always output the Final Answer within <answer> after the reasoning steps are done. | |
| 3. You should consistently work through the solution step by step before giving the final answer. | |
| 4. The final answer can only be an integer. | |
| """ |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nemo_rl/data/datasets/response_datasets/gsm8k.py` around lines 23 - 41, The
constant is defined twice—remove the lowercase alias and keep only
SYSTEM_PROMPT; update the SYSTEM_PROMPT string to correct typos: change "fianl"
to "final" and change "after the reasoning steps is done" to "after the
reasoning steps are done"; ensure any references to the removed symbol
system_prompt are updated to use SYSTEM_PROMPT.
| def extract_hash_answer(text: str) -> str | None: | ||
| if "####" not in text: | ||
| return None | ||
| return text.split("####")[1].strip() |
There was a problem hiding this comment.
extract_hash_answer can return None, which propagates as assistant message content.
If a GSM8K answer field doesn't contain ####, extract_hash_answer returns None (line 45). This None becomes the "content" of the assistant message at line 66, which could cause downstream failures (e.g., tokenization errors or type mismatches).
Consider raising an error or filtering such records during dataset preparation.
Also applies to: 64-67
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nemo_rl/data/datasets/response_datasets/gsm8k.py` around lines 43 - 46,
extract_hash_answer currently returns None when "####" is missing, which then
becomes the assistant message content and can break downstream processing;
update extract_hash_answer to raise a clear exception (e.g., ValueError with
context) when "####" is not found, and/or modify the code that constructs the
assistant message (the call site that sets content = extract_hash_answer(...))
to skip or filter out records where extract_hash_answer would be None and log a
warning; ensure callers handle the exception or check for a missing delimiter so
no assistant message is created with None content.
| if return_extracted_answer: | ||
| # Make sure the extracted answer is not None and is a list of two elements | ||
| assert extracted_answer is not None | ||
| assert len(extracted_answer) == 2 | ||
| extracted_gold, extracted_prediction = extracted_answer | ||
| # Get the extracted answer with the same logic as in the HFVerifyWorker | ||
| for pred in extracted_prediction: | ||
| if any(grader.verify(gold, pred) for gold in extracted_gold): | ||
| extracted_answers.append(pred) | ||
| break | ||
| else: | ||
| # If no match is found, means all answers are incorrect, just use the first prediction | ||
| extracted_answers.append(extracted_prediction[0][0]) |
There was a problem hiding this comment.
return_extracted_answer=True will crash — extracted_answer is a str, not a tuple.
When math_verify_impl == "hf_math_verify", extracted_answer is assigned from extract_xml_answer(response) (line 313), which returns a plain str. Lines 332–333 then assert len(extracted_answer) == 2 and attempt tuple unpacking, which will always fail for a string that isn't exactly 2 characters long.
This code path appears to be copy-pasted from HFVerifyWorker where extracted_answer comes from self.verify_func (which returns a tuple). Since the step method already raises NotImplementedError for this flag (line 625), consider removing this dead code block to avoid confusion.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nemo_rl/environments/math_environment.py` around lines 330 - 342, The code is
incorrectly treating extracted_answer as a tuple (len==2 unpack) though
extract_xml_answer(response) returns a str when math_verify_impl ==
"hf_math_verify"; remove the dead branch that handles return_extracted_answer
(the block that asserts len==2, unpacks extracted_answer, and appends
extracted_answers) or replace it with a simple guard returning/raising since
step() already raises NotImplementedError for return_extracted_answer; locate
the logic around extract_xml_answer(response), the return_extracted_answer flag,
and the HFVerifyWorker-style unpacking to remove or short-circuit that
tuple-based handling to avoid the crash.
| # print(f"self.workers {self.workers}") | ||
| # print('Init MathMultiRewardEnvironment') | ||
|
|
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
More debug artifacts in MathMultiRewardEnvironment.
Lines 549–550, 610–611, 633–634, and 647 contain commented-out print statements with no explanation. Same cleanup applies here.
Also applies to: 610-612, 633-634, 647-648
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nemo_rl/environments/math_environment.py` around lines 549 - 551, Remove the
leftover commented-out print debug artifacts inside the
MathMultiRewardEnvironment class (the commented prints around the
constructor/initialization and other methods at the noted locations); either
delete these commented print lines or replace them with proper logging (e.g.,
self.logger.debug(...)) if runtime diagnostic info is needed, ensuring you
reference the MathMultiRewardEnvironment class and the related init/step/helper
methods where those prints appear so you update all occurrences consistently.
| if result | ||
| else "Environment: incorrect", | ||
| } | ||
| for result in results[0] ## index 0 always store corretness reward |
There was a problem hiding this comment.
Typo: "corretness" → "correctness".
- for result in results[0] ## index 0 always store corretness reward
+ for result in results[0] # index 0 always stores correctness reward📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| for result in results[0] ## index 0 always store corretness reward | |
| for result in results[0] # index 0 always stores correctness reward |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nemo_rl/environments/math_environment.py` at line 642, Fix the typo in the
inline comment on the loop that iterates "for result in results[0]" by changing
"corretness" to "correctness" so the comment reads that index 0 always stores
the correctness reward; update the comment adjacent to the results variable and
the loop to reflect the corrected spelling.
| batch["rewards"] = ( | ||
| batch["rewards"] * batch["is_end"] | ||
| ) # set a reward of 0 for any incorrectly ended sequences |
There was a problem hiding this comment.
Shape mismatch: 2D rewards × 1D is_end will fail at runtime.
batch["rewards"] is now shape (B, num_rewards) (set at line 646), but batch["is_end"] is shape (B,). PyTorch broadcasting aligns trailing dimensions, so (B, num_rewards) * (B,) will raise a runtime error when B != num_rewards. You need to unsqueeze is_end.
The same 2D-vs-1D mismatch propagates to lines 673 and 687–689: batch["rewards"] == 1 yields a 2D mask that cannot index generation_lengths (1D), and calculate_pass_rate_per_prompt expects a 1D rewards tensor.
Proposed fix
batch["rewards"] = (
- batch["rewards"] * batch["is_end"]
+ batch["rewards"] * batch["is_end"].unsqueeze(-1)
) # set a reward of 0 for any incorrectly ended sequences
- if (batch["rewards"] == 1).float().sum() > 0:
+ # Use the first reward column (correctness) for scalar metrics
+ correctness_rewards = batch["rewards"][:, 0]
+ if (correctness_rewards == 1).float().sum() > 0:
correct_solution_generation_lengths = (
(batch["generation_lengths"] - batch["prompt_lengths"])[
- batch["rewards"] == 1
+ correctness_rewards == 1
]
.float()
.mean()
.item()
)
...
metrics = {
- "accuracy": batch["rewards"].mean().item(),
+ "accuracy": correctness_rewards.mean().item(),
"pass@samples_per_prompt": calculate_pass_rate_per_prompt(
- batch["text"], batch["rewards"]
+ batch["text"], correctness_rewards
),🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nemo_rl/environments/math_environment.py` around lines 670 - 672,
batch["is_end"] is 1D while batch["rewards"] is 2D, causing broadcast/indexing
errors; fix by unsqueezing is_end before any broadcast: replace uses of
batch["is_end"] with batch["is_end"].unsqueeze(-1) when multiplying into
batch["rewards"] (so batch["rewards"] = batch["rewards"] *
batch["is_end"].unsqueeze(-1)). For places where a 1D per-prompt pass vector is
required (indexing generation_lengths or calling
calculate_pass_rate_per_prompt), collapse the 2D rewards into a 1D boolean/byte
vector (e.g., batch["rewards"].eq(1).any(dim=1) or .max(dim=1).values as
appropriate) and use that 1D tensor for indexing and for
calculate_pass_rate_per_prompt.
| # uv run python examples/run_grpo_math.py | ||
| uv run python examples/run_gdpo_gsm8k.py No newline at end of file |
There was a problem hiding this comment.
Missing shebang and copyright header; contains unexplained commented-out code.
Per the Google Shell Style Guide (referenced in coding guidelines), shell scripts should start with a shebang. Line 1 is a commented-out command with no explanation for why it's there.
Proposed fix
+#!/usr/bin/env bash
+# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
-# uv run python examples/run_grpo_math.py
uv run python examples/run_gdpo_gsm8k.py📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # uv run python examples/run_grpo_math.py | |
| uv run python examples/run_gdpo_gsm8k.py | |
| #!/usr/bin/env bash | |
| # Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. | |
| # | |
| # Licensed under the Apache License, Version 2.0 (the "License"); | |
| # you may not use this file except in compliance with the License. | |
| # You may obtain a copy of the License at | |
| # | |
| # http://www.apache.org/licenses/LICENSE-2.0 | |
| # | |
| # Unless required by applicable law or agreed to in writing, software | |
| # distributed under the License is distributed on an "AS IS" BASIS, | |
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |
| # See the License for the specific language governing permissions and | |
| # limitations under the License. | |
| uv run python examples/run_gdpo_gsm8k.py |
🧰 Tools
🪛 Shellcheck (0.11.0)
[error] 1-1: Tips depend on target shell and yours is unknown. Add a shebang or a 'shell' directive.
(SC2148)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@run.sh` around lines 1 - 2, Add a POSIX shebang and a short copyright/header
comment at the top of run.sh, and either remove or replace the unexplained
commented-out command "# uv run python examples/run_grpo_math.py" with a brief
comment explaining why it is preserved (e.g., deprecated example or alternative
invocation); ensure the remaining actual command "uv run python
examples/run_gdpo_gsm8k.py" remains runnable and that the header follows the
project's copyright and style conventions.
What does this PR do ?
Add a one line overview of what this PR aims to accomplish.
Issues
List issues that this PR closes (syntax):
Usage
# Add a code snippet demonstrating how to use thisBefore your PR is "Ready for review"
Pre checks:
Additional Information
Summary by CodeRabbit
New Features