Skip to content

Comments

Gdpo#1986

Open
nbasyl wants to merge 3 commits intoNVIDIA-NeMo:mainfrom
nbasyl:gdpo
Open

Gdpo#1986
nbasyl wants to merge 3 commits intoNVIDIA-NeMo:mainfrom
nbasyl:gdpo

Conversation

@nbasyl
Copy link

@nbasyl nbasyl commented Feb 18, 2026

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Issues

List issues that this PR closes (syntax):

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

Summary by CodeRabbit

New Features

  • Added GSM8K dataset integration for mathematics training tasks
  • Introduced GDPO (Group Distributional Policy Optimization) training with multi-reward component support and per-component baseline optimization
  • Added advanced math environment with multiple reward verification capabilities
  • Expanded rollout system to handle multiple reward signals during training
  • Provided example scripts and configuration templates for GDPO workflows

@nbasyl nbasyl requested review from a team as code owners February 18, 2026 17:21
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 18, 2026

📝 Walkthrough

Walkthrough

Introduces GDPO multi-reward training support for math problem solving, including GSM8K dataset integration, dedicated data processors, Ray actor-based math verification environments for multi-reward computation, and configuration files enabling end-to-end GDPO workflows with asynchronous training options.

Changes

Cohort / File(s) Summary
Configuration & Build
.gitignore, run.sh, examples/configs/gdpo_math_1B.yaml, examples/configs/grpo_math_1B.yaml
Added interactive.sh to gitignore; updated run.sh with GDPO example execution; created comprehensive gdpo_math_1B.yaml configuration with GDPO training controls, async options, reward shaping, and multi-component settings; enabled wandb logging in grpo_math_1B.yaml.
GSM8K Dataset Support
nemo_rl/data/datasets/response_datasets/gsm8k.py, nemo_rl/data/datasets/response_datasets/__init__.py
New GSM8K dataset adapter with chat-format preprocessing, answer extraction from hash-marked fields, dataset loading from OpenAI/gsm8k, and RawDataset wrapper; registered gsm8kDataset in response_datasets imports and load_response_dataset function.
GRPO Algorithm & Processing
nemo_rl/algorithms/grpo.py, nemo_rl/data/processors.py
Added use_gdpo flag and per-component reward baseline logic to GRPOConfig and grpo_train; introduced _get_reward_component_keys utility for dynamic multi-reward handling; added math_gdpo_data_processor for OpenMathInstruct2 message-based data with task_name propagation and registered in PROCESSOR_REGISTRY.
Math Verification Environments
nemo_rl/environments/math_environment.py, nemo_rl/environments/interfaces.py, nemo_rl/distributed/ray_actor_environment_registry.py
Introduced HFMultiRewardVerifyWorker and HFMultiRewardEnvironment Ray actors for parallel math verification across multiple reward implementations; added MathMultiRewardEnvironment for chunked multi-reward processing with global_post_process_and_metrics hook; registered MathMultiRewardEnvironment in actor registry; updated EnvironmentReturn.rewards field annotation with shape flexibility note.
Multi-turn Rollout Handling
nemo_rl/experience/rollouts.py
Added per-component reward tensor accumulation (multi_rewards) across multi-turn rollouts; exposed reward1, reward2, ..., rewardN keys in final batch and sample state; updated async rollout functions to include per-component reward signals; maintains single-reward backward compatibility.
GDPO Entry Point
examples/run_gdpo_gsm8k.py
New orchestration script with parse_args CLI, setup_data function for environment and dataset provisioning, main workflow integrating Hydra config loading, tokenizer setup, Ray initialization, and conditional async/sync GRPO training path with feature-flag validation.

Sequence Diagram(s)

sequenceDiagram
    actor User as User / CLI
    participant Script as run_gdpo_gsm8k.py
    participant Config as Config Loader
    participant DataSetup as setup_data
    participant Proc as DataProcessor<br/>(math_gdpo)
    participant Env as MathMultiReward<br/>Environment
    participant Verify as HFMultiRewardVerify<br/>Worker
    participant Train as grpo_train<br/>(with GDPO)

    User->>Script: Execute with config
    Script->>Config: Load + merge overrides
    Config-->>Script: Hydra config
    
    Script->>DataSetup: setup_data(tokenizer, config)
    DataSetup->>Proc: Create processor
    Proc-->>DataSetup: math_gdpo_processor
    DataSetup->>Env: Initialize environment
    Env->>Verify: Spawn workers
    Verify-->>Env: Ready
    DataSetup-->>Script: datasets, envs, processors
    
    Script->>Train: grpo_train(config, ...)
    
    loop Training iterations
        Train->>Proc: Process batch
        Proc-->>Train: tokenized messages + task
        Train->>Env: step(message_log_batch)
        Env->>Verify: verify(predictions)
        Verify-->>Env: reward1, reward2, ...
        Env-->>Train: EnvironmentReturn(multi_rewards)
        Train->>Train: _get_reward_component_keys()
        Train->>Train: per-component baseline & advantage
        Train->>Train: sum advantages<br/>(GDPO)
        Train->>Train: normalize combined advantages
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested labels

CI:L1

Suggested reviewers

  • ashors1
  • terrykong
🚥 Pre-merge checks | ✅ 1 | ❌ 3

❌ Failed checks (2 warnings, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 68.57% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Results For Major Changes ⚠️ Warning PR introduces 1,200 lines of code with GDPO algorithm and new features but lacks any testing documentation, performance metrics, convergence validation, or accuracy benchmarks in the PR description. Add comprehensive testing documentation including GDPO vs GRPO baseline comparison, multi-reward validation, regression tests, and experimental setup details with accuracy/loss curves.
Title check ❓ Inconclusive The title 'Gdpo' is extremely vague and generic, providing no meaningful information about the changeset despite substantial additions across multiple domains (GRPO/GDPO training, math environments, datasets, processors, and configuration files). Replace the title with a more descriptive phrase that captures the main intent, such as 'Add GDPO multi-reward support with math environment and GSM8K dataset integration' or 'Implement GDPO framework with multi-reward training and math task configuration'.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 20

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)
nemo_rl/experience/rollouts.py (3)

287-309: ⚠️ Potential issue | 🟡 Minor

Remove debug prints/commented-out code in hot path.

Lines 287 and 322 are debug prints, and Lines 305–309 are commented-out debug code; these will spam logs and violate the commented-out code rule. Please remove or gate behind a debug logger.

🧹 Suggested cleanup
-        print(f"indices {indices}")
@@
-            # print(f"task_rewards {task_rewards.shape}")
-            # print(f"task_rewards[i] {len(task_rewards[i])}")
@@
-            # print(f"terminateds {terminateds}")
@@
-    print(f"rewards stack {rewards.shape}")

As per coding guidelines, "For commented-out code, include a comment describing its usage and why it is commented out; otherwise remove as debug comment before merging."

Also applies to: 322-322

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/experience/rollouts.py` around lines 287 - 309, Remove the stray
debug prints and commented-out debug snippets in the hot path: delete the
print(f"indices {indices}) and the later print/print-debug comments near the
task_rewards/terminateds block, or replace them with proper conditional debug
logging (e.g., use logger.debug(...) and guard with
logger.isEnabledFor(logging.DEBUG)) so the logic in functions handling indices,
all_indices_order, all_rewards, and all_env_observations is not spammed by
prints; ensure no commented-out debug lines remain unless accompanied by a short
comment explaining why they are kept.

369-455: ⚠️ Potential issue | 🟠 Major

Hard-coded reward count breaks variable/single-reward envs.

Lines 369–455 assume exactly 3 reward components and a 2D reward tensor. Single-reward outputs (1D) or different component counts will fail. Initialize multi_rewards dynamically from env_output.rewards and guard 1D rewards.

✅ Suggested fix (dynamic reward shape)
-    number_of_rewards = 3
-    multi_rewards = torch.zeros(batch_size, number_of_rewards, dtype=torch.float32)
+    multi_rewards: torch.Tensor | None = None
@@
-        multi_rewards[active_indices] += env_output.rewards
-        total_rewards[active_indices] += env_output.rewards.sum(dim=1)
+        env_rewards = env_output.rewards
+        if env_rewards.ndim == 1:
+            env_rewards = env_rewards.unsqueeze(1)
+        if multi_rewards is None:
+            multi_rewards = torch.zeros(
+                batch_size, env_rewards.shape[1], dtype=torch.float32
+            )
+        multi_rewards[active_indices] += env_rewards
+        total_rewards[active_indices] += env_rewards.sum(dim=1)
@@
-    num_reward_components = multi_rewards.shape[1]
-    for i in range(num_reward_components):
-        current_batch[f"reward{i + 1}"] = multi_rewards[:, i].clone()
+    if multi_rewards is not None:
+        for i in range(multi_rewards.shape[1]):
+            current_batch[f"reward{i + 1}"] = multi_rewards[:, i].clone()

Also applies to: 533-536

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/experience/rollouts.py` around lines 369 - 455, The code assumes
three reward components by hardcoding number_of_rewards=3 and indexing
multi_rewards, which breaks when calculate_rewards returns a 1D single-reward
tensor or a different number of components; change initialization of
multi_rewards to derive its shape from the first env_output.rewards (e.g., after
the first calculate_rewards call) and reshape/expand 1D rewards to 2D before
adding; update the accumulation lines (multi_rewards[active_indices] +=
env_output.rewards and total_rewards[active_indices] +=
env_output.rewards.sum(dim=1)) to handle both 1D and 2D cases by detecting
rewards.dim() and using .unsqueeze(-1) or .sum(dim=1) accordingly so
active_indices, multi_rewards, total_rewards and
calculate_rewards/env_output.rewards remain consistent (also apply the same
change to the later occurrence around the code noted at lines 533-536).

315-323: ⚠️ Potential issue | 🟠 Major

Remove torch.stack and handle variable reward shapes explicitly.

Line 321 uses torch.stack which requires all reward tensors to have identical shapes. However, the EnvironmentReturn interface (interfaces.py:48) explicitly permits variable reward shapes with the comment "## This could be of different shape". If different environment types return rewards with different shapes, this will raise RuntimeError at runtime.

Recommendations:

  1. Pad all rewards to a common shape before stacking, or
  2. Collect rewards grouped by shape/type, or
  3. Use torch.cat with flattening if shape variation is expected

Additionally, remove the debug print() statement at line 322 and the commented-out code at line 320 (which lacks a justification comment per coding guidelines).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/experience/rollouts.py` around lines 315 - 323, The code in
rollouts.py uses torch.stack on all_rewards (variable name all_rewards,
sorted_indices) which fails when reward tensors have different shapes per the
EnvironmentReturn interface; replace the torch.stack call by one of: pad each
reward to a common shape before stacking, group rewards by shape and keep
separate tensors, or keep rewards as a list (no stacking) and handle downstream
consumers accordingly; remove the debug print("rewards stack ...") and delete
the commented-out rewards line to satisfy guidelines; ensure references to
all_env_observations and env_observations remain correctly ordered using
sorted_indices and update any downstream code expecting a single tensor to
accept the new padded/grouped/list structure.
nemo_rl/algorithms/grpo.py (1)

2272-2284: ⚠️ Potential issue | 🟡 Minor

Async GRPO silently ignores GDPO; add validation or implementation.

Async training (lines 2272-2284) does not check the use_gdpo flag when computing advantages, unlike the sync path (lines 1254-1340) which implements per-component baselines. If both use_gdpo=true and async_grpo.enabled=true are set, GDPO will not be applied without error or warning. Either implement GDPO in async mode, add config validation to prevent this combination, or document it as intentionally unsupported.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/algorithms/grpo.py` around lines 2272 - 2284, Async GRPO path
silently ignores the use_gdpo flag when computing advantages; add an explicit
configuration validation before the async advantage computation (the block using
prompt_only_ids, rewards, calculate_baseline_and_std_per_prompt and producing
advantages) to detect the unsupported combination
master_config["grpo"]["use_gdpo"] == True and async_grpo.enabled == True and
fail fast: raise a clear error or log+exit instructing the user to disable
use_gdpo or async_grpo, so GDPO is not silently skipped; alternatively, if you
prefer to support it, implement the per-component baseline logic used in the
sync path for GDPO into the async branch where baseline/std are computed.
🧹 Nitpick comments (8)
nemo_rl/experience/rollouts.py (1)

902-920: Remove leftover debug print.

Line 919 prints "testing new code" unconditionally; please drop it or guard behind a debug logger.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/experience/rollouts.py` around lines 902 - 920, Remove the leftover
unconditional debug print: delete the print("testing new code") statement
(located immediately after constructing final_batch_dict in rollouts.py) or
replace it with a proper logger.debug call if you need runtime tracing; ensure
you reference the same spot where final_batch_dict is built so
message_log/total_reward/idx/truncated handling remains unchanged.
examples/configs/grpo_math_1B.yaml (1)

272-276: Confirm WandB logging is intended here.

Line 275 enables WandB; ensure credentials and data governance expectations are documented for this example config.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/configs/grpo_math_1B.yaml` around lines 272 - 276, The example
enables WandB via the logger key wandb_enabled: true; confirm whether WandB
telemetry is intended for this example and either disable it by setting
logger.wandb_enabled to false or add/update documentation/comments describing
required WandB credentials, how to provide them, and any data governance/privacy
expectations for this config (update the example config comment and relevant
README or examples docs to reference wandb_enabled and credential setup).
nemo_rl/algorithms/grpo.py (1)

1250-1257: Avoid implicit defaults for use_gdpo in code.

Line 1254 uses .get(..., False), which hides missing config keys. Prefer requiring the key and setting defaults in YAML.

💡 Suggested fix
-    use_gdpo = master_config["grpo"].get("use_gdpo", False)
+    use_gdpo = master_config["grpo"]["use_gdpo"]

As per coding guidelines, "YAML is the single source of truth for configuration defaults; do not set non-None defaults in code for configuration values."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/algorithms/grpo.py` around lines 1250 - 1257, The code currently uses
master_config["grpo"].get("use_gdpo", False) which silently falls back to False
when the config key is missing; change this to require the key from YAML and
fail fast: replace the .get call with an explicit access (e.g.,
master_config["grpo"]["use_gdpo"]) or add a clear validation step on
master_config["grpo"] that raises a descriptive error if "use_gdpo" is missing,
so that functions using use_gdpo (the block around rewards =
repeated_batch["total_reward"] and the subsequent use_multi_reward_advantages
logic that consults _get_reward_component_keys and repeated_batch) never rely on
an implicit code default.
examples/run_gdpo_gsm8k.py (2)

181-192: Unused variable cluster.

cluster is unpacked from setup() return but never used. Prefix with _ to indicate intentional discard.

     (
         policy,
         policy_generation,
-        cluster,
+        _cluster,
         dataloader,
         val_dataloader,
         loss_fn,
         logger,
         checkpointer,
         grpo_state,
         master_config,
     ) = setup(config, tokenizer, dataset, val_dataset)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/run_gdpo_gsm8k.py` around lines 181 - 192, The variable cluster
returned from setup(...) is unused; update the unpacking to mark it as
intentionally discarded by renaming cluster to _cluster (i.e., change "cluster"
to "_cluster" in the tuple on the left side of the assignment where
setup(config, tokenizer, dataset, val_dataset) is called) so linters/readers
know it’s intentionally unused.

70-74: Use X | None instead of Optional[X] for Python 3.12+.

Per coding guidelines, code should conform to Python 3.12+. Replace Optional[AllTaskProcessedDataset] with AllTaskProcessedDataset | None.

As per coding guidelines: "Conform code to Python 3.12+"

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/run_gdpo_gsm8k.py` around lines 70 - 74, The return type uses
Optional[AllTaskProcessedDataset]; update the annotation to use modern union
syntax AllTaskProcessedDataset | None (e.g., change ") ->
tuple[AllTaskProcessedDataset, Optional[AllTaskProcessedDataset], dict[str,
EnvironmentInterface], dict[str, EnvironmentInterface]," to ") ->
tuple[AllTaskProcessedDataset, AllTaskProcessedDataset | None, dict[str,
EnvironmentInterface], dict[str, EnvironmentInterface],"). Also remove or stop
using typing.Optional imports if they become unused and update any other
occurrences of Optional[AllTaskProcessedDataset] in this module to
AllTaskProcessedDataset | None.
nemo_rl/data/datasets/response_datasets/__init__.py (1)

87-91: gsm8kDataset not added to __all__.

Other dataset classes exposed here are listed in __all__ (line 159–172). gsm8kDataset (or its renamed PascalCase equivalent) should be added for API consistency.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/data/datasets/response_datasets/__init__.py` around lines 87 - 91,
The export list (__all__) is missing the gsm8kDataset symbol, causing it not to
be publicly exposed; update the __all__ list to include the dataset's exported
name (either "gsm8kDataset" or its PascalCase variant if the class is named
that, e.g., "Gsm8kDataset") so the symbol returned by the factory (gsm8kDataset
in the factory code) is consistently exported; locate the __all__ definition and
add the exact identifier used where the class is defined/constructed to that
list.
nemo_rl/distributed/ray_actor_environment_registry.py (1)

36-36: Minor formatting: extra space before : in dict entry.

Other entries in the registry use "key": value without a space before the colon. This is inconsistent and violates PEP 8 / Google style.

-    "nemo_rl.environments.math_environment.MathMultiRewardEnvironment" : PY_EXECUTABLES.SYSTEM,
+    "nemo_rl.environments.math_environment.MathMultiRewardEnvironment": PY_EXECUTABLES.SYSTEM,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/distributed/ray_actor_environment_registry.py` at line 36, Fix the
minor formatting inconsistency in the environment registry entry for
"nemo_rl.environments.math_environment.MathMultiRewardEnvironment": remove the
extra space before the colon so the mapping matches the other entries (i.e., use
"nemo_rl.environments.math_environment.MathMultiRewardEnvironment":
PY_EXECUTABLES.SYSTEM). Locate the dictionary where this key is defined (the
registry containing PY_EXECUTABLES) and make the spacing change to comply with
PEP8/Google style.
nemo_rl/environments/math_environment.py (1)

266-291: Inner reward functions are re-created on every verify call.

extract_xml_answer, correctness_reward_func, int_reward_func, and format_reward_func are defined as closures inside verify. Since they don't capture any mutable state from the enclosing scope (beyond extract_xml_answer which is itself stateless), they could be class methods or module-level functions to avoid the repeated allocation overhead on every batch verification.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/environments/math_environment.py` around lines 266 - 291, The inner
reward helpers extract_xml_answer, correctness_reward_func, int_reward_func, and
format_reward_func should be moved out of the verify function and made either
module-level functions or methods on the surrounding class to avoid reallocating
them per call; refactor by cutting these definitions from verify, pasting them
as top-level functions or class methods (keeping their signatures the same),
ensure re is imported at module level, and update verify to call these functions
(e.g., self.format_reward_func or format_reward_func) instead of the now-removed
inner definitions.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/run_gdpo_gsm8k.py`:
- Line 1: Update the copyright header in examples/run_gdpo_gsm8k.py from 2025 to
2026; locate the top-of-file copyright comment (the existing line starting with
"# Copyright (c) 2025, NVIDIA CORPORATION.") and change the year to 2026 so the
header matches current-year policy for new Python files.

In `@nemo_rl/algorithms/grpo.py`:
- Around line 140-143: Add documentation for the new TypedDict key use_gdpo in
nemo_rl/algorithms/grpo.py: explain it's a boolean that toggles GDPO
multi-reward advantage computation (per-component baselines, sum advantages,
then normalize), list valid values (true/false) and recommended default (false).
Update any exemplar GRPO configs under examples/configs/*.yaml to include
use_gdpo: false explicitly so defaults are visible in examples. Ensure the
TypedDict entry (use_gdpo) has a short inline comment or docstring describing
purpose, type, and default and that all example YAMLs and README/config docs
reflect the new key.

In `@nemo_rl/data/datasets/response_datasets/__init__.py`:
- Line 18: Rename the class symbol gsm8kDataset in gsm8k.py to a PascalCase name
(e.g., Gsm8kDataset or GSM8KDataset) and update all references/imports
accordingly; specifically change the import in response_datasets/__init__.py to
import the new class name from nemo_rl.data.datasets.response_datasets.gsm8k and
adjust any usage sites. Also reorder the import lines in __init__.py so the
gsm8k import comes after deepscaler (maintain alphabetical order consistent with
DAPOMath17KDataset and DeepScalerDataset). Ensure the class definition name in
gsm8k.py and the exported import name in __init__.py match exactly.

In `@nemo_rl/data/datasets/response_datasets/gsm8k.py`:
- Around line 43-46: extract_hash_answer currently returns None when "####" is
missing, which then becomes the assistant message content and can break
downstream processing; update extract_hash_answer to raise a clear exception
(e.g., ValueError with context) when "####" is not found, and/or modify the code
that constructs the assistant message (the call site that sets content =
extract_hash_answer(...)) to skip or filter out records where
extract_hash_answer would be None and log a warning; ensure callers handle the
exception or check for a missing delimiter so no assistant message is created
with None content.
- Line 104: Rename the class gsm8kDataset to GSM8KDataset (use PascalCase) in
the gsm8k.py file where class gsm8kDataset(RawDataset) is declared, and update
every reference to that class (imports, instantiations, registrations, type
hints, and any exports) to the new GSM8KDataset identifier so symbols like
gsm8kDataset, any factory/registry entries, and tests are updated accordingly to
avoid broken references.
- Around line 80-81: The inline comment above the GSM8K load is incorrect;
update the comment that precedes the load_dataset call (the line that assigns
val_ds using load_dataset('openai/gsm8k', 'main')["test"]) to accurately state
that it loads the GSM8K test split from the openai/gsm8k dataset instead of
referencing hendrydong/aime24. Keep it short and precise (e.g., "Load GSM8K test
split for validation") and ensure it sits immediately before the val_ds
assignment.
- Around line 23-41: The constant is defined twice—remove the lowercase alias
and keep only SYSTEM_PROMPT; update the SYSTEM_PROMPT string to correct typos:
change "fianl" to "final" and change "after the reasoning steps is done" to
"after the reasoning steps are done"; ensure any references to the removed
symbol system_prompt are updated to use SYSTEM_PROMPT.
- Around line 104-114: The constructor docstring for class gsm8kDataset
incorrectly references "DAPO Math 17K"; update the docstring in
gsm8kDataset.__init__ to accurately describe the GSM8K dataset (e.g.,
"Initialize the GSM8K dataset with train split") so it matches the class name
and the call to prepare_gsm8k_dataset; keep the seed argument description as-is.

In `@nemo_rl/data/processors.py`:
- Around line 213-219: Update the function/class docstring that starts with
"Process a datum dictionary..." to a Google-style docstring (with Args, Returns,
and optional Raises sections) describing parameters (datum_dict: dict), returned
DatumSpec, and any exceptions; remove the commented-out debug line
"print(f'user_message {user_message}')" unless you add a one-line justification
above it explaining when/why it's kept (otherwise delete it). Ensure the
docstring mentions the role of datum_dict, user_message and the produced
DatumSpec and keep references to datum_dict, user_message, and problem to help
locate the code.
- Around line 206-223: math_gdpo_data_processor currently ignores task_data_spec
and hardcodes messages from datum_dict["messages"], leaves a commented debug
print, and mis-annotates the variable returned by apply_chat_template; update
the function to honor configured prompts by checking task_data_spec (use its
system_prompt_file and prompt_file when present) and generate the system/user
content via apply_chat_template(tokenize=False) instead of reading
datum_dict["messages"] directly (replace references to user_message and
extra_env_info accordingly), remove the extraneous commented-out print (or add a
brief why-comment if it must remain), and change the type annotation for the
result of apply_chat_template (the variable named message) from list[str] to str
to match its actual return type; alternatively, if config-driven prompts are not
needed, remove the task_data_spec parameter and related config entries to avoid
confusion.
- Around line 238-250: The variable message is annotated as list[str] but
tokenizer.apply_chat_template(..., tokenize=False) returns a string here; update
the annotation and usage so message is typed as str (remove the type: ignore and
change list[str] to str) and ensure subsequent calls use it as a string (the
tokenizer(...) call and assignments to user_message["token_ids"] and
user_message["content"] remain unchanged but now operate on a str). This keeps
apply_chat_template, message, user_message["token_ids"], and
user_message["content"] consistent and removes the masked type error.

In `@nemo_rl/environments/interfaces.py`:
- Around line 44-48: Remove the inline comment on the public field rewards (the
"## This could be of different shape" text) and instead update the class-level
docstring for the interface in this file to document that rewards (Tensor) may
have different shapes depending on the environment; keep the field signature
rewards: Tensor unchanged and ensure the docstring sits with the class or
Protocol that defines observations, metadata, next_stop_strings, rewards, and
terminateds so external users see the note in the public API documentation.

In `@nemo_rl/environments/math_environment.py`:
- Around line 330-342: The code is incorrectly treating extracted_answer as a
tuple (len==2 unpack) though extract_xml_answer(response) returns a str when
math_verify_impl == "hf_math_verify"; remove the dead branch that handles
return_extracted_answer (the block that asserts len==2, unpacks
extracted_answer, and appends extracted_answers) or replace it with a simple
guard returning/raising since step() already raises NotImplementedError for
return_extracted_answer; locate the logic around extract_xml_answer(response),
the return_extracted_answer flag, and the HFVerifyWorker-style unpacking to
remove or short-circuit that tuple-based handling to avoid the crash.
- Around line 293-294: The hard-coded magic number 3 for number_of_rewards in
HFMultiRewardVerifyWorker.verify and MathMultiRewardEnvironment.step should be
made configurable: add a number_of_rewards field to MathEnvConfig (with a
sensible default), expose it on the MathMultiRewardEnvironment instance (or as a
class attribute), and pass it into HFMultiRewardVerifyWorker when the worker is
constructed/initialized; then replace the literal 3 usages in
HFMultiRewardVerifyWorker.verify and MathMultiRewardEnvironment.step with the
config-backed value (e.g., self.cfg.number_of_rewards or env.number_of_rewards)
so both places derive the same single source of truth.
- Line 642: Fix the typo in the inline comment on the loop that iterates "for
result in results[0]" by changing "corretness" to "correctness" so the comment
reads that index 0 always stores the correctness reward; update the comment
adjacent to the results variable and the loop to reflect the corrected spelling.
- Around line 549-551: Remove the leftover commented-out print debug artifacts
inside the MathMultiRewardEnvironment class (the commented prints around the
constructor/initialization and other methods at the noted locations); either
delete these commented print lines or replace them with proper logging (e.g.,
self.logger.debug(...)) if runtime diagnostic info is needed, ensuring you
reference the MathMultiRewardEnvironment class and the related init/step/helper
methods where those prints appear so you update all occurrences consistently.
- Around line 233-234: Update the file header copyright year from 2025 to 2026
to reflect the new class addition; locate the top-of-file copyright comment in
the same file where the new HFMultiRewardVerifyWorker class is defined and
change the year to 2026 so the header matches the new 2026 content.
- Around line 296-361: Remove the commented-out debug print blocks and the
active print calls in the loop of math_environment.py: delete the commented
prints inside the hf_math_verify branch (around the checks of response and
ground_truth) and remove the two print(f"results ...") lines and the stale
trailing comment before the final return in the block that builds results and
extracted_answers; ensure the logic around correctness_reward_func,
int_reward_func, format_reward_func, extract_xml_answer, and the
return_extracted_answer handling remains unchanged and tests still pass.
- Around line 670-672: batch["is_end"] is 1D while batch["rewards"] is 2D,
causing broadcast/indexing errors; fix by unsqueezing is_end before any
broadcast: replace uses of batch["is_end"] with batch["is_end"].unsqueeze(-1)
when multiplying into batch["rewards"] (so batch["rewards"] = batch["rewards"] *
batch["is_end"].unsqueeze(-1)). For places where a 1D per-prompt pass vector is
required (indexing generation_lengths or calling
calculate_pass_rate_per_prompt), collapse the 2D rewards into a 1D boolean/byte
vector (e.g., batch["rewards"].eq(1).any(dim=1) or .max(dim=1).values as
appropriate) and use that 1D tensor for indexing and for
calculate_pass_rate_per_prompt.

In `@run.sh`:
- Around line 1-2: Add a POSIX shebang and a short copyright/header comment at
the top of run.sh, and either remove or replace the unexplained commented-out
command "# uv run python examples/run_grpo_math.py" with a brief comment
explaining why it is preserved (e.g., deprecated example or alternative
invocation); ensure the remaining actual command "uv run python
examples/run_gdpo_gsm8k.py" remains runnable and that the header follows the
project's copyright and style conventions.

---

Outside diff comments:
In `@nemo_rl/algorithms/grpo.py`:
- Around line 2272-2284: Async GRPO path silently ignores the use_gdpo flag when
computing advantages; add an explicit configuration validation before the async
advantage computation (the block using prompt_only_ids, rewards,
calculate_baseline_and_std_per_prompt and producing advantages) to detect the
unsupported combination master_config["grpo"]["use_gdpo"] == True and
async_grpo.enabled == True and fail fast: raise a clear error or log+exit
instructing the user to disable use_gdpo or async_grpo, so GDPO is not silently
skipped; alternatively, if you prefer to support it, implement the per-component
baseline logic used in the sync path for GDPO into the async branch where
baseline/std are computed.

In `@nemo_rl/experience/rollouts.py`:
- Around line 287-309: Remove the stray debug prints and commented-out debug
snippets in the hot path: delete the print(f"indices {indices}) and the later
print/print-debug comments near the task_rewards/terminateds block, or replace
them with proper conditional debug logging (e.g., use logger.debug(...) and
guard with logger.isEnabledFor(logging.DEBUG)) so the logic in functions
handling indices, all_indices_order, all_rewards, and all_env_observations is
not spammed by prints; ensure no commented-out debug lines remain unless
accompanied by a short comment explaining why they are kept.
- Around line 369-455: The code assumes three reward components by hardcoding
number_of_rewards=3 and indexing multi_rewards, which breaks when
calculate_rewards returns a 1D single-reward tensor or a different number of
components; change initialization of multi_rewards to derive its shape from the
first env_output.rewards (e.g., after the first calculate_rewards call) and
reshape/expand 1D rewards to 2D before adding; update the accumulation lines
(multi_rewards[active_indices] += env_output.rewards and
total_rewards[active_indices] += env_output.rewards.sum(dim=1)) to handle both
1D and 2D cases by detecting rewards.dim() and using .unsqueeze(-1) or
.sum(dim=1) accordingly so active_indices, multi_rewards, total_rewards and
calculate_rewards/env_output.rewards remain consistent (also apply the same
change to the later occurrence around the code noted at lines 533-536).
- Around line 315-323: The code in rollouts.py uses torch.stack on all_rewards
(variable name all_rewards, sorted_indices) which fails when reward tensors have
different shapes per the EnvironmentReturn interface; replace the torch.stack
call by one of: pad each reward to a common shape before stacking, group rewards
by shape and keep separate tensors, or keep rewards as a list (no stacking) and
handle downstream consumers accordingly; remove the debug print("rewards stack
...") and delete the commented-out rewards line to satisfy guidelines; ensure
references to all_env_observations and env_observations remain correctly ordered
using sorted_indices and update any downstream code expecting a single tensor to
accept the new padded/grouped/list structure.

---

Nitpick comments:
In `@examples/configs/grpo_math_1B.yaml`:
- Around line 272-276: The example enables WandB via the logger key
wandb_enabled: true; confirm whether WandB telemetry is intended for this
example and either disable it by setting logger.wandb_enabled to false or
add/update documentation/comments describing required WandB credentials, how to
provide them, and any data governance/privacy expectations for this config
(update the example config comment and relevant README or examples docs to
reference wandb_enabled and credential setup).

In `@examples/run_gdpo_gsm8k.py`:
- Around line 181-192: The variable cluster returned from setup(...) is unused;
update the unpacking to mark it as intentionally discarded by renaming cluster
to _cluster (i.e., change "cluster" to "_cluster" in the tuple on the left side
of the assignment where setup(config, tokenizer, dataset, val_dataset) is
called) so linters/readers know it’s intentionally unused.
- Around line 70-74: The return type uses Optional[AllTaskProcessedDataset];
update the annotation to use modern union syntax AllTaskProcessedDataset | None
(e.g., change ") -> tuple[AllTaskProcessedDataset,
Optional[AllTaskProcessedDataset], dict[str, EnvironmentInterface], dict[str,
EnvironmentInterface]," to ") -> tuple[AllTaskProcessedDataset,
AllTaskProcessedDataset | None, dict[str, EnvironmentInterface], dict[str,
EnvironmentInterface],"). Also remove or stop using typing.Optional imports if
they become unused and update any other occurrences of
Optional[AllTaskProcessedDataset] in this module to AllTaskProcessedDataset |
None.

In `@nemo_rl/algorithms/grpo.py`:
- Around line 1250-1257: The code currently uses
master_config["grpo"].get("use_gdpo", False) which silently falls back to False
when the config key is missing; change this to require the key from YAML and
fail fast: replace the .get call with an explicit access (e.g.,
master_config["grpo"]["use_gdpo"]) or add a clear validation step on
master_config["grpo"] that raises a descriptive error if "use_gdpo" is missing,
so that functions using use_gdpo (the block around rewards =
repeated_batch["total_reward"] and the subsequent use_multi_reward_advantages
logic that consults _get_reward_component_keys and repeated_batch) never rely on
an implicit code default.

In `@nemo_rl/data/datasets/response_datasets/__init__.py`:
- Around line 87-91: The export list (__all__) is missing the gsm8kDataset
symbol, causing it not to be publicly exposed; update the __all__ list to
include the dataset's exported name (either "gsm8kDataset" or its PascalCase
variant if the class is named that, e.g., "Gsm8kDataset") so the symbol returned
by the factory (gsm8kDataset in the factory code) is consistently exported;
locate the __all__ definition and add the exact identifier used where the class
is defined/constructed to that list.

In `@nemo_rl/distributed/ray_actor_environment_registry.py`:
- Line 36: Fix the minor formatting inconsistency in the environment registry
entry for "nemo_rl.environments.math_environment.MathMultiRewardEnvironment":
remove the extra space before the colon so the mapping matches the other entries
(i.e., use "nemo_rl.environments.math_environment.MathMultiRewardEnvironment":
PY_EXECUTABLES.SYSTEM). Locate the dictionary where this key is defined (the
registry containing PY_EXECUTABLES) and make the spacing change to comply with
PEP8/Google style.

In `@nemo_rl/environments/math_environment.py`:
- Around line 266-291: The inner reward helpers extract_xml_answer,
correctness_reward_func, int_reward_func, and format_reward_func should be moved
out of the verify function and made either module-level functions or methods on
the surrounding class to avoid reallocating them per call; refactor by cutting
these definitions from verify, pasting them as top-level functions or class
methods (keeping their signatures the same), ensure re is imported at module
level, and update verify to call these functions (e.g., self.format_reward_func
or format_reward_func) instead of the now-removed inner definitions.

In `@nemo_rl/experience/rollouts.py`:
- Around line 902-920: Remove the leftover unconditional debug print: delete the
print("testing new code") statement (located immediately after constructing
final_batch_dict in rollouts.py) or replace it with a proper logger.debug call
if you need runtime tracing; ensure you reference the same spot where
final_batch_dict is built so message_log/total_reward/idx/truncated handling
remains unchanged.

@@ -0,0 +1,260 @@
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Copyright year should be 2026 for new files.

The copyright header says 2025, but this file is being created in 2026.

As per coding guidelines: "Add the NVIDIA copyright header (with current year) to all Python files and shell scripts, excluding tests"

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/run_gdpo_gsm8k.py` at line 1, Update the copyright header in
examples/run_gdpo_gsm8k.py from 2025 to 2026; locate the top-of-file copyright
comment (the existing line starting with "# Copyright (c) 2025, NVIDIA
CORPORATION.") and change the year to 2026 so the header matches current-year
policy for new Python files.

Comment on lines +140 to +143
# Use GDPO multi-reward advantage (per-component baselines, sum advantages, then normalize).
# When True, reward components are discovered from the batch (keys reward1, reward2, ...).
# Requires the environment to expose per-reward keys; see nemo_rl/environments/math_environment.py.
use_gdpo: NotRequired[bool]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Document use_gdpo and add explicit default in exemplar YAMLs.

You added a new GRPO config key but the default/usage isn’t reflected in exemplar YAMLs (e.g., existing GRPO configs). Please document the key’s purpose/valid values/default and add an explicit default in example configs.

As per coding guidelines, "When adding a new config key to a TypedDict subclass, document the key's purpose, valid values/types, recommended default, and reflect the default in exemplar YAMLs under examples/configs/*.yaml."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/algorithms/grpo.py` around lines 140 - 143, Add documentation for the
new TypedDict key use_gdpo in nemo_rl/algorithms/grpo.py: explain it's a boolean
that toggles GDPO multi-reward advantage computation (per-component baselines,
sum advantages, then normalize), list valid values (true/false) and recommended
default (false). Update any exemplar GRPO configs under examples/configs/*.yaml
to include use_gdpo: false explicitly so defaults are visible in examples.
Ensure the TypedDict entry (use_gdpo) has a short inline comment or docstring
describing purpose, type, and default and that all example YAMLs and
README/config docs reflect the new key.


from nemo_rl.data.datasets.response_datasets.clevr import CLEVRCoGenTDataset
from nemo_rl.data.datasets.response_datasets.dapo_math import DAPOMath17KDataset
from nemo_rl.data.datasets.response_datasets.gsm8k import gsm8kDataset
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Class name gsm8kDataset violates PascalCase naming convention for classes.

Per coding guidelines, class names must use PascalCase. gsm8kDataset should be renamed to Gsm8kDataset or GSM8KDataset (consistent with other dataset class names like DAPOMath17KDataset, DeepScalerDataset). This also affects the definition in gsm8k.py.

Also, the import is out of alphabetical order — gsm8k is placed between dapo_math and deepscaler, but should come after deepscaler.

As per coding guidelines: "Use PascalCase for class names"

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/data/datasets/response_datasets/__init__.py` at line 18, Rename the
class symbol gsm8kDataset in gsm8k.py to a PascalCase name (e.g., Gsm8kDataset
or GSM8KDataset) and update all references/imports accordingly; specifically
change the import in response_datasets/__init__.py to import the new class name
from nemo_rl.data.datasets.response_datasets.gsm8k and adjust any usage sites.
Also reorder the import lines in __init__.py so the gsm8k import comes after
deepscaler (maintain alphabetical order consistent with DAPOMath17KDataset and
DeepScalerDataset). Ensure the class definition name in gsm8k.py and the
exported import name in __init__.py match exactly.

Comment on lines +23 to +41
SYSTEM_PROMPT = system_prompt = """
You are a helpful AI assistant.

For every request, you should carefully think through the math problem step by step, then provide the fianl answer in integer format.

Steps for Each Request:
1. Think: Provide detailed, step-by-step reasoning, calculations, or derivations.
2. Produce Final Answer: After step-by-step reasoning, output the final answer in integer format.

Output Format:
<think>Your thoughts and reasoning</think>
<answer>Final answer in integer format</answer>

Important Notes:
1. You must include your reasoning steps inside <think>.
2. You must always output the Final Answer within <answer> after the reasoning steps is done.
3. You should consistently work through the solution step by step before giving the final answer.
4. The final answer can only be an integer.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Dual-assignment global, typos in the system prompt, and naming.

  1. Line 23: SYSTEM_PROMPT = system_prompt = """ creates two names for the same value. Remove the lowercase alias per the guideline to use upper snake_case for constants.
  2. Line 26: "fianl" → "final".
  3. Line 38: "after the reasoning steps is done" → "after the reasoning steps are done".
Proposed fix
-SYSTEM_PROMPT = system_prompt = """
+SYSTEM_PROMPT = """
 You are a helpful AI assistant.

-For every request, you should carefully think through the math problem step by step, then provide the fianl answer in integer format.
+For every request, you should carefully think through the math problem step by step, then provide the final answer in integer format.

 Steps for Each Request:
 1. Think: Provide detailed, step-by-step reasoning, calculations, or derivations.
 ...
-2. You must always output the Final Answer within <answer> after the reasoning steps is done.
+2. You must always output the Final Answer within <answer> after the reasoning steps are done.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
SYSTEM_PROMPT = system_prompt = """
You are a helpful AI assistant.
For every request, you should carefully think through the math problem step by step, then provide the fianl answer in integer format.
Steps for Each Request:
1. Think: Provide detailed, step-by-step reasoning, calculations, or derivations.
2. Produce Final Answer: After step-by-step reasoning, output the final answer in integer format.
Output Format:
<think>Your thoughts and reasoning</think>
<answer>Final answer in integer format</answer>
Important Notes:
1. You must include your reasoning steps inside <think>.
2. You must always output the Final Answer within <answer> after the reasoning steps is done.
3. You should consistently work through the solution step by step before giving the final answer.
4. The final answer can only be an integer.
"""
SYSTEM_PROMPT = """
You are a helpful AI assistant.
For every request, you should carefully think through the math problem step by step, then provide the final answer in integer format.
Steps for Each Request:
1. Think: Provide detailed, step-by-step reasoning, calculations, or derivations.
2. Produce Final Answer: After step-by-step reasoning, output the final answer in integer format.
Output Format:
<think>Your thoughts and reasoning</think>
<answer>Final answer in integer format</answer>
Important Notes:
1. You must include your reasoning steps inside <think>.
2. You must always output the Final Answer within <answer> after the reasoning steps are done.
3. You should consistently work through the solution step by step before giving the final answer.
4. The final answer can only be an integer.
"""
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/data/datasets/response_datasets/gsm8k.py` around lines 23 - 41, The
constant is defined twice—remove the lowercase alias and keep only
SYSTEM_PROMPT; update the SYSTEM_PROMPT string to correct typos: change "fianl"
to "final" and change "after the reasoning steps is done" to "after the
reasoning steps are done"; ensure any references to the removed symbol
system_prompt are updated to use SYSTEM_PROMPT.

Comment on lines +43 to +46
def extract_hash_answer(text: str) -> str | None:
if "####" not in text:
return None
return text.split("####")[1].strip()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

extract_hash_answer can return None, which propagates as assistant message content.

If a GSM8K answer field doesn't contain ####, extract_hash_answer returns None (line 45). This None becomes the "content" of the assistant message at line 66, which could cause downstream failures (e.g., tokenization errors or type mismatches).

Consider raising an error or filtering such records during dataset preparation.

Also applies to: 64-67

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/data/datasets/response_datasets/gsm8k.py` around lines 43 - 46,
extract_hash_answer currently returns None when "####" is missing, which then
becomes the assistant message content and can break downstream processing;
update extract_hash_answer to raise a clear exception (e.g., ValueError with
context) when "####" is not found, and/or modify the code that constructs the
assistant message (the call site that sets content = extract_hash_answer(...))
to skip or filter out records where extract_hash_answer would be None and log a
warning; ensure callers handle the exception or check for a missing delimiter so
no assistant message is created with None content.

Comment on lines +330 to +342
if return_extracted_answer:
# Make sure the extracted answer is not None and is a list of two elements
assert extracted_answer is not None
assert len(extracted_answer) == 2
extracted_gold, extracted_prediction = extracted_answer
# Get the extracted answer with the same logic as in the HFVerifyWorker
for pred in extracted_prediction:
if any(grader.verify(gold, pred) for gold in extracted_gold):
extracted_answers.append(pred)
break
else:
# If no match is found, means all answers are incorrect, just use the first prediction
extracted_answers.append(extracted_prediction[0][0])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

return_extracted_answer=True will crash — extracted_answer is a str, not a tuple.

When math_verify_impl == "hf_math_verify", extracted_answer is assigned from extract_xml_answer(response) (line 313), which returns a plain str. Lines 332–333 then assert len(extracted_answer) == 2 and attempt tuple unpacking, which will always fail for a string that isn't exactly 2 characters long.

This code path appears to be copy-pasted from HFVerifyWorker where extracted_answer comes from self.verify_func (which returns a tuple). Since the step method already raises NotImplementedError for this flag (line 625), consider removing this dead code block to avoid confusion.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/environments/math_environment.py` around lines 330 - 342, The code is
incorrectly treating extracted_answer as a tuple (len==2 unpack) though
extract_xml_answer(response) returns a str when math_verify_impl ==
"hf_math_verify"; remove the dead branch that handles return_extracted_answer
(the block that asserts len==2, unpacks extracted_answer, and appends
extracted_answers) or replace it with a simple guard returning/raising since
step() already raises NotImplementedError for return_extracted_answer; locate
the logic around extract_xml_answer(response), the return_extracted_answer flag,
and the HFVerifyWorker-style unpacking to remove or short-circuit that
tuple-based handling to avoid the crash.

Comment on lines +549 to +551
# print(f"self.workers {self.workers}")
# print('Init MathMultiRewardEnvironment')

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

More debug artifacts in MathMultiRewardEnvironment.

Lines 549–550, 610–611, 633–634, and 647 contain commented-out print statements with no explanation. Same cleanup applies here.

Also applies to: 610-612, 633-634, 647-648

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/environments/math_environment.py` around lines 549 - 551, Remove the
leftover commented-out print debug artifacts inside the
MathMultiRewardEnvironment class (the commented prints around the
constructor/initialization and other methods at the noted locations); either
delete these commented print lines or replace them with proper logging (e.g.,
self.logger.debug(...)) if runtime diagnostic info is needed, ensuring you
reference the MathMultiRewardEnvironment class and the related init/step/helper
methods where those prints appear so you update all occurrences consistently.

if result
else "Environment: incorrect",
}
for result in results[0] ## index 0 always store corretness reward
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Typo: "corretness" → "correctness".

-        for result in results[0] ## index 0 always store corretness reward
+        for result in results[0]  # index 0 always stores correctness reward
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for result in results[0] ## index 0 always store corretness reward
for result in results[0] # index 0 always stores correctness reward
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/environments/math_environment.py` at line 642, Fix the typo in the
inline comment on the loop that iterates "for result in results[0]" by changing
"corretness" to "correctness" so the comment reads that index 0 always stores
the correctness reward; update the comment adjacent to the results variable and
the loop to reflect the corrected spelling.

Comment on lines +670 to +672
batch["rewards"] = (
batch["rewards"] * batch["is_end"]
) # set a reward of 0 for any incorrectly ended sequences
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Shape mismatch: 2D rewards × 1D is_end will fail at runtime.

batch["rewards"] is now shape (B, num_rewards) (set at line 646), but batch["is_end"] is shape (B,). PyTorch broadcasting aligns trailing dimensions, so (B, num_rewards) * (B,) will raise a runtime error when B != num_rewards. You need to unsqueeze is_end.

The same 2D-vs-1D mismatch propagates to lines 673 and 687–689: batch["rewards"] == 1 yields a 2D mask that cannot index generation_lengths (1D), and calculate_pass_rate_per_prompt expects a 1D rewards tensor.

Proposed fix
         batch["rewards"] = (
-            batch["rewards"] * batch["is_end"]
+            batch["rewards"] * batch["is_end"].unsqueeze(-1)
         )  # set a reward of 0 for any incorrectly ended sequences
-        if (batch["rewards"] == 1).float().sum() > 0:
+        # Use the first reward column (correctness) for scalar metrics
+        correctness_rewards = batch["rewards"][:, 0]
+        if (correctness_rewards == 1).float().sum() > 0:
             correct_solution_generation_lengths = (
                 (batch["generation_lengths"] - batch["prompt_lengths"])[
-                    batch["rewards"] == 1
+                    correctness_rewards == 1
                 ]
                 .float()
                 .mean()
                 .item()
             )
         ...
         metrics = {
-            "accuracy": batch["rewards"].mean().item(),
+            "accuracy": correctness_rewards.mean().item(),
             "pass@samples_per_prompt": calculate_pass_rate_per_prompt(
-                batch["text"], batch["rewards"]
+                batch["text"], correctness_rewards
             ),
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/environments/math_environment.py` around lines 670 - 672,
batch["is_end"] is 1D while batch["rewards"] is 2D, causing broadcast/indexing
errors; fix by unsqueezing is_end before any broadcast: replace uses of
batch["is_end"] with batch["is_end"].unsqueeze(-1) when multiplying into
batch["rewards"] (so batch["rewards"] = batch["rewards"] *
batch["is_end"].unsqueeze(-1)). For places where a 1D per-prompt pass vector is
required (indexing generation_lengths or calling
calculate_pass_rate_per_prompt), collapse the 2D rewards into a 1D boolean/byte
vector (e.g., batch["rewards"].eq(1).any(dim=1) or .max(dim=1).values as
appropriate) and use that 1D tensor for indexing and for
calculate_pass_rate_per_prompt.

Comment on lines +1 to +2
# uv run python examples/run_grpo_math.py
uv run python examples/run_gdpo_gsm8k.py No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Missing shebang and copyright header; contains unexplained commented-out code.

Per the Google Shell Style Guide (referenced in coding guidelines), shell scripts should start with a shebang. Line 1 is a commented-out command with no explanation for why it's there.

Proposed fix
+#!/usr/bin/env bash
+# Copyright (c) 2026, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
-# uv run python examples/run_grpo_math.py
 uv run python examples/run_gdpo_gsm8k.py
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# uv run python examples/run_grpo_math.py
uv run python examples/run_gdpo_gsm8k.py
#!/usr/bin/env bash
# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
uv run python examples/run_gdpo_gsm8k.py
🧰 Tools
🪛 Shellcheck (0.11.0)

[error] 1-1: Tips depend on target shell and yours is unknown. Add a shebang or a 'shell' directive.

(SC2148)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@run.sh` around lines 1 - 2, Add a POSIX shebang and a short copyright/header
comment at the top of run.sh, and either remove or replace the unexplained
commented-out command "# uv run python examples/run_grpo_math.py" with a brief
comment explaining why it is preserved (e.g., deprecated example or alternative
invocation); ensure the remaining actual command "uv run python
examples/run_gdpo_gsm8k.py" remains runnable and that the header follows the
project's copyright and style conventions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant