Skip to content

Comments

chore: update transformers to v5 and automodel to latest main in dtensor v2#1962

Open
hemildesai wants to merge 7 commits intomainfrom
hemil/automodel-transformers-v5
Open

chore: update transformers to v5 and automodel to latest main in dtensor v2#1962
hemildesai wants to merge 7 commits intomainfrom
hemil/automodel-transformers-v5

Conversation

@hemildesai
Copy link
Contributor

@hemildesai hemildesai commented Feb 15, 2026

  • Update transformers to v5 just for automodel extra
  • Update Automodel to latest main

Summary by CodeRabbit

Release Notes

  • New Features

    • Added MoE parallelizer configuration options for distributed training
    • Introduced distributed context management for enhanced distributed setup
  • Configuration Updates

    • Updated backend configuration paths for automodel components
    • Added checkpoint save period configuration option
    • Enhanced DTensor configuration with cache clearing and async checkpointing flags
  • Dependencies

    • Relaxed transformers version constraint
    • Updated transformer-engine to latest compatible version
    • Added CUDA and DeepSeek dependencies for improved model support
  • Improvements

    • Simplified distributed training initialization flow
    • Enhanced automodel execution with cache optimization

@github-actions
Copy link

⚠️ File Consistency Check

Check based on commit: 9dce543 (PR #1962 from hemil/automodel-transformers-v5)

⚠️ DTensor Policy Worker Synchronization Warning

The file nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py was modified in this PR, but nemo_rl/models/policy/workers/dtensor_policy_worker.py was not updated.

Why this matters:
These files contain related DTensor policy worker implementations that should be kept synchronized to ensure consistency across different versions.

Action required:

  • Please review if the changes in nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py should also be applied to nemo_rl/models/policy/workers/dtensor_policy_worker.py
  • Update nemo_rl/models/policy/workers/dtensor_policy_worker.py if necessary to maintain consistency
  • If the files are intentionally different, please add a comment in the PR explaining why

Files to check:

  • Modified: nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py
  • Not modified: nemo_rl/models/policy/workers/dtensor_policy_worker.py

This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.

@github-actions
Copy link

✅ Submodule Fast-Forward Check Results

Check based on commit: 9dce543 (PR #1962 from hemil/automodel-transformers-v5)

✅ Submodules that are properly updated:

Automodel: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

@hemildesai hemildesai marked this pull request as ready for review February 15, 2026 19:52
@hemildesai hemildesai requested review from a team as code owners February 15, 2026 19:52
@hemildesai hemildesai added the CI:L1 Run doctests, unit tests, and functional tests label Feb 15, 2026
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 15, 2026

📝 Walkthrough

Walkthrough

This PR refactors the distributed context management by replacing FSDP2Manager with a new DistributedContext object, removes model_state_dict_keys parameters from checkpoint management, updates backend configuration paths from moe.utils to models.common.utils, and modifies dependency management and workspace configuration in pyproject.toml.

Changes

Cohort / File(s) Summary
Automodel Submodule & Dependencies
3rdparty/Automodel-workspace/Automodel, pyproject.toml
Updated Automodel submodule pointer; changed workspace setup to path-based editable mode; relaxed transformers version constraint; added nemo-automodel[moe] extra; updated transformer-engine to 2.10.0; added nvidia-cudnn-cu12 and deep_ep dependencies; introduced automodel conflicts with fsdp, mcore, vllm in lint configuration.
Backend Config Path Migration
examples/configs/recipes/llm/*, nemo_rl/models/policy/__init__.py, tests/unit/models/policy/test_automodel_types.py
Updated BackendConfig import paths from nemo_automodel.components.moe.utils.BackendConfig to nemo_automodel.components.models.common.utils.BackendConfig across YAML configs and type definitions; added checkpointing.save_period: 30 in sft config.
Distributed Context Refactoring
nemo_rl/models/automodel/config.py, nemo_rl/models/automodel/setup.py
Introduced new DistributedContext NamedTuple to encapsulate device meshes and distributed configs; replaced setup_distributed to return DistributedContext instead of FSDP2Manager; refactored setup_model_and_optimizer to accept distributed_context parameter and use from_pretrained initialization with device meshes instead of manager-based approach; removed model_state_dict_keys field from ModelAndOptimizerState.
Policy Configuration Types
nemo_rl/models/policy/__init__.py
Added MoEParallelizerOptions TypedDict with fields for MoE parallelizer settings; extended DTensorConfig with clear_cache_every_n_steps, moe_parallelizer, defer_fsdp_grad_sync, expert_parallel_size, and custom_parallel_plan; reordered existing fields for clarity.
Checkpoint Manager Cleanup
nemo_rl/models/automodel/checkpoint.py
Removed model_state_dict_keys constructor parameter, attribute, and related docstrings; removed load_base_model and set_model_state_dict_keys public methods; removed TRANSFORMERS_CACHE import; updated init_checkpointer and update_checkpointer_config calls.
Integration Updates
nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py, nemo_rl/distributed/virtual_cluster.py
Updated DTensorPolicyWorkerV2 to use distributed_context instead of distributed_manager; removed model_state_dict_keys from AutomodelCheckpointManager initialization; added is_async flag to checkpoint config; added --no-cache flag to automodel uv command execution.
Test Updates
tests/unit/models/automodel/test_automodel_setup.py, tests/unit/models/automodel/test_automodel_checkpoint.py, tests/unit/models/policy/test_dtensor_worker_v2.py, tests/unit/models/policy/test_automodel_types.py
Updated tests to use new DistributedContext API; removed model_state_dict_keys from checkpoint manager tests; refactored test_automodel_setup.py to verify DistributedContext return and device mesh population; removed use_hf_tp_plan parameter from worker config tests.
Configuration
pyrefly.toml
Added nemo_rl/models/automodel/checkpoint.py to project-includes; removed nemo_rl/utils/automodel_checkpoint.py from project-includes.

Sequence Diagram

sequenceDiagram
    participant Setup as setup_distributed()
    participant Context as DistributedContext
    participant DeviceMesh as create_device_mesh()
    participant ModelSetup as setup_model_and_optimizer()
    participant FromPretrained as model_class.from_pretrained()
    participant Optimizer as OptimizerSetup
    
    Setup->>DeviceMesh: Create device/moe meshes
    DeviceMesh-->>Setup: Return meshes
    Setup->>Context: Construct DistributedContext<br/>(device_mesh, moe_mesh, fsdp2_config, moe_config, sizes)
    Setup-->>ModelSetup: Return DistributedContext
    
    ModelSetup->>ModelSetup: Validate CP/TP/EP interactions
    ModelSetup->>FromPretrained: Call with device_mesh,<br/>moe_mesh, distributed_config
    FromPretrained-->>ModelSetup: Return initialized model
    
    ModelSetup->>ModelSetup: Apply activation checkpointing<br/>and config overrides
    ModelSetup->>Optimizer: Initialize optimizer
    Optimizer-->>ModelSetup: Return optimizer state
    
    ModelSetup-->>ModelSetup: Return ModelAndOptimizerState
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested labels

Run CICD

Suggested reviewers

  • terrykong
  • yuki-97
  • adil-a
🚥 Pre-merge checks | ✅ 3 | ❌ 2
❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Merge Conflict Detection ⚠️ Warning ❌ Merge conflicts detected (27 files):

⚔️ 3rdparty/Automodel-workspace/Automodel (content)
⚔️ docker/Dockerfile (content)
⚔️ docs/guides/use-custom-vllm.md (content)
⚔️ examples/configs/grpo_math_1B.yaml (content)
⚔️ examples/configs/recipes/llm/grpo-moonlight-16b-automodel-1n8g-ep8.yaml (content)
⚔️ examples/configs/recipes/llm/sft-gpt-oss-20b-1n8g-fsdp8ep8-automodel.yaml (content)
⚔️ examples/configs/vlm_grpo_3B.yaml (content)
⚔️ examples/configs/vlm_grpo_3B_megatron.yaml (content)
⚔️ examples/nemo_gym/grpo_workplace_assistant_nemotron_nano_v2_9b.yaml (content)
⚔️ nemo_rl/algorithms/grpo.py (content)
⚔️ nemo_rl/distributed/virtual_cluster.py (content)
⚔️ nemo_rl/environments/nemo_gym.py (content)
⚔️ nemo_rl/models/automodel/config.py (content)
⚔️ nemo_rl/models/automodel/setup.py (content)
⚔️ nemo_rl/models/generation/vllm/vllm_worker.py (content)
⚔️ nemo_rl/models/policy/__init__.py (content)
⚔️ nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py (content)
⚔️ pyproject.toml (content)
⚔️ pyrefly.toml (content)
⚔️ tests/functional/grpo_non_colocated.sh (content)
⚔️ tests/unit/algorithms/test_grpo.py (content)
⚔️ tests/unit/environments/test_nemo_gym.py (content)
⚔️ tests/unit/models/automodel/test_automodel_setup.py (content)
⚔️ tests/unit/models/policy/test_automodel_types.py (content)
⚔️ tests/unit/models/policy/test_dtensor_worker_v2.py (content)
⚔️ tools/build-custom-vllm.sh (content)
⚔️ uv.lock (content)

These conflicts must be resolved before merging into main.
Resolve conflicts locally and push changes to this branch.
Test Results For Major Changes ⚠️ Warning PR contains major breaking changes and dependency upgrades but lacks test results, regression verification, and convergence validation in the description. Add comprehensive testing summary documenting test results, regression testing confirmation, editable install fix verification, and resolution status for identified review issues.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately reflects the main changes: updating transformers and automodel dependencies in the dtensor v2 workflow.
Docstring Coverage ✅ Passed Docstring coverage is 96.77% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch hemil/automodel-transformers-v5
⚔️ Resolve merge conflicts (beta)
  • Auto-commit resolved conflicts to branch hemil/automodel-transformers-v5
  • Create stacked PR with resolved conflicts
  • Post resolved changes as copyable diffs in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/unit/models/policy/test_automodel_types.py (1)

21-21: ⚠️ Potential issue | 🟠 Major

Update import path to match new _target_ reference.

Line 21 imports BackendConfig from the old path nemo_automodel.components.moe.utils, but line 50's _target_ string references the new path nemo_automodel.components.models.common.utils.BackendConfig. Update the import to match:

Proposed fix
-    from nemo_automodel.components.moe.utils import BackendConfig  # noqa: F401
+    from nemo_automodel.components.models.common.utils import BackendConfig  # noqa: F401
🤖 Fix all issues with AI agents
In `@nemo_rl/distributed/virtual_cluster.py`:
- Line 56: AUTOMODEL currently includes the --no-cache flag which forces uv to
bypass its cache on every automodel worker launch; either remove --no-cache from
the AUTOMODEL string to restore normal cached startup behavior or, if it was
intentionally added to workaround dependency/stale-cache issues (e.g.,
transformers v5 transition), add an inline comment next to the AUTOMODEL
definition explaining the rationale, when it can be removed, and any
reproduction steps that justify keeping it; update the AUTOMODEL constant
accordingly to reflect the chosen approach.

In `@nemo_rl/models/automodel/setup.py`:
- Line 465: The unconditional print(model) should be run only on the main
process to avoid repeated logs in distributed runs; wrap the existing
print(model) call with a check using the existing rank variable (e.g., if rank
== 0) so only rank 0 prints the model; locate the print(model) call and guard it
with the rank check (using the same rank identifier already declared earlier) so
other ranks skip printing.
- Around line 449-463: The call to model_class.from_pretrained passes
torch_dtype as str(model_config.torch_dtype), which yields values like
"torch.float32" but the loader expects the actual torch.dtype or a bare string
like "float32"; change the argument to pass the dtype object directly
(torch_dtype=model_config.torch_dtype) in the from_pretrained call inside
setup.py (where model_class.from_pretrained is invoked) so it aligns with the
STRING_TO_DTYPE mapping and test mocks that expect a torch.dtype rather than a
stringified value.

In `@pyproject.toml`:
- Line 165: The path-based editable dependency "nemo-automodel" points to an
empty directory (3rdparty/Automodel-workspace/Automodel) and lacks a
pyproject.toml, so fix by either placing the Automodel source into that
directory or updating the dependency to the correct path; then add a valid
pyproject.toml in that directory (with project metadata and build-backend) so
the editable install for nemo-automodel succeeds and verify the package layout
(package/module files) matches the pyproject configuration.
- Around line 238-239: The comment about the transformer-engine override is
stale and the global override to "transformer-engine[pytorch]==2.10.0" may
unintentionally force TE 2.10.0 into extras like mcore which pins
"transformer-engine[pytorch]==2.8.0"; update the comment to reflect the current
2.10.0 override and the rationale, or verify and ensure mcore compatibility with
TE 2.10.0 (and adjust mcore's pin or the global override accordingly) so
automodel, mcore, and Megatron-Bridge/pyproject.toml are all consistent; search
for the symbols transformer-engine[pytorch], mcore, automodel and the
Megatron-Bridge/pyproject.toml reference to locate the relevant pins and change
either the comment or the pinning strategy to resolve the version conflict.
🧹 Nitpick comments (6)
tests/unit/models/automodel/test_automodel_checkpoint.py (1)

375-388: Redundant local re-imports of AutomodelCheckpointManager.

AutomodelCheckpointManager is already imported at the module level (Line 35). The local re-imports inside each test method (Lines 375, 395, 417, 444, 465, 495, 526, 557) are unnecessary.

nemo_rl/models/policy/__init__.py (1)

88-109: New DTensorConfig keys could use brief inline documentation.

Several newly added keys (expert_parallel_size, custom_parallel_plan, defer_fsdp_grad_sync, moe_parallelizer, clear_cache_every_n_steps) lack purpose/default documentation. The coding guidelines ask that new TypedDict keys document their purpose, valid values, and recommended default.

The grouping comments (lines 92–93, 97, 103, 105, 108) are a good start — consider adding brief per-field comments similar to the style used in AutomodelBackendConfig above.

As per coding guidelines: "When adding a new config key to a TypedDict subclass, document the key's purpose, valid values/types, recommended default, and reflect the default in exemplar YAMLs under examples/configs/*.yaml".

pyproject.toml (1)

250-250: deep_ep override duplicates the spec already in vllm and mcore extras.

deep_ep is pinned to the same git+commit in vllm (Line 74), mcore (Line 114), and now the global override-dependencies (Line 250). This is fine for ensuring consistent resolution, but consider adding a brief comment explaining why the override is needed (e.g., ensuring automodel also uses this version).

nemo_rl/models/automodel/setup.py (2)

265-265: Hidden non-None default for defer_fsdp_grad_sync.

.get("defer_fsdp_grad_sync", True) introduces a default of True in code. Per coding guidelines, YAML should be the single source of truth for configuration defaults — non-None defaults should not be set in code.

Consider either:

  1. Making defer_fsdp_grad_sync a required field in DTensorConfig, or
  2. Setting the default in the YAML config files and accessing it directly here.

As per coding guidelines: "YAML is the single source of truth for configuration defaults; do not set non-None defaults in code for configuration values".


436-463: Potential key collision between from_pretrained_kwargs and automodel_kwargs.

Both **from_pretrained_kwargs (from hf_config_overrides) and **automodel_kwargs are unpacked into from_pretrained(). If any key exists in both dicts, automodel_kwargs silently wins. This may be intentional, but if not, it could cause subtle config loss.

Consider adding a guard:

overlap = set(from_pretrained_kwargs) & set(automodel_kwargs)
if overlap:
    print(f"[WARNING] Overlapping keys between hf_config_overrides and automodel_kwargs: {overlap}")
tests/unit/models/automodel/test_automodel_setup.py (1)

610-627: Lambda self parameter shadows outer fixture self.

Ruff flags self as unused in the lambda on line 622. The parameter is actually the mock instance receiving __getitem__, but it shadows the fixture's self. Consider renaming to _self or _ for clarity.

♻️ Minor rename to silence Ruff ARG005
-        mock_mesh.__getitem__ = lambda self, key: {
+        mock_mesh.__getitem__ = lambda _self, key: {


# Use NeMo-RL direct dependencies and nemo-automodel.
AUTOMODEL = f"uv run --locked --extra automodel --directory {git_root}"
AUTOMODEL = f"uv run --locked --no-cache --extra automodel --directory {git_root}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

--no-cache will bypass uv's cache on every automodel worker launch — is this intentional?

This flag forces uv to skip its package cache, which means every worker startup will re-resolve and potentially re-install/re-build dependencies. This can significantly increase startup time compared to the other executables (VLLM, FSDP, MCORE, etc.) which don't use --no-cache.

If this was added to work around stale-cache issues during the automodel/transformers v5 transition, consider adding a comment explaining the rationale and whether it should be removed once the transition stabilizes.

🤖 Prompt for AI Agents
In `@nemo_rl/distributed/virtual_cluster.py` at line 56, AUTOMODEL currently
includes the --no-cache flag which forces uv to bypass its cache on every
automodel worker launch; either remove --no-cache from the AUTOMODEL string to
restore normal cached startup behavior or, if it was intentionally added to
workaround dependency/stale-cache issues (e.g., transformers v5 transition), add
an inline comment next to the AUTOMODEL definition explaining the rationale,
when it can be removed, and any reproduction steps that justify keeping it;
update the AUTOMODEL constant accordingly to reflect the chosen approach.

Comment on lines +449 to +463
model = model_class.from_pretrained(
model_name,
device_mesh=device_mesh,
moe_mesh=moe_mesh,
distributed_config=fsdp2_config,
moe_config=moe_config if ep_size > 1 else None,
activation_checkpointing=config["dtensor_cfg"]["activation_checkpointing"],
peft_config=peft_config,
attn_implementation=attn_impl,
torch_dtype=str(model_config.torch_dtype),
trust_remote_code=True,
sdpa_method=sdpa_method,
**from_pretrained_kwargs,
**automodel_kwargs,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, let's find the setup.py file and examine the relevant lines
fd -n "setup.py" -g "**/nemo_rl/models/automodel/*"

Repository: NVIDIA-NeMo/RL

Length of output: 288


🏁 Script executed:

# Check what str(torch.float32) actually produces
python3 << 'EOF'
import torch
print(f"str(torch.float32) = {str(torch.float32)}")
print(f"repr(torch.float32) = {repr(torch.float32)}")
EOF

Repository: NVIDIA-NeMo/RL

Length of output: 175


🏁 Script executed:

# Now let's examine the setup.py file around line 124 and lines 449-463
wc -l nemo_rl/models/automodel/setup.py

Repository: NVIDIA-NeMo/RL

Length of output: 95


🏁 Script executed:

# Read the setup.py file around line 124 to see how model_config.torch_dtype is set
sed -n '120,130p' nemo_rl/models/automodel/setup.py

Repository: NVIDIA-NeMo/RL

Length of output: 431


🏁 Script executed:

# Read lines 449-463 to see the exact context
sed -n '449,463p' nemo_rl/models/automodel/setup.py

Repository: NVIDIA-NeMo/RL

Length of output: 615


🏁 Script executed:

# Search for nemo_automodel's from_pretrained method
rg -n 'def from_pretrained' --type py -g '**/*automodel*' -A 10 | head -100

Repository: NVIDIA-NeMo/RL

Length of output: 40


🏁 Script executed:

# Search for torch_dtype handling in the automodel directory
rg -n 'torch_dtype' nemo_rl/models/automodel/ --type py -B 2 -A 2 | head -120

Repository: NVIDIA-NeMo/RL

Length of output: 888


🏁 Script executed:

# Find where model_class comes from
sed -n '400,450p' nemo_rl/models/automodel/setup.py

Repository: NVIDIA-NeMo/RL

Length of output: 2181


🏁 Script executed:

# Search for nemo_automodel imports and class definitions
rg -n 'nemo_automodel|model_class' nemo_rl/models/automodel/setup.py | head -20

Repository: NVIDIA-NeMo/RL

Length of output: 1196


🏁 Script executed:

# Look for the from_pretrained method in nemo_automodel module
fd -type f -name "*.py" -path "*nemo_automodel*" | head -20

Repository: NVIDIA-NeMo/RL

Length of output: 228


🏁 Script executed:

# Search more broadly for any nemo_automodel implementation
rg -l 'class.*AutoModel|def from_pretrained' --type py nemo_rl/ | head -20

Repository: NVIDIA-NeMo/RL

Length of output: 146


🏁 Script executed:

# Search for any torch_dtype string parsing or conversion logic
rg -n 'torch_dtype.*str|str.*torch_dtype' --type py

Repository: NVIDIA-NeMo/RL

Length of output: 146


🏁 Script executed:

# Look for how torch_dtype is typically used elsewhere in the codebase
rg -n 'torch_dtype' --type py -B 1 -A 1 | grep -v "Binary" | head -80

Repository: NVIDIA-NeMo/RL

Length of output: 4154


🏁 Script executed:

# Check if nemo_automodel is in the repo or external
find . -name "*nemo_automodel*" -o -name "*automodel*" 2>/dev/null | grep -v ".git" | head -20

Repository: NVIDIA-NeMo/RL

Length of output: 1101


🏁 Script executed:

# Look at the resolve_model_class function to understand what models are returned
rg -n 'def resolve_model_class' --type py -A 30 | head -60

Repository: NVIDIA-NeMo/RL

Length of output: 2369


🏁 Script executed:

# Look at the test_automodel_setup.py to understand how torch_dtype is used in tests
sed -n '70,80p' tests/unit/models/automodel/test_automodel_setup.py

Repository: NVIDIA-NeMo/RL

Length of output: 363


🏁 Script executed:

# Check more context around how torch_dtype is used in tests
sed -n '365,455p' tests/unit/models/automodel/test_automodel_setup.py

Repository: NVIDIA-NeMo/RL

Length of output: 3569


🏁 Script executed:

# Search for any dtype string conversion/parsing logic
rg -n 'torch_dtype.*parse|dtype.*parse|str.*torch|torch.*str' --type py -B 2 -A 2

Repository: NVIDIA-NeMo/RL

Length of output: 50371


🏁 Script executed:

# Look for HuggingFace AutoConfig usage to understand what torch_dtype becomes
rg -n 'AutoConfig.from_pretrained' --type py -B 2 -A 5 | head -80

Repository: NVIDIA-NeMo/RL

Length of output: 6099


🏁 Script executed:

# Check what happens when AutoConfig stores torch_dtype - does it keep object or convert?
# Look at how model_config is used after being loaded
sed -n '120,180p' nemo_rl/models/automodel/setup.py

Repository: NVIDIA-NeMo/RL

Length of output: 2451


🏁 Script executed:

# Check if there's any documentation or type hints for nemo_automodel.from_pretrained
rg -n 'class.*AutoModel|def from_pretrained' nemo_rl/models/automodel/ --type py -B 2 -A 15 | head -100

Repository: NVIDIA-NeMo/RL

Length of output: 1637


🏁 Script executed:

# Check if there's dtype conversion happening anywhere in the setup module
rg -n 'dtype|torch_dtype' nemo_rl/models/automodel/setup.py --type py -B 1 -A 1

Repository: NVIDIA-NeMo/RL

Length of output: 1139


🏁 Script executed:

# Look at line 387 more closely to see if "torch.float32" string format is used elsewhere
sed -n '380,395p' nemo_rl/models/automodel/setup.py

Repository: NVIDIA-NeMo/RL

Length of output: 802


🏁 Script executed:

# Check what HuggingFace AutoConfig typically stores for torch_dtype
# by looking at how model_config attributes are accessed/used elsewhere
rg -n 'model_config\.' nemo_rl/models/automodel/setup.py --type py | head -20

Repository: NVIDIA-NeMo/RL

Length of output: 724


🏁 Script executed:

# Look for any dtype conversion utilities in the codebase
rg -n 'STRING_TO_DTYPE|DTYPE_TO_STRING' nemo_rl/ --type py -B 2 -A 5

Repository: NVIDIA-NeMo/RL

Length of output: 1348


Pass torch_dtype as the actual dtype object, not a string representation.

Line 458 converts model_config.torch_dtype to a string via str(). HuggingFace's AutoConfig stores torch_dtype as a torch.dtype object (e.g., torch.float32), so str(torch.float32) produces "torch.float32" with the "torch." prefix. However, nemo_automodel's from_pretrained method likely expects either the actual torch.dtype object or the string format "float32" without the prefix (as evidenced by the STRING_TO_DTYPE mapping in this file and test mock expectations). Pass torch_dtype=model_config.torch_dtype directly instead of torch_dtype=str(model_config.torch_dtype).

🤖 Prompt for AI Agents
In `@nemo_rl/models/automodel/setup.py` around lines 449 - 463, The call to
model_class.from_pretrained passes torch_dtype as str(model_config.torch_dtype),
which yields values like "torch.float32" but the loader expects the actual
torch.dtype or a bare string like "float32"; change the argument to pass the
dtype object directly (torch_dtype=model_config.torch_dtype) in the
from_pretrained call inside setup.py (where model_class.from_pretrained is
invoked) so it aligns with the STRING_TO_DTYPE mapping and test mocks that
expect a torch.dtype rather than a stringified value.

raise AssertionError(
"Context parallel is yet not supported for VLM models. Please set cp_size = 1 to train VLM models."
)
print(model)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

print(model) fires on every rank — consider gating on rank 0.

For distributed training with many GPUs, this will dump the full model architecture once per rank, creating excessive log output. Gate it behind rank == 0 (rank is available at Line 344).

Proposed fix
-    print(model)
+    if rank == 0:
+        print(model)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
print(model)
if rank == 0:
print(model)
🤖 Prompt for AI Agents
In `@nemo_rl/models/automodel/setup.py` at line 465, The unconditional
print(model) should be run only on the main process to avoid repeated logs in
distributed runs; wrap the existing print(model) call with a check using the
existing rank variable (e.g., if rank == 0) so only rank 0 prints the model;
locate the print(model) call and guard it with the rank check (using the same
rank identifier already declared earlier) so other ranks skip printing.

[tool.uv.sources]
megatron-core = { workspace = true }
nemo-automodel = { workspace = true }
nemo-automodel = { path = "3rdparty/Automodel-workspace/Automodel", editable = true }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify the Automodel path exists and has a pyproject.toml
ls -la 3rdparty/Automodel-workspace/Automodel/pyproject.toml 2>/dev/null || echo "NOT FOUND: 3rdparty/Automodel-workspace/Automodel/pyproject.toml"

# Check that Automodel-workspace is no longer listed in workspace members
echo "---"
echo "Workspace members:"
rg 'Automodel' pyproject.toml -n

Repository: NVIDIA-NeMo/RL

Length of output: 235


🏁 Script executed:

#!/bin/bash
# Check if the directory exists at all
echo "=== Directory existence check ==="
ls -la 3rdparty/Automodel-workspace/Automodel/ 2>/dev/null || echo "Directory does not exist"

# Show the full [tool.uv.workspace] section
echo "=== Workspace configuration ==="
rg '\[tool\.uv\.workspace\]' -A 20 pyproject.toml

# Check if the Automodel-workspace directory exists
echo "=== Parent directory check ==="
ls -la 3rdparty/Automodel-workspace/ 2>/dev/null || echo "Parent directory does not exist"

Repository: NVIDIA-NeMo/RL

Length of output: 1247


The path-based editable install for nemo-automodel is broken—the referenced directory is empty and contains no pyproject.toml.

The dependency at line 165 points to 3rdparty/Automodel-workspace/Automodel, which exists but is completely empty. A valid pyproject.toml must be added to this directory before the editable install will work. Verify that the Automodel source code has been properly placed in this location. The workspace member configuration has been correctly removed from [tool.uv.workspace].

🤖 Prompt for AI Agents
In `@pyproject.toml` at line 165, The path-based editable dependency
"nemo-automodel" points to an empty directory
(3rdparty/Automodel-workspace/Automodel) and lacks a pyproject.toml, so fix by
either placing the Automodel source into that directory or updating the
dependency to the correct path; then add a valid pyproject.toml in that
directory (with project metadata and build-backend) so the editable install for
nemo-automodel succeeds and verify the package layout (package/module files)
matches the pyproject configuration.

Comment on lines +238 to +239
"transformer-engine[pytorch]==2.10.0",
"nvidia-cudnn-cu12==9.19.0.56",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# First, let's examine the relevant sections of pyproject.toml
echo "=== Checking mcore extra (around line 107) ==="
sed -n '100,115p' pyproject.toml

echo -e "\n=== Checking TE override (around line 238-239) ==="
sed -n '230,245p' pyproject.toml

echo -e "\n=== Checking conflicts section (around line 281-296) ==="
sed -n '275,300p' pyproject.toml

Repository: NVIDIA-NeMo/RL

Length of output: 2613


🏁 Script executed:

#!/bin/bash
# Search for any comments about TE version requirements or constraints
echo "=== Searching for TE version comments/requirements ==="
rg -n 'transformer.engine|TE.*version|TE.*2\.[0-9]' pyproject.toml -B 1 -A 1

echo -e "\n=== Checking if mcore is mentioned in conflict entries ==="
grep -n 'mcore\|mbridge\|automodel' pyproject.toml | head -20

Repository: NVIDIA-NeMo/RL

Length of output: 3086


🏁 Script executed:

#!/bin/bash
# Check for any documentation or comments about the override rationale
echo "=== Looking for override comments and rationale ==="
sed -n '230,250p' pyproject.toml

Repository: NVIDIA-NeMo/RL

Length of output: 1381


🏁 Script executed:

#!/bin/bash
# Check if there's any megatron-core source in workspace or requirements
echo "=== Searching for megatron-core version or TE compatibility mentions ==="
rg -i 'megatron.core|megatron-bridge' --type toml -B 2 -A 2 pyproject.toml

echo -e "\n=== Checking for any GitHub issues or comments about TE versions ==="
find . -name '*.md' -o -name '*.txt' | xargs grep -l -i 'transformer.engine.*2\.10\|megatron.*2\.8' 2>/dev/null | head -5

Repository: NVIDIA-NeMo/RL

Length of output: 1555


🏁 Script executed:

#!/bin/bash
# Check PR/commit context for the TE override change
echo "=== Looking for any configuration or documentation about TE/mcore compatibility ==="
find . -name '*README*' -o -name '*INSTALL*' -o -name '*requirements*' | xargs grep -l -i 'transformer.engine\|megatron' 2>/dev/null | head -5

# Also check if there's a Megatron-Bridge workspace reference
ls -la 3rdparty/ 2>/dev/null || echo "3rdparty not found or empty"

Repository: NVIDIA-NeMo/RL

Length of output: 659


🏁 Script executed:

#!/bin/bash
# Check the actual megatron-bridge or megatron-core package to see TE requirements
echo "=== Checking git history for the TE override change ==="
git log --oneline --all -S 'transformer-engine.*2.10.0' -- pyproject.toml 2>/dev/null | head -5

echo -e "\n=== Checking recent changes to pyproject.toml ==="
git log --oneline -20 -- pyproject.toml 2>/dev/null | head -10

Repository: NVIDIA-NeMo/RL

Length of output: 255


The TE 2.10.0 override comment is outdated; verify mcore compatibility.

The override comment at line 233 states "The TE override is needed because automodel/mbridge we are on is still on 2.5.0", but the override is now 2.10.0. Additionally, the global override forces TE 2.10.0 for all extras, including mcore, which explicitly pins transformer-engine[pytorch]==2.8.0 (line 107). While automodel and mcore are in conflict (preventing coinstallation), installing mcore alone will receive TE 2.10.0 due to the global override.

Either update the comment to reflect the new override version and rationale, or verify that mcore is compatible with TE 2.10.0. The comment at line 104-106 explicitly notes that mcore's TE pin "needs to be compatible with the spec in Megatron-Bridge/pyproject.toml", suggesting this should be intentional rather than an unintended side effect.

🤖 Prompt for AI Agents
In `@pyproject.toml` around lines 238 - 239, The comment about the
transformer-engine override is stale and the global override to
"transformer-engine[pytorch]==2.10.0" may unintentionally force TE 2.10.0 into
extras like mcore which pins "transformer-engine[pytorch]==2.8.0"; update the
comment to reflect the current 2.10.0 override and the rationale, or verify and
ensure mcore compatibility with TE 2.10.0 (and adjust mcore's pin or the global
override accordingly) so automodel, mcore, and Megatron-Bridge/pyproject.toml
are all consistent; search for the symbols transformer-engine[pytorch], mcore,
automodel and the Megatron-Bridge/pyproject.toml reference to locate the
relevant pins and change either the comment or the pinning strategy to resolve
the version conflict.

@hemildesai hemildesai removed the CI:L1 Run doctests, unit tests, and functional tests label Feb 16, 2026
@github-actions
Copy link

⚠️ File Consistency Check

Check based on commit: fdb0374 (PR #1962 from hemil/automodel-transformers-v5)

⚠️ DTensor Policy Worker Synchronization Warning

The file nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py was modified in this PR, but nemo_rl/models/policy/workers/dtensor_policy_worker.py was not updated.

Why this matters:
These files contain related DTensor policy worker implementations that should be kept synchronized to ensure consistency across different versions.

Action required:

  • Please review if the changes in nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py should also be applied to nemo_rl/models/policy/workers/dtensor_policy_worker.py
  • Update nemo_rl/models/policy/workers/dtensor_policy_worker.py if necessary to maintain consistency
  • If the files are intentionally different, please add a comment in the PR explaining why

Files to check:

  • Modified: nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py
  • Not modified: nemo_rl/models/policy/workers/dtensor_policy_worker.py

This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.

@hemildesai hemildesai added the CI:L1 Run doctests, unit tests, and functional tests label Feb 16, 2026
@github-actions
Copy link

✅ Submodule Fast-Forward Check Results

Check based on commit: fdb0374 (PR #1962 from hemil/automodel-transformers-v5)

✅ Submodules that are properly updated:

Automodel: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

@hemildesai hemildesai removed the CI:L1 Run doctests, unit tests, and functional tests label Feb 17, 2026
@github-actions
Copy link

⚠️ File Consistency Check

Check based on commit: 945117f (PR #1962 from hemil/automodel-transformers-v5)

⚠️ DTensor Policy Worker Synchronization Warning

The file nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py was modified in this PR, but nemo_rl/models/policy/workers/dtensor_policy_worker.py was not updated.

Why this matters:
These files contain related DTensor policy worker implementations that should be kept synchronized to ensure consistency across different versions.

Action required:

  • Please review if the changes in nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py should also be applied to nemo_rl/models/policy/workers/dtensor_policy_worker.py
  • Update nemo_rl/models/policy/workers/dtensor_policy_worker.py if necessary to maintain consistency
  • If the files are intentionally different, please add a comment in the PR explaining why

Files to check:

  • Modified: nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py
  • Not modified: nemo_rl/models/policy/workers/dtensor_policy_worker.py

This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.

@hemildesai hemildesai added the CI:L1 Run doctests, unit tests, and functional tests label Feb 17, 2026
@github-actions
Copy link

✅ Submodule Fast-Forward Check Results

Check based on commit: 945117f (PR #1962 from hemil/automodel-transformers-v5)

✅ Submodules that are properly updated:

Automodel: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

@github-actions
Copy link

⚠️ File Consistency Check

Check based on commit: d9143bc (PR #1962 from hemil/automodel-transformers-v5)

⚠️ DTensor Policy Worker Synchronization Warning

The file nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py was modified in this PR, but nemo_rl/models/policy/workers/dtensor_policy_worker.py was not updated.

Why this matters:
These files contain related DTensor policy worker implementations that should be kept synchronized to ensure consistency across different versions.

Action required:

  • Please review if the changes in nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py should also be applied to nemo_rl/models/policy/workers/dtensor_policy_worker.py
  • Update nemo_rl/models/policy/workers/dtensor_policy_worker.py if necessary to maintain consistency
  • If the files are intentionally different, please add a comment in the PR explaining why

Files to check:

  • Modified: nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py
  • Not modified: nemo_rl/models/policy/workers/dtensor_policy_worker.py

This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.

@github-actions
Copy link

✅ Submodule Fast-Forward Check Results

Check based on commit: d9143bc (PR #1962 from hemil/automodel-transformers-v5)

✅ Submodules that are properly updated:

Automodel: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

@hemildesai hemildesai removed the CI:L1 Run doctests, unit tests, and functional tests label Feb 17, 2026
Signed-off-by: Hemil Desai <hemild@nvidia.com>
@hemildesai hemildesai removed the CI:L1 Run doctests, unit tests, and functional tests label Feb 17, 2026
@github-actions
Copy link

⚠️ File Consistency Check

Check based on commit: 3ef6ad0 (PR #1962 from hemil/automodel-transformers-v5)

⚠️ DTensor Policy Worker Synchronization Warning

The file nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py was modified in this PR, but nemo_rl/models/policy/workers/dtensor_policy_worker.py was not updated.

Why this matters:
These files contain related DTensor policy worker implementations that should be kept synchronized to ensure consistency across different versions.

Action required:

  • Please review if the changes in nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py should also be applied to nemo_rl/models/policy/workers/dtensor_policy_worker.py
  • Update nemo_rl/models/policy/workers/dtensor_policy_worker.py if necessary to maintain consistency
  • If the files are intentionally different, please add a comment in the PR explaining why

Files to check:

  • Modified: nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py
  • Not modified: nemo_rl/models/policy/workers/dtensor_policy_worker.py

This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.

@hemildesai hemildesai added the CI:L1 Run doctests, unit tests, and functional tests label Feb 17, 2026
@github-actions
Copy link

✅ Submodule Fast-Forward Check Results

Check based on commit: 3ef6ad0 (PR #1962 from hemil/automodel-transformers-v5)

✅ Submodules that are properly updated:

Automodel: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

@hemildesai hemildesai requested review from a team and terrykong as code owners February 18, 2026 06:47
@github-actions github-actions bot added the CI Relating to CI label Feb 18, 2026
@github-actions
Copy link

ℹ️ File Consistency Check

Check based on commit: d36caba (PR #1962 from hemil/automodel-transformers-v5)

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

  • nemo_rl/models/policy/workers/dtensor_policy_worker.py
  • nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.


This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.

@hemildesai hemildesai added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Feb 18, 2026
@github-actions
Copy link

✅ Submodule Fast-Forward Check Results

Check based on commit: d36caba (PR #1962 from hemil/automodel-transformers-v5)

✅ Submodules that are properly updated:

Automodel: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

@hemildesai hemildesai added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Feb 18, 2026
@github-actions
Copy link

ℹ️ File Consistency Check

Check based on commit: db6c6ae (PR #1962 from hemil/automodel-transformers-v5)

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

  • nemo_rl/models/policy/workers/dtensor_policy_worker.py
  • nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.


This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.

@github-actions
Copy link

✅ Submodule Fast-Forward Check Results

Check based on commit: db6c6ae (PR #1962 from hemil/automodel-transformers-v5)

✅ Submodules that are properly updated:

Automodel: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

Signed-off-by: Hemil Desai <hemild@nvidia.com>
@hemildesai hemildesai force-pushed the hemil/automodel-transformers-v5 branch from db6c6ae to 23b7ccb Compare February 18, 2026 07:04
@github-actions
Copy link

⚠️ File Consistency Check

Check based on commit: 23b7ccb (PR #1962 from hemil/automodel-transformers-v5)

⚠️ DTensor Policy Worker Synchronization Warning

The file nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py was modified in this PR, but nemo_rl/models/policy/workers/dtensor_policy_worker.py was not updated.

Why this matters:
These files contain related DTensor policy worker implementations that should be kept synchronized to ensure consistency across different versions.

Action required:

  • Please review if the changes in nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py should also be applied to nemo_rl/models/policy/workers/dtensor_policy_worker.py
  • Update nemo_rl/models/policy/workers/dtensor_policy_worker.py if necessary to maintain consistency
  • If the files are intentionally different, please add a comment in the PR explaining why

Files to check:

  • Modified: nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py
  • Not modified: nemo_rl/models/policy/workers/dtensor_policy_worker.py

This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.

@github-actions
Copy link

✅ Submodule Fast-Forward Check Results

Check based on commit: 23b7ccb (PR #1962 from hemil/automodel-transformers-v5)

✅ Submodules that are properly updated:

Automodel: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

@hemildesai hemildesai removed request for a team February 18, 2026 07:05
@hemildesai hemildesai added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Feb 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:L1 Run doctests, unit tests, and functional tests CI Relating to CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant