chore: update transformers to v5 and automodel to latest main in dtensor v2#1962
chore: update transformers to v5 and automodel to latest main in dtensor v2#1962hemildesai wants to merge 7 commits intomainfrom
Conversation
|
📝 WalkthroughWalkthroughThis PR refactors the distributed context management by replacing Changes
Sequence DiagramsequenceDiagram
participant Setup as setup_distributed()
participant Context as DistributedContext
participant DeviceMesh as create_device_mesh()
participant ModelSetup as setup_model_and_optimizer()
participant FromPretrained as model_class.from_pretrained()
participant Optimizer as OptimizerSetup
Setup->>DeviceMesh: Create device/moe meshes
DeviceMesh-->>Setup: Return meshes
Setup->>Context: Construct DistributedContext<br/>(device_mesh, moe_mesh, fsdp2_config, moe_config, sizes)
Setup-->>ModelSetup: Return DistributedContext
ModelSetup->>ModelSetup: Validate CP/TP/EP interactions
ModelSetup->>FromPretrained: Call with device_mesh,<br/>moe_mesh, distributed_config
FromPretrained-->>ModelSetup: Return initialized model
ModelSetup->>ModelSetup: Apply activation checkpointing<br/>and config overrides
ModelSetup->>Optimizer: Initialize optimizer
Optimizer-->>ModelSetup: Return optimizer state
ModelSetup-->>ModelSetup: Return ModelAndOptimizerState
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
⚔️ Resolve merge conflicts (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tests/unit/models/policy/test_automodel_types.py (1)
21-21:⚠️ Potential issue | 🟠 MajorUpdate import path to match new
_target_reference.Line 21 imports
BackendConfigfrom the old pathnemo_automodel.components.moe.utils, but line 50's_target_string references the new pathnemo_automodel.components.models.common.utils.BackendConfig. Update the import to match:Proposed fix
- from nemo_automodel.components.moe.utils import BackendConfig # noqa: F401 + from nemo_automodel.components.models.common.utils import BackendConfig # noqa: F401
🤖 Fix all issues with AI agents
In `@nemo_rl/distributed/virtual_cluster.py`:
- Line 56: AUTOMODEL currently includes the --no-cache flag which forces uv to
bypass its cache on every automodel worker launch; either remove --no-cache from
the AUTOMODEL string to restore normal cached startup behavior or, if it was
intentionally added to workaround dependency/stale-cache issues (e.g.,
transformers v5 transition), add an inline comment next to the AUTOMODEL
definition explaining the rationale, when it can be removed, and any
reproduction steps that justify keeping it; update the AUTOMODEL constant
accordingly to reflect the chosen approach.
In `@nemo_rl/models/automodel/setup.py`:
- Line 465: The unconditional print(model) should be run only on the main
process to avoid repeated logs in distributed runs; wrap the existing
print(model) call with a check using the existing rank variable (e.g., if rank
== 0) so only rank 0 prints the model; locate the print(model) call and guard it
with the rank check (using the same rank identifier already declared earlier) so
other ranks skip printing.
- Around line 449-463: The call to model_class.from_pretrained passes
torch_dtype as str(model_config.torch_dtype), which yields values like
"torch.float32" but the loader expects the actual torch.dtype or a bare string
like "float32"; change the argument to pass the dtype object directly
(torch_dtype=model_config.torch_dtype) in the from_pretrained call inside
setup.py (where model_class.from_pretrained is invoked) so it aligns with the
STRING_TO_DTYPE mapping and test mocks that expect a torch.dtype rather than a
stringified value.
In `@pyproject.toml`:
- Line 165: The path-based editable dependency "nemo-automodel" points to an
empty directory (3rdparty/Automodel-workspace/Automodel) and lacks a
pyproject.toml, so fix by either placing the Automodel source into that
directory or updating the dependency to the correct path; then add a valid
pyproject.toml in that directory (with project metadata and build-backend) so
the editable install for nemo-automodel succeeds and verify the package layout
(package/module files) matches the pyproject configuration.
- Around line 238-239: The comment about the transformer-engine override is
stale and the global override to "transformer-engine[pytorch]==2.10.0" may
unintentionally force TE 2.10.0 into extras like mcore which pins
"transformer-engine[pytorch]==2.8.0"; update the comment to reflect the current
2.10.0 override and the rationale, or verify and ensure mcore compatibility with
TE 2.10.0 (and adjust mcore's pin or the global override accordingly) so
automodel, mcore, and Megatron-Bridge/pyproject.toml are all consistent; search
for the symbols transformer-engine[pytorch], mcore, automodel and the
Megatron-Bridge/pyproject.toml reference to locate the relevant pins and change
either the comment or the pinning strategy to resolve the version conflict.
🧹 Nitpick comments (6)
tests/unit/models/automodel/test_automodel_checkpoint.py (1)
375-388: Redundant local re-imports ofAutomodelCheckpointManager.
AutomodelCheckpointManageris already imported at the module level (Line 35). The local re-imports inside each test method (Lines 375, 395, 417, 444, 465, 495, 526, 557) are unnecessary.nemo_rl/models/policy/__init__.py (1)
88-109: NewDTensorConfigkeys could use brief inline documentation.Several newly added keys (
expert_parallel_size,custom_parallel_plan,defer_fsdp_grad_sync,moe_parallelizer,clear_cache_every_n_steps) lack purpose/default documentation. The coding guidelines ask that new TypedDict keys document their purpose, valid values, and recommended default.The grouping comments (lines 92–93, 97, 103, 105, 108) are a good start — consider adding brief per-field comments similar to the style used in
AutomodelBackendConfigabove.As per coding guidelines: "When adding a new config key to a TypedDict subclass, document the key's purpose, valid values/types, recommended default, and reflect the default in exemplar YAMLs under
examples/configs/*.yaml".pyproject.toml (1)
250-250:deep_epoverride duplicates the spec already invllmandmcoreextras.
deep_epis pinned to the same git+commit invllm(Line 74),mcore(Line 114), and now the globaloverride-dependencies(Line 250). This is fine for ensuring consistent resolution, but consider adding a brief comment explaining why the override is needed (e.g., ensuring automodel also uses this version).nemo_rl/models/automodel/setup.py (2)
265-265: Hidden non-None default fordefer_fsdp_grad_sync.
.get("defer_fsdp_grad_sync", True)introduces a default ofTruein code. Per coding guidelines, YAML should be the single source of truth for configuration defaults — non-None defaults should not be set in code.Consider either:
- Making
defer_fsdp_grad_synca required field inDTensorConfig, or- Setting the default in the YAML config files and accessing it directly here.
As per coding guidelines: "YAML is the single source of truth for configuration defaults; do not set non-None defaults in code for configuration values".
436-463: Potential key collision betweenfrom_pretrained_kwargsandautomodel_kwargs.Both
**from_pretrained_kwargs(fromhf_config_overrides) and**automodel_kwargsare unpacked intofrom_pretrained(). If any key exists in both dicts,automodel_kwargssilently wins. This may be intentional, but if not, it could cause subtle config loss.Consider adding a guard:
overlap = set(from_pretrained_kwargs) & set(automodel_kwargs) if overlap: print(f"[WARNING] Overlapping keys between hf_config_overrides and automodel_kwargs: {overlap}")tests/unit/models/automodel/test_automodel_setup.py (1)
610-627: Lambdaselfparameter shadows outer fixtureself.Ruff flags
selfas unused in the lambda on line 622. The parameter is actually the mock instance receiving__getitem__, but it shadows the fixture'sself. Consider renaming to_selfor_for clarity.♻️ Minor rename to silence Ruff ARG005
- mock_mesh.__getitem__ = lambda self, key: { + mock_mesh.__getitem__ = lambda _self, key: {
|
|
||
| # Use NeMo-RL direct dependencies and nemo-automodel. | ||
| AUTOMODEL = f"uv run --locked --extra automodel --directory {git_root}" | ||
| AUTOMODEL = f"uv run --locked --no-cache --extra automodel --directory {git_root}" |
There was a problem hiding this comment.
--no-cache will bypass uv's cache on every automodel worker launch — is this intentional?
This flag forces uv to skip its package cache, which means every worker startup will re-resolve and potentially re-install/re-build dependencies. This can significantly increase startup time compared to the other executables (VLLM, FSDP, MCORE, etc.) which don't use --no-cache.
If this was added to work around stale-cache issues during the automodel/transformers v5 transition, consider adding a comment explaining the rationale and whether it should be removed once the transition stabilizes.
🤖 Prompt for AI Agents
In `@nemo_rl/distributed/virtual_cluster.py` at line 56, AUTOMODEL currently
includes the --no-cache flag which forces uv to bypass its cache on every
automodel worker launch; either remove --no-cache from the AUTOMODEL string to
restore normal cached startup behavior or, if it was intentionally added to
workaround dependency/stale-cache issues (e.g., transformers v5 transition), add
an inline comment next to the AUTOMODEL definition explaining the rationale,
when it can be removed, and any reproduction steps that justify keeping it;
update the AUTOMODEL constant accordingly to reflect the chosen approach.
| model = model_class.from_pretrained( | ||
| model_name, | ||
| device_mesh=device_mesh, | ||
| moe_mesh=moe_mesh, | ||
| distributed_config=fsdp2_config, | ||
| moe_config=moe_config if ep_size > 1 else None, | ||
| activation_checkpointing=config["dtensor_cfg"]["activation_checkpointing"], | ||
| peft_config=peft_config, | ||
| attn_implementation=attn_impl, | ||
| torch_dtype=str(model_config.torch_dtype), | ||
| trust_remote_code=True, | ||
| sdpa_method=sdpa_method, | ||
| **from_pretrained_kwargs, | ||
| **automodel_kwargs, | ||
| ) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's find the setup.py file and examine the relevant lines
fd -n "setup.py" -g "**/nemo_rl/models/automodel/*"Repository: NVIDIA-NeMo/RL
Length of output: 288
🏁 Script executed:
# Check what str(torch.float32) actually produces
python3 << 'EOF'
import torch
print(f"str(torch.float32) = {str(torch.float32)}")
print(f"repr(torch.float32) = {repr(torch.float32)}")
EOFRepository: NVIDIA-NeMo/RL
Length of output: 175
🏁 Script executed:
# Now let's examine the setup.py file around line 124 and lines 449-463
wc -l nemo_rl/models/automodel/setup.pyRepository: NVIDIA-NeMo/RL
Length of output: 95
🏁 Script executed:
# Read the setup.py file around line 124 to see how model_config.torch_dtype is set
sed -n '120,130p' nemo_rl/models/automodel/setup.pyRepository: NVIDIA-NeMo/RL
Length of output: 431
🏁 Script executed:
# Read lines 449-463 to see the exact context
sed -n '449,463p' nemo_rl/models/automodel/setup.pyRepository: NVIDIA-NeMo/RL
Length of output: 615
🏁 Script executed:
# Search for nemo_automodel's from_pretrained method
rg -n 'def from_pretrained' --type py -g '**/*automodel*' -A 10 | head -100Repository: NVIDIA-NeMo/RL
Length of output: 40
🏁 Script executed:
# Search for torch_dtype handling in the automodel directory
rg -n 'torch_dtype' nemo_rl/models/automodel/ --type py -B 2 -A 2 | head -120Repository: NVIDIA-NeMo/RL
Length of output: 888
🏁 Script executed:
# Find where model_class comes from
sed -n '400,450p' nemo_rl/models/automodel/setup.pyRepository: NVIDIA-NeMo/RL
Length of output: 2181
🏁 Script executed:
# Search for nemo_automodel imports and class definitions
rg -n 'nemo_automodel|model_class' nemo_rl/models/automodel/setup.py | head -20Repository: NVIDIA-NeMo/RL
Length of output: 1196
🏁 Script executed:
# Look for the from_pretrained method in nemo_automodel module
fd -type f -name "*.py" -path "*nemo_automodel*" | head -20Repository: NVIDIA-NeMo/RL
Length of output: 228
🏁 Script executed:
# Search more broadly for any nemo_automodel implementation
rg -l 'class.*AutoModel|def from_pretrained' --type py nemo_rl/ | head -20Repository: NVIDIA-NeMo/RL
Length of output: 146
🏁 Script executed:
# Search for any torch_dtype string parsing or conversion logic
rg -n 'torch_dtype.*str|str.*torch_dtype' --type pyRepository: NVIDIA-NeMo/RL
Length of output: 146
🏁 Script executed:
# Look for how torch_dtype is typically used elsewhere in the codebase
rg -n 'torch_dtype' --type py -B 1 -A 1 | grep -v "Binary" | head -80Repository: NVIDIA-NeMo/RL
Length of output: 4154
🏁 Script executed:
# Check if nemo_automodel is in the repo or external
find . -name "*nemo_automodel*" -o -name "*automodel*" 2>/dev/null | grep -v ".git" | head -20Repository: NVIDIA-NeMo/RL
Length of output: 1101
🏁 Script executed:
# Look at the resolve_model_class function to understand what models are returned
rg -n 'def resolve_model_class' --type py -A 30 | head -60Repository: NVIDIA-NeMo/RL
Length of output: 2369
🏁 Script executed:
# Look at the test_automodel_setup.py to understand how torch_dtype is used in tests
sed -n '70,80p' tests/unit/models/automodel/test_automodel_setup.pyRepository: NVIDIA-NeMo/RL
Length of output: 363
🏁 Script executed:
# Check more context around how torch_dtype is used in tests
sed -n '365,455p' tests/unit/models/automodel/test_automodel_setup.pyRepository: NVIDIA-NeMo/RL
Length of output: 3569
🏁 Script executed:
# Search for any dtype string conversion/parsing logic
rg -n 'torch_dtype.*parse|dtype.*parse|str.*torch|torch.*str' --type py -B 2 -A 2Repository: NVIDIA-NeMo/RL
Length of output: 50371
🏁 Script executed:
# Look for HuggingFace AutoConfig usage to understand what torch_dtype becomes
rg -n 'AutoConfig.from_pretrained' --type py -B 2 -A 5 | head -80Repository: NVIDIA-NeMo/RL
Length of output: 6099
🏁 Script executed:
# Check what happens when AutoConfig stores torch_dtype - does it keep object or convert?
# Look at how model_config is used after being loaded
sed -n '120,180p' nemo_rl/models/automodel/setup.pyRepository: NVIDIA-NeMo/RL
Length of output: 2451
🏁 Script executed:
# Check if there's any documentation or type hints for nemo_automodel.from_pretrained
rg -n 'class.*AutoModel|def from_pretrained' nemo_rl/models/automodel/ --type py -B 2 -A 15 | head -100Repository: NVIDIA-NeMo/RL
Length of output: 1637
🏁 Script executed:
# Check if there's dtype conversion happening anywhere in the setup module
rg -n 'dtype|torch_dtype' nemo_rl/models/automodel/setup.py --type py -B 1 -A 1Repository: NVIDIA-NeMo/RL
Length of output: 1139
🏁 Script executed:
# Look at line 387 more closely to see if "torch.float32" string format is used elsewhere
sed -n '380,395p' nemo_rl/models/automodel/setup.pyRepository: NVIDIA-NeMo/RL
Length of output: 802
🏁 Script executed:
# Check what HuggingFace AutoConfig typically stores for torch_dtype
# by looking at how model_config attributes are accessed/used elsewhere
rg -n 'model_config\.' nemo_rl/models/automodel/setup.py --type py | head -20Repository: NVIDIA-NeMo/RL
Length of output: 724
🏁 Script executed:
# Look for any dtype conversion utilities in the codebase
rg -n 'STRING_TO_DTYPE|DTYPE_TO_STRING' nemo_rl/ --type py -B 2 -A 5Repository: NVIDIA-NeMo/RL
Length of output: 1348
Pass torch_dtype as the actual dtype object, not a string representation.
Line 458 converts model_config.torch_dtype to a string via str(). HuggingFace's AutoConfig stores torch_dtype as a torch.dtype object (e.g., torch.float32), so str(torch.float32) produces "torch.float32" with the "torch." prefix. However, nemo_automodel's from_pretrained method likely expects either the actual torch.dtype object or the string format "float32" without the prefix (as evidenced by the STRING_TO_DTYPE mapping in this file and test mock expectations). Pass torch_dtype=model_config.torch_dtype directly instead of torch_dtype=str(model_config.torch_dtype).
🤖 Prompt for AI Agents
In `@nemo_rl/models/automodel/setup.py` around lines 449 - 463, The call to
model_class.from_pretrained passes torch_dtype as str(model_config.torch_dtype),
which yields values like "torch.float32" but the loader expects the actual
torch.dtype or a bare string like "float32"; change the argument to pass the
dtype object directly (torch_dtype=model_config.torch_dtype) in the
from_pretrained call inside setup.py (where model_class.from_pretrained is
invoked) so it aligns with the STRING_TO_DTYPE mapping and test mocks that
expect a torch.dtype rather than a stringified value.
| raise AssertionError( | ||
| "Context parallel is yet not supported for VLM models. Please set cp_size = 1 to train VLM models." | ||
| ) | ||
| print(model) |
There was a problem hiding this comment.
print(model) fires on every rank — consider gating on rank 0.
For distributed training with many GPUs, this will dump the full model architecture once per rank, creating excessive log output. Gate it behind rank == 0 (rank is available at Line 344).
Proposed fix
- print(model)
+ if rank == 0:
+ print(model)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| print(model) | |
| if rank == 0: | |
| print(model) |
🤖 Prompt for AI Agents
In `@nemo_rl/models/automodel/setup.py` at line 465, The unconditional
print(model) should be run only on the main process to avoid repeated logs in
distributed runs; wrap the existing print(model) call with a check using the
existing rank variable (e.g., if rank == 0) so only rank 0 prints the model;
locate the print(model) call and guard it with the rank check (using the same
rank identifier already declared earlier) so other ranks skip printing.
| [tool.uv.sources] | ||
| megatron-core = { workspace = true } | ||
| nemo-automodel = { workspace = true } | ||
| nemo-automodel = { path = "3rdparty/Automodel-workspace/Automodel", editable = true } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify the Automodel path exists and has a pyproject.toml
ls -la 3rdparty/Automodel-workspace/Automodel/pyproject.toml 2>/dev/null || echo "NOT FOUND: 3rdparty/Automodel-workspace/Automodel/pyproject.toml"
# Check that Automodel-workspace is no longer listed in workspace members
echo "---"
echo "Workspace members:"
rg 'Automodel' pyproject.toml -nRepository: NVIDIA-NeMo/RL
Length of output: 235
🏁 Script executed:
#!/bin/bash
# Check if the directory exists at all
echo "=== Directory existence check ==="
ls -la 3rdparty/Automodel-workspace/Automodel/ 2>/dev/null || echo "Directory does not exist"
# Show the full [tool.uv.workspace] section
echo "=== Workspace configuration ==="
rg '\[tool\.uv\.workspace\]' -A 20 pyproject.toml
# Check if the Automodel-workspace directory exists
echo "=== Parent directory check ==="
ls -la 3rdparty/Automodel-workspace/ 2>/dev/null || echo "Parent directory does not exist"Repository: NVIDIA-NeMo/RL
Length of output: 1247
The path-based editable install for nemo-automodel is broken—the referenced directory is empty and contains no pyproject.toml.
The dependency at line 165 points to 3rdparty/Automodel-workspace/Automodel, which exists but is completely empty. A valid pyproject.toml must be added to this directory before the editable install will work. Verify that the Automodel source code has been properly placed in this location. The workspace member configuration has been correctly removed from [tool.uv.workspace].
🤖 Prompt for AI Agents
In `@pyproject.toml` at line 165, The path-based editable dependency
"nemo-automodel" points to an empty directory
(3rdparty/Automodel-workspace/Automodel) and lacks a pyproject.toml, so fix by
either placing the Automodel source into that directory or updating the
dependency to the correct path; then add a valid pyproject.toml in that
directory (with project metadata and build-backend) so the editable install for
nemo-automodel succeeds and verify the package layout (package/module files)
matches the pyproject configuration.
| "transformer-engine[pytorch]==2.10.0", | ||
| "nvidia-cudnn-cu12==9.19.0.56", |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# First, let's examine the relevant sections of pyproject.toml
echo "=== Checking mcore extra (around line 107) ==="
sed -n '100,115p' pyproject.toml
echo -e "\n=== Checking TE override (around line 238-239) ==="
sed -n '230,245p' pyproject.toml
echo -e "\n=== Checking conflicts section (around line 281-296) ==="
sed -n '275,300p' pyproject.tomlRepository: NVIDIA-NeMo/RL
Length of output: 2613
🏁 Script executed:
#!/bin/bash
# Search for any comments about TE version requirements or constraints
echo "=== Searching for TE version comments/requirements ==="
rg -n 'transformer.engine|TE.*version|TE.*2\.[0-9]' pyproject.toml -B 1 -A 1
echo -e "\n=== Checking if mcore is mentioned in conflict entries ==="
grep -n 'mcore\|mbridge\|automodel' pyproject.toml | head -20Repository: NVIDIA-NeMo/RL
Length of output: 3086
🏁 Script executed:
#!/bin/bash
# Check for any documentation or comments about the override rationale
echo "=== Looking for override comments and rationale ==="
sed -n '230,250p' pyproject.tomlRepository: NVIDIA-NeMo/RL
Length of output: 1381
🏁 Script executed:
#!/bin/bash
# Check if there's any megatron-core source in workspace or requirements
echo "=== Searching for megatron-core version or TE compatibility mentions ==="
rg -i 'megatron.core|megatron-bridge' --type toml -B 2 -A 2 pyproject.toml
echo -e "\n=== Checking for any GitHub issues or comments about TE versions ==="
find . -name '*.md' -o -name '*.txt' | xargs grep -l -i 'transformer.engine.*2\.10\|megatron.*2\.8' 2>/dev/null | head -5Repository: NVIDIA-NeMo/RL
Length of output: 1555
🏁 Script executed:
#!/bin/bash
# Check PR/commit context for the TE override change
echo "=== Looking for any configuration or documentation about TE/mcore compatibility ==="
find . -name '*README*' -o -name '*INSTALL*' -o -name '*requirements*' | xargs grep -l -i 'transformer.engine\|megatron' 2>/dev/null | head -5
# Also check if there's a Megatron-Bridge workspace reference
ls -la 3rdparty/ 2>/dev/null || echo "3rdparty not found or empty"Repository: NVIDIA-NeMo/RL
Length of output: 659
🏁 Script executed:
#!/bin/bash
# Check the actual megatron-bridge or megatron-core package to see TE requirements
echo "=== Checking git history for the TE override change ==="
git log --oneline --all -S 'transformer-engine.*2.10.0' -- pyproject.toml 2>/dev/null | head -5
echo -e "\n=== Checking recent changes to pyproject.toml ==="
git log --oneline -20 -- pyproject.toml 2>/dev/null | head -10Repository: NVIDIA-NeMo/RL
Length of output: 255
The TE 2.10.0 override comment is outdated; verify mcore compatibility.
The override comment at line 233 states "The TE override is needed because automodel/mbridge we are on is still on 2.5.0", but the override is now 2.10.0. Additionally, the global override forces TE 2.10.0 for all extras, including mcore, which explicitly pins transformer-engine[pytorch]==2.8.0 (line 107). While automodel and mcore are in conflict (preventing coinstallation), installing mcore alone will receive TE 2.10.0 due to the global override.
Either update the comment to reflect the new override version and rationale, or verify that mcore is compatible with TE 2.10.0. The comment at line 104-106 explicitly notes that mcore's TE pin "needs to be compatible with the spec in Megatron-Bridge/pyproject.toml", suggesting this should be intentional rather than an unintended side effect.
🤖 Prompt for AI Agents
In `@pyproject.toml` around lines 238 - 239, The comment about the
transformer-engine override is stale and the global override to
"transformer-engine[pytorch]==2.10.0" may unintentionally force TE 2.10.0 into
extras like mcore which pins "transformer-engine[pytorch]==2.8.0"; update the
comment to reflect the current 2.10.0 override and the rationale, or verify and
ensure mcore compatibility with TE 2.10.0 (and adjust mcore's pin or the global
override accordingly) so automodel, mcore, and Megatron-Bridge/pyproject.toml
are all consistent; search for the symbols transformer-engine[pytorch], mcore,
automodel and the Megatron-Bridge/pyproject.toml reference to locate the
relevant pins and change either the comment or the pinning strategy to resolve
the version conflict.
|
|
|
|
ℹ️ File Consistency CheckCheck based on commit: d36caba (PR #1962 from ✅ DTensor Policy Worker Synchronization CheckBoth DTensor policy worker files were modified in this PR:
Please ensure that the changes are consistent between both files where applicable. This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning. |
ℹ️ File Consistency CheckCheck based on commit: db6c6ae (PR #1962 from ✅ DTensor Policy Worker Synchronization CheckBoth DTensor policy worker files were modified in this PR:
Please ensure that the changes are consistent between both files where applicable. This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning. |
db6c6ae to
23b7ccb
Compare
|
Summary by CodeRabbit
Release Notes
New Features
Configuration Updates
Dependencies
Improvements