Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
46 changes: 29 additions & 17 deletions examples/megatron_bridge/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ Once inside the container, you need to login with your HuggingFace token to down
Note that the default dataset for pruning and quantization is [`nemotron-post-training-dataset-v2`](https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v2), which is gated.

```bash
huggingface-cli login --token <your token>
hf auth login --token <your token>
```

## Pruning
Expand Down Expand Up @@ -97,23 +97,35 @@ The [distill.py](distill.py) script loads student and teacher models from Huggin
### Data Preparation

The distillation script expects pre-tokenized data in Megatron's binary format (`.bin` / `.idx` files).
You can tokenize your JSONL dataset using the following function:

```python
from modelopt.torch.utils.plugins import megatron_preprocess_data

megatron_preprocess_data(
input_path="/path/to/your/data.jsonl",
output_dir="/path/to/tokenized/data",
tokenizer_name_or_path="Qwen/Qwen3-0.6B",
json_keys=["text"], # change to your JSON key if needed
workers=32,
log_interval=100000,
max_sequence_length=256000, # To avoid rare OOM errors if text is too long
)

You can tokenize your JSONL datasets using the following command:

```bash
python -m modelopt.torch.utils.plugins.megatron_preprocess_data \
--jsonl_paths /path/to/data1.jsonl /path/to/data2.jsonl ... \
--json_keys text \
--tokenizer Qwen/Qwen3-0.6B \
--output_dir /path/to/tokenized/data/qwen3 \
--workers 32 \
--max_sequence_length 256000
```

Instead of `--jsonl_paths`, you can also pass a directory path to the `--input_dir` argument to tokenize all JSONL files in the directory.

If you want to download and tokenize a dataset from Hugging Face Hub directly (login with your HuggingFace token to download gated datasets like the one used here), you can use the following command:

```bash
python -m modelopt.torch.utils.plugins.megatron_preprocess_data \
--hf_dataset nvidia/Nemotron-Post-Training-Dataset-v2 \
--hf_split code \
--json_keys text \
--tokenizer Qwen/Qwen3-0.6B \
--output_dir /path/to/tokenized/data/qwen3 \
--workers 32 \
--max_sequence_length 256000
```

If you have multiple JSONL files, you can tokenize them one by one and pass all the paths to the `--data_paths` argument.
If you skip `--hf_name`, it will download and tokenize all subsets for the dataset. If you skip `--hf_split`, it will download and tokenize all splits for the subset.

### Distillation with Real Data

Expand All @@ -124,7 +136,7 @@ torchrun --nnodes 1 --nproc_per_node 8 distill.py \
--tp_size 8 \
--teacher_hf_path Qwen/Qwen3-8B \
--student_hf_path Qwen/Qwen3-4B \
--data_paths 1.0 /path/to/tokenized/data \
--data_paths 1.0 /path/to/tokenized/data/qwen3 \
--data_path_to_cache /path/to/cache/dataset_indices_qwen3 \
--seq_length 8192 \
--mbs 1 \
Expand Down
8 changes: 5 additions & 3 deletions examples/megatron_bridge/distill.py
Original file line number Diff line number Diff line change
Expand Up @@ -163,7 +163,7 @@ def _build_model_provider(hf_path):
lr_warmup_iters=args.lr_warmup_iters,
max_lr=args.lr,
min_lr=args.min_lr,
adam_beta2=0.98,
adam_beta2=0.95,
)

# Build dataset config
Expand Down Expand Up @@ -227,7 +227,7 @@ def _build_model_provider(hf_path):
save_interval=args.eval_interval,
save=checkpoint_dir,
load=checkpoint_dir, # Resume from this directory (if exists)
most_recent_k=3, # Keeps 3 most recent checkpoints (not metric-based)
most_recent_k=5, # Keeps 5 most recent checkpoints (not metric-based)
ckpt_format="torch_dist",
async_save=True,
fully_parallel_save=True,
Expand All @@ -238,7 +238,9 @@ def _build_model_provider(hf_path):

print_rank_0("\nStarting distillation...")
distill(config)
print_rank_0(f"\nDistillation done! Saved checkpoint to {checkpoint_dir}\n")
print_rank_0(
f"\nDistillation done! Saved checkpoint to {checkpoint_dir} in megatron distributed checkpoint format.\n"
)


if __name__ == "__main__":
Expand Down
2 changes: 1 addition & 1 deletion examples/nemo_run/common/process_climbmix.py
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,7 @@ def get_args():
print("Tokenizing ClimbMix dataset...")
input_paths = [raw_dir / name for name in subset_filenames]
megatron_preprocess_data(
input_paths,
jsonl_paths=input_paths,
output_dir=proc_dir,
tokenizer_name_or_path=args.tokenizer,
append_eod=True,
Expand Down
1 change: 1 addition & 0 deletions modelopt/torch/prune/plugins/mcore_minitron.py
Original file line number Diff line number Diff line change
Expand Up @@ -317,6 +317,7 @@ def run_search(self) -> None:
# Prune homogeneously
self._prune(export_config, prune_depth=True)

# TODO: Rename to hybrid_layer_pattern after https://github.com/NVIDIA/Megatron-LM/pull/3377
# Update hybrid_override_pattern if pruning is done on a hybrid model
if isinstance(self.model, MambaModel):
print_rank_0(f"Original hybrid_override_pattern: {self.model.hybrid_override_pattern}")
Expand Down
Loading
Loading