Reproduce the DeepSeek R1 "aha moment" on a single GPU using GRPO (Group Relative Policy Optimization).
A base pretrained model (Qwen2.5-3B) learns chain-of-thought reasoning through reinforcement learning alone — no supervised fine-tuning, no human-written reasoning traces. The model discovers <think>...</think><answer>...</answer> formatting and step-by-step math reasoning purely from reward signals.
- Single NVIDIA GPU with >= 16 GB VRAM (tested on RTX 5090 32 GB)
- ~4 hours for 1000 training steps (200+ steps is enough for good results — under 1 hour)
# Install system dependencies (CUDA toolkit, gcc-12)
chmod +x install_deps.sh
./install_deps.sh
# Create venv and install Python packages
python3 -m venv venv
source venv/bin/activate
pip install -r requirements_pretrained.txt
# Train (default: RTX 5090)
python train_pretrained.py
# Or specify a config for your GPU
python train_pretrained.py --config configs/4090.yaml
# Resume from latest checkpoint
python train_pretrained.py --resume
# Resume from a specific checkpoint
python train_pretrained.py --resume outputs_pretrained/checkpoint-200
# Interactive Q&A with the trained model
python train_pretrained.py --play
# Use a specific LoRA adapter
python train_pretrained.py --play path/to/lora- Loads Qwen2.5-3B (base pretrained, no instruction tuning)
- Applies QLoRA (rank 64) for parameter-efficient training
- Trains with GRPO on GSM8K math problems using five reward signals:
- correctness_reward (2.0) — extracted answer matches ground truth
- int_reward (0.5) — answer is a valid integer
- strict_format_reward (0.5) — exact
<think>\n...\n</think>\n<answer>\n...\n</answer>format - soft_format_reward (0.5) — loose tag matching
- xmlcount_reward (up to 0.5) — partial credit for individual tags
- Uses vLLM for fast generation, PyTorch for gradient computation
You: Victor is competing in the CrossFit Open workout 24.1, which has a 15-minute time cap.
He completes burpees over the bar at a rate of 8 per minute for the first 5 minutes, then
slows to 5 per minute for the next 6 minutes, and finishes the last 4 minutes at 3 per
minute. How many total burpees over the bar did Victor complete?
Model:
<think>
Victor completes burpees over the bar at a rate of 8 per minute for the first 5 minutes.
So, the number of burpees he completes in the first 5 minutes is:
8 * 5 = 40 burpees
Then, he slows down to 5 per minute for the next 6 minutes:
5 * 6 = 30 burpees
Finally, he finishes the last 4 minutes at 3 per minute:
3 * 4 = 12 burpees
Therefore, the total number of burpees Victor completes is:
40 + 30 + 12 = 82 burpees
</think>
<answer>82</answer>
This question is not in the GSM8K training set — the model learned to reason step-by-step purely from reward signals.
| Steps | What Happens |
|---|---|
| 0-50 | Gibberish, random tokens, no structure |
| 50-100 | Model discovers <think>/<answer> tags, attempts basic math |
| 100-200 | "Aha moment" — format locks in, correct answers with coherent chain-of-thought start appearing |
| 200-400 | Rapid improvement — correctness reward climbs, reasoning becomes consistent |
| 400+ | Near-perfect format, high accuracy on novel questions outside training set |
| Split | Accuracy |
|---|---|
| GSM8K test (1319) | 75.7% |
| GSM8K train (7473) | 83.2% |
For reference, Qwen2.5-3B-Instruct (fully supervised) scores ~79% on GSM8K. With more training (1000 steps) expect ~78-82% test accuracy. To go higher, use a larger base model (7B+).
Building packages like vllm, bitsandbytes, or flash-attn from source requires a working CUDA toolkit. Here are common issues and fixes:
The default cuda-toolkit from Ubuntu repos is often too old. Install directly from NVIDIA:
# Add NVIDIA repo (Ubuntu 24.04 example)
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt update
sudo apt install -y cuda-toolkit-12-8Then set the environment:
export CUDA_HOME=/usr/local/cuda-12.8
export PATH=/usr/local/cuda-12.8/bin:/usr/local/cuda-12.8/nvvm/bin:$PATHRTX 5090 (Blackwell, sm_120a) requires CUDA 12.8+. Older toolkit versions don't recognize this architecture. Solution: install cuda-toolkit-12-8 or newer.
CUDA 12.x requires gcc <= 12. If your system has gcc 13+:
sudo apt install -y gcc-12 g++-12
export NVCC_PREPEND_FLAGS="--compiler-bindir=/usr/bin/gcc-12"The CUDA intermediate compiler lives in a non-standard path. Add it:
export PATH=/usr/local/cuda-12.8/nvvm/bin:$PATHUnsloth's compiled UnslothGRPOTrainer has a bug that crashes training with an error like:
RuntimeError: The size of tensor a (793) must match the size of tensor b (768)
at non-singleton dimension 1
or similar with numbers like (537) vs (512). The difference between the two numbers equals the left-padding size (typically ~25 tokens, roughly the system prompt length).
Root cause: In compute_loss, the function grpo_accumulated_loss returns coef_1 with shape [batch, logits_to_keep + max_left_pad], but the metrics section uses completion_mask with shape [batch, logits_to_keep]. The max_left_pad extra columns come from left-padding alignment for variable-length prompts in the batch.
The crash happens at:
# Inside UnslothGRPOTrainer.compute_loss (unsloth_compiled_cache/UnslothGRPOTrainer.py)
def masked_batch_mean(x):
return (x * completion_mask).sum() / completion_token_count
# ^ coef_1 has 793 cols ^ completion_mask has 768 cols → RuntimeErrorFix: The training script includes an automatic patch that inserts coef_1 = coef_1[:, -completion_mask.shape[1]:] before both masked_batch_mean definitions. The patch runs after PatchFastRL writes the compiled file, then reloads the module and re-binds the class in trl.
Key detail for anyone trying to patch this manually: PatchFastRL stores the module as "UnslothGRPOTrainer" in sys.modules (not "unsloth_compiled_cache.UnslothGRPOTrainer"). If you use the wrong name, importlib.reload silently does nothing and the fix never takes effect.
tensorboard --logdir outputs_pretrained/logstrain_pretrained.py— main training scriptconfigs/5090.yaml— config for RTX 5090 (32 GB) — lora_rank=64, num_gen=8configs/4090.yaml— config for RTX 4090 (24 GB) — lora_rank=32, num_gen=6 (not tested yet)requirements_pretrained.txt— Python dependenciesinstall_deps.sh— system-level CUDA toolkit and gcc setup