Crucible

GRPO reasoning trainer. Take a small instruct model and improve its math reasoning with reinforcement learning against a verifiable reward, the "R1-Zero on a laptop" recipe, sized to run on a 48GB Mac via MPS.

The reward is a function, not a learned reward model: parse the model's final answer and check it against the gold answer. That keeps runs reproducible and ungameable. GSM8K (grade-school math) is the starting domain.

See docs/NOTES.md for the GRPO math and docs/ROADMAP.md for where this goes.

Layout

src/crucible/
  config.py    presets (smoke, gsm8k_0p5b, gsm8k_1p5b) and CLI overrides
  data.py      GSM8K loading, prompt template, gold-answer parsing
  rewards.py   answer extraction, correctness + format rewards (verifier)
  model.py     load policy + frozen reference, group sampling
  grpo.py      group-relative advantage, clipped objective, k3 KL
  train.py     rollout, loss, optimizer loop, periodic eval
  eval.py      greedy exact-match accuracy
tests/         verifier and GRPO-math checks, no model download
docs/          NOTES (the math), ROADMAP

Setup

uv sync                 # core deps (torch, transformers, datasets)
uv run pytest           # verify the verifier and GRPO math, fast, no model

Run

# tiny end-to-end wiring check on a 0.5B model
uv run python -m crucible.train --preset smoke

# the real starter run
uv run python -m crucible.train --preset gsm8k_0p5b

The smoke run downloads Qwen2.5-0.5B-Instruct and GSM8K on first use, then runs two steps so you can confirm the loop trains before committing to a full run. train.py prints a baseline eval accuracy at step 0, then reward and eval accuracy as it goes.

For a full overnight run on MPS, use the supervisor, which checkpoints and auto-resumes around the MPS memory growth (see docs/MPS_LEAK.md):

scripts/supervise.sh gsm8k_0p5b        # resumes from runs/<preset>/checkpoint.pt
uv run python -m crucible.train --preset gsm8k_0p5b --resume   # one-off resume

Results

Phase 1 (Qwen2.5-0.5B-Instruct, GSM8K, 400 steps on a 48GB M5 Pro):

metric	value
baseline eval acc	0.125
final eval acc	0.438
relative improvement	3.5×

Eval accuracy roughly tripled; KL stayed stable throughout. Full write-up and caveats in docs/PHASE1.md.

Notes

Optional LoRA for the 1.5B preset: uv sync --extra lora.
MPS fallback for unsupported ops is enabled in train.py.
Long runs checkpoint every eval and can --resume; the MPS allocator leak and its workaround are documented in docs/MPS_LEAK.md.
This is standalone. Hardware deployment and an inference engine are separate projects, not dependencies.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
docs		docs
scripts		scripts
src/crucible		src/crucible
tests		tests
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Crucible

Layout

Setup

Run

Results

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Crucible

Layout

Setup

Run

Results

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages