Reinforcement-learning a Qwen3 model to play chess. Stockfish supplies the reward (centipawn loss); GRPO reinforces moves that beat the group average. Prove the method at 1.7B, then scale to 14B.
chess_reward.py— parses the model's move, scores it with Stockfish (the training reward).DEPTH=6.chess_eval.py— shared eval: build a prompt, score a move vs Stockfish (EVAL_DEPTH=8). Used by training + benchmark so they match.make_dataset.py— generates the TRAIN set (chess_positions/) and a fixed VAL set (val_fens.json).train.py— proper 1.7B GRPO run: 400 steps, constant LR, eval on the fixed val set every 25 steps ->eval_history.json. vLLM on (TRITON_ATTN backend).benchmark.py— before/after benchmark on the fixed val set (greedy, deterministic).plot_eval.py— plots the learning curve fromeval_history.json->eval_curve.png.uci_engine.py— wraps the model as a standard UCI engine (reusable for matches, fastchess, lichess-bot).elo.py— plays full games via the UCI engine and reports Elo (vs base / random / weak Stockfish).chess_reward_wdl.py— alternative win-probability (WDL) reward to A/B against the centipawn reward.train_smoke.py— the original 30-step smoke test (kept for reference).
vLLM 0.11.0's default FlashInfer backend asserts (_sm_scale) when unsloth enables LoRA,
so remove it once per pod:
pip uninstall -y flashinfer-python flashinfertrain.py and benchmark.py set VLLM_ATTENTION_BACKEND=TRITON_ATTN at the top, so vLLM
uses the Triton backend (bundled, no extra install) instead.
python make_dataset.py # train set + fixed val_fens.json
python benchmark.py Qwen/Qwen3-1.7B # baseline on the fixed val set
python train.py # ~30-60 min; prints [eval @ step N] every 25 steps
python plot_eval.py # -> eval_curve.png
python benchmark.py ./qwen3-1.7b-chess-merged # trained, directly comparable to baseline
runpodctl send eval_curve.png # pull the curve to your MacOn the held-out curve (eval_curve.png) and the before/after benchmark, all measured
on the SAME fixed positions every time:
- top-1 match rises, avg centipawn loss falls, legal-move rate trends toward ~100%.
uci_engine.py exposes the model through the standard UCI protocol, so the whole
computer-chess toolchain works with it. elo.py runs a gauntlet and reports Elo:
python elo.py ./qwen3-1.7b-chess-merged # vs base (relative gain), random, weak Stockfish- "vs base" is the headline RL-gain number (relative Elo, no calibration needed).
- Tune game count with
ELO_GAMES=20 python elo.py ...(more games = tighter CI, slower). - Every game is written to
elo_games.pgn— feed it to ordo or bayeselo for official ratings with confidence intervals, or pluguci_engine.pyinto lichess-bot for a real Lichess rating once the model is strong enough (>~1320).
The literature favors win-probability (WDL) over centipawn as the strength target.
To try it, point train.py at the WDL reward and compare Elo before deciding:
from chess_reward_wdl import chess_reward_wdl # in train.py
...
reward_funcs = [chess_reward_wdl],Run both versions, then python elo.py on each merged model — let Elo pick the winner.
- In
train.py: apply the[14B]lines (Qwen/Qwen3-14B,load_in_4bit=True). vLLM is already on; the same flashinfer-uninstall + TRITON_ATTN fix applies. Lowergpu_memory_utilizationif you hit OOM. - In
make_dataset.py: remove/no_think, bumpbuild_train(n=...). - In
chess_reward.py: setDEPTH=10.