Chess RL — Qwen3 + GRPO + Stockfish

Reinforcement-learning a Qwen3 model to play chess. Stockfish supplies the reward (centipawn loss); GRPO reinforces moves that beat the group average. Prove the method at 1.7B, then scale to 14B.

Files

chess_reward.py — parses the model's move, scores it with Stockfish (the training reward). DEPTH=6.
chess_eval.py — shared eval: build a prompt, score a move vs Stockfish (EVAL_DEPTH=8). Used by training + benchmark so they match.
make_dataset.py — generates the TRAIN set (chess_positions/) and a fixed VAL set (val_fens.json).
train.py — proper 1.7B GRPO run: 400 steps, constant LR, eval on the fixed val set every 25 steps -> eval_history.json. vLLM on (TRITON_ATTN backend).
benchmark.py — before/after benchmark on the fixed val set (greedy, deterministic).
plot_eval.py — plots the learning curve from eval_history.json -> eval_curve.png.
uci_engine.py — wraps the model as a standard UCI engine (reusable for matches, fastchess, lichess-bot).
elo.py — plays full games via the UCI engine and reports Elo (vs base / random / weak Stockfish).
chess_reward_wdl.py — alternative win-probability (WDL) reward to A/B against the centipawn reward.
train_smoke.py — the original 30-step smoke test (kept for reference).

One-time environment setup (vLLM fix)

vLLM 0.11.0's default FlashInfer backend asserts (_sm_scale) when unsloth enables LoRA, so remove it once per pod:

pip uninstall -y flashinfer-python flashinfer

train.py and benchmark.py set VLLM_ATTENTION_BACKEND=TRITON_ATTN at the top, so vLLM uses the Triton backend (bundled, no extra install) instead.

Run order (on the H100 pod)

python make_dataset.py                          # train set + fixed val_fens.json
python benchmark.py Qwen/Qwen3-1.7B             # baseline on the fixed val set
python train.py                                 # ~30-60 min; prints [eval @ step N] every 25 steps
python plot_eval.py                             # -> eval_curve.png
python benchmark.py ./qwen3-1.7b-chess-merged   # trained, directly comparable to baseline
runpodctl send eval_curve.png                   # pull the curve to your Mac

What success looks like

On the held-out curve (eval_curve.png) and the before/after benchmark, all measured on the SAME fixed positions every time:

top-1 match rises, avg centipawn loss falls, legal-move rate trends toward ~100%.

Measuring Elo

uci_engine.py exposes the model through the standard UCI protocol, so the whole computer-chess toolchain works with it. elo.py runs a gauntlet and reports Elo:

python elo.py ./qwen3-1.7b-chess-merged    # vs base (relative gain), random, weak Stockfish

"vs base" is the headline RL-gain number (relative Elo, no calibration needed).
Tune game count with ELO_GAMES=20 python elo.py ... (more games = tighter CI, slower).
Every game is written to elo_games.pgn — feed it to ordo or bayeselo for official ratings with confidence intervals, or plug uci_engine.py into lichess-bot for a real Lichess rating once the model is strong enough (>~1320).

A/B the reward (centipawn vs WDL)

The literature favors win-probability (WDL) over centipawn as the strength target. To try it, point train.py at the WDL reward and compare Elo before deciding:

from chess_reward_wdl import chess_reward_wdl   # in train.py
...
reward_funcs = [chess_reward_wdl],

Run both versions, then python elo.py on each merged model — let Elo pick the winner.

Scaling to 14B (after the 1.7B run proves the method)

In train.py: apply the [14B] lines (Qwen/Qwen3-14B, load_in_4bit=True). vLLM is already on; the same flashinfer-uninstall + TRITON_ATTN fix applies. Lower gpu_memory_utilization if you hit OOM.
In make_dataset.py: remove /no_think, bump build_train(n=...).
In chess_reward.py: set DEPTH=10.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Chess RL — Qwen3 + GRPO + Stockfish

Files

One-time environment setup (vLLM fix)

Run order (on the H100 pod)

What success looks like

Measuring Elo

A/B the reward (centipawn vs WDL)

Scaling to 14B (after the 1.7B run proves the method)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.gitignore		.gitignore
README.md		README.md
benchmark.py		benchmark.py
chess_eval.py		chess_eval.py
chess_microgpt.py		chess_microgpt.py
chess_reward.py		chess_reward.py
chess_reward_wdl.py		chess_reward_wdl.py
elo.py		elo.py
eval_curve.png		eval_curve.png
log.txt		log.txt
make_dataset.py		make_dataset.py
plot_eval.py		plot_eval.py
requirements.txt		requirements.txt
reward_curve.png		reward_curve.png
train.py		train.py
train_smoke.py		train_smoke.py
train_wdl.py		train_wdl.py
uci_engine.py		uci_engine.py

Folders and files

Latest commit

History

Repository files navigation

Chess RL — Qwen3 + GRPO + Stockfish

Files

One-time environment setup (vLLM fix)

Run order (on the H100 pod)

What success looks like

Measuring Elo

A/B the reward (centipawn vs WDL)

Scaling to 14B (after the 1.7B run proves the method)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages