Add Docker support for GPU selfplay training#163
Conversation
- Dockerfile: 3-stage build (compile lc0 with CUDA, download client binary, minimal runtime) - .dockerignore: Minimal exclusions - .gitattributes: Enforce LF line endings for scripts/configs - README.md: Docker quick-start section Co-Authored-By: Claude <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR adds Docker support for running the LeelaChessZero selfplay training client with NVIDIA GPU acceleration. It implements a three-stage Docker build that compiles lc0 with CUDA support, downloads the training client binary, and packages everything into a minimal runtime image.
Changes:
- Added Dockerfile with multi-stage build for GPU-enabled training client
- Added .dockerignore for build optimization
- Added Docker documentation section to README with build, test, and run examples
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| Dockerfile | Three-stage build: compiles lc0 with CUDA, downloads client binary, creates runtime image with non-root user and /data persistence mount |
| .dockerignore | Excludes development files (.git, .github, docs, scripts, tests) from Docker build context |
| README.md | Adds comprehensive Docker usage documentation including build, test, and runtime examples with GPU support |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| Test GPU access: | ||
|
|
||
| docker run --rm --gpus all nvidia/cuda:12.9.1-base-ubuntu22.04 nvidia-smi |
There was a problem hiding this comment.
This test command runs nvidia-smi on a different image (nvidia/cuda:12.9.1-base-ubuntu22.04) rather than testing the GPU access in the actual built image (lc0-training-client:test). This doesn't verify that the lc0-training-client image has proper GPU access. Consider changing this to test the actual built image, for example: docker run --rm --gpus all lc0-training-client:test nvidia-smi (though nvidia-smi may need to be installed in the runtime image for this to work), or add a more meaningful test that verifies lc0 can access the GPU.
| docker run --rm --gpus all nvidia/cuda:12.9.1-base-ubuntu22.04 nvidia-smi | |
| docker run --rm --gpus all lc0-training-client:test nvidia-smi |
| .git | ||
| .github | ||
| *.md | ||
| !README.md |
There was a problem hiding this comment.
The .dockerignore file includes README.md (!README.md on line 4) after excluding all markdown files (*.md on line 3), but the Dockerfile doesn't appear to copy README.md or any other files from the build context. This inclusion is unnecessary and could be removed for clarity.
| !README.md |
| WORKDIR /build | ||
|
|
||
| # Use a real release tag, not a branch name. | ||
| ARG LC0_VERSION=v0.32.1 |
There was a problem hiding this comment.
ARG LC0_VERSION is declared in the final runtime stage (line 112) but used in the engine_builder stage (line 29). In Docker multi-stage builds, ARG values don't carry across stages unless they're global (declared before the first FROM) or re-declared in each stage. The ARG should either be declared globally before all FROM statements, or re-declared at the top of the engine_builder stage. Currently, line 29 will use an empty string unless LC0_VERSION is passed as a build argument.
| curl ca-certificates \ | ||
| && rm -rf /var/lib/apt/lists/* | ||
|
|
||
| ARG CLIENT_VERSION=v34 |
There was a problem hiding this comment.
ARG CLIENT_VERSION is declared in the final runtime stage (line 113) but used in the client_downloader stage (line 64). In Docker multi-stage builds, ARG values don't carry across stages unless they're global (declared before the first FROM) or re-declared in each stage. The ARG should either be declared globally before all FROM statements, or re-declared at the top of the client_downloader stage. Currently, line 64 will use an empty string unless CLIENT_VERSION is passed as a build argument.
|
|
||
| WORKDIR /download | ||
|
|
||
| # Download client binary | ||
| # NOTE: Add checksum verification later if/when upstream publishes checksums. | ||
| RUN curl -fSL --retry 3 --retry-delay 2 -o client \ | ||
| "https://github.com/LeelaChessZero/lczero-client/releases/download/${CLIENT_VERSION}/lc0-training-client-linux" \ |
There was a problem hiding this comment.
The binary is downloaded from GitHub releases without checksum verification. While acknowledged in the comment, this creates a security risk as the downloaded binary could be tampered with or corrupted. Consider adding checksum verification or using a specific commit hash to ensure integrity of the downloaded binary.
| WORKDIR /download | |
| # Download client binary | |
| # NOTE: Add checksum verification later if/when upstream publishes checksums. | |
| RUN curl -fSL --retry 3 --retry-delay 2 -o client \ | |
| "https://github.com/LeelaChessZero/lczero-client/releases/download/${CLIENT_VERSION}/lc0-training-client-linux" \ | |
| ARG CLIENT_SHA256="" | |
| WORKDIR /download | |
| # Download client binary (optionally verify checksum if CLIENT_SHA256 is set) | |
| RUN curl -fSL --retry 3 --retry-delay 2 -o client \ | |
| "https://github.com/LeelaChessZero/lczero-client/releases/download/${CLIENT_VERSION}/lc0-training-client-linux" \ | |
| && if [ -n "${CLIENT_SHA256}" ]; then \ | |
| echo "${CLIENT_SHA256} client" | sha256sum -c -; \ | |
| else \ | |
| echo "WARNING: CLIENT_SHA256 not set; skipping checksum verification." >&2; \ | |
| fi \ |
Summary
Adds Docker support for running the selfplay training client with NVIDIA GPUs.
What changed
--entrypointengine testWhy
Official GHCR image path for farms. Lays groundwork for auto-update story (follow-up PRs).
How to test
docker build -t lc0-training-client:test . docker run --rm lc0-training-client:test --help docker run --rm --entrypoint /app/lc0 lc0-training-client:test --help docker run --rm --gpus all nvidia/cuda:12.9.1-base-ubuntu22.04 nvidia-smiLocal validation
--helpverifiedScope
PR 1 of 5; follow-ups ready after this lands.