Skip to content

Add Docker support for GPU selfplay training#163

Open
iftikharramnandan wants to merge 1 commit into
LeelaChessZero:releasefrom
iftikharramnandan:pr1
Open

Add Docker support for GPU selfplay training#163
iftikharramnandan wants to merge 1 commit into
LeelaChessZero:releasefrom
iftikharramnandan:pr1

Conversation

@iftikharramnandan

Copy link
Copy Markdown

Summary

Adds Docker support for running the selfplay training client with NVIDIA GPUs.

What changed

  • Dockerfile: 3-stage build (compile lc0 with CUDA, download client binary, minimal runtime)
  • .dockerignore: Minimal exclusions
  • README.md: Docker quick-start section with correct --entrypoint engine test

Why

Official GHCR image path for farms. Lays groundwork for auto-update story (follow-up PRs).

How to test

docker build -t lc0-training-client:test .
docker run --rm lc0-training-client:test --help
docker run --rm --entrypoint /app/lc0 lc0-training-client:test --help
docker run --rm --gpus all nvidia/cuda:12.9.1-base-ubuntu22.04 nvidia-smi

Local validation

  • WSL2 + Docker Desktop + RTX 2060 (Driver 591.74)
  • ✅ Build succeeds
  • ✅ Client + engine --help verified
  • ✅ GPU access verified

Scope

PR 1 of 5; follow-ups ready after this lands.

- Dockerfile: 3-stage build (compile lc0 with CUDA, download client binary, minimal runtime)
- .dockerignore: Minimal exclusions
- .gitattributes: Enforce LF line endings for scripts/configs
- README.md: Docker quick-start section

Co-Authored-By: Claude <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds Docker support for running the LeelaChessZero selfplay training client with NVIDIA GPU acceleration. It implements a three-stage Docker build that compiles lc0 with CUDA support, downloads the training client binary, and packages everything into a minimal runtime image.

Changes:

  • Added Dockerfile with multi-stage build for GPU-enabled training client
  • Added .dockerignore for build optimization
  • Added Docker documentation section to README with build, test, and run examples

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File Description
Dockerfile Three-stage build: compiles lc0 with CUDA, downloads client binary, creates runtime image with non-root user and /data persistence mount
.dockerignore Excludes development files (.git, .github, docs, scripts, tests) from Docker build context
README.md Adds comprehensive Docker usage documentation including build, test, and runtime examples with GPU support

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread README.md

Test GPU access:

docker run --rm --gpus all nvidia/cuda:12.9.1-base-ubuntu22.04 nvidia-smi

Copilot AI Jan 19, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test command runs nvidia-smi on a different image (nvidia/cuda:12.9.1-base-ubuntu22.04) rather than testing the GPU access in the actual built image (lc0-training-client:test). This doesn't verify that the lc0-training-client image has proper GPU access. Consider changing this to test the actual built image, for example: docker run --rm --gpus all lc0-training-client:test nvidia-smi (though nvidia-smi may need to be installed in the runtime image for this to work), or add a more meaningful test that verifies lc0 can access the GPU.

Suggested change
docker run --rm --gpus all nvidia/cuda:12.9.1-base-ubuntu22.04 nvidia-smi
docker run --rm --gpus all lc0-training-client:test nvidia-smi

Copilot uses AI. Check for mistakes.
Comment thread .dockerignore
.git
.github
*.md
!README.md

Copilot AI Jan 19, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The .dockerignore file includes README.md (!README.md on line 4) after excluding all markdown files (*.md on line 3), but the Dockerfile doesn't appear to copy README.md or any other files from the build context. This inclusion is unnecessary and could be removed for clarity.

Suggested change
!README.md

Copilot uses AI. Check for mistakes.
Comment thread Dockerfile
WORKDIR /build

# Use a real release tag, not a branch name.
ARG LC0_VERSION=v0.32.1

Copilot AI Jan 19, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ARG LC0_VERSION is declared in the final runtime stage (line 112) but used in the engine_builder stage (line 29). In Docker multi-stage builds, ARG values don't carry across stages unless they're global (declared before the first FROM) or re-declared in each stage. The ARG should either be declared globally before all FROM statements, or re-declared at the top of the engine_builder stage. Currently, line 29 will use an empty string unless LC0_VERSION is passed as a build argument.

Copilot uses AI. Check for mistakes.
Comment thread Dockerfile
curl ca-certificates \
&& rm -rf /var/lib/apt/lists/*

ARG CLIENT_VERSION=v34

Copilot AI Jan 19, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ARG CLIENT_VERSION is declared in the final runtime stage (line 113) but used in the client_downloader stage (line 64). In Docker multi-stage builds, ARG values don't carry across stages unless they're global (declared before the first FROM) or re-declared in each stage. The ARG should either be declared globally before all FROM statements, or re-declared at the top of the client_downloader stage. Currently, line 64 will use an empty string unless CLIENT_VERSION is passed as a build argument.

Copilot uses AI. Check for mistakes.
Comment thread Dockerfile
Comment on lines +65 to +71

WORKDIR /download

# Download client binary
# NOTE: Add checksum verification later if/when upstream publishes checksums.
RUN curl -fSL --retry 3 --retry-delay 2 -o client \
"https://github.com/LeelaChessZero/lczero-client/releases/download/${CLIENT_VERSION}/lc0-training-client-linux" \

Copilot AI Jan 19, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The binary is downloaded from GitHub releases without checksum verification. While acknowledged in the comment, this creates a security risk as the downloaded binary could be tampered with or corrupted. Consider adding checksum verification or using a specific commit hash to ensure integrity of the downloaded binary.

Suggested change
WORKDIR /download
# Download client binary
# NOTE: Add checksum verification later if/when upstream publishes checksums.
RUN curl -fSL --retry 3 --retry-delay 2 -o client \
"https://github.com/LeelaChessZero/lczero-client/releases/download/${CLIENT_VERSION}/lc0-training-client-linux" \
ARG CLIENT_SHA256=""
WORKDIR /download
# Download client binary (optionally verify checksum if CLIENT_SHA256 is set)
RUN curl -fSL --retry 3 --retry-delay 2 -o client \
"https://github.com/LeelaChessZero/lczero-client/releases/download/${CLIENT_VERSION}/lc0-training-client-linux" \
&& if [ -n "${CLIENT_SHA256}" ]; then \
echo "${CLIENT_SHA256} client" | sha256sum -c -; \
else \
echo "WARNING: CLIENT_SHA256 not set; skipping checksum verification." >&2; \
fi \

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants