Skip to content

Improve Metal backend with hybrid CPU+GPU+ANE inference for macOS#1148

Open
ChinChangYang wants to merge 27 commits intolightvector:masterfrom
ChinChangYang:coreml-backend
Open

Improve Metal backend with hybrid CPU+GPU+ANE inference for macOS#1148
ChinChangYang wants to merge 27 commits intolightvector:masterfrom
ChinChangYang:coreml-backend

Conversation

@ChinChangYang
Copy link
Contributor

@ChinChangYang ChinChangYang commented Jan 5, 2026

Summary

This PR adds a Metal neural network backend for macOS that leverages Apple's full compute stack—CPU, GPU, and Apple Neural Engine (ANE) simultaneously—through a per-thread multiplexer architecture.

Key Features

  • Per-Thread GPU/ANE Multiplexer: Each NN server thread is dedicated to either GPU (MPSGraph) or ANE (CoreML). Multiple threads run in parallel, saturating all compute units without intra-batch splitting overhead.
  • Native Model Conversion: Uses the katagocoreml C++ library for on-the-fly conversion from KataGo's .bin.gz format to CoreML .mlpackage (no Python dependency).
  • Configurable Precision: FP16 (default) enables both GPU and ANE paths; FP32 uses GPU-only MPSGraph, bypassing CoreML conversion entirely.
  • Dynamic Batch Size: Supports runtime batch sizes from 1 to maxBatchSize without model recompilation.
  • Flexible Dispatch Configuration: Users configure compute unit assignment per thread via metalDeviceToUseThread<N> (0 = GPU, 100 = ANE), allowing GPU-only, ANE-only, or mux mode.

Architecture

The backend uses a multiplexer design where each NN server thread owns exactly one compute handle:

  • GPU threads (metalDeviceToUseThread<N>=0): Use MPSGraph for Metal GPU inference
  • ANE threads (metalDeviceToUseThread<N>=100): Use CoreML for CPU+ANE inference

Example mux configuration (recommended for best throughput):

numNNServerThreadsPerModel = 4
metalDeviceToUseThread0 = 0    # GPU
metalDeviceToUseThread1 = 0    # GPU
metalDeviceToUseThread2 = 100  # ANE
metalDeviceToUseThread3 = 100  # ANE

Performance

Benchmarked on Apple M3 Max with mux mode (2 GPU + 2 ANE threads), 2048 visits, -half-batch-size:

b18c384nbt (18 blocks, 384 channels):

Threads visits/s nnEvals/s avgBatchSize
8 735.68 570.97 1.46
16 817.86 629.23 2.70
32 810.71 660.98 5.02

b28c512nbt (28 blocks, 512 channels):

Threads visits/s nnEvals/s avgBatchSize
8 325.57 256.51 1.47
16 335.40 263.98 2.61
32 319.61 255.57 4.91

Per-thread workload distribution (b18c384nbt, 32 threads):

  • GPU threads: ~24,200 rows processed
  • ANE threads: ~24,572 rows processed

Files Changed

File Description
cpp/neuralnet/metalbackend.cpp C++ interface: model conversion, batch processing, data marshaling
cpp/neuralnet/metalbackend.h Backend header with compute handle and layer descriptors
cpp/neuralnet/metalbackend.swift CoreML model loading and per-thread compute handle orchestration
cpp/neuralnet/metallayers.swift MPSGraph layer implementations for GPU inference path
cpp/CMakeLists.txt Build configuration for Metal backend (Ninja + Swift)
cpp/command/benchmark.cpp -half-batch-size option for benchmarking
cpp/configs/gtp_example.cfg Metal backend dispatch configuration documentation
cpp/configs/analysis_example.cfg Metal backend dispatch configuration documentation
.github/workflows/build.yml CI job for Metal backend on macOS
Compiling.md Build instructions for Metal backend

Build Requirements

  • macOS 13.0+
  • Xcode Command Line Tools with Swift 5.9+
  • Ninja build system: brew install ninja
  • katagocoreml library: brew tap chinchangyang/katagocoreml-cpp && brew install katagocoreml
cd cpp
cmake -G Ninja -DUSE_BACKEND=METAL -DBUILD_DISTRIBUTED=1
ninja

Test Plan

  • ./katago runtests passes
  • ./katago benchmark verifies performance with mux mode (GPU+ANE)
  • GPU-only mode (metalDeviceToUseThread0=0) works correctly
  • ANE-only mode (metalDeviceToUseThread0=100) works correctly
  • FP16 mode: both GPU and ANE paths active
  • FP32 mode: GPU-only MPSGraph execution (CoreML bypassed)
  • Batch size 1: single compute handle works correctly
  • CI build succeeds on GitHub Actions

ChinChangYang and others added 13 commits December 31, 2025 17:15
Add dedicated mask buffer to fix incorrect mask offset calculation in
batched inference. The Swift code assumed mask buffer stride of H*W per
batch element, but was receiving spatial input buffer with stride of
numInputChannels*H*W, causing batch elements > 0 to read garbage data.

Changes:
- Add userInputMaskBuffer to InputBuffers with correct stride
- Copy first channel of spatial input (mask) to dedicated buffer
- Pass mask buffer to Swift instead of reusing spatial buffer

Batched winrate error: 19% → 0.037% (now matches single evaluation)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This update enhances the CMake configuration to support the Core ML backend alongside existing options. Key changes include:
- Updated project definition to include Core ML in backend options.
- Added necessary checks for Swift compiler version and generator type.
- Introduced a new library for Core ML and updated target properties.
- Modified output messages to reflect the selected backend during runtime.

This integration allows for improved compatibility and functionality when using Core ML for neural network evaluations.
This update adds an entry for the CoreML backend to the .gitignore file, ensuring that generated files related to the CoreML integration are not tracked by Git. This change helps maintain a cleaner repository by excluding unnecessary build artifacts.
Core ML may return non-contiguous MLMultiArray outputs after GPU computation,
especially for spatial tensors. The previous code used direct dataPointer access
with linear indexing, which read data from wrong memory locations when strides
were non-contiguous.

This fix adds stride-aware extraction that checks MLMultiArray.strides and
handles both contiguous (fast path) and non-contiguous (recursive copy) cases.
Also fixes hard-coded passChannels=2 to use numPolicyChannels.

Before: Policy KL Div ~9.19, Ownership Error ~54c
After:  Policy KL Div ~0.003, Ownership Error ~0.02c

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The CoreML model exports pass policy as "policy_pass" but the code was
looking for "policy_pass_mul2", causing the pass policy buffer to remain
at 0. This resulted in systematically inflated pass move probabilities
after softmax (up to 14% error vs reference).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The CoreML backend now respects the useFP16 config option, allowing users
to choose between FP16 (default, faster, uses Neural Engine) and FP32
(higher precision). FP16 has ~0.87% max winrate error while FP32 achieves
~0.0006% by matching the Eigen reference. Cache keys include precision
suffix to store FP16 and FP32 models separately.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Eliminate Python dependency for CoreML model conversion by using the
native C++ katagocoreml library instead of calling Python subprocess.

Changes:
- CMakeLists.txt: Add pkg-config detection for katagocoreml library
- coremlbackend.cpp: Add CoreMLConversion namespace with native converter
  wrapper, caching logic, and directory management functions
- coremlbackend.swift: Remove CoreMLConverter and ModelCacheManager
  structs, simplify createCoreMLComputeHandle to only load pre-converted
  models

The native converter uses katagocoreml::KataGoConverter::convert() and
caches converted models with a "_native" suffix to distinguish from
previously Python-converted models.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Implement a hybrid inference system that runs CoreML on CPU + Neural Engine
and MPSGraph on GPU simultaneously, with adaptive batch sizing:

- Add mpsgraphlayers.swift: Shared MPSGraph layer implementations
- Add HybridComputeHandle: Dispatches work to both backends in parallel
- Add ThroughputTracker: Adaptively adjusts batch split ratio using EMA
- Parallelize CoreML batch processing with DispatchQueue.concurrentPerform
- Optimize data copying with memcpy for inputs and outputs
- Clean up CMakeLists.txt: Remove redundant SOURCES from _swift_generate_cxx_header

Performance: Achieves 577 nnEvals/s at 16 threads (vs ~374 before),
exceeding the 500 nnEvals/s target for CPU+GPU+ANE utilization.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When requireExactNNLen is true (all mask values are 1), skip unnecessary
mask operations in MPSGraph layers:

- BatchNormLayer: Skip output * maskTensor multiplication
- GlobalPoolingLayer: Skip mask-1 trick for max pooling
- MaskSumLayer and derived layers: Use precomputed constants instead of
  computing from mask tensor

The optimization is enabled by passing requireExactNNLen to
MPSGraphModelHandle, which propagates it through the layer hierarchy.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Document the CoreML backend as an alternative to Metal for macOS, including
Homebrew installation of the katagocoreml library dependency.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@ChinChangYang
Copy link
Contributor Author

Cross-Validation Test Report for CoreML Backend

Test Configuration

  • Model: b18c384nbt-uec-20221121b.bin.gz (18-block, 384-channel model)
  • Board Size: 19x19
  • Test Positions: 2,247 positions from built-in test dataset
  • Reference: Eigen backend (CPU, FP32)
  • Test Command: ./katago testgpuerror

Hardware

  • Apple M3 Max
  • macOS with CoreML + MPSGraph hybrid backend

Results

CoreML FP32 vs Eigen FP32 Reference

Metric Average 90th % 99th % Max Threshold (99th/Max)
winrateError 0.00006% 0.00017% 0.00035% 0.00066% 0.45% / 1.35%
leadError 0.00002 0.00003 0.00011 0.00033 0.225 / 0.90
scoreMeanError 0.00002 0.00004 0.00012 0.00030 0.225 / 0.90
scoreStdevError 0.00001 0.00001 0.00004 0.00009 0.135 / 0.54
topPolicyDelta 0.00007% 0.00017% 0.00036% 0.00068% 0.45% / 1.35%
policyKLDiv 0.000000 0.000000 0.000000 0.000000 0.0006 / 0.0012
ownershipError 0.00003c 0.00006c 0.00018c 0.00174c

Status: PASS — CoreML FP32 matches Eigen FP32 with near-zero error.

CoreML FP16 vs Eigen FP32 Reference

Metric Average 90th % 99th % Max Threshold (99th/Max)
winrateError 0.0997% 0.2553% 0.4857% 0.9147% 2.0% / 5.0%
leadError 0.0277 0.0615 0.1754 0.3852 1.00 / 3.00
scoreMeanError 0.0332 0.0718 0.1721 0.3739 1.00 / 3.00
scoreStdevError 0.0129 0.0274 0.0658 0.1646 0.60 / 1.80
topPolicyDelta 0.0944% 0.2124% 0.4547% 0.8814% 2.50% / 6.00%
policyKLDiv 0.000017 0.000035 0.000119 0.000604 0.0020 / 0.0040
ownershipError 0.0443c 0.1000c 0.2420c 4.3359c

Status: PASS — CoreML FP16 is well within acceptable error bounds for half-precision inference.

Summary

Configuration Max Winrate Error Max Policy KL Div Result
CoreML FP32 0.00066% 0.000000 PASS
CoreML FP16 0.91% 0.000604 PASS

The CoreML backend passes all validation checks:

  • FP32 mode: Numerically equivalent to Eigen reference (errors < 0.001%)
  • FP16 mode: Max winrate error of 0.91% is well below the 5% threshold, consistent with expected half-precision behavior

Conclusion

The hybrid CoreML + MPSGraph backend produces numerically correct results across 2,247 test positions when compared against the Eigen CPU reference implementation. Both FP16 and FP32 precision modes meet KataGo's accuracy requirements for neural network inference.

Comment on lines 100 to 149
build-macos-coreml:
runs-on: macos-latest
permissions:
contents: read

steps:
- name: Checkout code
uses: actions/checkout@v4

- name: Install dependencies
run: |
brew install ninja zlib libzip
brew tap chinchangyang/katagocoreml-cpp
brew install katagocoreml

- name: Cache CMake build
uses: actions/cache@v4
with:
path: |
cpp/CMakeCache.txt
cpp/CMakeFiles
cpp/build.ninja
cpp/.ninja_deps
cpp/.ninja_log
key: ${{ runner.os }}-cmake-coreml-${{ hashFiles('**/CMakeLists.txt') }}
restore-keys: |
${{ runner.os }}-cmake-coreml-

- name: Configure CMake
working-directory: cpp
run: |
cmake . -G Ninja -DUSE_BACKEND=COREML -DCMAKE_BUILD_TYPE=Release

- name: Build
working-directory: cpp
run: |
ninja

- name: Run tests
working-directory: cpp
run: |
./katago runtests

- name: Upload artifact
if: github.event_name == 'push' && github.ref == 'refs/heads/master'
uses: actions/upload-artifact@v4
with:
name: katago-macos-coreml
path: cpp/katago

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to merge macOS build into a single configuration? As I see some build steps are overlapping.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely! It can be merged into a single configuration. Would you like to review it? ChinChangYang#9

ChinChangYang and others added 5 commits January 20, 2026 19:52
- Add runtime batch size support (1 to maxBatchSize) with batch size
  included in model cache key for proper cache invalidation
- Simplify model loading: convert to temp .mlpackage, load via
  MLModel.compileModel(), then delete immediately (CoreML caches internally)
- Remove ~400 lines of complex manual cache management code
- Ensure temp files are cleaned up on conversion/compile failures using
  defer, with warning logged if cleanup itself fails

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When useFP16=false, the CoreML CPU+ANE path runs extremely slowly in
FP32 mode. This adds a GPU-only execution path using MPSGraph that
bypasses CoreML model conversion entirely, providing much faster FP32
inference.

Changes:
- Add createMPSGraphOnlyHandle() function in Swift for direct GPU-only
  handle creation without CoreML conversion
- Add mpsGraphOnlyHandle field to ComputeHandle for FP32 mode
- Add conditional helper functions to create appropriate handle based
  on precision mode (hybridHandle for FP16, mpsGraphOnlyHandle for FP32)
- Modify getCoreMLOutput() to dispatch to correct handle
- Add assertion enforcing mutual exclusivity of handles
- Standardize logging to use "Core ML backend X:" prefix throughout

FP16 mode continues to use hybrid CoreML+MPSGraph execution as before.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The hybrid execution mode splits batches between CoreML (CPU+ANE) and
MPSGraph (GPU), requiring at least 2 samples (1 for each backend).
When maxBatchSize < 2, fall back to MPSGraph-only which provides more
stable latency and avoids CoreML dispatch overhead.

This also enables explicit single-threaded GPU-only execution via
nnMaxBatchSize=1, useful for debugging or deterministic behavior.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Optimize hybrid dispatch convergence with adaptive EMA parameters:
- Lower alpha (0.15) and warm-start ratio (0.47) for faster convergence
- Adaptive warmup phase with variance-based transition
- Remove unnecessary NSLock (thread-safe by design via single-owner access)
- Add diagnostic logging via KATAGO_HYBRID_DIAG environment variable

The thread safety is ensured without locks because:
1. Each server thread owns its own ThroughputTracker instance
2. Concurrent queue access is to disjoint fields only
3. group.wait() provides sequential barrier before shared reads

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Consolidate the previously separate CoreML and MPSGraph backends into a
unified Metal backend. The hybrid architecture (CPU+GPU+ANE) is preserved
but exposed as a single -DUSE_BACKEND=METAL build option.

Key changes:
- Remove standalone CoreML backend files (coremlbackend.{cpp,h,swift})
- Merge CoreML logic into metalbackend.{cpp,h,swift}
- Rename mpsgraphlayers.swift to metallayers.swift
- Rename CoreMLProcess namespace to MetalProcess
- Update CMakeLists.txt: remove COREML backend, keep METAL only
- Update CI workflow to use METAL backend
- Prefer MPSGraph over CoreML for batch size 1 (better latency)
- Add autoreleasepool to CoreML dispatch path
- Remove unused alpha constant from ThroughputTracker
- Fix minor comment and log prefix inconsistencies

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@ChinChangYang ChinChangYang changed the title Add CoreML backend with hybrid CPU+GPU+ANE inference for macOS Improve Metal backend with hybrid CPU+GPU+ANE inference for macOS Jan 24, 2026
ChinChangYang and others added 8 commits February 23, 2026 18:19
Remove intra-batch HybridComputeHandle that split work between CoreML
and MPSGraph within a single thread. Instead, each server thread now
runs exclusively as GPU (MPSGraph, gpuIdx=0) or ANE (CoreML, gpuIdx=100),
configured via metalDeviceToUseThread<N>. This eliminates the
ThroughputTracker, adaptive batch ratio, and dispatch queue complexity
in favor of thread-level multiplexing managed by KataGo's existing
multi-server-thread infrastructure.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Validate gpuIdx in createComputeHandle to warn on unrecognized values
and default to GPU mode. Warn when ANE mode is used with FP32 since
CoreML FP32 bypasses ANE and runs on CPU only. Fix printMetalDevices
to accurately describe available modes after the multiplexer refactor.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove non-existent --sweep-backends benchmark reference
- Update FP32+ANE warning to reference correct config keys
  (metalDeviceToUseThread<N>, metalUseFP16)
- Add metalUseFP16 setting to Metal sections in gtp and analysis configs
- Remove outdated Swift doc comment on createMPSGraphOnlyHandle
- Add clarifying comment for unused maxBatchSize parameter

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use `hasMPSGraph == hasCoreML` instead of `hasMPSGraph + hasCoreML != 1`
to avoid implicit bool-to-int promotion. The error message now reports
"both" or "neither" instead of a numeric count.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Switch from graph.run() to graph.encode() with explicit command buffer
management (commit/waitUntilCompleted). This enables GPU error checking
via commandBuffer.error and lays groundwork for future pipelining.

Also consolidates variable declarations and reduces verbosity in the
apply() method for better readability.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The COREML backend was unified into METAL, but two leftover references
remained: a dead #ifdef block in benchmark.cpp and an outdated comment
in .gitignore.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Updated the CMakeLists.txt to eliminate the COREML backend condition, retaining only the METAL backend check. Additionally, removed references to COREML backend files from .gitignore to reflect the recent unification of backends.
@ChinChangYang ChinChangYang marked this pull request as ready for review February 27, 2026 01:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants