Improve Metal backend with hybrid CPU+GPU+ANE inference for macOS by ChinChangYang · Pull Request #1148 · lightvector/KataGo

ChinChangYang · 2026-01-05T23:53:04Z

Summary

This PR adds a Metal neural network backend for macOS that leverages Apple's full compute stack—CPU, GPU, and Apple Neural Engine (ANE) simultaneously—through a per-thread multiplexer architecture.

Key Features

Per-Thread GPU/ANE Multiplexer: Each NN server thread is dedicated to either GPU (MPSGraph) or ANE (CoreML). Multiple threads run in parallel, saturating all compute units without intra-batch splitting overhead.
Native Model Conversion: Uses the katagocoreml C++ library for on-the-fly conversion from KataGo's .bin.gz format to CoreML .mlpackage (no Python dependency).
Configurable Precision: FP16 (default) enables both GPU and ANE paths; FP32 uses GPU-only MPSGraph, bypassing CoreML conversion entirely.
Dynamic Batch Size: Supports runtime batch sizes from 1 to maxBatchSize without model recompilation.
Flexible Dispatch Configuration: Users configure compute unit assignment per thread via metalDeviceToUseThread<N> (0 = GPU, 100 = ANE), allowing GPU-only, ANE-only, or mux mode.

Architecture

The backend uses a multiplexer design where each NN server thread owns exactly one compute handle:

GPU threads (metalDeviceToUseThread<N>=0): Use MPSGraph for Metal GPU inference
ANE threads (metalDeviceToUseThread<N>=100): Use CoreML for CPU+ANE inference

Example mux configuration (recommended for best throughput):

numNNServerThreadsPerModel = 4
metalDeviceToUseThread0 = 0    # GPU
metalDeviceToUseThread1 = 0    # GPU
metalDeviceToUseThread2 = 100  # ANE
metalDeviceToUseThread3 = 100  # ANE

Performance

Benchmarked on Apple M3 Max with mux mode (2 GPU + 2 ANE threads), 2048 visits, -half-batch-size:

b18c384nbt (18 blocks, 384 channels):

Threads	visits/s	nnEvals/s	avgBatchSize
8	735.68	570.97	1.46
16	817.86	629.23	2.70
32	810.71	660.98	5.02

b28c512nbt (28 blocks, 512 channels):

Threads	visits/s	nnEvals/s	avgBatchSize
8	325.57	256.51	1.47
16	335.40	263.98	2.61
32	319.61	255.57	4.91

Per-thread workload distribution (b18c384nbt, 32 threads):

GPU threads: ~24,200 rows processed
ANE threads: ~24,572 rows processed

Files Changed

File	Description
`cpp/neuralnet/metalbackend.cpp`	C++ interface: model conversion, batch processing, data marshaling
`cpp/neuralnet/metalbackend.h`	Backend header with compute handle and layer descriptors
`cpp/neuralnet/metalbackend.swift`	CoreML model loading and per-thread compute handle orchestration
`cpp/neuralnet/metallayers.swift`	MPSGraph layer implementations for GPU inference path
`cpp/CMakeLists.txt`	Build configuration for Metal backend (Ninja + Swift)
`cpp/command/benchmark.cpp`	`-half-batch-size` option for benchmarking
`cpp/configs/gtp_example.cfg`	Metal backend dispatch configuration documentation
`cpp/configs/analysis_example.cfg`	Metal backend dispatch configuration documentation
`.github/workflows/build.yml`	CI job for Metal backend on macOS
`Compiling.md`	Build instructions for Metal backend

Build Requirements

macOS 13.0+
Xcode Command Line Tools with Swift 5.9+
Ninja build system: brew install ninja
katagocoreml library: brew tap chinchangyang/katagocoreml-cpp && brew install katagocoreml

cd cpp
cmake -G Ninja -DUSE_BACKEND=METAL -DBUILD_DISTRIBUTED=1
ninja

Test Plan

./katago runtests passes
./katago benchmark verifies performance with mux mode (GPU+ANE)
GPU-only mode (metalDeviceToUseThread0=0) works correctly
ANE-only mode (metalDeviceToUseThread0=100) works correctly
FP16 mode: both GPU and ANE paths active
FP32 mode: GPU-only MPSGraph execution (CoreML bypassed)
Batch size 1: single compute handle works correctly
CI build succeeds on GitHub Actions

Add dedicated mask buffer to fix incorrect mask offset calculation in batched inference. The Swift code assumed mask buffer stride of H*W per batch element, but was receiving spatial input buffer with stride of numInputChannels*H*W, causing batch elements > 0 to read garbage data. Changes: - Add userInputMaskBuffer to InputBuffers with correct stride - Copy first channel of spatial input (mask) to dedicated buffer - Pass mask buffer to Swift instead of reusing spatial buffer Batched winrate error: 19% → 0.037% (now matches single evaluation) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

This update enhances the CMake configuration to support the Core ML backend alongside existing options. Key changes include: - Updated project definition to include Core ML in backend options. - Added necessary checks for Swift compiler version and generator type. - Introduced a new library for Core ML and updated target properties. - Modified output messages to reflect the selected backend during runtime. This integration allows for improved compatibility and functionality when using Core ML for neural network evaluations.

This update adds an entry for the CoreML backend to the .gitignore file, ensuring that generated files related to the CoreML integration are not tracked by Git. This change helps maintain a cleaner repository by excluding unnecessary build artifacts.

Core ML may return non-contiguous MLMultiArray outputs after GPU computation, especially for spatial tensors. The previous code used direct dataPointer access with linear indexing, which read data from wrong memory locations when strides were non-contiguous. This fix adds stride-aware extraction that checks MLMultiArray.strides and handles both contiguous (fast path) and non-contiguous (recursive copy) cases. Also fixes hard-coded passChannels=2 to use numPolicyChannels. Before: Policy KL Div ~9.19, Ownership Error ~54c After: Policy KL Div ~0.003, Ownership Error ~0.02c 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

The CoreML model exports pass policy as "policy_pass" but the code was looking for "policy_pass_mul2", causing the pass policy buffer to remain at 0. This resulted in systematically inflated pass move probabilities after softmax (up to 14% error vs reference). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

The CoreML backend now respects the useFP16 config option, allowing users to choose between FP16 (default, faster, uses Neural Engine) and FP32 (higher precision). FP16 has ~0.87% max winrate error while FP32 achieves ~0.0006% by matching the Eigen reference. Cache keys include precision suffix to store FP16 and FP32 models separately. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Eliminate Python dependency for CoreML model conversion by using the native C++ katagocoreml library instead of calling Python subprocess. Changes: - CMakeLists.txt: Add pkg-config detection for katagocoreml library - coremlbackend.cpp: Add CoreMLConversion namespace with native converter wrapper, caching logic, and directory management functions - coremlbackend.swift: Remove CoreMLConverter and ModelCacheManager structs, simplify createCoreMLComputeHandle to only load pre-converted models The native converter uses katagocoreml::KataGoConverter::convert() and caches converted models with a "_native" suffix to distinguish from previously Python-converted models. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Implement a hybrid inference system that runs CoreML on CPU + Neural Engine and MPSGraph on GPU simultaneously, with adaptive batch sizing: - Add mpsgraphlayers.swift: Shared MPSGraph layer implementations - Add HybridComputeHandle: Dispatches work to both backends in parallel - Add ThroughputTracker: Adaptively adjusts batch split ratio using EMA - Parallelize CoreML batch processing with DispatchQueue.concurrentPerform - Optimize data copying with memcpy for inputs and outputs - Clean up CMakeLists.txt: Remove redundant SOURCES from _swift_generate_cxx_header Performance: Achieves 577 nnEvals/s at 16 threads (vs ~374 before), exceeding the 500 nnEvals/s target for CPU+GPU+ANE utilization. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

When requireExactNNLen is true (all mask values are 1), skip unnecessary mask operations in MPSGraph layers: - BatchNormLayer: Skip output * maskTensor multiplication - GlobalPoolingLayer: Skip mask-1 trick for max pooling - MaskSumLayer and derived layers: Use precomputed constants instead of computing from mask tensor The optimization is enabled by passing requireExactNNLen to MPSGraphModelHandle, which propagates it through the layer hierarchy. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Document the CoreML backend as an alternative to Metal for macOS, including Homebrew installation of the katagocoreml library dependency. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

ChinChangYang · 2026-01-06T10:43:28Z

Cross-Validation Test Report for CoreML Backend

Test Configuration

Model: b18c384nbt-uec-20221121b.bin.gz (18-block, 384-channel model)
Board Size: 19x19
Test Positions: 2,247 positions from built-in test dataset
Reference: Eigen backend (CPU, FP32)
Test Command: ./katago testgpuerror

Hardware

Apple M3 Max
macOS with CoreML + MPSGraph hybrid backend

Results

CoreML FP32 vs Eigen FP32 Reference

Metric	Average	90th %	99th %	Max	Threshold (99th/Max)
winrateError	0.00006%	0.00017%	0.00035%	0.00066%	0.45% / 1.35%
leadError	0.00002	0.00003	0.00011	0.00033	0.225 / 0.90
scoreMeanError	0.00002	0.00004	0.00012	0.00030	0.225 / 0.90
scoreStdevError	0.00001	0.00001	0.00004	0.00009	0.135 / 0.54
topPolicyDelta	0.00007%	0.00017%	0.00036%	0.00068%	0.45% / 1.35%
policyKLDiv	0.000000	0.000000	0.000000	0.000000	0.0006 / 0.0012
ownershipError	0.00003c	0.00006c	0.00018c	0.00174c	—

Status: PASS — CoreML FP32 matches Eigen FP32 with near-zero error.

CoreML FP16 vs Eigen FP32 Reference

Metric	Average	90th %	99th %	Max	Threshold (99th/Max)
winrateError	0.0997%	0.2553%	0.4857%	0.9147%	2.0% / 5.0%
leadError	0.0277	0.0615	0.1754	0.3852	1.00 / 3.00
scoreMeanError	0.0332	0.0718	0.1721	0.3739	1.00 / 3.00
scoreStdevError	0.0129	0.0274	0.0658	0.1646	0.60 / 1.80
topPolicyDelta	0.0944%	0.2124%	0.4547%	0.8814%	2.50% / 6.00%
policyKLDiv	0.000017	0.000035	0.000119	0.000604	0.0020 / 0.0040
ownershipError	0.0443c	0.1000c	0.2420c	4.3359c	—

Status: PASS — CoreML FP16 is well within acceptable error bounds for half-precision inference.

Summary

Configuration	Max Winrate Error	Max Policy KL Div	Result
CoreML FP32	0.00066%	0.000000	PASS
CoreML FP16	0.91%	0.000604	PASS

The CoreML backend passes all validation checks:

FP32 mode: Numerically equivalent to Eigen reference (errors < 0.001%)
FP16 mode: Max winrate error of 0.91% is well below the 5% threshold, consistent with expected half-precision behavior

Conclusion

The hybrid CoreML + MPSGraph backend produces numerically correct results across 2,247 test positions when compared against the Eigen CPU reference implementation. Both FP16 and FP32 precision modes meet KataGo's accuracy requirements for neural network inference.

KvanTTT · 2026-01-06T13:51:24Z

.github/workflows/build.yml

+  build-macos-coreml:
+    runs-on: macos-latest
+    permissions:
+      contents: read
+
+    steps:
+      - name: Checkout code
+        uses: actions/checkout@v4
+
+      - name: Install dependencies
+        run: |
+          brew install ninja zlib libzip
+          brew tap chinchangyang/katagocoreml-cpp
+          brew install katagocoreml
+
+      - name: Cache CMake build
+        uses: actions/cache@v4
+        with:
+          path: |
+            cpp/CMakeCache.txt
+            cpp/CMakeFiles
+            cpp/build.ninja
+            cpp/.ninja_deps
+            cpp/.ninja_log
+          key: ${{ runner.os }}-cmake-coreml-${{ hashFiles('**/CMakeLists.txt') }}
+          restore-keys: |
+            ${{ runner.os }}-cmake-coreml-
+
+      - name: Configure CMake
+        working-directory: cpp
+        run: |
+          cmake . -G Ninja -DUSE_BACKEND=COREML -DCMAKE_BUILD_TYPE=Release
+
+      - name: Build
+        working-directory: cpp
+        run: |
+          ninja
+
+      - name: Run tests
+        working-directory: cpp
+        run: |
+          ./katago runtests
+
+      - name: Upload artifact
+        if: github.event_name == 'push' && github.ref == 'refs/heads/master'
+        uses: actions/upload-artifact@v4
+        with:
+          name: katago-macos-coreml
+          path: cpp/katago
+


Is it possible to merge macOS build into a single configuration? As I see some build steps are overlapping.

Absolutely! It can be merged into a single configuration. Would you like to review it? ChinChangYang#9

- Add runtime batch size support (1 to maxBatchSize) with batch size included in model cache key for proper cache invalidation - Simplify model loading: convert to temp .mlpackage, load via MLModel.compileModel(), then delete immediately (CoreML caches internally) - Remove ~400 lines of complex manual cache management code - Ensure temp files are cleaned up on conversion/compile failures using defer, with warning logged if cleanup itself fails Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

When useFP16=false, the CoreML CPU+ANE path runs extremely slowly in FP32 mode. This adds a GPU-only execution path using MPSGraph that bypasses CoreML model conversion entirely, providing much faster FP32 inference. Changes: - Add createMPSGraphOnlyHandle() function in Swift for direct GPU-only handle creation without CoreML conversion - Add mpsGraphOnlyHandle field to ComputeHandle for FP32 mode - Add conditional helper functions to create appropriate handle based on precision mode (hybridHandle for FP16, mpsGraphOnlyHandle for FP32) - Modify getCoreMLOutput() to dispatch to correct handle - Add assertion enforcing mutual exclusivity of handles - Standardize logging to use "Core ML backend X:" prefix throughout FP16 mode continues to use hybrid CoreML+MPSGraph execution as before. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

The hybrid execution mode splits batches between CoreML (CPU+ANE) and MPSGraph (GPU), requiring at least 2 samples (1 for each backend). When maxBatchSize < 2, fall back to MPSGraph-only which provides more stable latency and avoids CoreML dispatch overhead. This also enables explicit single-threaded GPU-only execution via nnMaxBatchSize=1, useful for debugging or deterministic behavior. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Optimize hybrid dispatch convergence with adaptive EMA parameters: - Lower alpha (0.15) and warm-start ratio (0.47) for faster convergence - Adaptive warmup phase with variance-based transition - Remove unnecessary NSLock (thread-safe by design via single-owner access) - Add diagnostic logging via KATAGO_HYBRID_DIAG environment variable The thread safety is ensured without locks because: 1. Each server thread owns its own ThroughputTracker instance 2. Concurrent queue access is to disjoint fields only 3. group.wait() provides sequential barrier before shared reads Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Consolidate the previously separate CoreML and MPSGraph backends into a unified Metal backend. The hybrid architecture (CPU+GPU+ANE) is preserved but exposed as a single -DUSE_BACKEND=METAL build option. Key changes: - Remove standalone CoreML backend files (coremlbackend.{cpp,h,swift}) - Merge CoreML logic into metalbackend.{cpp,h,swift} - Rename mpsgraphlayers.swift to metallayers.swift - Rename CoreMLProcess namespace to MetalProcess - Update CMakeLists.txt: remove COREML backend, keep METAL only - Update CI workflow to use METAL backend - Prefer MPSGraph over CoreML for batch size 1 (better latency) - Add autoreleasepool to CoreML dispatch path - Remove unused alpha constant from ThroughputTracker - Fix minor comment and log prefix inconsistencies Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Remove intra-batch HybridComputeHandle that split work between CoreML and MPSGraph within a single thread. Instead, each server thread now runs exclusively as GPU (MPSGraph, gpuIdx=0) or ANE (CoreML, gpuIdx=100), configured via metalDeviceToUseThread<N>. This eliminates the ThroughputTracker, adaptive batch ratio, and dispatch queue complexity in favor of thread-level multiplexing managed by KataGo's existing multi-server-thread infrastructure. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Validate gpuIdx in createComputeHandle to warn on unrecognized values and default to GPU mode. Warn when ANE mode is used with FP32 since CoreML FP32 bypasses ANE and runs on CPU only. Fix printMetalDevices to accurately describe available modes after the multiplexer refactor. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Remove non-existent --sweep-backends benchmark reference - Update FP32+ANE warning to reference correct config keys (metalDeviceToUseThread<N>, metalUseFP16) - Add metalUseFP16 setting to Metal sections in gtp and analysis configs - Remove outdated Swift doc comment on createMPSGraphOnlyHandle - Add clarifying comment for unused maxBatchSize parameter Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Use `hasMPSGraph == hasCoreML` instead of `hasMPSGraph + hasCoreML != 1` to avoid implicit bool-to-int promotion. The error message now reports "both" or "neither" instead of a numeric count. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Switch from graph.run() to graph.encode() with explicit command buffer management (commit/waitUntilCompleted). This enables GPU error checking via commandBuffer.error and lays groundwork for future pipelining. Also consolidates variable declarations and reduces verbosity in the apply() method for better readability. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The COREML backend was unified into METAL, but two leftover references remained: a dead #ifdef block in benchmark.cpp and an outdated comment in .gitignore. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Updated the CMakeLists.txt to eliminate the COREML backend condition, retaining only the METAL backend check. Additionally, removed references to COREML backend files from .gitignore to reflect the recent unification of backends.

ChinChangYang and others added 13 commits December 31, 2025 17:15

Add KataGoCoreML-swift.h to .gitignore

2c77de1

Remove unused maskSize variable in HybridComputeHandle

270651b

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Add CoreML backend CI job to GitHub Actions workflow

f952519

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

KvanTTT reviewed Jan 6, 2026

View reviewed changes

ChinChangYang force-pushed the coreml-backend branch from b940e99 to f952519 Compare January 13, 2026 09:47

ChinChangYang and others added 5 commits January 20, 2026 19:52

ChinChangYang changed the title ~~Add CoreML backend with hybrid CPU+GPU+ANE inference for macOS~~ Improve Metal backend with hybrid CPU+GPU+ANE inference for macOS Jan 24, 2026

ChinChangYang and others added 8 commits February 23, 2026 18:19

Include gpuIdx in ComputeHandle error message for easier diagnosis

8cfeae5

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Remove dead USE_COREML_BACKEND ifdef and fix gitignore comment

60760f1

The COREML backend was unified into METAL, but two leftover references remained: a dead #ifdef block in benchmark.cpp and an outdated comment in .gitignore. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Adjusted the parameter alignment in the getMetalOutput function

308ad19

Remove COREML backend condition

ea9c81f

Updated the CMakeLists.txt to eliminate the COREML backend condition, retaining only the METAL backend check. Additionally, removed references to COREML backend files from .gitignore to reflect the recent unification of backends.

ChinChangYang marked this pull request as ready for review February 27, 2026 01:46

ChinChangYang mentioned this pull request Feb 27, 2026

Metal and CoreML Backends #865

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Metal backend with hybrid CPU+GPU+ANE inference for macOS#1148

Improve Metal backend with hybrid CPU+GPU+ANE inference for macOS#1148
ChinChangYang wants to merge 27 commits intolightvector:masterfrom
ChinChangYang:coreml-backend

ChinChangYang commented Jan 5, 2026 •

edited

Loading

Uh oh!

ChinChangYang commented Jan 6, 2026

Uh oh!

KvanTTT Jan 6, 2026

Uh oh!

ChinChangYang Jan 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ChinChangYang commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Features

Architecture

Performance

Files Changed

Build Requirements

Test Plan

Uh oh!

ChinChangYang commented Jan 6, 2026

Cross-Validation Test Report for CoreML Backend

Test Configuration

Hardware

Results

CoreML FP32 vs Eigen FP32 Reference

CoreML FP16 vs Eigen FP32 Reference

Summary

Conclusion

Uh oh!

KvanTTT Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

ChinChangYang Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ChinChangYang commented Jan 5, 2026 •

edited

Loading