feat: add simdgroup-optimized Metal kernels#172
Open
cluster2600 wants to merge 33 commits intoalibaba:mainfrom
Open
feat: add simdgroup-optimized Metal kernels#172cluster2600 wants to merge 33 commits intoalibaba:mainfrom
cluster2600 wants to merge 33 commits intoalibaba:mainfrom
Conversation
- backends/detect.py: Hardware detection - backends/gpu.py: FAISS GPU integration - backends/quantization.py: Product Quantization - backends/opq.py: OPQ + Scalar Quantization - backends/search.py: Search optimization - backends/hnsw.py: HNSW implementation - backends/apple_silicon.py: Apple Silicon optimization - backends/benchmark.py: Benchmarks Internal sprint work - not for upstream PR.
- ShardManager for vector sharding - DistributedIndex with scatter-gather queries - QueryRouter for routing strategies - ResultMerger for merging results from shards - Support for hash, range, and random sharding
- Add README.md with full API documentation - Add BENCHMARK_README.md with benchmark results - Add test_backends.py with comprehensive tests
- Adjust k to avoid sampling errors - Simplify k-means implementation - Fix codebooks shape
Based on cuVS documentation: - Support for CAGRA, IVF-PQ, HNSW algorithms - 12x faster builds, 8x lower latency target - Dynamic batching for CAGRA
Based on cuVS documentation: - IVF-PQ: 12x faster builds, 8x lower latency - CAGRA: 10x latency with dynamic batching, 8x throughput - Both support fallback when cuVS not available
- 9x speedup target vs CPU - Compatible with DiskANN
Based on arXiv:2401.11324: - Synthetic clustered data generation - FAISS CPU/GPU/IVF-PQ benchmarks - cuVS placeholder benchmarks - Results output to markdown
S3: GPU-PIM collaboration research S4: Memory coalescing kernel (2-8x speedup) S5: Apple ANE optimization guide S6: ANE vs MPS benchmark S7: Graph reordering (15% QPS gain) S8: PIM evaluation framework All based on scientific papers.
1. cuVS C++ bindings (zvec_cuvs.h) - IVFPQ, CAGRA, HNSW index classes - Template-based for float/uint8_t/int8_t 2. CUDA coalesced kernels (coalesce.cuh, coalesce.cu) - Coalesced L2 distance (2-8x speedup) - Warp-level reductions - FP16 support - Tiled shared memory version 3. Metal MPS kernels (distance.metal) - L2 distance with SIMD/NEON - FP16 support for Apple Silicon - Batch processing - Matrix multiplication All based on scientific papers.
1. SIMD CPU optimization (simd_distance.h) - SSE2, AVX2 for x86 - NEON for ARM/Apple Silicon - 4-16x speedup expected 2. CMake build system (CMakeLists.txt) - CUDA coalesced kernels - Metal shaders - SIMD CPU - Optional cuVS integration 3. Graph-based ANN (graph_ann.h) - CAGRA-like implementation - NN-Descent graph construction - Hierarchical search
1. FastScan (simd_distance.h) - SIMD-optimized Product Quantization - AVX2 distance computation - Bitonic sort for k-selection 2. Vamana Graph (vamana.h) - DiskANN algorithm - Robust to search parameters - Used in Azure AI Search 3. NUMA-aware (numa.h) - Per-NUMA-node allocation - Work-stealing thread pool - 6-20x speedup on multi-socket Based on papers: - Quake (OSDI 2025): NUMA-aware partitioning - FAISS (2024): FastScan SIMD optimization - DiskANN: Vamana graph
1. Lock-free concurrent structures (lockfree.h) - LockFreeVector (Stroustrup design) - AtomicIndex for HNSW - Hazard pointer reclamation 2. Memory pool optimizations (memory_pool.h) - Aligned allocator (cache-line, huge pages) - Object pool - Slab allocator - SoA layout 3. Batch processing (batch.h) - Transposed matrix for PQ (30-50% faster) - Loop unrolling - AVX-512 support - PQ distance tables Based on: - FAISS optimization guide - Stroustrup lock-free vector - OptiTrust paper (2024)
Add 6 new Metal compute kernels using simdgroup cooperative intrinsics (simd_sum, simd_min, simd_shuffle) for hardware-accelerated reductions across 32 SIMD lanes without shared memory barriers: - metal_l2_distance_simdgroup: cooperative L2 distance - metal_inner_product_simdgroup: cooperative dot product - metal_cosine_similarity_simdgroup: normalized inner product - metal_topk_simdgroup: per-query top-k selection via simd_min - metal_matmul_tiled: tiled matmul with threadgroup shared memory - metal_normalize_simdgroup: in-place L2 normalization Also fixes existing kernels: - Replace simd_make_float4 with float4 constructor (MSL compliance) - Add device address space qualifiers in batch kernel Tested: compiles cleanly with metal -std=metal3.1 -W -Werror. Signed-off-by: Maxime Grenu <maxime.grenu@gmail.com>
- cuvs_cagra.py: use cagra.build(IndexParams, dataset) and cagra.search(SearchParams, index, queries, k) instead of the non-existent Index().build() / Index().search() methods - cuvs_ivf_pq.py: same pattern fix, plus correct import path (cuvs.neighbors.ivf_pq instead of cuvs.ivf_pq) - Both backends now convert numpy queries to cupy device arrays before search (cuVS requires CUDA-compatible memory) Tested on RTX 4090: - cuVS CAGRA: 43K QPS (50K vectors, dim=128) - cuVS IVF-PQ: 45K QPS (50K vectors, dim=128) - FAISS GPU: 529K QPS (50K vectors, dim=128, flat) Signed-off-by: Maxime Grenu <maxime.grenu@gmail.com>
Add CUVS_AVAILABLE and CPP_CUVS_AVAILABLE flags to detect.py. Update get_optimal_backend() priority chain: C++ cuVS > Python cuVS > FAISS GPU > MPS > FAISS CPU > NumPy Signed-off-by: Maxime Grenu <maxime.grenu@gmail.com>
This was referenced Feb 25, 2026
Contributor
Author
|
Discussion issue opened: #177 — feedback welcome before review. |
Collaborator
|
@greptile |
Greptile SummaryThis PR adds hardware-accelerated Metal compute kernels for Apple Silicon, introducing 6 new simdgroup-optimized kernels that leverage cooperative SIMD intrinsics ( Key Changes
Issues Found
The implementation follows proper Metal Shading Language conventions and correctly uses simdgroup cooperative reductions for hardware acceleration. Confidence Score: 4/5
Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
Start[Vector Search Request] --> Detect[Hardware Detection]
Detect --> CheckCppCuvs{C++ cuVS<br/>Available?}
CheckCppCuvs -->|Yes| UseCppCuvs[Use C++ cuVS<br/>CAGRA/IVF-PQ]
CheckCppCuvs -->|No| CheckPyCuvs{Python cuVS<br/>Available?}
CheckPyCuvs -->|Yes| UsePyCuvs[Use Python cuVS<br/>CuPy arrays]
CheckPyCuvs -->|No| CheckFaissGpu{FAISS GPU +<br/>NVIDIA GPU?}
CheckFaissGpu -->|Yes| UseFaissGpu[Use FAISS GPU]
CheckFaissGpu -->|No| CheckMps{Apple Silicon<br/>MPS?}
CheckMps -->|Yes| UseMps[Use Metal Kernels<br/>simdgroup ops]
CheckMps -->|No| CheckFaissCpu{FAISS CPU?}
CheckFaissCpu -->|Yes| UseFaissCpu[Use FAISS CPU<br/>+ Accelerate]
CheckFaissCpu -->|No| UseNumpy[Fallback to NumPy]
UseCppCuvs --> Execute[Execute Search]
UsePyCuvs --> Execute
UseFaissGpu --> Execute
UseMps --> MetalKernels[Metal Kernels:<br/>L2/cosine/topk]
MetalKernels --> Execute
UseFaissCpu --> Execute
UseNumpy --> Execute
Execute --> Results[Return distances<br/>+ indices]
style UseMps fill:#a8dadc
style MetalKernels fill:#a8dadc
style UseCppCuvs fill:#f1faee
style UsePyCuvs fill:#f1faee
Last reviewed commit: b08a835 |
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
- Fix F821 (undefined np): add module-level numpy imports for type annotations - Fix PLC0415: add noqa for intentional lazy imports inside functions - Fix G004: convert f-string logging to lazy % formatting - Fix NPY002: add noqa for legacy numpy random calls in benchmarks - Fix ARG001/ARG002: prefix unused args with underscore - Fix PTH123: use Path.open() instead of open() - Fix I001: sort imports in __init__.py - Exclude *.ipynb from ruff (demo/benchmark notebooks) Signed-off-by: Maxime Grenu <maxime.grenu@gmail.com>
Signed-off-by: Maxime Grenu <maxime.grenu@gmail.com>
The GPU CMakeLists.txt requires CUDA toolkit (nvcc) which is not available on CI runners. The C++ headers are header-only and do not need a separate build system. Signed-off-by: Maxime Grenu <maxime.grenu@gmail.com>
- Remove spurious .T in asymmetric_distance_computation() that caused a broadcast shape mismatch (10,100) vs (100,10) - Fix off-by-one in test_distributed_index: assert shard count == 4 instead of checking for non-existent shard index 4 - Skip TestGPUIndex when FAISS is not installed instead of raising RuntimeError Signed-off-by: Maxime Kawawa-Beaudan <maxkb@meta.com> Signed-off-by: Maxime Grenu <maxime.grenu@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds 6 new Metal compute kernels using simdgroup cooperative intrinsics (
simd_sum,simd_min,simd_shuffle) for hardware-accelerated reductions across 32 SIMD lanes — no shared memory barriers needed.Follow-up to #166 ("Future Work: SIMD optimization").
New kernels
metal_l2_distance_simdgroupmetal_inner_product_simdgroupmetal_cosine_similarity_simdgroupmetal_topk_simdgroupsimd_minlane votingmetal_matmul_tiledmetal_normalize_simdgroupDispatch model
(n_database, n_queries)threadgroupsFixes to existing kernels
simd_make_float4→float4constructor (MSL compliance)deviceaddress space qualifiers inmetal_l2_distance_batchMerge order
Test plan
metal -std=metal3.1 -W -Werroron macOS with Xcode Metal toolchain