Feat/oxk kernels#16
Conversation
- numa.rs: OXIDIZE_NUMA_REPLICATE=1 copies the mapped GGUF into one MPOL_BIND + MADV_HUGEPAGE buffer per NUMA node at load; decode chunk closures translate their matrix slice to the caller's node-local replica (TLS-cached getcpu; pinned workers are exact). Removes the ~52%-remote weight reads and the Skylake directory-write tax for the cost of one weight copy per node. Falls back silently on single-node hosts or allocation failure. - spinpool: an idle worker waking on the 50ms park timeout no longer re-enters the spin phase (only a notify does) — two pools sharing pinned cores otherwise bleed several ms of CPU per worker per 50ms, which degraded every process on the box by ~25%. Qwen3-30B-A3B at native 32K context window on the dual-socket CPU box: 9.7 tok/s short-form, 8.5-9.6 sustained (matched A/B: replication alone +20%, idle fix recovers the cross-server regression). THP on the replicas matters: 4KB anon pages cost ~4.5M TLB entries vs the large folios the page cache uses. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Phase 1-3 of the OXK migration plan (.cursor/plans/xeon-oxk-kernels.md), implemented in Rust std::arch intrinsics rather than C: - New optional oxidize-kernels crate: Q4_K x Q8_K row dots (scalar reference + AVX2 x1/x4/x8) and a contiguous-range GEMV helper. Bit-exact vs the legacy tensor.rs kernels (exact-equality parity test, plus 131k-range shadow run with 0 mismatches on Xeon Silver). - OXIDIZE_GEMV=legacy|oxk|shadow choke point in gemv_q4_k_q8_k_fused and the Q4_K expert GEMV paths. Default stays legacy; without --features oxk the build is unchanged. - Plain-harness microbench (oxk_q4k_bench) for Gate B. Gate results on 2x Xeon Silver 4110 (AVX2, no VNNI): - Microbench (1 core, 30s sustained): x8 beats the legacy-style x4 structure +6.2% cache-resident, +1.9% DRAM-resident. - E2E decode Qwen3-30B-A3B Q4_K_M, interleaved A/B (3 pairs, 28 threads): legacy 7.70/7.77/7.77 vs oxk 7.66/7.73/7.70 tok/s (oxk = 99.4%). Decode is at the DRAM ceiling, so the Phase 5 flip-default gate (>= 100%) is NOT met; legacy remains default. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
numa.rs now supports multiple replicated regions (sorted, binary-search
translation in local_slice). With OXIDIZE_NUMA_REPLICATE=1, layer-wise
load replicates the whole GGUF only when it fits half the smallest
node's memory; past that it falls back to replicating just the dense
(non-routed-expert) tensors — on nex-n2-pro that is 5.1 GiB per node
carrying ~half the per-token weight reads. OXIDIZE_NUMA_REPLICATE=dense
forces the partial mode.
Nex-n2-pro Q4_K_M (208GB, 2x Xeon Silver 4110) decode, 64tok x 3iter:
- baseline 16T: 2.46-2.56 tok/s (prior production config)
- 28T, no replication: 3.01-3.03 tok/s (idle box; old 16T rule was
measured with two servers sharing cores)
- 28-32T + dense repl: 3.29-3.35 tok/s (+34% total)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…esktop CPUs The spin pool and identity CPU pinning, both tuned on the dual-socket Nex box, collapsed decode throughput on single-socket consumer parts (Ryzen 6850H, 8C/16T, Qwen3-4B Q4_K_M: 2.96 tok/s vs 9.65 for the pre-spinpool build): - Spin pool now defaults on only when NUMA nodes > 1. Its always-spinning workers (rayon_threads - 1 of them, on top of the rayon pool itself) starve an 8-core part: 2.96 -> 9.0 tok/s just by disabling it here. OXIDIZE_SPINPOOL=1/0 still forces either way; multi-socket hosts keep it. - Worker pinning now uses core-first order read from sysfs (first SMT sibling of each core, then the rest). Linux enumerates sibling pairs adjacently on AMD, so the old identity map stacked 8 workers onto 4 physical cores (9.15 -> 12.0 tok/s). - Default thread count is now the physical core count instead of available_parallelism: decode GEMV is DRAM-bound, SMT siblings only add contention (16T 11.0 vs 8T 11.6 with the other fixes in). - `oxidize run model "prompt"` (one-shot) no longer auto-starts the background API server: it loaded the model a second time concurrently with prefill and died with the process right after generation. Benchmarks (Ryzen 7 PRO 6850H, Qwen3-4B Q4_K_M, 128-token decode): before: 2.96 tok/s after: 12.0 tok/s ollama-performance-benchmark harness (768 tokens, load included): before: 8.60 tok/s after: 11.12 tok/s (ollama 14.07, llama.cpp 13.60) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…> 0.61x to 0.81x of ollama Decode profiling (new OXIDIZE_DECODE_PROFILE=1, per-shape GB/s + phase timers at exit) attributed the remaining gap to four causes, fixed here: - q/k/v and gate/up ran as nested rayon::join regions whose inner par_iters stole work from each other, interleaving weight streams on the same cores: the gate/up matrices measured 19-21 GB/s vs 32+ for the same shape alone (and with the spin pool, the losing join arm ran entirely serial — the reason the pool collapsed to 3 tok/s on desktops). New gemv_quantized_multi_f32 runs all same-input projections as ONE flat region sharing one Q8_K quantization; chunk sizes are byte-weighted so mixed Q4_K/Q6_K jobs stay balanced. Bit-identical rows (test included). - Attention heads now dispatch through run_chunks instead of a raw rayon region, and the parallel threshold drops 128 -> 16 (the old value left attention single-threaded for the entire early context). - The spin pool is default-on everywhere now: with every decode hot loop on run_chunks it beats rayon's sleep/wake handoff on single-socket too, but only with the submitting thread pinned to slot 0 — an unpinned submitter timeshares against spinning workers and loses ~8%. - `oxidize run`/`serve` default KV dtype q8 -> f32: q8/f16 caches cannot be borrowed by decode attention, so every layer of every token dequantized the WHOLE K/V prefix into workspace buffers (~2 GB of copies over a 768-token run). cpu-optimized clamps ctx to 2048, bounding f32 KV at ~600 MB for a 4B model. Decode glue: 98 -> 28 us/layer. Also: rayon fallback in run_chunks uses static block partitioning (one contiguous range per worker), MADV_COLLAPSE attempt at model load (no-op on kernels/filesystems without file-THP), GEMV shape microbench test. Benchmarks (Ryzen 6850H, Qwen3-4B Q4_K_M, 128-tok decode, same-run pairs; absolute numbers swing ~15% with package thermals): oxidize self-reported: 11.5 -> 12.4 tok/s (was 2.96 at yesterday's defaults, 9.65 for the pre-spinpool build) vs llama.cpp decode-only: 0.69-0.75x (llama.cpp 13.8-16.7 same-minute) ollama-performance-benchmark (768 tok, load included, same run): oxidize 9.50 vs ollama 11.78 = 0.81x (was 8.60 vs 14.07 = 0.61x) Remaining known gaps: f16-KV borrow path for attention (llama.cpp attends f16 natively; f32 doubles late-context KV reads), Q4_K scale-decode SIMD, batched prefill (~30 tok/s vs llama.cpp 72). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…llama
Three decode fixes on top of the fused-region work:
- f16 KV cache borrow path: new KvElem trait makes the online-softmax
decode kernel generic over the KV element; u16 rows convert in-kernel
via F16C (AVX2), f32 passes through bit-identically. The KV cache gains
f16_layer_{key,value}_prefix borrows, and decode attention prefers them
before the f32 borrow / dequant-copy paths. `run`/`serve` KV default
f32 -> f16: zero-copy like f32 but half the attention DRAM reads as the
context grows, and half the memory.
- Next-quad prefetch sweep in the Q4_K/Q6_K x4 row kernels for short rows
(blocks_per_row <= 16): 10-block rows restart the hardware prefetcher
every 22 cache lines, which held every 2560-column matrix (gate/up
projections, q/k/v, lm_head) ~10-20% under the DRAM roof. The sweep
walks the next quad's row one quad-time ahead: gate/up 37 -> 45.6 GB/s,
qkv 34 -> 41.9, lm_head 38 -> 44.3, decode 13.3 -> 15.1 tok/s
(same-conditions A/B). Long rows get a deeper in-row T1 sweep instead
(down-proj relative deficit closed in the shape microbench).
- gemm_quantized_f32 now records into the OXIDIZE_DECODE_PROFILE summary
(prefill attribution; batch-43 GEMM measured ~46% of FMA peak).
Benchmarks (Ryzen 6850H, Qwen3-4B Q4_K_M, cool machine, same-run pairs):
decode-only token_forward: 70.1 -> 62.7 ms/token (15.95 tok/s)
oxidize self-reported 512 tok: 13.3 -> 15.1-15.2 tok/s
ollama-performance-benchmark (768 tok, load included, ollama runs
first on the cooler machine): oxidize 14.91 vs ollama 15.72 = 0.95x
(was 0.61x at the start of this effort)
f16 attention matches f32 within half-precision rounding (test included).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ining hooks
- GDN gated RMSNorm: near-zero eps (model eps over-floored tiny delta
outputs), gate-after order matching llama.cpp's qwen3next graph, and
L2-normed q/k without 1/sqrt(d)
- canonicalize bare 'blk.N.ssm_a' (no .weight suffix) from llama.cpp
GGUFs; handle both ssm_conv1d layouts ({kernel,channels} vs
{channels,kernel})
- tokenizer: honor tokenizer.ggml.add_bos_token metadata; default BOS
only for SentencePiece (spurious BOS corrupted Qwen forward passes)
- layer-wise: forward_normed_hidden + lm_head_logits_batch batched
training entry points; warm_layer_cache; OXIDIZE_TRACE_VALS debugging
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…rts them auto (new default) routes Q4_K GEMV to oxidize-kernels when the crate is compiled in and AVX2 is available, falling back to legacy intrinsics otherwise. Also checks in the Xeon OXK migration plan. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- parse nextn_predict_layers from GGUF; exclude appended MTP draft blocks from layer_count; load blk.N.nextn.* tensors (MtpWeights) - MtpGenerationStream: drafts from the last committed token plus its output-normalized hidden state, so prefill provides the first anchor - CLI uses native MTP automatically when present (--no-mtp to disable, --draft-tokens to tune); accept qwen3_5_text arch aliases - dflash: GGUF row/col dim handling fixes for draft weight loading Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ntime flags - open_generation_stream picks DFlash draft (--draft-model, --draft-tokens), native MTP when the GGUF has nextn layers, or standard generation - new flags: --kv-cache-dtype f32/f16/q8/q4, --threads, --ram-offload-threads, --no-turboquant-kv - TurboQuant is now the default q4/q8 KV cache quantizer - warm layer cache at load; qwen chat-template fallback Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- shard-by-shard streaming conversion (plan_stream_outputs + write_gguf_streaming) so model dirs no longer materialize in RAM - --quantize target on oxidize-convert (Q4_K/Q8_0/... via quantize_linear_4bit and friends in compute/quantization) - qwen3_5* arch aliases normalized to qwen35 metadata prefix - gguf_layer_keys inspection bin for normalized per-layer tensor keys Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ter hot paths - LoRA forward/backward take count rows at once (cache-friendly parallel loops instead of one rayon dispatch per token) - trainer drives layer_wise forward_normed_hidden + lm_head_logits_batch for whole windows; fused AdamW + batched softmax cross-entropy - dataset: chat-format JSONL support; CLI: more training knobs Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- gemv/layer/inference benches grow OXIDIZE_GEMV-aware comparison runs and Q4_K coverage - cuda.rs/numa.rs/build.rs/metrics.rs: rustfmt-only cleanup; lib.rs module reorder Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…dth bench - cpu.rs: CPUID vendor (Intel/AMD) + ISA summary (AVX2/FMA/AVX-VNNI/ AVX512-VNNI) with a per-vendor tuning profile resolved once per process - prefetch distance and hint are now runtime-tunable: OXIDIZE_OXK_PF (blocks, 0 disables) and OXIDIZE_OXK_PF_HINT=t0|nta - default stays 4 blocks/T0 for both vendors: a contended 8-thread sweep on Ryzen 6850H (Zen 3+) showed pf 0/2/4 x t0/nta all within noise and pf=8 mildly worse — Zen's HW prefetcher covers the stream, so AMD shares the Xeon-tuned default instead of diverging on noise - oxk_q4k_bench grows OXK_BENCH_THREADS contended mode (all cores streaming at once, the shape of real decode) and prints the CPU summary Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0c4292169c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
22 issues found across 55 files
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
… contended MT bench
The Q4_K GEMV kernel was measured at ~33 GB/s and assumed "DRAM-latency
bound, kernel exhausted." That conclusion came from a benchmark bug: the
contended-mode harness spawned OS threads per pass, understating throughput
~2x. Fixed to persistent deadline-loop workers (run_mt, OXK_BENCH_MT_ONLY/
OXK_BENCH_MT_KERNEL). With a correct harness, an on-box prefetch sweep
(2x Xeon Silver 4110, Skylake-SP, DDR4-2133; 302 MB fixture, 32T, interleaved
pf in {0..8} x {t0,nta}) showed pf=1/t0 hits 72-74 GB/s = the platform's
pure-read ceiling, vs ~63.5 at the old pf=4 default and ~57 for any NTA hint.
- cpu.rs: Intel software-prefetch default 4 -> 1 block, with the measurement
recorded inline; AMD/Other unchanged (Zen sweep was within noise).
- lib.rs: select_isa() resolved once via OnceLock — it ran inside
gemv_q4k_range (per pool chunk), and the per-call env::var showed up at
>1% of decode samples (libc getenv scans the environment). Adds AVX-512F/BW,
AVX-512 VNNI, and AVX-VNNI dispatch behind runtime detection +
OXIDIZE_OXK_ISA/OXIDIZE_OXK_AVX512 overrides (AVX-512 stays off by default
on Skylake-SP: measured 71 vs 73 GB/s DRAM, 52 vs 80 cache-resident — the
frequency drop loses).
- q4k_avx512.rs: new AVX-512 / VNNI / AVX-VNNI Q4_K x Q8_K kernels, bit-exact
vs scalar (parity tests).
- q4k_avx2.rs: x16 multi-row variant, decode/dot split, vendor-tuned prefetch.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…r-layer gate+up alloc Qwen3-30B-A3B-Q4_K_M CPU decode went from ~9 to ~12.4 tok/s (28T, interleaved A/B). Profiling (OXIDIZE_DECODE_PROFILE + perf IMC counters) showed decode was NOT bandwidth-bound — only ~20 of 73 GB/s achieved — but stalled on CPU overhead in the serial forward path: - tensor.rs gemv_f32_cpu: the MoE router projection (every layer, every token) used a scalar `.map().sum()` f32 reduction LLVM can't vectorize (non-associative) — a serial FMA chain. Switched to dot_f32_fast (AVX2 FMA, 4 independent accumulators). - inference.rs / layer_wise.rs: OXIDIZE_TRACE_FWD/_VALS were read via env::var_os on every layer of every token. Hoisted behind cached trace_fwd_enabled()/trace_vals_enabled() OnceLock helpers. - inference.rs moe_ffn_forward_weights: the fused gate+up branch heap-allocated `vec![0.0; 2*n_sel*i_size]` and memcpy'd it back into two scratch buffers every layer every token (~14% of main-thread decode samples). Replaced with a thread-local reusable buffer read in place by SwiGLU + down-projection; fused3 GEMV improved 34 -> 39 GB/s. Output verified coherent. - tokenizer.rs: add the add_bos_token field to two test-only SpecialTokens initializers so the oxidize-core test binary compiles again. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… optimal
"Optimize OXK" investigation outcome: the kernel is already at the right
tiling on the target hardware. A single-threaded microbench suggested x1 beats
the wide tiles on Skylake-SP (4.23 vs 3.76 GB/s — register pressure from 8 Q8
ymm vectors held live across 8-16 row dots), but that bench is L3-resident.
A decisive interleaved e2e A/B on Qwen3-30B-A3B (28T, cold-DRAM expert reads)
showed the opposite and monotone: tile16 11.7/10.0 > tile8 7.5/7.0 >
tile1 4.8/4.3 tok/s. The wide tile's 16 independent outstanding loads hide DRAM
latency, which is what actually limits decode — so narrowing the tile would
have ~halved throughput.
gemv_q4k_range now gates its x16->x8->x4->x1 cascade on a once-resolved
max_tile(), default 16 (== prior behavior, verified no regression), overridable
via OXIDIZE_OXK_TILE={1,4,8,16} for retuning on other parts (e.g. VNNI cores).
Bit-identical regardless of width.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
There was a problem hiding this comment.
5 issues found across 11 files (changes from recent commits).
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
# Conflicts: # oxidize-core/build.rs # oxidize-core/kernels/gemv_f32.cu # oxidize-core/src/backends/cuda.rs # oxidize-core/src/compute/tensor.rs
Spinpool panic-safety: - P0: submitter catches panics in its own chunk range and still drains worker acks before returning, so workers never call the fat-pointer closure after its borrow ends (use-after-free). - P1: workers ack even when a chunk panics (and stay alive), so one panicking chunk can no longer deadlock the pool. oxidize-kernels: - Forced OXIDIZE_OXK_ISA modes are now gated by the same availability checks as auto, so forcing an unsupported ISA can't execute illegal instructions. - q4k_avx2 next-tile prefetch no longer double-counts the row offset and uses wrapping_add (was UB via .add past the allocation). - AVX-VNNI detection reads CPUID leaf 7 subleaf 1 EAX[4] (was subleaf 0 EDX[4]). - MT x1 bench path runtime-guards the AVX2 kernel. NUMA: - Freed replica mappings on a lost REGIONS.set race (was leaking GBs). - Robust online-node parsing for comma/range lists; node bitmask sized per node id (was capped at 64 nodes / UB shift). Correctness: - flash_attention: overflow-checked head q_len so the unsafe per-head output slices can't run past the buffer. - conversion: a fused gate_up_proj that fails to split is now a hard error (matches the streaming path) instead of emitting a broken MoE GGUF. - safetensors->gguf: I16 (ggml type 25) byte size is 2; non-index/file conversions now honor target_quantization. - dflash: dequant fallback transpose used swapped dims, corrupting weights when the quantized GEMV path was skipped — now mirrors the primary loader. - quantization: Q4_K_S errors keep their variant (were mislabeled Q4_K_M). - MTP stream: budget/stop checks run before draining the emit buffer, so a multi-token step can't over-emit past max_new_tokens or past a stop token. Build / perf / hygiene: - build.rs: gate PTX compilation on CARGO_FEATURE_CUDA, probe nvcc.exe on Windows, drop the dead OXIDIZE_CUDA_PTX env. - cuda: GpuState now destroys its cuBLAS handle on drop. - inference_bench: reuse one layer's weights (was ~22GB OOM at 7B dims). - fused MoE: don't zero gate/up scratch on the fused early-return path. - finetuning: serde(default) for backward-compatible configs; skip (not clamp) out-of-range CE targets; avoid full-vector clone before truncation; keep packing-buffer capacity across flushes. - plan doc: mermaid phase numbering matches the phase text. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
There was a problem hiding this comment.
4 issues found across 19 files (changes from recent commits).
Tip: Review your code locally with the cubic CLI to iterate faster.
Re-trigger cubic
…K kernels Add diffusion-gemma (Gemma-4 26B-A4B MoE block-diffusion) support, ported faithfully from the llama.cpp diffusion-gemma4 reference graph (PR #24427): - oxidize-core/src/model/diffusion_gemma.rs: GGUF loader + bidirectional canvas forward (QK-norm, scale-less V-norm, V=K on full layers, dual head dims 256/512, NEOX rope with proportional rope_freqs on full layers, attn scale 1.0), dual dense+routed-MoE FFN (128 experts top-8, fused gate_up split, per-expert/router scales), self-conditioning MLP, layer output scalar, final logit softcap, tied output head. Q5_0 down-projections (unsupported by OXK gemv) are dequantized to f32 at load. - 48-step entropy-bound denoise loop (linear temp schedule, entropy-bound accept, stable-and-confident stop) matching the reference sampler. - oxidize-cli bin diffusion_gemma_bench: runs one canvas, reports canvas tok/s + per-step mean-entropy trace. Build with --features oxk and run with OXIDIZE_GEMV=oxk to exercise the OXK kernels. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…_0->Q8_0 Make the DiffusionGemma OXK path correct *and* fast on CPU: - Requantize Q5_0 down-projections to Q8_0 at load (OXK gemv has no Q5_0 path) instead of a scalar f32 fallback: near-lossless, ~4x less RAM, stays on the fast SIMD experts kernel. - Batch the tied output head into one big GEMM over the whole canvas instead of 256 sequential per-token GEMVs. - Parallelize the scalar hot loops across the 256 canvas tokens (bidirectional attention, full-vocab softmax/entropy/sample) with rayon, avoiding nested parallelism with the kernels' own row-parallelism (nesting measured 2-4x slower; the per-token MoE stays sequential so its inner experts GEMV keeps the single level of parallelism). Result on the 32-core CPU box (Q4_K_M): ~113 -> ~60 s/step, full core utilization, entropy collapse unchanged (correctness preserved). Remaining gap to llama.cpp's ~12 s/step is its batched mul_mat_id experts kernel. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… bench Batch the routed MoE across the whole canvas: all nt*N_USED (token, expert) pairs flow through ONE gate_up experts GEMV and ONE down experts GEMV (flattened selections + per-slot strided inputs), giving a single level of rayon parallelism over the full output instead of 256 nested per-token calls. Result on the 32-core CPU box (Q4_K_M, 'What is the capital of France?'): - 60 -> 30 s/step; converges in 6 denoising steps (early-stop), 181 s total - 1.41 canvas tok/s end-to-end (vs llama.cpp reference ~1.0) - correct, coherent output: 'The capital of France is **Paris**.' bench now decodes and prints the final canvas via the GGUF tokenizer. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Resolve conflicts from PR #15 CUDA performance fixes while keeping OXK kernel paths, NUMA replication, MTP conversion, and speculative decoding. Ignore local deploy/ scripts; set core.filemode=false for external drive. Co-authored-by: Cursor <cursoragent@cursor.com>
bd2a4a6 to
c1958ef
Compare
Only declare known binaries explicitly; gate diffusion_gemma_bench behind oxk. Co-authored-by: Cursor <cursoragent@cursor.com>
The OXK crate is a workspace member; Docker smoke tests failed without it. Co-authored-by: Cursor <cursoragent@cursor.com>
Cargo validates [[bench]] paths at manifest load time. Co-authored-by: Cursor <cursoragent@cursor.com>
Unblocks PR #16 CI after the master merge by fixing lint in core, server, CLI, finetuning, and bench targets. Co-authored-by: Cursor <cursoragent@cursor.com>
There was a problem hiding this comment.
2 issues found across 11 files (changes from recent commits).
Tip: Review your code locally with the cubic CLI to iterate faster.
Re-trigger cubic
Implements activation-aware one-shot pruning (Wanda, arxiv:2306.11695)
and per-output-row magnitude pruning (Han et al. 2015, with the
per-row comparison group from Wanda Table 7) on top of the existing
tensor-name substring filter. Both methods work on quantized GGUFs:
weights are dequantized to f32, masked, and re-quantized to the
original type (or a joint target via --joint-quantize).
New surface:
- oxidize-core::activation_stats::ActivationStats + CalibrationRunner:
streaming per-input-neuron L2 accumulator (Wanda's X side).
- oxidize-prune::mask::{magnitude_mask, wanda_mask, apply_nm_pattern}:
pure-Rust Wanda + 2:4 / 4:8 N:M structured mask primitives.
- oxidize-prune::wanda::{wanda_prune, magnitude_prune, WandaOptions}:
full GGUF round-trip; reads quantized bytes, masks, requantizes,
writes a new GGUF.
- L2-norms cache: simple text format, one row per linear weight, N
f32 values per row. Loaded via --calibration; validated against
the input GGUF.
- oxidize convert --prune wanda|magnitude: single-pass prune+quantize
on a freshly-converted SafeTensors GGUF.
Tests: 14 in oxidize-prune (mask, wanda, magnitude, calibration
cache roundtrip, N:M patterns, full dequant/quant roundtrip) and 7
in oxidize-core (activation_stats streaming, merge, runner finalize).
All passing.
Plan: ~/.commandcode/plans/make-pruning-and-inference-faster.md
Refs: arxiv:2306.11695 (Wanda), arxiv:2301.00774 (SparseGPT).
Co-authored-by: CommandCodeBot <noreply@commandcode.ai>
Add `oxidize-core::autotune` — a stateless orchestrator that detects the host (CPU/ISA/RAM/NUMA/GPU/Metal/CUDA/WSL/cgroup memory limit/hugepages), fingerprints the loaded GGUF model (architecture, dims, MoE/MTP, dominant qtype, file size), and produces a `TuningPlan` for the most-relevant inference knobs: threads, ctx_size, kv_cache_dtype, kv_quantization, n_gpu_layers, mmap/mlock/mmap_hugepages/mmap_prefetch, numa_replicate_dense, layer_wise, layer_cache, pipeline (Sequential/Continuous/Paged/ Asymmetric), speculative (None/DFlash/Mtp), decode_tile_tokens (FlashDecoding split-K), oxk_isa (Scalar/Avx2/Avx512), oxk_tile (1/4/8/16), and a tok/s estimate. Rules are an ordered table at `oxidize-core/src/autotune/rules.rs`: - Tier 0: model-too-big-for-RAM forces layer_wise streaming. - Tier 1: ISA + Skylake-SP gate (which disables AVX-512 on the regressing uarch; we lift `is_skylake_sp()` to public in `oxidize-kernels::cpu`). - Tier 2: GPU offload (whole model on GPU when it fits; partial n_gpu_layers sized to 0.85 × usable VRAM per-layer; skip entirely when VRAM < 25% of model size). - Tier 3: KV cache dtype (F16 on >=16 GiB VRAM, asymmetric INT8 in the 8–16 GiB band, TurboQuant INT4 on low-VRAM / very deep models) + ctx size capped to fit `total_ram * 0.6 - model`. - Tier 4: layer cache + NUMA replication (NUMA only on dense, non-trivial core count, with a SIMD backend present). - Tier 5: speculative decoding (MTP if `nextn.*` tensors, DFlash for qwen/llama/lfm2). - Tier 6: threads (full physical_cores on CPU; clamped to 4–8 when a cgroup memory limit is present or when GPU does the work). - Tier 7: decode tile (split-K above 1024 KV tokens on AVX2). - Tier 8: pipeline (Paged on GPU, Continuous on 8+ cores / 64+ GiB / dense, Sequential otherwise). - tps estimate: `min(per-core tps × cores, RAM bandwidth / model bytes)` calibrated against the existing `results/bench/` numbers. CLI: `--auto` (default for `run`), `--no-auto`, `--print-plan` (plain or `json`). Plan is applied to the `Args` struct before the model is built: only fields the user didn't explicitly set are touched (the `n_gpu_layers_set` and `kv_cache_dtype_set` internals are derived from a `user_passed_flag` argv scan). Server: `--auto` (default), `--no-auto`, `--print-plan` — prints the plan to logs and re-derives server fields; explicit flags win. Tests: 16 new unit tests in `oxidize-core` (plan() table-driven across desktop-no-GPU, desktop-70B-streaming, A100-32B, A100-70B, MacBook Apple Silicon, MoE-on-low-cores, tiny-box, AVX2-decode-tile). Smoke-tested locally: detects AMD AVX2 8c/16t 27 GiB no-GPU, plans Qwen3-4B at 8.2 tok/s decode (matches existing benchmark). Smoke-tested K3 nodes ai-2@192.168.1.152 and ai@192.168.1.68: both Intel Xeon Silver 4110 family 6 model 85 (Skylake-SP, AVX-512 disabled by gate), 32 cores, ai has 325 GiB RAM. Plan: 32 threads, AVX2 x8, sequential → ~30 tok/s decode on Qwen3-4B. scripts/auto_tune_report.sh runs `oxidize run --no-api --print-plan=json` locally or on a remote K3 node via sshpass and emits a Markdown report. AGENTS.md updated with the autotune and is_skylake_sp rows. Co-authored-by: CommandCodeBot <noreply@commandcode.ai>
There was a problem hiding this comment.
13 issues found across 23 files (changes from recent commits).
Tip: Review your code locally with the cubic CLI to iterate faster.
Re-trigger cubic
- Introduced the `oxidize-prune` package with dependencies on `anyhow`, `clap`, and `oxidize-core`. - Updated `Cargo.toml` to include `oxidize-prune` as a workspace member. - Modified `Dockerfile.server` to create a model cache directory for the `oxidize` user and changed the exposed port from 3000 to 8080. - Removed the obsolete `serve.log` file. - Enhanced `Args` struct in `oxidize-cli` to include `force_dflash` flag for speculative decoding. - Updated inference configuration in `oxidize-core` to support DeepSeek architecture with new parameters for expert weights scaling and group routing. - Various code style improvements and adjustments for better readability across multiple files.
There was a problem hiding this comment.
13 issues found across 36 files (changes from recent commits).
Tip: Review your code locally with the cubic CLI to iterate faster.
Re-trigger cubic
Introduces a workspace crate to merge two HuggingFace SafeTensors models with linear or SLERP interpolation, per-category blend weights, and mmap-based sharded I/O for large checkpoints. Co-authored-by: Cursor <cursoragent@cursor.com>
- Added `oxidize-prune` dependency to leverage SIMD magnitude and Wanda masks for efficient tensor processing. - Updated `AGENTS.md` to document the new `oxidize-prune` functionality and its dependencies on `oxidize-kernels`. - Modified `Cargo.lock` to include `oxidize-kernels` and `rayon` for parallel processing. - Refactored `oxidize-cli` to streamline command handling and improve usability. - Cleaned up `continual-learning` state files to reflect recent changes in model handling. This commit enhances the performance and capabilities of the oxidize framework, particularly in pruning and tensor operations.
There was a problem hiding this comment.
12 issues found across 31 files (changes from recent commits).
Tip: Review your code locally with the cubic CLI to iterate faster.
Re-trigger cubic
…PU GEMV Enable AMD inference via hipcc-compiled kernels and unified CUDA/ROCm dispatch, with RDMA ring transport scaffolding and ultra-low-bit quant fast paths for large GGUF models. Co-authored-by: Cursor <cursoragent@cursor.com>
|
You're iterating quickly on this pull request. To help protect your rate limits, cubic has paused automatic reviews on new pushes for now—when you're ready for another review, comment |
…and CUDA Port hardware autotune, layer-wise/MTP/LoRA inference, draft loading, vision/video, convert/prune/validation tooling, TCP mesh routing, and CUDA backend selection to oxidize-golang with matching Python CLI and runtime wiring plus parity tests. Co-authored-by: Cursor <cursoragent@cursor.com>
- Updated `AGENTS.md` to clarify guidelines for extending Go/Python ports and GPU backend implementations. - Improved handling of continual learning state files with additional metadata and timestamps. - Refactored `diffusion_gemma_bench.rs` to ensure proper error handling during model generation. - Adjusted `lib.rs` and `generate.rs` to enforce stricter Clippy linting rules, enhancing code quality. - Removed obsolete `tensor.rs` file and reorganized module structure for better clarity. - Added error handling in `block_pool.rs` and `scheduler.rs` to prevent panics and improve robustness. These changes collectively enhance the functionality, maintainability, and reliability of the oxidize framework.
|
@cubic-dev-ai review |
@Jackson57279 I have started the AI code review. It will take a few minutes to complete. |
There was a problem hiding this comment.
4 issues found across 16 files (changes from recent commits).
Tip: Review your code locally with the cubic CLI to iterate faster.
Re-trigger cubic
…ariable' Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
Address all outstanding cubic/codex review findings on the OXK-kernels PR. Correctness / safety: - spinpool: propagate a worker-chunk panic to the submitter after the ack-drain instead of only logging (no more silent incomplete output). - kernels/prune: assert_eq! (not debug_assert_eq!) on weight/mask length so release builds don't silently leave weights unzeroed; use total_cmp for a strict weak ordering under NaN. - merge/blend: guard slerp against near-antipodal vectors (sin_theta → 0) to avoid NaN/Inf weights; tighten the midpoint angle test. - merge/index: error on conflicting shard metadata instead of silent overwrite; reject non-plain shard names (path-traversal guard). - merge/writer: fail loudly when a shard referenced by the index is missing. - finetuning/fused: fail fast on out-of-range targets in both the gradient and loss-only paths (was release-only silent skip vs clamp). - cuda: don't evict the just-inserted quantized weight in the same budget pass (enforce_budget_protecting); cuBLAS handle lifetime unchanged. - cli: build the rayon global pool after autotune finalizes --threads so the recommended thread count actually takes effect. - prune: memory-map the model for calibration validation instead of reading the whole file (OOM on large models). Autotune: - detect: pick the highest-capability GPU family deterministically (rank, not nvidia-smi order); Display instead of Debug in --print-hardware. - rules: KV budget accounts for GPU-offloaded layers; tier6 thread reduction no longer gated on oxk_isa (ARM/Neon); F16 KV rationale wording. - server: drop layer_wise recommendation for DFlash models before logging. Cleanup: - conversion: extract StagedTensor alias, drop file-level type_complexity allow. - server: collapse MTP if, drop collapsible_if allow; auth keys() returns an iterator (no per-request Vec alloc). - tensor: move DType/ActivationFn out of errors.rs into types.rs. - scheduler/block_pool: remove redundant HashMap lookups / id validation. - prune/filter, merge/recipe: doc + classification fixes; k8s image tag pinned. - AGENTS.md: clarify CGO is permitted for native GPU bindings. Remove stray local experiment artifacts that leaked into the PR (personal LAN scripts with a hard-coded SSH password, a k8s manifest, a planning HTML, and a codebase training-data dump): ai2_probe.sh, llama-qwen7b.yaml, kimi-k2-merge-plan-v2.html, training-data/oxidize-codebase.jsonl, scripts/auto_tune_report.sh, scripts/kimi_k2_ai2_*.sh. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Summary by cubic
Adds the new
oxidize-kernels(OXK) CPU kernels and broad inference upgrades across CPU, CUDA, and ROCm/HIP, plus auto‑tuning, dual‑socket NUMA replication, pruning withoxidize-prune, SafeTensors merging viaoxidize-merge, native MTP/speculative decoding, streaming safetensors→GGUF with on‑the‑fly quantize/prune, batched LoRA SFT, DiffusionGemma, and a Go port with parity. DefaultOXIDIZE_GEMV=autonow picks OXK when supported; flash‑attention borrows F16 KV to cut DRAM and memory.New Features
IQ1_S,IQ1_M,NVFP4,Q4_K, and F16 KV borrow in flash‑attention;oxidize-coreexposesgpu_dispatchfor one entrypoint.RdmaRingTransport.oxidize-golangparity for autotune, inference (incl. MTP/LoRA), convert/prune/validation, TCP mesh, and CUDA selection, with tests.Bug Fixes
layer_wisehints for DFlash.--threads; minor scheduler/block‑pool and tensor type cleanups.Written for commit 55b5029. Summary will update on new commits.