Skip to content

Feat/oxk kernels#16

Merged
Jackson57279 merged 38 commits into
masterfrom
feat/oxk-kernels
Jun 18, 2026
Merged

Feat/oxk kernels#16
Jackson57279 merged 38 commits into
masterfrom
feat/oxk-kernels

Conversation

@Jackson57279

@Jackson57279 Jackson57279 commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Summary by cubic

Adds the new oxidize-kernels (OXK) CPU kernels and broad inference upgrades across CPU, CUDA, and ROCm/HIP, plus auto‑tuning, dual‑socket NUMA replication, pruning with oxidize-prune, SafeTensors merging via oxidize-merge, native MTP/speculative decoding, streaming safetensors→GGUF with on‑the‑fly quantize/prune, batched LoRA SFT, DiffusionGemma, and a Go port with parity. Default OXIDIZE_GEMV=auto now picks OXK when supported; flash‑attention borrows F16 KV to cut DRAM and memory.

  • New Features

    • ROCm/HIP: new AMD backend with unified CUDA/ROCm dispatch, direct GPU GEMV for IQ1_S, IQ1_M, NVFP4, Q4_K, and F16 KV borrow in flash‑attention; oxidize-core exposes gpu_dispatch for one entrypoint.
    • RDMA mesh: verbs‑probed ring transport scaffolded as RdmaRingTransport.
    • Go port: oxidize-golang parity for autotune, inference (incl. MTP/LoRA), convert/prune/validation, TCP mesh, and CUDA selection, with tests.
  • Bug Fixes

    • Spin‑pool: propagate worker panics to the submitter after ack‑drain; no silent partial output.
    • Pruning/merge: strict length checks for masks, stable cmp under NaN, guarded SLERP to avoid NaNs, conflict checks in shard index/writer; calibration validation uses mmap.
    • Autotune: deterministic best‑GPU pick by family rank, GPU‑offload aware KV budget, threads reduction not gated on ISA, clearer F16 KV rationale; server drops layer_wise hints for DFlash.
    • CLI/runtime: build the rayon pool after autotune resolves --threads; minor scheduler/block‑pool and tensor type cleanups.
    • Hygiene: remove stray local artifacts; conversion refactors (staged tensor alias) without changing behavior.

Written for commit 55b5029. Summary will update on new commits.

Review in cubic

Jackson57279 and others added 14 commits June 10, 2026 12:15
- numa.rs: OXIDIZE_NUMA_REPLICATE=1 copies the mapped GGUF into one
  MPOL_BIND + MADV_HUGEPAGE buffer per NUMA node at load; decode chunk
  closures translate their matrix slice to the caller's node-local
  replica (TLS-cached getcpu; pinned workers are exact). Removes the
  ~52%-remote weight reads and the Skylake directory-write tax for the
  cost of one weight copy per node. Falls back silently on single-node
  hosts or allocation failure.
- spinpool: an idle worker waking on the 50ms park timeout no longer
  re-enters the spin phase (only a notify does) — two pools sharing
  pinned cores otherwise bleed several ms of CPU per worker per 50ms,
  which degraded every process on the box by ~25%.

Qwen3-30B-A3B at native 32K context window on the dual-socket CPU box:
9.7 tok/s short-form, 8.5-9.6 sustained (matched A/B: replication alone
+20%, idle fix recovers the cross-server regression). THP on the
replicas matters: 4KB anon pages cost ~4.5M TLB entries vs the large
folios the page cache uses.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Phase 1-3 of the OXK migration plan (.cursor/plans/xeon-oxk-kernels.md),
implemented in Rust std::arch intrinsics rather than C:

- New optional oxidize-kernels crate: Q4_K x Q8_K row dots (scalar
  reference + AVX2 x1/x4/x8) and a contiguous-range GEMV helper.
  Bit-exact vs the legacy tensor.rs kernels (exact-equality parity
  test, plus 131k-range shadow run with 0 mismatches on Xeon Silver).
- OXIDIZE_GEMV=legacy|oxk|shadow choke point in gemv_q4_k_q8_k_fused
  and the Q4_K expert GEMV paths. Default stays legacy; without
  --features oxk the build is unchanged.
- Plain-harness microbench (oxk_q4k_bench) for Gate B.

Gate results on 2x Xeon Silver 4110 (AVX2, no VNNI):
- Microbench (1 core, 30s sustained): x8 beats the legacy-style x4
  structure +6.2% cache-resident, +1.9% DRAM-resident.
- E2E decode Qwen3-30B-A3B Q4_K_M, interleaved A/B (3 pairs, 28
  threads): legacy 7.70/7.77/7.77 vs oxk 7.66/7.73/7.70 tok/s
  (oxk = 99.4%). Decode is at the DRAM ceiling, so the Phase 5
  flip-default gate (>= 100%) is NOT met; legacy remains default.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
numa.rs now supports multiple replicated regions (sorted, binary-search
translation in local_slice). With OXIDIZE_NUMA_REPLICATE=1, layer-wise
load replicates the whole GGUF only when it fits half the smallest
node's memory; past that it falls back to replicating just the dense
(non-routed-expert) tensors — on nex-n2-pro that is 5.1 GiB per node
carrying ~half the per-token weight reads. OXIDIZE_NUMA_REPLICATE=dense
forces the partial mode.

Nex-n2-pro Q4_K_M (208GB, 2x Xeon Silver 4110) decode, 64tok x 3iter:
- baseline 16T:            2.46-2.56 tok/s (prior production config)
- 28T, no replication:     3.01-3.03 tok/s (idle box; old 16T rule was
                           measured with two servers sharing cores)
- 28-32T + dense repl:     3.29-3.35 tok/s  (+34% total)

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…esktop CPUs

The spin pool and identity CPU pinning, both tuned on the dual-socket Nex
box, collapsed decode throughput on single-socket consumer parts (Ryzen
6850H, 8C/16T, Qwen3-4B Q4_K_M: 2.96 tok/s vs 9.65 for the pre-spinpool
build):

- Spin pool now defaults on only when NUMA nodes > 1. Its always-spinning
  workers (rayon_threads - 1 of them, on top of the rayon pool itself)
  starve an 8-core part: 2.96 -> 9.0 tok/s just by disabling it here.
  OXIDIZE_SPINPOOL=1/0 still forces either way; multi-socket hosts keep it.
- Worker pinning now uses core-first order read from sysfs (first SMT
  sibling of each core, then the rest). Linux enumerates sibling pairs
  adjacently on AMD, so the old identity map stacked 8 workers onto 4
  physical cores (9.15 -> 12.0 tok/s).
- Default thread count is now the physical core count instead of
  available_parallelism: decode GEMV is DRAM-bound, SMT siblings only add
  contention (16T 11.0 vs 8T 11.6 with the other fixes in).
- `oxidize run model "prompt"` (one-shot) no longer auto-starts the
  background API server: it loaded the model a second time concurrently
  with prefill and died with the process right after generation.

Benchmarks (Ryzen 7 PRO 6850H, Qwen3-4B Q4_K_M, 128-token decode):
  before: 2.96 tok/s   after: 12.0 tok/s
ollama-performance-benchmark harness (768 tokens, load included):
  before: 8.60 tok/s   after: 11.12 tok/s (ollama 14.07, llama.cpp 13.60)

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…> 0.61x to 0.81x of ollama

Decode profiling (new OXIDIZE_DECODE_PROFILE=1, per-shape GB/s + phase
timers at exit) attributed the remaining gap to four causes, fixed here:

- q/k/v and gate/up ran as nested rayon::join regions whose inner par_iters
  stole work from each other, interleaving weight streams on the same cores:
  the gate/up matrices measured 19-21 GB/s vs 32+ for the same shape alone
  (and with the spin pool, the losing join arm ran entirely serial — the
  reason the pool collapsed to 3 tok/s on desktops). New
  gemv_quantized_multi_f32 runs all same-input projections as ONE flat
  region sharing one Q8_K quantization; chunk sizes are byte-weighted so
  mixed Q4_K/Q6_K jobs stay balanced. Bit-identical rows (test included).
- Attention heads now dispatch through run_chunks instead of a raw rayon
  region, and the parallel threshold drops 128 -> 16 (the old value left
  attention single-threaded for the entire early context).
- The spin pool is default-on everywhere now: with every decode hot loop on
  run_chunks it beats rayon's sleep/wake handoff on single-socket too, but
  only with the submitting thread pinned to slot 0 — an unpinned submitter
  timeshares against spinning workers and loses ~8%.
- `oxidize run`/`serve` default KV dtype q8 -> f32: q8/f16 caches cannot be
  borrowed by decode attention, so every layer of every token dequantized
  the WHOLE K/V prefix into workspace buffers (~2 GB of copies over a
  768-token run). cpu-optimized clamps ctx to 2048, bounding f32 KV at
  ~600 MB for a 4B model. Decode glue: 98 -> 28 us/layer.

Also: rayon fallback in run_chunks uses static block partitioning (one
contiguous range per worker), MADV_COLLAPSE attempt at model load (no-op
on kernels/filesystems without file-THP), GEMV shape microbench test.

Benchmarks (Ryzen 6850H, Qwen3-4B Q4_K_M, 128-tok decode, same-run pairs;
absolute numbers swing ~15% with package thermals):
  oxidize self-reported: 11.5 -> 12.4 tok/s (was 2.96 at yesterday's
  defaults, 9.65 for the pre-spinpool build)
  vs llama.cpp decode-only: 0.69-0.75x (llama.cpp 13.8-16.7 same-minute)
  ollama-performance-benchmark (768 tok, load included, same run):
  oxidize 9.50 vs ollama 11.78 = 0.81x (was 8.60 vs 14.07 = 0.61x)

Remaining known gaps: f16-KV borrow path for attention (llama.cpp attends
f16 natively; f32 doubles late-context KV reads), Q4_K scale-decode SIMD,
batched prefill (~30 tok/s vs llama.cpp 72).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…llama

Three decode fixes on top of the fused-region work:

- f16 KV cache borrow path: new KvElem trait makes the online-softmax
  decode kernel generic over the KV element; u16 rows convert in-kernel
  via F16C (AVX2), f32 passes through bit-identically. The KV cache gains
  f16_layer_{key,value}_prefix borrows, and decode attention prefers them
  before the f32 borrow / dequant-copy paths. `run`/`serve` KV default
  f32 -> f16: zero-copy like f32 but half the attention DRAM reads as the
  context grows, and half the memory.
- Next-quad prefetch sweep in the Q4_K/Q6_K x4 row kernels for short rows
  (blocks_per_row <= 16): 10-block rows restart the hardware prefetcher
  every 22 cache lines, which held every 2560-column matrix (gate/up
  projections, q/k/v, lm_head) ~10-20% under the DRAM roof. The sweep
  walks the next quad's row one quad-time ahead: gate/up 37 -> 45.6 GB/s,
  qkv 34 -> 41.9, lm_head 38 -> 44.3, decode 13.3 -> 15.1 tok/s
  (same-conditions A/B). Long rows get a deeper in-row T1 sweep instead
  (down-proj relative deficit closed in the shape microbench).
- gemm_quantized_f32 now records into the OXIDIZE_DECODE_PROFILE summary
  (prefill attribution; batch-43 GEMM measured ~46% of FMA peak).

Benchmarks (Ryzen 6850H, Qwen3-4B Q4_K_M, cool machine, same-run pairs):
  decode-only token_forward: 70.1 -> 62.7 ms/token (15.95 tok/s)
  oxidize self-reported 512 tok: 13.3 -> 15.1-15.2 tok/s
  ollama-performance-benchmark (768 tok, load included, ollama runs
  first on the cooler machine): oxidize 14.91 vs ollama 15.72 = 0.95x
  (was 0.61x at the start of this effort)

f16 attention matches f32 within half-precision rounding (test included).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ining hooks

- GDN gated RMSNorm: near-zero eps (model eps over-floored tiny delta
  outputs), gate-after order matching llama.cpp's qwen3next graph, and
  L2-normed q/k without 1/sqrt(d)
- canonicalize bare 'blk.N.ssm_a' (no .weight suffix) from llama.cpp
  GGUFs; handle both ssm_conv1d layouts ({kernel,channels} vs
  {channels,kernel})
- tokenizer: honor tokenizer.ggml.add_bos_token metadata; default BOS
  only for SentencePiece (spurious BOS corrupted Qwen forward passes)
- layer-wise: forward_normed_hidden + lm_head_logits_batch batched
  training entry points; warm_layer_cache; OXIDIZE_TRACE_VALS debugging

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…rts them

auto (new default) routes Q4_K GEMV to oxidize-kernels when the crate
is compiled in and AVX2 is available, falling back to legacy intrinsics
otherwise. Also checks in the Xeon OXK migration plan.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- parse nextn_predict_layers from GGUF; exclude appended MTP draft
  blocks from layer_count; load blk.N.nextn.* tensors (MtpWeights)
- MtpGenerationStream: drafts from the last committed token plus its
  output-normalized hidden state, so prefill provides the first anchor
- CLI uses native MTP automatically when present (--no-mtp to disable,
  --draft-tokens to tune); accept qwen3_5_text arch aliases
- dflash: GGUF row/col dim handling fixes for draft weight loading

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ntime flags

- open_generation_stream picks DFlash draft (--draft-model,
  --draft-tokens), native MTP when the GGUF has nextn layers, or
  standard generation
- new flags: --kv-cache-dtype f32/f16/q8/q4, --threads,
  --ram-offload-threads, --no-turboquant-kv
- TurboQuant is now the default q4/q8 KV cache quantizer
- warm layer cache at load; qwen chat-template fallback

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- shard-by-shard streaming conversion (plan_stream_outputs +
  write_gguf_streaming) so model dirs no longer materialize in RAM
- --quantize target on oxidize-convert (Q4_K/Q8_0/... via
  quantize_linear_4bit and friends in compute/quantization)
- qwen3_5* arch aliases normalized to qwen35 metadata prefix
- gguf_layer_keys inspection bin for normalized per-layer tensor keys

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ter hot paths

- LoRA forward/backward take count rows at once (cache-friendly
  parallel loops instead of one rayon dispatch per token)
- trainer drives layer_wise forward_normed_hidden +
  lm_head_logits_batch for whole windows; fused AdamW + batched
  softmax cross-entropy
- dataset: chat-format JSONL support; CLI: more training knobs

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- gemv/layer/inference benches grow OXIDIZE_GEMV-aware comparison runs
  and Q4_K coverage
- cuda.rs/numa.rs/build.rs/metrics.rs: rustfmt-only cleanup; lib.rs
  module reorder

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…dth bench

- cpu.rs: CPUID vendor (Intel/AMD) + ISA summary (AVX2/FMA/AVX-VNNI/
  AVX512-VNNI) with a per-vendor tuning profile resolved once per process
- prefetch distance and hint are now runtime-tunable: OXIDIZE_OXK_PF
  (blocks, 0 disables) and OXIDIZE_OXK_PF_HINT=t0|nta
- default stays 4 blocks/T0 for both vendors: a contended 8-thread sweep
  on Ryzen 6850H (Zen 3+) showed pf 0/2/4 x t0/nta all within noise and
  pf=8 mildly worse — Zen's HW prefetcher covers the stream, so AMD
  shares the Xeon-tuned default instead of diverging on noise
- oxk_q4k_bench grows OXK_BENCH_THREADS contended mode (all cores
  streaming at once, the shape of real decode) and prints the CPU summary

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0c4292169c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread oxidize-core/src/model/generation.rs Outdated

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

22 issues found across 55 files

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

Comment thread oxidize-core/src/compute/spinpool.rs Outdated
Comment thread oxidize-core/benches/inference_bench.rs Outdated
Comment thread oxidize-core/build.rs Outdated
Comment thread oxidize-core/src/format/safetensors_to_gguf.rs Outdated
Comment thread oxidize-core/src/format/conversion.rs Outdated
Comment thread oxidize-core/src/backends/cuda.rs
Comment thread oxidize-core/src/compute/quantization.rs Outdated
Comment thread oxidize-finetuning/src/dataset.rs Outdated
Comment thread oxidize-finetuning/src/dataset.rs Outdated
Comment thread oxidize-core/build.rs Outdated
Jackson57279 and others added 4 commits June 12, 2026 22:59
… contended MT bench

The Q4_K GEMV kernel was measured at ~33 GB/s and assumed "DRAM-latency
bound, kernel exhausted." That conclusion came from a benchmark bug: the
contended-mode harness spawned OS threads per pass, understating throughput
~2x. Fixed to persistent deadline-loop workers (run_mt, OXK_BENCH_MT_ONLY/
OXK_BENCH_MT_KERNEL). With a correct harness, an on-box prefetch sweep
(2x Xeon Silver 4110, Skylake-SP, DDR4-2133; 302 MB fixture, 32T, interleaved
pf in {0..8} x {t0,nta}) showed pf=1/t0 hits 72-74 GB/s = the platform's
pure-read ceiling, vs ~63.5 at the old pf=4 default and ~57 for any NTA hint.

- cpu.rs: Intel software-prefetch default 4 -> 1 block, with the measurement
  recorded inline; AMD/Other unchanged (Zen sweep was within noise).
- lib.rs: select_isa() resolved once via OnceLock — it ran inside
  gemv_q4k_range (per pool chunk), and the per-call env::var showed up at
  >1% of decode samples (libc getenv scans the environment). Adds AVX-512F/BW,
  AVX-512 VNNI, and AVX-VNNI dispatch behind runtime detection +
  OXIDIZE_OXK_ISA/OXIDIZE_OXK_AVX512 overrides (AVX-512 stays off by default
  on Skylake-SP: measured 71 vs 73 GB/s DRAM, 52 vs 80 cache-resident — the
  frequency drop loses).
- q4k_avx512.rs: new AVX-512 / VNNI / AVX-VNNI Q4_K x Q8_K kernels, bit-exact
  vs scalar (parity tests).
- q4k_avx2.rs: x16 multi-row variant, decode/dot split, vendor-tuned prefetch.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…r-layer gate+up alloc

Qwen3-30B-A3B-Q4_K_M CPU decode went from ~9 to ~12.4 tok/s (28T, interleaved
A/B). Profiling (OXIDIZE_DECODE_PROFILE + perf IMC counters) showed decode was
NOT bandwidth-bound — only ~20 of 73 GB/s achieved — but stalled on CPU
overhead in the serial forward path:

- tensor.rs gemv_f32_cpu: the MoE router projection (every layer, every token)
  used a scalar `.map().sum()` f32 reduction LLVM can't vectorize
  (non-associative) — a serial FMA chain. Switched to dot_f32_fast (AVX2 FMA,
  4 independent accumulators).
- inference.rs / layer_wise.rs: OXIDIZE_TRACE_FWD/_VALS were read via
  env::var_os on every layer of every token. Hoisted behind cached
  trace_fwd_enabled()/trace_vals_enabled() OnceLock helpers.
- inference.rs moe_ffn_forward_weights: the fused gate+up branch heap-allocated
  `vec![0.0; 2*n_sel*i_size]` and memcpy'd it back into two scratch buffers
  every layer every token (~14% of main-thread decode samples). Replaced with
  a thread-local reusable buffer read in place by SwiGLU + down-projection;
  fused3 GEMV improved 34 -> 39 GB/s. Output verified coherent.
- tokenizer.rs: add the add_bos_token field to two test-only SpecialTokens
  initializers so the oxidize-core test binary compiles again.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… optimal

"Optimize OXK" investigation outcome: the kernel is already at the right
tiling on the target hardware. A single-threaded microbench suggested x1 beats
the wide tiles on Skylake-SP (4.23 vs 3.76 GB/s — register pressure from 8 Q8
ymm vectors held live across 8-16 row dots), but that bench is L3-resident.
A decisive interleaved e2e A/B on Qwen3-30B-A3B (28T, cold-DRAM expert reads)
showed the opposite and monotone: tile16 11.7/10.0 > tile8 7.5/7.0 >
tile1 4.8/4.3 tok/s. The wide tile's 16 independent outstanding loads hide DRAM
latency, which is what actually limits decode — so narrowing the tile would
have ~halved throughput.

gemv_q4k_range now gates its x16->x8->x4->x1 cascade on a once-resolved
max_tile(), default 16 (== prior behavior, verified no regression), overridable
via OXIDIZE_OXK_TILE={1,4,8,16} for retuning on other parts (e.g. VNNI cores).
Bit-identical regardless of width.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 issues found across 11 files (changes from recent commits).

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

Comment thread oxidize-kernels/benches/oxk_q4k_bench.rs Outdated
Comment thread oxidize-kernels/src/lib.rs Outdated
Comment thread oxidize-kernels/src/q4k_avx2.rs Outdated
Comment thread oxidize-core/src/model/inference.rs
Comment thread oxidize-kernels/src/cpu.rs Outdated
Jackson57279 and others added 2 commits June 13, 2026 21:15
# Conflicts:
#	oxidize-core/build.rs
#	oxidize-core/kernels/gemv_f32.cu
#	oxidize-core/src/backends/cuda.rs
#	oxidize-core/src/compute/tensor.rs
Spinpool panic-safety:
- P0: submitter catches panics in its own chunk range and still drains worker
  acks before returning, so workers never call the fat-pointer closure after
  its borrow ends (use-after-free).
- P1: workers ack even when a chunk panics (and stay alive), so one panicking
  chunk can no longer deadlock the pool.

oxidize-kernels:
- Forced OXIDIZE_OXK_ISA modes are now gated by the same availability checks as
  auto, so forcing an unsupported ISA can't execute illegal instructions.
- q4k_avx2 next-tile prefetch no longer double-counts the row offset and uses
  wrapping_add (was UB via .add past the allocation).
- AVX-VNNI detection reads CPUID leaf 7 subleaf 1 EAX[4] (was subleaf 0 EDX[4]).
- MT x1 bench path runtime-guards the AVX2 kernel.

NUMA:
- Freed replica mappings on a lost REGIONS.set race (was leaking GBs).
- Robust online-node parsing for comma/range lists; node bitmask sized per node
  id (was capped at 64 nodes / UB shift).

Correctness:
- flash_attention: overflow-checked head q_len so the unsafe per-head output
  slices can't run past the buffer.
- conversion: a fused gate_up_proj that fails to split is now a hard error
  (matches the streaming path) instead of emitting a broken MoE GGUF.
- safetensors->gguf: I16 (ggml type 25) byte size is 2; non-index/file
  conversions now honor target_quantization.
- dflash: dequant fallback transpose used swapped dims, corrupting weights when
  the quantized GEMV path was skipped — now mirrors the primary loader.
- quantization: Q4_K_S errors keep their variant (were mislabeled Q4_K_M).
- MTP stream: budget/stop checks run before draining the emit buffer, so a
  multi-token step can't over-emit past max_new_tokens or past a stop token.

Build / perf / hygiene:
- build.rs: gate PTX compilation on CARGO_FEATURE_CUDA, probe nvcc.exe on
  Windows, drop the dead OXIDIZE_CUDA_PTX env.
- cuda: GpuState now destroys its cuBLAS handle on drop.
- inference_bench: reuse one layer's weights (was ~22GB OOM at 7B dims).
- fused MoE: don't zero gate/up scratch on the fused early-return path.
- finetuning: serde(default) for backward-compatible configs; skip (not clamp)
  out-of-range CE targets; avoid full-vector clone before truncation; keep
  packing-buffer capacity across flushes.
- plan doc: mermaid phase numbering matches the phase text.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 issues found across 19 files (changes from recent commits).

Tip: Review your code locally with the cubic CLI to iterate faster.

Re-trigger cubic

Comment thread oxidize-core/benches/inference_bench.rs
Comment thread oxidize-core/src/compute/spinpool.rs
Comment thread oxidize-finetuning/src/fused.rs Outdated
Comment thread oxidize-core/src/format/safetensors_to_gguf.rs
Jackson57279 and others added 4 commits June 14, 2026 18:37
…K kernels

Add diffusion-gemma (Gemma-4 26B-A4B MoE block-diffusion) support, ported
faithfully from the llama.cpp diffusion-gemma4 reference graph (PR #24427):

- oxidize-core/src/model/diffusion_gemma.rs: GGUF loader + bidirectional
  canvas forward (QK-norm, scale-less V-norm, V=K on full layers, dual head
  dims 256/512, NEOX rope with proportional rope_freqs on full layers,
  attn scale 1.0), dual dense+routed-MoE FFN (128 experts top-8, fused
  gate_up split, per-expert/router scales), self-conditioning MLP, layer
  output scalar, final logit softcap, tied output head. Q5_0 down-projections
  (unsupported by OXK gemv) are dequantized to f32 at load.
- 48-step entropy-bound denoise loop (linear temp schedule, entropy-bound
  accept, stable-and-confident stop) matching the reference sampler.
- oxidize-cli bin diffusion_gemma_bench: runs one canvas, reports canvas
  tok/s + per-step mean-entropy trace. Build with --features oxk and run
  with OXIDIZE_GEMV=oxk to exercise the OXK kernels.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…_0->Q8_0

Make the DiffusionGemma OXK path correct *and* fast on CPU:

- Requantize Q5_0 down-projections to Q8_0 at load (OXK gemv has no Q5_0 path)
  instead of a scalar f32 fallback: near-lossless, ~4x less RAM, stays on the
  fast SIMD experts kernel.
- Batch the tied output head into one big GEMM over the whole canvas instead of
  256 sequential per-token GEMVs.
- Parallelize the scalar hot loops across the 256 canvas tokens (bidirectional
  attention, full-vocab softmax/entropy/sample) with rayon, avoiding nested
  parallelism with the kernels' own row-parallelism (nesting measured 2-4x
  slower; the per-token MoE stays sequential so its inner experts GEMV keeps the
  single level of parallelism).

Result on the 32-core CPU box (Q4_K_M): ~113 -> ~60 s/step, full core
utilization, entropy collapse unchanged (correctness preserved). Remaining gap
to llama.cpp's ~12 s/step is its batched mul_mat_id experts kernel.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… bench

Batch the routed MoE across the whole canvas: all nt*N_USED (token, expert)
pairs flow through ONE gate_up experts GEMV and ONE down experts GEMV (flattened
selections + per-slot strided inputs), giving a single level of rayon
parallelism over the full output instead of 256 nested per-token calls.

Result on the 32-core CPU box (Q4_K_M, 'What is the capital of France?'):
  - 60 -> 30 s/step; converges in 6 denoising steps (early-stop), 181 s total
  - 1.41 canvas tok/s end-to-end (vs llama.cpp reference ~1.0)
  - correct, coherent output: 'The capital of France is **Paris**.'

bench now decodes and prints the final canvas via the GGUF tokenizer.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Resolve conflicts from PR #15 CUDA performance fixes while keeping OXK
kernel paths, NUMA replication, MTP conversion, and speculative decoding.
Ignore local deploy/ scripts; set core.filemode=false for external drive.

Co-authored-by: Cursor <cursoragent@cursor.com>
Only declare known binaries explicitly; gate diffusion_gemma_bench behind oxk.

Co-authored-by: Cursor <cursoragent@cursor.com>
Jackson57279 and others added 3 commits June 15, 2026 15:22
The OXK crate is a workspace member; Docker smoke tests failed without it.

Co-authored-by: Cursor <cursoragent@cursor.com>
Cargo validates [[bench]] paths at manifest load time.

Co-authored-by: Cursor <cursoragent@cursor.com>
Unblocks PR #16 CI after the master merge by fixing lint in core, server, CLI, finetuning, and bench targets.

Co-authored-by: Cursor <cursoragent@cursor.com>

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 11 files (changes from recent commits).

Tip: Review your code locally with the cubic CLI to iterate faster.

Re-trigger cubic

Comment thread oxidize-core/src/format/conversion.rs Outdated
Comment thread oxidize-server/src/runtime/generate.rs Outdated
Jackson57279 and others added 2 commits June 16, 2026 02:27
Implements activation-aware one-shot pruning (Wanda, arxiv:2306.11695)
and per-output-row magnitude pruning (Han et al. 2015, with the
per-row comparison group from Wanda Table 7) on top of the existing
tensor-name substring filter. Both methods work on quantized GGUFs:
weights are dequantized to f32, masked, and re-quantized to the
original type (or a joint target via --joint-quantize).

New surface:
- oxidize-core::activation_stats::ActivationStats + CalibrationRunner:
  streaming per-input-neuron L2 accumulator (Wanda's X side).
- oxidize-prune::mask::{magnitude_mask, wanda_mask, apply_nm_pattern}:
  pure-Rust Wanda + 2:4 / 4:8 N:M structured mask primitives.
- oxidize-prune::wanda::{wanda_prune, magnitude_prune, WandaOptions}:
  full GGUF round-trip; reads quantized bytes, masks, requantizes,
  writes a new GGUF.
- L2-norms cache: simple text format, one row per linear weight, N
  f32 values per row. Loaded via --calibration; validated against
  the input GGUF.
- oxidize convert --prune wanda|magnitude: single-pass prune+quantize
  on a freshly-converted SafeTensors GGUF.

Tests: 14 in oxidize-prune (mask, wanda, magnitude, calibration
cache roundtrip, N:M patterns, full dequant/quant roundtrip) and 7
in oxidize-core (activation_stats streaming, merge, runner finalize).
All passing.

Plan: ~/.commandcode/plans/make-pruning-and-inference-faster.md
Refs: arxiv:2306.11695 (Wanda), arxiv:2301.00774 (SparseGPT).

Co-authored-by: CommandCodeBot <noreply@commandcode.ai>
Add `oxidize-core::autotune` — a stateless orchestrator that
detects the host (CPU/ISA/RAM/NUMA/GPU/Metal/CUDA/WSL/cgroup
memory limit/hugepages), fingerprints the loaded GGUF model
(architecture, dims, MoE/MTP, dominant qtype, file size), and
produces a `TuningPlan` for the most-relevant inference knobs:
threads, ctx_size, kv_cache_dtype, kv_quantization, n_gpu_layers,
mmap/mlock/mmap_hugepages/mmap_prefetch, numa_replicate_dense,
layer_wise, layer_cache, pipeline (Sequential/Continuous/Paged/
Asymmetric), speculative (None/DFlash/Mtp), decode_tile_tokens
(FlashDecoding split-K), oxk_isa (Scalar/Avx2/Avx512), oxk_tile
(1/4/8/16), and a tok/s estimate.

Rules are an ordered table at `oxidize-core/src/autotune/rules.rs`:
- Tier 0: model-too-big-for-RAM forces layer_wise streaming.
- Tier 1: ISA + Skylake-SP gate (which disables AVX-512 on the
  regressing uarch; we lift `is_skylake_sp()` to public in
  `oxidize-kernels::cpu`).
- Tier 2: GPU offload (whole model on GPU when it fits; partial
  n_gpu_layers sized to 0.85 × usable VRAM per-layer; skip
  entirely when VRAM < 25% of model size).
- Tier 3: KV cache dtype (F16 on >=16 GiB VRAM, asymmetric INT8 in
  the 8–16 GiB band, TurboQuant INT4 on low-VRAM / very deep
  models) + ctx size capped to fit `total_ram * 0.6 - model`.
- Tier 4: layer cache + NUMA replication (NUMA only on dense,
  non-trivial core count, with a SIMD backend present).
- Tier 5: speculative decoding (MTP if `nextn.*` tensors, DFlash
  for qwen/llama/lfm2).
- Tier 6: threads (full physical_cores on CPU; clamped to 4–8 when
  a cgroup memory limit is present or when GPU does the work).
- Tier 7: decode tile (split-K above 1024 KV tokens on AVX2).
- Tier 8: pipeline (Paged on GPU, Continuous on 8+ cores / 64+
  GiB / dense, Sequential otherwise).
- tps estimate: `min(per-core tps × cores, RAM bandwidth / model
  bytes)` calibrated against the existing `results/bench/`
  numbers.

CLI: `--auto` (default for `run`), `--no-auto`, `--print-plan`
(plain or `json`). Plan is applied to the `Args` struct before
the model is built: only fields the user didn't explicitly set
are touched (the `n_gpu_layers_set` and `kv_cache_dtype_set`
internals are derived from a `user_passed_flag` argv scan).
Server: `--auto` (default), `--no-auto`, `--print-plan` —
prints the plan to logs and re-derives server fields; explicit
flags win.

Tests: 16 new unit tests in `oxidize-core` (plan() table-driven
across desktop-no-GPU, desktop-70B-streaming, A100-32B,
A100-70B, MacBook Apple Silicon, MoE-on-low-cores, tiny-box,
AVX2-decode-tile). Smoke-tested locally: detects AMD AVX2 8c/16t
27 GiB no-GPU, plans Qwen3-4B at 8.2 tok/s decode (matches
existing benchmark). Smoke-tested K3 nodes ai-2@192.168.1.152
and ai@192.168.1.68: both Intel Xeon Silver 4110 family 6 model
85 (Skylake-SP, AVX-512 disabled by gate), 32 cores, ai has
325 GiB RAM. Plan: 32 threads, AVX2 x8, sequential → ~30 tok/s
decode on Qwen3-4B.

scripts/auto_tune_report.sh runs `oxidize run --no-api
--print-plan=json` locally or on a remote K3 node via sshpass and
emits a Markdown report. AGENTS.md updated with the autotune
and is_skylake_sp rows.

Co-authored-by: CommandCodeBot <noreply@commandcode.ai>

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

13 issues found across 23 files (changes from recent commits).

Tip: Review your code locally with the cubic CLI to iterate faster.

Re-trigger cubic

Comment thread scripts/auto_tune_report.sh Outdated
Comment thread oxidize-cli/src/main.rs
Comment thread oxidize-prune/src/main.rs Outdated
Comment thread oxidize-core/src/autotune/detect.rs Outdated
Comment thread oxidize-core/src/autotune/detect.rs Outdated
Comment thread scripts/auto_tune_report.sh Outdated
Comment thread oxidize-server/src/runtime/model.rs
Comment thread oxidize-core/src/autotune/rules.rs Outdated
Comment thread oxidize-core/src/autotune/rules.rs Outdated
Comment thread oxidize-core/src/autotune/rules.rs Outdated
- Introduced the `oxidize-prune` package with dependencies on `anyhow`, `clap`, and `oxidize-core`.
- Updated `Cargo.toml` to include `oxidize-prune` as a workspace member.
- Modified `Dockerfile.server` to create a model cache directory for the `oxidize` user and changed the exposed port from 3000 to 8080.
- Removed the obsolete `serve.log` file.
- Enhanced `Args` struct in `oxidize-cli` to include `force_dflash` flag for speculative decoding.
- Updated inference configuration in `oxidize-core` to support DeepSeek architecture with new parameters for expert weights scaling and group routing.
- Various code style improvements and adjustments for better readability across multiple files.

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

13 issues found across 36 files (changes from recent commits).

Tip: Review your code locally with the cubic CLI to iterate faster.

Re-trigger cubic

Comment thread ai2_probe.sh Outdated
Comment thread llama-qwen7b.yaml Outdated
Comment thread scripts/kimi_k2_ai2_continue_after_k27.sh Outdated
Comment thread oxidize-server/src/auth.rs Outdated
Comment thread oxidize-prune/src/filter.rs
Comment thread llama-qwen7b.yaml Outdated
Comment thread scripts/kimi_k2_ai2_pipeline.sh Outdated
Comment thread scripts/kimi_k2_ai2_pipeline.sh Outdated
Comment thread oxidize-core/src/model/inference.rs Outdated
Comment thread kimi-k2-merge-plan-v2.html Outdated
Jackson57279 and others added 2 commits June 17, 2026 01:40
Introduces a workspace crate to merge two HuggingFace SafeTensors models with linear or SLERP interpolation, per-category blend weights, and mmap-based sharded I/O for large checkpoints.

Co-authored-by: Cursor <cursoragent@cursor.com>
- Added `oxidize-prune` dependency to leverage SIMD magnitude and Wanda masks for efficient tensor processing.
- Updated `AGENTS.md` to document the new `oxidize-prune` functionality and its dependencies on `oxidize-kernels`.
- Modified `Cargo.lock` to include `oxidize-kernels` and `rayon` for parallel processing.
- Refactored `oxidize-cli` to streamline command handling and improve usability.
- Cleaned up `continual-learning` state files to reflect recent changes in model handling.

This commit enhances the performance and capabilities of the oxidize framework, particularly in pruning and tensor operations.

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

12 issues found across 31 files (changes from recent commits).

Tip: Review your code locally with the cubic CLI to iterate faster.

Re-trigger cubic

Comment thread oxidize-merge/src/blend.rs
Comment thread oxidize-kernels/src/prune.rs Outdated
Comment thread oxidize-kernels/src/prune.rs Outdated
Comment thread oxidize-core/src/backends/cuda.rs Outdated
Comment thread training-data/oxidize-codebase.jsonl Outdated
Comment thread oxidize-merge/src/merge.rs Outdated
Comment thread oxidize-merge/src/writer.rs
Comment thread oxidize-merge/src/index.rs Outdated
Comment thread oxidize-merge/src/index.rs
Comment thread oxidize-merge/src/blend.rs
…PU GEMV

Enable AMD inference via hipcc-compiled kernels and unified CUDA/ROCm dispatch, with RDMA ring transport scaffolding and ultra-low-bit quant fast paths for large GGUF models.

Co-authored-by: Cursor <cursoragent@cursor.com>
@cubic-dev-ai

cubic-dev-ai Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

You're iterating quickly on this pull request. To help protect your rate limits, cubic has paused automatic reviews on new pushes for now—when you're ready for another review, comment @cubic-dev-ai review.

Jackson57279 and others added 2 commits June 17, 2026 03:38
…and CUDA

Port hardware autotune, layer-wise/MTP/LoRA inference, draft loading, vision/video,
convert/prune/validation tooling, TCP mesh routing, and CUDA backend selection to
oxidize-golang with matching Python CLI and runtime wiring plus parity tests.

Co-authored-by: Cursor <cursoragent@cursor.com>
- Updated `AGENTS.md` to clarify guidelines for extending Go/Python ports and GPU backend implementations.
- Improved handling of continual learning state files with additional metadata and timestamps.
- Refactored `diffusion_gemma_bench.rs` to ensure proper error handling during model generation.
- Adjusted `lib.rs` and `generate.rs` to enforce stricter Clippy linting rules, enhancing code quality.
- Removed obsolete `tensor.rs` file and reorganized module structure for better clarity.
- Added error handling in `block_pool.rs` and `scheduler.rs` to prevent panics and improve robustness.

These changes collectively enhance the functionality, maintainability, and reliability of the oxidize framework.
@Jackson57279

Copy link
Copy Markdown
Contributor Author

@cubic-dev-ai review

@cubic-dev-ai

cubic-dev-ai Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

@cubic-dev-ai review

@Jackson57279 I have started the AI code review. It will take a few minutes to complete.

Comment thread oxidize-golang/core/autotune/rules.go Fixed

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 issues found across 16 files (changes from recent commits).

Tip: Review your code locally with the cubic CLI to iterate faster.

Re-trigger cubic

Comment thread AGENTS.md Outdated
Comment thread oxidize-core/src/paged_attention/scheduler.rs Outdated
Comment thread oxidize-core/src/compute/tensor/errors.rs
Comment thread oxidize-core/src/paged_attention/block_pool.rs
Jackson57279 and others added 2 commits June 17, 2026 05:16
…ariable'

Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
Address all outstanding cubic/codex review findings on the OXK-kernels PR.

Correctness / safety:
- spinpool: propagate a worker-chunk panic to the submitter after the
  ack-drain instead of only logging (no more silent incomplete output).
- kernels/prune: assert_eq! (not debug_assert_eq!) on weight/mask length so
  release builds don't silently leave weights unzeroed; use total_cmp for a
  strict weak ordering under NaN.
- merge/blend: guard slerp against near-antipodal vectors (sin_theta → 0)
  to avoid NaN/Inf weights; tighten the midpoint angle test.
- merge/index: error on conflicting shard metadata instead of silent
  overwrite; reject non-plain shard names (path-traversal guard).
- merge/writer: fail loudly when a shard referenced by the index is missing.
- finetuning/fused: fail fast on out-of-range targets in both the gradient
  and loss-only paths (was release-only silent skip vs clamp).
- cuda: don't evict the just-inserted quantized weight in the same budget
  pass (enforce_budget_protecting); cuBLAS handle lifetime unchanged.
- cli: build the rayon global pool after autotune finalizes --threads so the
  recommended thread count actually takes effect.
- prune: memory-map the model for calibration validation instead of reading
  the whole file (OOM on large models).

Autotune:
- detect: pick the highest-capability GPU family deterministically (rank,
  not nvidia-smi order); Display instead of Debug in --print-hardware.
- rules: KV budget accounts for GPU-offloaded layers; tier6 thread reduction
  no longer gated on oxk_isa (ARM/Neon); F16 KV rationale wording.
- server: drop layer_wise recommendation for DFlash models before logging.

Cleanup:
- conversion: extract StagedTensor alias, drop file-level type_complexity allow.
- server: collapse MTP if, drop collapsible_if allow; auth keys() returns an
  iterator (no per-request Vec alloc).
- tensor: move DType/ActivationFn out of errors.rs into types.rs.
- scheduler/block_pool: remove redundant HashMap lookups / id validation.
- prune/filter, merge/recipe: doc + classification fixes; k8s image tag pinned.
- AGENTS.md: clarify CGO is permitted for native GPU bindings.

Remove stray local experiment artifacts that leaked into the PR (personal
LAN scripts with a hard-coded SSH password, a k8s manifest, a planning HTML,
and a codebase training-data dump): ai2_probe.sh, llama-qwen7b.yaml,
kimi-k2-merge-plan-v2.html, training-data/oxidize-codebase.jsonl,
scripts/auto_tune_report.sh, scripts/kimi_k2_ai2_*.sh.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@Jackson57279 Jackson57279 merged commit e5791f3 into master Jun 18, 2026
9 of 15 checks passed
@Jackson57279 Jackson57279 deleted the feat/oxk-kernels branch June 18, 2026 02:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant