Perf/engine sweep by Jackson57279 · Pull Request #19 · Zapdev-labs/oxidize

Jackson57279 · 2026-06-18T09:42:03Z

Summary by cubic

Whole‑engine perf sweep with AVX‑512/AVX2 upgrades, CUDA GPU‑native forward pass and GPU lm_head, safer I/O, and per‑token OXK benchmarks. Adds DPO/QLoRA/RLHF tools in oxidize-finetuning, a CPU video training/generator pipeline in oxidize-train, importance‑weighted quantization, and a modular CLI.

Performance
- AVX‑512/AVX2 dot/GEMV upgrades: multi‑accumulator FMAs in flash‑attention and 16‑lane dot paths in tensor kernels.
- CUDA: GPU‑native forward for Q4_K/Q6_K GEMV and GPU lm_head GEMV; keeps activations on device and auto‑detects quantization.
- CUDA build: emit compute_75 PTX plus native cubins (SM80/89/90/120) when available to cut JIT time.
- I/O/safety/build: centralized read‑only mmap and volatile reads; removed unsafe F16 payload writes; codegen-units = 1.
New Features
- oxidize-finetuning: DPO trainer, QLoRA (NF4), PPO/RLHF scaffolding, adapter merge (Linear/SLERP/TIES), telemetry; CLI split into SFT/DPO/PPO/Merge.
- oxidize-train: CPU video pipeline with ffmpeg frame extraction, clip classifier, and autoregressive generator (new subcommands).
- OXK full decode‑token benchmarks in Rust/Go/Python, including bench_oxk.py; oxidize-quantize importance‑weighted quantization.
- oxidize-core/oxidize-cli: KV cache rewritten into eviction/storage modules with tests; CLI modularized (server/generation/model resolution/GPU cluster); added release workflow and CODEOWNERS.

^{Written for commit 8708715. Summary will update on new commits.}

Introduce a video pipeline for TikTok-style clip datasets with ffmpeg frame extraction, a trainable patch-embedding classifier, and a prototype subcommand that renders averaged base clips while excluding selected creators. Co-authored-by: Cursor <cursoragent@cursor.com>

Whole-engine perf sweep, round 1. All wins verified against existing bit-exact GEMM/attention tests (AVX2 fallback locally; AVX-512 paths validated on Skylake-SP target). - flash_attention: dot_product_f32 avx512/avx2 now use 4 independent accumulators + a 16/8-wide remainder loop, breaking the single-chain FMA latency bottleneck on short head_dim (64/96/128) loops. Same for the f16 dot (2 accumulators). - flash_attention: vectorize the f32 KvElem::axpy (decode V-accumulation) with AVX-512/AVX2 FMA instead of a scalar loop. - tensor/kernels: add dot4_f32_avx512 / dot_f32_avx512 (16-wide) and dispatch the Q4_K/Q6_K/Q8_0 decode-once GEMM and dot_f32_fast to them when avx512f+vl are present. Doubles dot lanes on the hottest quantized matmul path for AVX-512 hardware. - tensor/kernels: gemm_f32_cpu inner loop replaced its autovectorization- blocking black_box "prefetch" with the SIMD dot_product_f32 over the (now contiguous) transposed column. - Cargo.toml: codegen-units = 1 for whole-crate inlining/LTO scope. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Extract safe wrappers for read-only mmap, volatile page reads, and Q8_K bsum decoding so format and kernel call sites avoid scattered unsafe blocks. Co-authored-by: Cursor <cursoragent@cursor.com>

Use shared map_readonly and read_volatile_byte for model loading and NUMA warm-up checksums. Co-authored-by: Cursor <cursoragent@cursor.com>

Replace inline unsafe Mmap::map calls in SafeTensors-to-GGUF conversion and oxidize-merge index reads. Co-authored-by: Cursor <cursoragent@cursor.com>

read_q8_k_bsum now takes a byte slice instead of a raw pointer for safer kernel boundaries. Co-authored-by: Cursor <cursoragent@cursor.com>

load_q8_block helpers take &[u8] with debug_assert bounds checks instead of pointer arithmetic at call sites. Co-authored-by: Cursor <cursoragent@cursor.com>

…ode API. Add quantize_scalar_weighted, export IQ4 block constants, and refresh tensor module docs for GPU dispatch. Co-authored-by: Cursor <cursoragent@cursor.com>

…line. Add streaming weighted quantization options to the quantize tool and serialize hidden-state payloads with safe byte copies in the distributed pipeline. Co-authored-by: Cursor <cursoragent@cursor.com>

Clarify lifetime extension and disjoint slice invariants without changing behavior. Co-authored-by: Cursor <cursoragent@cursor.com>

Measure per-token GEMV throughput across all layers instead of a single isolated matvec. Co-authored-by: Cursor <cursoragent@cursor.com>

Port the full-layer GEMV token benchmark used to compare Rust, Go, and Python throughput. Co-authored-by: Cursor <cursoragent@cursor.com>

Ship a small command entrypoint for the Go OXK benchmark runner. Co-authored-by: Cursor <cursoragent@cursor.com>

Mirror the Go/Rust full-token GEMV benchmark in the pure-Python port. Co-authored-by: Cursor <cursoragent@cursor.com>

Introduce preference optimization training and 4-bit NF4 LoRA support as library building blocks. Co-authored-by: Cursor <cursoragent@cursor.com>

…ing. Support linear/SLERP/TIES adapter merging plus a PPO trainer stub for future RLHF workflows. Co-authored-by: Cursor <cursoragent@cursor.com>

Expose metrics logging, early stopping helpers, and public module surface for DPO/QLoRA/merge/RLHF. Co-authored-by: Cursor <cursoragent@cursor.com>

Replace the single-entry SFT flow with subcommands for supervised training, DPO, PPO stub, and adapter merging. Co-authored-by: Cursor <cursoragent@cursor.com>

…dule. Drop prototype.rs in favor of generator.rs for next-frame video model training and sampling. Co-authored-by: Cursor <cursoragent@cursor.com>

Add gen-train/generate exports, virality-first defaults, and dataset exclude/alias merge options. Co-authored-by: Cursor <cursoragent@cursor.com>

…tooling. Filter excluded usernames before training and support reconstructing preview images from normalized patches. Co-authored-by: Cursor <cursoragent@cursor.com>

Extend the train binary with generator training and sampling subcommands alongside the clip classifier. Co-authored-by: Cursor <cursoragent@cursor.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 962714cf39

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-18T09:47:45Z

+	blockOut[3] = byte(bits >> 24)
+	qsOff := 4
+	for i, v := range blockIn {
+		q := int(iscale * float64(v))


Use rounding for Q8_K activation quantization

When the Go Q4_K GEMV path quantizes an input vector to Q8_K, this truncates scaled activations toward zero, whereas the Rust kernel rounds before clamping (quantize_block_q8_k_scalar uses scaled.round()). For any non-integer scaled activation, the Go port produces different q8 bytes and bsums, so row-dot outputs no longer match the Rust/ggml reference and accuracy regresses; use math.Round before clamping.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-18T09:47:45Z

+    struct.pack_into("<f", block_out, 0, d)
+    qs_off = 4
+    for i, v in enumerate(block_in):
+        q = int(iscale * v)


Use rounding for Q8_K activation quantization

When the Python OXK path quantizes activations for Q4_K × Q8_K GEMV, int(...) truncates instead of rounding, unlike the Rust implementation's scaled.round() before clamping. For ordinary inputs whose scaled value is fractional, this changes the Q8_K block and its bsums, so the Python port no longer has the advertised bit-identical math and can return different logits; round the scaled value first.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-18T09:47:45Z

+    // reference-model log-prob tracking. Wire DpoTrainer here once the
+    // model-side gradient API stabilises.
+    eprintln!("oxidize-finetuning dpo: full training loop not yet wired — coming soon");
+    Ok(())


Return an error instead of succeeding for unwired DPO

When a user runs the advertised dpo subcommand with a valid JSONL file, this path prints that an output will be written and then exits successfully without loading the model, training, or creating args.output. That makes scripts treat a no-op as a completed fine-tune and can leave callers consuming a missing or stale adapter; either wire the DPO trainer/export here or return a non-zero error until it is implemented.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-18T09:47:45Z

+    // Full PPO requires a reward model and rollout collection loop.
+    // Wire PpoTrainer + RewardModel here once the reward-model API stabilises.
+    eprintln!("oxidize-finetuning ppo: full training loop not yet wired — coming soon");
+    Ok(())


Return an error instead of succeeding for unwired PPO

When a user runs the advertised ppo subcommand, this path reports the target output and then exits Ok(()) without checking the model, collecting rollouts, training, or writing the adapter. Automation will see a successful RLHF run even though no artifact exists, so this should fail non-zero until the PPO loop is actually connected.

Useful? React with 👍 / 👎.

cubic-dev-ai

40 issues found across 46 files

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="oxidize-core/src/util/bytes.rs">

<violation number="1" location="oxidize-core/src/util/bytes.rs:13">
P1: Safe wrapper `map_readonly` wraps `unsafe Mmap::map` but relies on an unenforceable caller invariant (file must stay unmodified). Violating this invariant causes UB, making the safe signature unsound.</violation>

<violation number="2" location="oxidize-core/src/util/bytes.rs:31">
P0: Public safe function `read_volatile_byte` can cause undefined behavior when called with an out-of-bounds offset. A safe Rust function must never expose UB regardless of inputs.</violation>
</file>

<file name="oxidize-python/oxidize_python/core/oxk.py">

<violation number="1" location="oxidize-python/oxidize_python/core/oxk.py:113">
P1: `os.stderr` does not exist — will raise `AttributeError` at runtime when an invalid `OXIDIZE_OXK_PF_HINT` value is encountered. Must be `sys.stderr` with `import sys`.</violation>
</file>

<file name="oxidize-kernels/benches/oxk_token_bench.rs">

<violation number="1" location="oxidize-kernels/benches/oxk_token_bench.rs:1">
P1: Benchmark file added without a `[[bench]]` entry in Cargo.toml — `cargo bench` will not discover it.</violation>

<violation number="2" location="oxidize-kernels/benches/oxk_token_bench.rs:72">
P2: Duplicated fill_pseudo + f16 header taming logic already in oxk_q4k_bench.rs — should be shared (e.g. a test-util module or pub fn in the crate).</violation>
</file>

<file name="oxidize-finetuning/src/dpo.rs">

<violation number="1" location="oxidize-finetuning/src/dpo.rs:311">
P1: Dead `else` branch: `reference_free` flag has no effect — both branches return `(0.0, 0.0)`, so DPO always runs in reference-free mode regardless of the config.</violation>

<violation number="2" location="oxidize-finetuning/src/dpo.rs:404">
P2: Inconsistent `epochs=0` handling: `.max(1)` silently forces at least 1 epoch, while the existing SFT trainer respects `epochs=0` as zero iterations.</violation>
</file>

<file name="oxidize-golang/core/oxk/oxk.go">

<violation number="1" location="oxidize-golang/core/oxk/oxk.go:254">
P1: Exported library functions panic on invalid input instead of returning errors. This can crash host applications and violates repository error-handling conventions.</violation>

<violation number="2" location="oxidize-golang/core/oxk/oxk.go:259">
P2: Quantization loop allocates chunk slices before processing. Replace with index-based slicing to avoid per-call allocation overhead.</violation>

<violation number="3" location="oxidize-golang/core/oxk/oxk.go:291">
P2: Round the scaled activation before clamping when quantizing Q8_K values. Truncating toward zero changes q8 bytes/bsums and can make Go GEMV outputs diverge from the Rust kernel.</violation>
</file>

<file name="oxidize-finetuning/src/merge.rs">

<violation number="1" location="oxidize-finetuning/src/merge.rs:49">
P1: Slerp path accepts more than two adapters but silently ignores all after the first two. This can drop user inputs and return an unintended merge result.</violation>

<violation number="2" location="oxidize-finetuning/src/merge.rs:120">
P1: `slerp_merge` can interpolate adapters with different target/scale because compatibility checks are incomplete. The returned adapter keeps `a` metadata, causing incorrect semantics when mismatched inputs are passed.</violation>

<violation number="3" location="oxidize-finetuning/src/merge.rs:287">
P1: Linear/TIES compatibility checks omit `target` and `scale`, so incompatible adapters can merge successfully. Result keeps first adapter metadata, which can mislabel target and apply incorrect scaling at inference.</violation>
</file>

<file name="oxidize-train/src/main.rs">

<violation number="1" location="oxidize-train/src/main.rs:109">
P1: Clap `bool` field with `default_value_t = true` cannot be set to `false` from the CLI. The default `SetTrue` action means `--merge-aliases` only sets it to `true` (same as default), with no `--no-merge-aliases` negation. Users cannot disable merge_aliases.</violation>

<violation number="2" location="oxidize-train/src/main.rs:255">
P2: Misleading ffmpeg warning in `run_gen_train`: `load_gen_dataset()` hard-errors on missing ffmpeg, so the command always fails. The warning implies degraded-but-working operation like `run_video`, but that's not the case here.</violation>
</file>

<file name="oxidize-golang/cmd/bench_oxk/main.go">

<violation number="1" location="oxidize-golang/cmd/bench_oxk/main.go:131">
P1: Missing lower-bound validation for `tokens`; OXK_BENCH_TOKENS=0 causes no iterations and divide-by-zero in throughput calculations. Rust port guards with `.max(1)`.</violation>
</file>

<file name="oxidize-train/src/video/generator.rs">

<violation number="1" location="oxidize-train/src/video/generator.rs:410">
P2: Validation split leaks clip information because it is pair-level, not clip-level. Reported val_loss can be overly optimistic and unstable for model selection.</violation>

<violation number="2" location="oxidize-train/src/video/generator.rs:424">
P2: batch_size is effectively ignored because optimization runs one sample at a time inside each chunk. This makes the config misleading and prevents expected minibatch training/perf behavior.</violation>

<violation number="3" location="oxidize-train/src/video/generator.rs:435">
P1: val_split=0 makes val_loss constant 0, so best_model locks to epoch 1. Final model can drop later training progress.</violation>
</file>

<file name="oxidize-finetuning/src/main.rs">

<violation number="1" location="oxidize-finetuning/src/main.rs:191">
P2: Default output path "merged-adapter.gguf" is misleading — export_lora_gguf treats it as a directory and writes adapter_manifest.json inside, creating merged-adapter.gguf/ instead of a file.</violation>

<violation number="2" location="oxidize-finetuning/src/main.rs:360">
P1: This path reports that DPO training is not wired but still returns success. Return an error until the implementation is connected so automation does not treat a no-op as a completed run.</violation>

<violation number="3" location="oxidize-finetuning/src/main.rs:384">
P1: This stub prints that PPO is not wired and then returns `Ok(())`. Return a non-zero error until PPO training is implemented to avoid false-success runs.</violation>

<violation number="4" location="oxidize-finetuning/src/main.rs:458">
P2: TIES density is hardcoded to 0.5 with no CLI flag to customize it, making the ties strategy inflexible.</violation>
</file>

<file name="oxidize-train/src/video/config.rs">

<violation number="1" location="oxidize-train/src/video/config.rs:121">
P2: Missing validation: epochs == 0 allows a training run that performs zero iterations — a silent no-op.</violation>

<violation number="2" location="oxidize-train/src/video/config.rs:123">
P2: Missing validation: learning_rate <= 0.0 silently prevents training progress or causes divergence.</violation>
</file>

<file name="oxidize-merge/src/index.rs">

<violation number="1" location="oxidize-merge/src/index.rs:45">
P2: Duplicate `map_readonly` — identical wrapper already exists in oxidize-core/src/util/bytes.rs. Consider depending on oxidize-core or extracting a shared utility crate to avoid diverging SAFETY comments and duplicated unsafe wrappers.</violation>

<violation number="2" location="oxidize-merge/src/index.rs:46">
P2: SAFETY comment doesn't justify the unsafe call — it describes the mapping type and ownership, not the actual invariant (file must not be modified while mapped). The existing `map_readonly` in oxidize-core/src/util/bytes.rs has a correct version of this comment.</violation>
</file>

<file name="bench_oxk.py">

<violation number="1" location="bench_oxk.py:98">
P2: Missing lower-bound guard on `tokens` allows ZeroDivisionError when OXK_BENCH_TOKENS=0, unlike `n_layers` which is clamped with max(1, …).</violation>
</file>

<file name="oxidize-finetuning/src/qlora.rs">

<violation number="1" location="oxidize-finetuning/src/qlora.rs:61">
P2: `NF4Block::scale` is never read and its doc ("alpha / absmax") contradicts its assignment (`absmax`). Dead field + misleading documentation.</violation>

<violation number="2" location="oxidize-finetuning/src/qlora.rs:205">
P2: forward_batch fully dequantizes the frozen base weight into a new Vec every call. Frozen weights should be dequantized once (e.g. on construction or lazily cached) to avoid repeated allocation and recomputation.</violation>

<violation number="3" location="oxidize-finetuning/src/qlora.rs:257">
P2: adam_step accepts beta1/beta2/eps params that are silently ignored. Callers passing non-defaults get no effect — misleading API.</violation>
</file>

<file name="oxidize-quantize/src/main.rs">

<violation number="1" location="oxidize-quantize/src/main.rs:71">
P2: `nval * 4` can overflow usize before `checked_add` on 32-bit targets. Use `checked_mul(4).and_then(|b| cursor.checked_add(b))` instead.</violation>
</file>

<file name="oxidize-train/src/video/frames.rs">

<violation number="1" location="oxidize-train/src/video/frames.rs:188">
P2: patches_to_image indexes into `patches` without bounds checks — a short or malformed slice causes a panic. Add an early length guard or use get()/chunks_exact().</violation>
</file>

<file name="oxidize-cli/src/pipeline.rs">

<violation number="1" location="oxidize-cli/src/pipeline.rs:216">
P2: Per-call `Vec::with_capacity` allocation in `send_hidden` is a performance regression in the hot decode loop. Consider adding a reusable `&mut Vec<u8>` scratch parameter instead of allocating each call.</violation>

<violation number="2" location="oxidize-cli/src/pipeline.rs:256">
P2: Per-call `vec![0u8; nbytes]` allocation in `recv_hidden_payload` is a performance regression. The previously reused `f16_scratch` buffer is now unused; repurpose it (or a new byte scratch param) to avoid per-step allocation.</violation>
</file>

<file name="oxidize-core/src/model/layer_wise.rs">

<violation number="1" location="oxidize-core/src/model/layer_wise.rs:948">
P2: SAFETY comment drops the access invariant ("we only touch kv/ssm state below"). Without it, a future edit adding a cache-mutating `&mut self` call would create UB with no local reminder to check.</violation>
</file>

<file name="oxidize-finetuning/src/telemetry.rs">

<violation number="1" location="oxidize-finetuning/src/telemetry.rs:79">
P2: Returning 0.0 as mean loss for empty history is semantically incorrect — callers cannot distinguish "no data" from "average loss is zero".</violation>

<violation number="2" location="oxidize-finetuning/src/telemetry.rs:81">
P2: n=0 is treated as "all history" instead of "no steps" — a caller passing 0 likely has a bug, but the function silently returns the full-history mean.</violation>
</file>

<file name="oxidize-train/src/video/dataset.rs">

<violation number="1" location="oxidize-train/src/video/dataset.rs:285">
P2: f32 precision loss in split_indices: casting num_clips to f32 truncates large counts before multiplying by val_split. Use f64 for the intermediate calculation.</violation>
</file>

<file name="oxidize-train/src/video/error.rs">

<violation number="1" location="oxidize-train/src/video/error.rs:44">
P2: VideoError::Training variant with #[from] is dead code — never constructed, forcing exhaustive-match arms for an unreachable case in a public enum.</violation>
</file>

_{Tip: cubic can generate docs of your entire codebase and keep them up to date. Try it here.

Re-trigger cubic}

cubic-dev-ai · 2026-06-18T09:57:45Z

+#[inline]
+pub fn read_volatile_byte(bytes: &[u8], offset: usize) -> u8 {
+    // SAFETY: `offset` is always page-aligned and in-bounds at call sites.
+    unsafe { std::ptr::read_volatile(bytes.as_ptr().add(offset)) }


P0: Public safe function read_volatile_byte can cause undefined behavior when called with an out-of-bounds offset. A safe Rust function must never expose UB regardless of inputs.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At oxidize-core/src/util/bytes.rs, line 31: <comment>Public safe function `read_volatile_byte` can cause undefined behavior when called with an out-of-bounds offset. A safe Rust function must never expose UB regardless of inputs.</comment> <file context> @@ -0,0 +1,32 @@ +#[inline] +pub fn read_volatile_byte(bytes: &[u8], offset: usize) -> u8 { + // SAFETY: `offset` is always page-aligned and in-bounds at call sites. + unsafe { std::ptr::read_volatile(bytes.as_ptr().add(offset)) } +} </file context>

Suggested change

unsafe { std::ptr::read_volatile(bytes.as_ptr().add(offset)) }

assert!(offset < bytes.len(), "offset out of bounds");

unsafe { std::ptr::read_volatile(bytes.as_ptr().add(offset)) }

cubic-dev-ai · 2026-06-18T09:57:45Z

+                elif hint == "t0" or hint == "":
+                    pf_nta = False
+                else:
+                    print(f"OXIDIZE_OXK_PF_HINT={hint} unknown (use t0|nta); using t0", file=os.stderr)


P1: os.stderr does not exist — will raise AttributeError at runtime when an invalid OXIDIZE_OXK_PF_HINT value is encountered. Must be sys.stderr with import sys.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At oxidize-python/oxidize_python/core/oxk.py, line 113: <comment>`os.stderr` does not exist — will raise `AttributeError` at runtime when an invalid `OXIDIZE_OXK_PF_HINT` value is encountered. Must be `sys.stderr` with `import sys`.</comment> <file context> @@ -0,0 +1,400 @@ + elif hint == "t0" or hint == "": + pf_nta = False + else: + print(f"OXIDIZE_OXK_PF_HINT={hint} unknown (use t0|nta); using t0", file=os.stderr) + _tune_val = OxkTune(pf_bytes=blocks * BLOCK_Q4_K_SIZE, pf_nta=pf_nta) + return _tune_val </file context>

cubic-dev-ai · 2026-06-18T09:57:45Z

@@ -0,0 +1,174 @@
+//! OXK full decode-token bench — Qwen3-30B-A3B.


P1: Benchmark file added without a [[bench]] entry in Cargo.toml — cargo bench will not discover it.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At oxidize-kernels/benches/oxk_token_bench.rs, line 1: <comment>Benchmark file added without a `[[bench]]` entry in Cargo.toml — `cargo bench` will not discover it.</comment> <file context> @@ -0,0 +1,174 @@ +//! OXK full decode-token bench — Qwen3-30B-A3B. +//! +//! Times one *real* decode token's worth of GEMVs (every attention + MoE </file context>

cubic-dev-ai · 2026-06-18T09:57:45Z

+        // (treated as uniform/constant), so the ratio collapses to the policy
+        // ratio alone and the reference terms are 0.
+        let (ref_c, ref_r) = if self.dpo_config.reference_free {
+            (0.0_f32, 0.0_f32)


P1: Dead else branch: reference_free flag has no effect — both branches return (0.0, 0.0), so DPO always runs in reference-free mode regardless of the config.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At oxidize-finetuning/src/dpo.rs, line 311: <comment>Dead `else` branch: `reference_free` flag has no effect — both branches return `(0.0, 0.0)`, so DPO always runs in reference-free mode regardless of the config.</comment> <file context> @@ -0,0 +1,708 @@ + // (treated as uniform/constant), so the ratio collapses to the policy + // ratio alone and the reference terms are 0. + let (ref_c, ref_r) = if self.dpo_config.reference_free { + (0.0_f32, 0.0_f32) + } else { + // Without a separate reference model we approximate the reference </file context>

cubic-dev-ai · 2026-06-18T09:57:45Z

+// QuantizeQ8KInto quantizes vector (length nBlocks*256) into nBlocks Q8_K blocks.
+func QuantizeQ8KInto(vector []float32, nBlocks int, out []byte) {
+	if len(vector) != nBlocks*QK_K {
+		panic("vector length mismatch")


P1: Exported library functions panic on invalid input instead of returning errors. This can crash host applications and violates repository error-handling conventions.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At oxidize-golang/core/oxk/oxk.go, line 254: <comment>Exported library functions panic on invalid input instead of returning errors. This can crash host applications and violates repository error-handling conventions.</comment> <file context> @@ -0,0 +1,475 @@ +// QuantizeQ8KInto quantizes vector (length nBlocks*256) into nBlocks Q8_K blocks. +func QuantizeQ8KInto(vector []float32, nBlocks int, out []byte) { + if len(vector) != nBlocks*QK_K { + panic("vector length mismatch") + } + if len(out) < nBlocks*BLOCK_Q8_K_BYTES { </file context>

cubic-dev-ai · 2026-06-18T09:57:46Z

+        &mut self,
+        step: usize,
+        lr: f32,
+        _beta1: f32,


P2: adam_step accepts beta1/beta2/eps params that are silently ignored. Callers passing non-defaults get no effect — misleading API.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At oxidize-finetuning/src/qlora.rs, line 257: <comment>adam_step accepts beta1/beta2/eps params that are silently ignored. Callers passing non-defaults get no effect — misleading API.</comment> <file context> @@ -0,0 +1,486 @@ + &mut self, + step: usize, + lr: f32, + _beta1: f32, + _beta2: f32, + _eps: f32, </file context>

cubic-dev-ai · 2026-06-18T09:57:47Z

+    let strategy = match args.strategy.to_lowercase().as_str() {
+        "linear" => MergeStrategy::Linear,
+        "slerp" => MergeStrategy::Slerp,
+        "ties" => MergeStrategy::Ties { density: 0.5 },


P2: TIES density is hardcoded to 0.5 with no CLI flag to customize it, making the ties strategy inflexible.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At oxidize-finetuning/src/main.rs, line 458: <comment>TIES density is hardcoded to 0.5 with no CLI flag to customize it, making the ties strategy inflexible.</comment> <file context> @@ -197,7 +328,219 @@ fn main() -> Result<()> { + let strategy = match args.strategy.to_lowercase().as_str() { + "linear" => MergeStrategy::Linear, + "slerp" => MergeStrategy::Slerp, + "ties" => MergeStrategy::Ties { density: 0.5 }, + other => { + anyhow::bail!( </file context>

cubic-dev-ai · 2026-06-18T09:57:47Z

+        if self.history.is_empty() {
+            return 0.0;
+        }
+        let slice = if n == 0 || n >= self.history.len() {


P2: n=0 is treated as "all history" instead of "no steps" — a caller passing 0 likely has a bug, but the function silently returns the full-history mean.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At oxidize-finetuning/src/telemetry.rs, line 81: <comment>n=0 is treated as "all history" instead of "no steps" — a caller passing 0 likely has a bug, but the function silently returns the full-history mean.</comment> <file context> @@ -0,0 +1,455 @@ + if self.history.is_empty() { + return 0.0; + } + let slice = if n == 0 || n >= self.history.len() { + &self.history[..] + } else { </file context>

cubic-dev-ai · 2026-06-18T09:57:47Z

+    NoFrames(PathBuf),
+
+    #[error(transparent)]
+    Training(#[from] TrainingError),


P2: VideoError::Training variant with #[from] is dead code — never constructed, forcing exhaustive-match arms for an unreachable case in a public enum.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At oxidize-train/src/video/error.rs, line 44: <comment>VideoError::Training variant with #[from] is dead code — never constructed, forcing exhaustive-match arms for an unreachable case in a public enum.</comment> <file context> @@ -0,0 +1,48 @@ + NoFrames(PathBuf), + + #[error(transparent)] + Training(#[from] TrainingError), + + #[error("invalid configuration: {0}")] </file context>

cubic-dev-ai · 2026-06-18T09:57:47Z

+        .sum()
+}
+
+fn fill_pseudo(bytes: &mut [u8], mut state: u64) {


P2: Duplicated fill_pseudo + f16 header taming logic already in oxk_q4k_bench.rs — should be shared (e.g. a test-util module or pub fn in the crate).

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At oxidize-kernels/benches/oxk_token_bench.rs, line 72: <comment>Duplicated fill_pseudo + f16 header taming logic already in oxk_q4k_bench.rs — should be shared (e.g. a test-util module or pub fn in the crate).</comment> <file context> @@ -0,0 +1,174 @@ + .sum() +} + +fn fill_pseudo(bytes: &mut [u8], mut state: u64) { + for b in bytes.iter_mut() { + state ^= state << 13; </file context>

Keep hidden state resident on GPU across all transformer layers, eliminating per-layer CPU residual-stream round trips. The CUDA path now holds the full 3072-element hidden vector in a pre-allocated `GpuActivationBuffer` and only copies Q/K/V vectors to CPU for RoPE and attention; the wo and FFN projections (gate/up/silu/down) stay entirely on device. Key changes: - `gemv_q4k_f32in_kernel`: Q4_K × F32-input GEMV (no Q8K quantization step), one warp per output row - `gemv_q6k_f32in_kernel`: Q6_K × F32-input GEMV with correct GGML interleaved block layout (two 128-element halves, lo4/hi2 split) - `GpuActivationBuffer`: pre-allocated F32 device buffers for hidden, normed, ffn_gate, ffn_up, ffn_down_in - `rms_norm_f32_kernel`: dynamic shared memory fix (pass block_size×4 bytes at launch, not 0) - `gpu_download_hidden`: synchronize stream before D2H copy to avoid racing the last residual-add kernel - `layer_can_use_gpu_native`: per-layer eligibility check (Q4K/Q6K weights, no biases, 256-aligned, dense FFN) - `q4k_or_q6k_bytes`: accepts Q4_K_S / Q4_K_M / Q6_K so that Llama-3.2 Q4_K_M layers (which use Q6_K for attn_v) are eligible - Auto-detect Q4K vs Q6K from block byte-size (≥200 → Q6K) - `kv_cache.rs`: add `copy_layer_{key,value}_prefix_values` helpers - `gguf.rs`: expose `tensor_bytes` / `tensor_mmap` accessors Measured on RunPod H100 80GB with Llama-3.2-3B-Instruct Q4_K_M: baseline (--backend cuda): ~41 tok/s GPU-native (--backend cuda): ~39 tok/s The speedup is neutral because the original CUDA path already ran all GEMVs on GPU; the per-layer D2H sync cost on H100 PCIe 5.0 is ~5 µs (not the 100 µs estimated), so eliminating residual-stream round trips saves only ~1 ms/token while adding ~0.3 ms of attn_out H2D uploads. Real throughput gains require moving attention itself to GPU (next step). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

lm_head (output.weight) is a 128256×3072 matrix. Running it on CPU costs 6–16 ms per token (RAM bandwidth bound); on H100 HBM it takes ~100 µs (33–130× faster). This single change is the primary bottleneck preventing 100 tok/s on 3B models. Changes: - cuda.rs: add `gpu_lm_head_quantized` — uploads output.weight once, dispatches Q4K or Q6K GEMV kernel, downloads 512 KB logits; auto- detects quant type from block byte-size (same pattern as layer GEMVs) - inference.rs: `final_head_from_workspace` tries GPU path for Q4K/Q6K output weights (covers all Q4_K_M / Q6_K GGUF models); falls through to CPU on error so the non-CUDA path is unaffected - build.rs: probe nvcc version and emit native cubins for each Blackwell/ Hopper/Ada/Ampere generation the installed toolkit supports, alongside the compute_75 PTX fallback. SM120 (RTX 5090) and SM121 (RTX 5080) now get pre-compiled native code instead of per-run JIT recompilation when CUDA ≥ 12.8 is installed. Expected result: ~80–120 tok/s on 3B Q4_K_M (from ~40 tok/s baseline) once lm_head moves off CPU; full 100 tok/s parity with L40 7B benchmarks requires GPU attention in a follow-up. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Updated `oxidize-cli/src/main.rs` to modularize submodules for better maintainability. - Adjusted `oxidize-core/src/compute/kv_cache.rs` to separate cache functionalities into distinct files. - Enhanced `oxidize-core/src/backends/cuda.rs` by removing unused structures and functions related to CUDA memory management. - Updated `oxidize-core/src/model/inference.rs` and `oxidize-core/src/model/layer_wise.rs` to expose necessary functions for external use. - Modified `continual-learning.json` to reflect updated state values. These changes aim to streamline the codebase, making it easier to navigate and extend in the future.

cubic-dev-ai

31 issues found across 174 files (changes from recent commits).

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="oxidize-cli/src/main/gpu_cluster.rs">

<violation number="1" location="oxidize-cli/src/main/gpu_cluster.rs:70">
P2: `--nodes` accepts `0` despite claiming a positive integer is required. This silently changes user input into defaults instead of reporting invalid input.</violation>

<violation number="2" location="oxidize-cli/src/main/gpu_cluster.rs:88">
P2: `--gpus-per-node` accepts `0` even though validation message requires a positive integer. This hides invalid input and falls back to defaults.</violation>
</file>

<file name="oxidize-cli/src/main/command_rewrite.rs">

<violation number="1" location="oxidize-cli/src/main/command_rewrite.rs:145">
P2: `--prompt` flag is not recognized as one-shot mode in rewrite logic. This can incorrectly enable chat/server behavior for prompt-only runs.</violation>
</file>

<file name="oxidize-cli/src/main/inference.rs">

<violation number="1" location="oxidize-cli/src/main/inference.rs:93">
P2: Autotune `turboquant=false` is never applied. This can leave KV quantization in the default mode instead of the planned asymmetric mode.</violation>
</file>

<file name=".omo/ulw-loop/evidence/oxidize-post-gen-status.txt">

<violation number="1" location=".omo/ulw-loop/evidence/oxidize-post-gen-status.txt:3">
P2: Raw prompt text is committed in process-output evidence. Redact user/request payloads before checking logs into git.</violation>

<violation number="2" location=".omo/ulw-loop/evidence/oxidize-post-gen-status.txt:9">
P2: Unfiltered host `ps` output leaks unrelated service details in repository history. Keep evidence scoped to oxidize processes or redact external commands.</violation>
</file>

<file name="oxidize-core/build.rs">

<violation number="1" location="oxidize-core/build.rs:140">
P2: `-ptx` forces PTX-only output, so added native `sm_*` gencode targets do not produce embedded cubins. This makes the new version-gated native-arch path ineffective and the documented perf benefit incorrect.</violation>
</file>

<file name="oxidize-core/src/backends/cuda/gemm.rs">

<violation number="1" location="oxidize-core/src/backends/cuda/gemm.rs:64">
P1: GEMM caches the dynamic activation input instead of the weight matrix. This can return stale results when left-buffer pointers are reused and also defeats intended weight-caching performance.</violation>
</file>

<file name=".pre-commit-config.yaml">

<violation number="1" location=".pre-commit-config.yaml:32">
P1: `cargo-udeps` hook references a non-existent script, so the hook cannot run successfully.</violation>
</file>

<file name=".cursor/hooks/state/continual-learning.json">

<violation number="1" location=".cursor/hooks/state/continual-learning.json:4">
P3: Volatile Cursor hook state was committed in this PR. This is unrelated runtime-local metadata and should be excluded from versioned changes.</violation>
</file>

<file name="oxidize-core/src/compute/quantization/quant_nvfp4.rs">

<violation number="1" location="oxidize-core/src/compute/quantization/quant_nvfp4.rs:14">
P3: This introduces duplicate NVFP4 decode logic that already exists in `tensor/kernels/q_kernels.rs`. Keeping two copies risks silent drift when one implementation changes.</violation>
</file>

<file name="oxidize-core/src/compute/quantization/quant_utils.rs">

<violation number="1" location="oxidize-core/src/compute/quantization/quant_utils.rs:59">
P2: New helper duplicates existing float16 conversion code in two modules. This creates divergence risk and inconsistent edge-case handling.</violation>
</file>

<file name=".omo/ulw-loop/evidence/oxidize-server-log-listener.txt">

<violation number="1" location=".omo/ulw-loop/evidence/oxidize-server-log-listener.txt:1">
P3: This adds a near-duplicate evidence log instead of extending existing evidence artifacts. Duplicate files increase repo noise and drift risk.</violation>

<violation number="2" location=".omo/ulw-loop/evidence/oxidize-server-log-listener.txt:5">
P2: Absolute local model path is committed in the evidence log. Redact host-specific paths before committing logs.</violation>
</file>

<file name="oxidize-core/src/backends/cuda/gpu_state.rs">

<violation number="1" location="oxidize-core/src/backends/cuda/gpu_state.rs:105">
P1: Budget eviction runs too late in `preload_layer`; VRAM allocation can fail before old layers are evicted. Evict headroom before each new upload.</violation>
</file>

<file name="oxidize-cli/src/main/generation.rs">

<violation number="1" location="oxidize-cli/src/main/generation.rs:67">
P2: Per-request full-vocab decode loop adds avoidable generation latency. Cache suppressed token IDs per tokenizer/model instead of recomputing each call.</violation>

<violation number="2" location="oxidize-cli/src/main/generation.rs:212">
P2: DFlash path drops generated output when decoded text is empty. Add the same token-id fallback used by other generation paths.</violation>
</file>

<file name="oxidize-core/src/backends/cuda/gpu_kernels.rs">

<violation number="1" location="oxidize-core/src/backends/cuda/gpu_kernels.rs:70">
P3: This introduces substantial duplicate CUDA launch logic already implemented in `gpu_native_forward.rs`. Keeping both paths in sync increases drift risk for kernel signatures and caching behavior.</violation>

<violation number="2" location="oxidize-core/src/backends/cuda/gpu_kernels.rs:158">
P2: `gpu_silu_mul` can issue an invalid zero-block CUDA launch for empty intermediate buffers. Return early when `n == 0`.</violation>

<violation number="3" location="oxidize-core/src/backends/cuda/gpu_kernels.rs:158">
P2: `gpu_residual_add` can launch with `grid_size == 0` when hidden size is zero. Guard `n == 0` before launching to avoid invalid CUDA launch errors.</violation>
</file>

<file name="oxidize-cli/src/main/model_resolution.rs">

<violation number="1" location="oxidize-cli/src/main/model_resolution.rs:73">
P1: Unvalidated joined filenames allow path traversal/absolute-path writes outside intended cache directories.</violation>

<violation number="2" location="oxidize-cli/src/main/model_resolution.rs:152">
P1: Cache key generation is non-injective; different model specs can map to the same cache directory and return incorrect artifacts.</violation>

<violation number="3" location="oxidize-cli/src/main/model_resolution.rs:255">
P1: Final GGUF cache writes are non-atomic, so interrupted writes can poison cache with corrupted files.</violation>
</file>

<file name="oxidize-core/src/backends/cuda/gemv_quantized.rs">

<violation number="1" location="oxidize-core/src/backends/cuda/gemv_quantized.rs:304">
P2: Combined length guard returns wrong error for short `d_input`. Split checks so vector failures return `InvalidVectorLength` with `d_input.len()`.</violation>

<violation number="2" location="oxidize-core/src/backends/cuda/gemv_quantized.rs:310">
P2: Unchecked `as u32` casts can truncate large dimensions and launch kernel with wrapped sizes. Use checked conversions and return a dimension error on overflow.</violation>
</file>

<file name="oxidize-core/src/backends/cuda/gemv_f32.rs">

<violation number="1" location="oxidize-core/src/backends/cuda/gemv_f32.rs:29">
P1: F32 weight cache inserts bypass VRAM budget tracking/eviction. Repeated new matrices can accumulate resident VRAM until OOM despite configured cache limits.</violation>

<violation number="2" location="oxidize-core/src/backends/cuda/gemv_f32.rs:151">
P2: Transposed CUDA GEMV does per-call vector allocation instead of using the existing f32 buffer pool. This adds avoidable allocation overhead and hurts throughput.</violation>
</file>

<file name="oxidize-core/src/backends/cuda.rs">

<violation number="1" location="oxidize-core/src/backends/cuda.rs:43">
P1: Removing content identity from cache keys makes weight-cache hits unsafe for mutable/reused host buffers. GEMV/GEMM can return stale results when data changes without pointer/length changes.</violation>
</file>

<file name="oxidize-cli/src/main/conversation.rs">

<violation number="1" location="oxidize-cli/src/main/conversation.rs:231">
P2: Mesh chat accepts input before leader election can complete. First prompts can time out with no streamed tokens.</violation>

<violation number="2" location="oxidize-cli/src/main/conversation.rs:347">
P1: Timeout/disconnect paths cache empty responses. Repeated prompts can return blank cache hits rather than retrying generation.</violation>
</file>

<file name="oxidize-core/src/backends/cuda/types.rs">

<violation number="1" location="oxidize-core/src/backends/cuda/types.rs:233">
P1: Quantized CUDA weight cache is never cleared by `clear_resident_cache`. This can leak VRAM across model loads and break budget enforcement because `resident_bytes` is reset while `resident_quant` still holds device buffers.</violation>
</file>

_{Note: This PR contains a large number of files. cubic only reviews up to 100 files per PR, so some files may not have been reviewed. cubic prioritizes the most important files to review.

On a pro plan you can use ultrareview for larger PRs.

Re-trigger cubic}

cubic-dev-ai · 2026-06-19T06:24:46Z

+
+    with_gpu(|gpu| {
+        // Cache left matrix (model weights) in VRAM.
+        let left_key = f32_cache_key(left_matrix);


P1: GEMM caches the dynamic activation input instead of the weight matrix. This can return stale results when left-buffer pointers are reused and also defeats intended weight-caching performance.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At oxidize-core/src/backends/cuda/gemm.rs, line 64: <comment>GEMM caches the dynamic activation input instead of the weight matrix. This can return stale results when left-buffer pointers are reused and also defeats intended weight-caching performance.</comment> <file context> @@ -0,0 +1,120 @@ + + with_gpu(|gpu| { + // Cache left matrix (model weights) in VRAM. + let left_key = f32_cache_key(left_matrix); + if !gpu.resident_f32.contains_key(&left_key) { + let buffer = cust::memory::DeviceBuffer::from_slice(left_matrix).map_err(stringify)?; </file context>

cubic-dev-ai · 2026-06-19T06:24:46Z

+
+      - id: cargo-udeps
+        name: cargo udeps (unused dependencies)
+        entry: ./scripts/check-udeps.sh


P1: cargo-udeps hook references a non-existent script, so the hook cannot run successfully.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At .pre-commit-config.yaml, line 32: <comment>`cargo-udeps` hook references a non-existent script, so the hook cannot run successfully.</comment> <file context> @@ -0,0 +1,63 @@ + + - id: cargo-udeps + name: cargo udeps (unused dependencies) + entry: ./scripts/check-udeps.sh + language: system + pass_filenames: false </file context>

cubic-dev-ai · 2026-06-19T06:24:46Z

+        for (matrix, _rows, _cols) in f32_weights {
+            let key = f32_cache_key(matrix);
+            if !gpu.resident_f32.contains_key(&key) {
+                let buf = cust::memory::DeviceBuffer::from_slice(*matrix).map_err(stringify)?;


P1: Budget eviction runs too late in preload_layer; VRAM allocation can fail before old layers are evicted. Evict headroom before each new upload.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At oxidize-core/src/backends/cuda/gpu_state.rs, line 105: <comment>Budget eviction runs too late in `preload_layer`; VRAM allocation can fail before old layers are evicted. Evict headroom before each new upload.</comment> <file context> @@ -0,0 +1,150 @@ + for (matrix, _rows, _cols) in f32_weights { + let key = f32_cache_key(matrix); + if !gpu.resident_f32.contains_key(&key) { + let buf = cust::memory::DeviceBuffer::from_slice(*matrix).map_err(stringify)?; + entry.bytes += buf.len() * std::mem::size_of::<f32>(); + gpu.resident_f32.insert(key, buf); </file context>

cubic-dev-ai · 2026-06-19T06:24:46Z

+        .map_err(|e| io::Error::other(format!("failed to read GGUF for requantization: {e}")))?;
+    let quantized = quantize_gguf_to_target(&input_bytes, GgufQuantizationType::Q8_0)
+        .map_err(|e| io::Error::other(format!("Q8_0 requantization failed: {e}")))?;
+    std::fs::write(output, &quantized)


P1: Final GGUF cache writes are non-atomic, so interrupted writes can poison cache with corrupted files.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At oxidize-cli/src/main/model_resolution.rs, line 255: <comment>Final GGUF cache writes are non-atomic, so interrupted writes can poison cache with corrupted files.</comment> <file context> @@ -0,0 +1,324 @@ + .map_err(|e| io::Error::other(format!("failed to read GGUF for requantization: {e}")))?; + let quantized = quantize_gguf_to_target(&input_bytes, GgufQuantizationType::Q8_0) + .map_err(|e| io::Error::other(format!("Q8_0 requantization failed: {e}")))?; + std::fs::write(output, &quantized) + .map_err(|e| io::Error::other(format!("failed to write Q8_0 GGUF: {e}")))?; + eprintln!( </file context>

cubic-dev-ai · 2026-06-19T06:24:46Z

+
+pub(super) fn cache_safe_name(spec: &str) -> String {
+    spec.chars()
+        .map(|ch| if ch.is_ascii_alphanumeric() { ch } else { '-' })


P1: Cache key generation is non-injective; different model specs can map to the same cache directory and return incorrect artifacts.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At oxidize-cli/src/main/model_resolution.rs, line 152: <comment>Cache key generation is non-injective; different model specs can map to the same cache directory and return incorrect artifacts.</comment> <file context> @@ -0,0 +1,324 @@ + +pub(super) fn cache_safe_name(spec: &str) -> String { + spec.chars() + .map(|ch| if ch.is_ascii_alphanumeric() { ch } else { '-' }) + .collect() +} </file context>

cubic-dev-ai · 2026-06-19T06:24:47Z

+
+    // Give the mesh node a moment to start up and discover peers.
+    rt.block_on(async {
+        tokio::time::sleep(Duration::from_secs(2)).await;


P2: Mesh chat accepts input before leader election can complete. First prompts can time out with no streamed tokens.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At oxidize-cli/src/main/conversation.rs, line 231: <comment>Mesh chat accepts input before leader election can complete. First prompts can time out with no streamed tokens.</comment> <file context> @@ -0,0 +1,358 @@ + + // Give the mesh node a moment to start up and discover peers. + rt.block_on(async { + tokio::time::sleep(Duration::from_secs(2)).await; + }); + </file context>

cubic-dev-ai · 2026-06-19T06:24:47Z

  "version": 1,
  "lastRunAtMs": 1781685502133,
-  "turnsSinceLastRun": 1,
+  "turnsSinceLastRun": 4,


P3: Volatile Cursor hook state was committed in this PR. This is unrelated runtime-local metadata and should be excluded from versioned changes.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At .cursor/hooks/state/continual-learning.json, line 4: <comment>Volatile Cursor hook state was committed in this PR. This is unrelated runtime-local metadata and should be excluded from versioned changes.</comment> <file context> @@ -1,8 +1,8 @@ "version": 1, "lastRunAtMs": 1781685502133, - "turnsSinceLastRun": 1, + "turnsSinceLastRun": 4, "lastTranscriptMtimeMs": 1781685501947.5315, - "lastProcessedGenerationId": "f1a2db2c-d576-4862-9869-f0392e82e294", </file context>

cubic-dev-ai · 2026-06-19T06:24:47Z

+    }
+}
+
+pub fn dequantize_nvfp4_scalar(input: &[u8], output: &mut [f32]) -> Result<(), QuantizationError> {


P3: This introduces duplicate NVFP4 decode logic that already exists in tensor/kernels/q_kernels.rs. Keeping two copies risks silent drift when one implementation changes.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At oxidize-core/src/compute/quantization/quant_nvfp4.rs, line 14: <comment>This introduces duplicate NVFP4 decode logic that already exists in `tensor/kernels/q_kernels.rs`. Keeping two copies risks silent drift when one implementation changes.</comment> <file context> @@ -0,0 +1,41 @@ + } +} + +pub fn dequantize_nvfp4_scalar(input: &[u8], output: &mut [f32]) -> Result<(), QuantizationError> { + validate_layout( + GgufQuantizationType::NVFP4, </file context>

cubic-dev-ai · 2026-06-19T06:24:47Z

@@ -0,0 +1,23 @@
+backend: cpu (CPU)


P3: This adds a near-duplicate evidence log instead of extending existing evidence artifacts. Duplicate files increase repo noise and drift risk.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At .omo/ulw-loop/evidence/oxidize-server-log-listener.txt, line 1: <comment>This adds a near-duplicate evidence log instead of extending existing evidence artifacts. Duplicate files increase repo noise and drift risk.</comment> <file context> @@ -0,0 +1,23 @@ +backend: cpu (CPU) +api server: starting in background at http://0.0.0.0:18082 (REST /v1/*, WebSocket /v1/realtime) +load progress: 0% stage=starting bytes=0/7381381664 </file context>

cubic-dev-ai · 2026-06-19T06:24:47Z

@@ -0,0 +1,231 @@
+#[allow(unused_imports)]


P3: This introduces substantial duplicate CUDA launch logic already implemented in gpu_native_forward.rs. Keeping both paths in sync increases drift risk for kernel signatures and caching behavior.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At oxidize-core/src/backends/cuda/gpu_kernels.rs, line 70: <comment>This introduces substantial duplicate CUDA launch logic already implemented in `gpu_native_forward.rs`. Keeping both paths in sync increases drift risk for kernel signatures and caching behavior.</comment> <file context> @@ -0,0 +1,231 @@ +/// cached in `resident_f32` (mmap-stable pointer identity, same as ordinary +/// weight matrices). +#[cfg(feature = "cuda")] +pub fn gpu_rms_norm(weight: &[f32], eps: f32) -> Result<(), String> { + with_gpu(|gpu| { + let ab = gpu </file context>

Jackson57279 and others added 22 commits June 18, 2026 04:25

Add centralized byte/mmap helpers in oxidize-core.

bf318c8

Extract safe wrappers for read-only mmap, volatile page reads, and Q8_K bsum decoding so format and kernel call sites avoid scattered unsafe blocks. Co-authored-by: Cursor <cursoragent@cursor.com>

Route GGUF and SafeTensors mmap through bytes helpers.

745ad02

Use shared map_readonly and read_volatile_byte for model loading and NUMA warm-up checksums. Co-authored-by: Cursor <cursoragent@cursor.com>

Use safe mmap helpers in convert and merge shard loading.

f1b8add

Replace inline unsafe Mmap::map calls in SafeTensors-to-GGUF conversion and oxidize-merge index reads. Co-authored-by: Cursor <cursoragent@cursor.com>

Switch Q4K scalar path to slice-based Q8 block reads.

eb71f3e

read_q8_k_bsum now takes a byte slice instead of a raw pointer for safer kernel boundaries. Co-authored-by: Cursor <cursoragent@cursor.com>

Refactor AVX2/AVX-512 Q4K loaders to use slice inputs.

a573c98

load_q8_block helpers take &[u8] with debug_assert bounds checks instead of pointer arithmetic at call sites. Co-authored-by: Cursor <cursoragent@cursor.com>

Extend quantization with IQ4_XS importance weighting and weighted enc…

06266bf

…ode API. Add quantize_scalar_weighted, export IQ4 block constants, and refresh tensor module docs for GPU dispatch. Co-authored-by: Cursor <cursoragent@cursor.com>

Expand oxidize-quantize CLI and remove unsafe F16 wire format in pipe…

3e26457

…line. Add streaming weighted quantization options to the quantize tool and serialize hidden-state payloads with safe byte copies in the distributed pipeline. Co-authored-by: Cursor <cursoragent@cursor.com>

Tighten SAFETY comments in layer-wise inference hot paths.

f13cec2

Clarify lifetime extension and disjoint slice invariants without changing behavior. Co-authored-by: Cursor <cursoragent@cursor.com>

Add OXK full decode-token benchmarks for Rust and Python.

7680963

Measure per-token GEMV throughput across all layers instead of a single isolated matvec. Co-authored-by: Cursor <cursoragent@cursor.com>

Add Go OXK decode-token benchmark core and tests.

90a7264

Port the full-layer GEMV token benchmark used to compare Rust, Go, and Python throughput. Co-authored-by: Cursor <cursoragent@cursor.com>

Add Go bench_oxk CLI and ignore local Firecrawl artifacts.

53e95a8

Ship a small command entrypoint for the Go OXK benchmark runner. Co-authored-by: Cursor <cursoragent@cursor.com>

Add Python OXK decode-token benchmark module and tests.

b47e9e4

Mirror the Go/Rust full-token GEMV benchmark in the pure-Python port. Co-authored-by: Cursor <cursoragent@cursor.com>

Add DPO trainer and QLoRA NF4 adapter modules to oxidize-finetuning.

f42ca15

Introduce preference optimization training and 4-bit NF4 LoRA support as library building blocks. Co-authored-by: Cursor <cursoragent@cursor.com>

Add LoRA merge strategies and PPO/RLHF scaffolding to oxidize-finetun…

b35f0f9

…ing. Support linear/SLERP/TIES adapter merging plus a PPO trainer stub for future RLHF workflows. Co-authored-by: Cursor <cursoragent@cursor.com>

Wire finetuning telemetry exports and register new training modules.

cfeae49

Expose metrics logging, early stopping helpers, and public module surface for DPO/QLoRA/merge/RLHF. Co-authored-by: Cursor <cursoragent@cursor.com>

Restructure oxidize-finetuning CLI into SFT/DPO/PPO/Merge subcommands.

dea3c64

Replace the single-entry SFT flow with subcommands for supervised training, DPO, PPO stub, and adapter merging. Co-authored-by: Cursor <cursoragent@cursor.com>

Replace video prototype averaging with an autoregressive generator mo…

4ab066d

…dule. Drop prototype.rs in favor of generator.rs for next-frame video model training and sampling. Co-authored-by: Cursor <cursoragent@cursor.com>

Expose generator APIs and tune default video classifier hyperparameters.

4929bbe

Add gen-train/generate exports, virality-first defaults, and dataset exclude/alias merge options. Co-authored-by: Cursor <cursoragent@cursor.com>

Add creator filtering/alias merge and patch-to-RGB helpers for video …

5941549

…tooling. Filter excluded usernames before training and support reconstructing preview images from normalized patches. Co-authored-by: Cursor <cursoragent@cursor.com>

Add gen-train/generate CLI commands to oxidize-train.

962714c

Extend the train binary with generator training and sampling subcommands alongside the clip classifier. Co-authored-by: Cursor <cursoragent@cursor.com>

github-code-quality Bot found potential problems Jun 18, 2026

View reviewed changes

Comment thread oxidize-python/oxidize_python/core/oxk.py Fixed

Comment thread oxidize-python/oxidize_python/core/oxk.py Fixed

chatgpt-codex-connector Bot reviewed Jun 18, 2026

View reviewed changes

cubic-dev-ai Bot reviewed Jun 18, 2026

View reviewed changes

Jackson57279 and others added 3 commits June 18, 2026 14:59

cubic-dev-ai Bot reviewed Jun 19, 2026

View reviewed changes

	unsafe { std::ptr::read_volatile(bytes.as_ptr().add(offset)) }
	assert!(offset < bytes.len(), "offset out of bounds");
	unsafe { std::ptr::read_volatile(bytes.as_ptr().add(offset)) }

		@@ -0,0 +1,174 @@
		//! OXK full decode-token bench — Qwen3-30B-A3B.

Conversation

Jackson57279 commented Jun 18, 2026 • edited by cubic-dev-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by cubic

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

Jackson57279 commented Jun 18, 2026 •

edited by cubic-dev-ai Bot

Loading