Skip to content

Perf/engine sweep#19

Open
Jackson57279 wants to merge 25 commits into
masterfrom
perf/engine-sweep
Open

Perf/engine sweep#19
Jackson57279 wants to merge 25 commits into
masterfrom
perf/engine-sweep

Conversation

@Jackson57279

@Jackson57279 Jackson57279 commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Summary by cubic

Whole‑engine perf sweep with AVX‑512/AVX2 upgrades, CUDA GPU‑native forward pass and GPU lm_head, safer I/O, and per‑token OXK benchmarks. Adds DPO/QLoRA/RLHF tools in oxidize-finetuning, a CPU video training/generator pipeline in oxidize-train, importance‑weighted quantization, and a modular CLI.

  • Performance

    • AVX‑512/AVX2 dot/GEMV upgrades: multi‑accumulator FMAs in flash‑attention and 16‑lane dot paths in tensor kernels.
    • CUDA: GPU‑native forward for Q4_K/Q6_K GEMV and GPU lm_head GEMV; keeps activations on device and auto‑detects quantization.
    • CUDA build: emit compute_75 PTX plus native cubins (SM80/89/90/120) when available to cut JIT time.
    • I/O/safety/build: centralized read‑only mmap and volatile reads; removed unsafe F16 payload writes; codegen-units = 1.
  • New Features

    • oxidize-finetuning: DPO trainer, QLoRA (NF4), PPO/RLHF scaffolding, adapter merge (Linear/SLERP/TIES), telemetry; CLI split into SFT/DPO/PPO/Merge.
    • oxidize-train: CPU video pipeline with ffmpeg frame extraction, clip classifier, and autoregressive generator (new subcommands).
    • OXK full decode‑token benchmarks in Rust/Go/Python, including bench_oxk.py; oxidize-quantize importance‑weighted quantization.
    • oxidize-core/oxidize-cli: KV cache rewritten into eviction/storage modules with tests; CLI modularized (server/generation/model resolution/GPU cluster); added release workflow and CODEOWNERS.

Written for commit 8708715. Summary will update on new commits.

Review in cubic

Jackson57279 and others added 22 commits June 18, 2026 04:25
Introduce a video pipeline for TikTok-style clip datasets with ffmpeg frame extraction, a trainable patch-embedding classifier, and a prototype subcommand that renders averaged base clips while excluding selected creators.

Co-authored-by: Cursor <cursoragent@cursor.com>
Whole-engine perf sweep, round 1. All wins verified against existing
bit-exact GEMM/attention tests (AVX2 fallback locally; AVX-512 paths
validated on Skylake-SP target).

- flash_attention: dot_product_f32 avx512/avx2 now use 4 independent
  accumulators + a 16/8-wide remainder loop, breaking the single-chain
  FMA latency bottleneck on short head_dim (64/96/128) loops. Same for
  the f16 dot (2 accumulators).
- flash_attention: vectorize the f32 KvElem::axpy (decode V-accumulation)
  with AVX-512/AVX2 FMA instead of a scalar loop.
- tensor/kernels: add dot4_f32_avx512 / dot_f32_avx512 (16-wide) and
  dispatch the Q4_K/Q6_K/Q8_0 decode-once GEMM and dot_f32_fast to them
  when avx512f+vl are present. Doubles dot lanes on the hottest quantized
  matmul path for AVX-512 hardware.
- tensor/kernels: gemm_f32_cpu inner loop replaced its autovectorization-
  blocking black_box "prefetch" with the SIMD dot_product_f32 over the
  (now contiguous) transposed column.
- Cargo.toml: codegen-units = 1 for whole-crate inlining/LTO scope.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Extract safe wrappers for read-only mmap, volatile page reads, and Q8_K bsum decoding so format and kernel call sites avoid scattered unsafe blocks.

Co-authored-by: Cursor <cursoragent@cursor.com>
Use shared map_readonly and read_volatile_byte for model loading and NUMA warm-up checksums.

Co-authored-by: Cursor <cursoragent@cursor.com>
Replace inline unsafe Mmap::map calls in SafeTensors-to-GGUF conversion and oxidize-merge index reads.

Co-authored-by: Cursor <cursoragent@cursor.com>
read_q8_k_bsum now takes a byte slice instead of a raw pointer for safer kernel boundaries.

Co-authored-by: Cursor <cursoragent@cursor.com>
load_q8_block helpers take &[u8] with debug_assert bounds checks instead of pointer arithmetic at call sites.

Co-authored-by: Cursor <cursoragent@cursor.com>
…ode API.

Add quantize_scalar_weighted, export IQ4 block constants, and refresh tensor module docs for GPU dispatch.

Co-authored-by: Cursor <cursoragent@cursor.com>
…line.

Add streaming weighted quantization options to the quantize tool and serialize hidden-state payloads with safe byte copies in the distributed pipeline.

Co-authored-by: Cursor <cursoragent@cursor.com>
Clarify lifetime extension and disjoint slice invariants without changing behavior.

Co-authored-by: Cursor <cursoragent@cursor.com>
Measure per-token GEMV throughput across all layers instead of a single isolated matvec.

Co-authored-by: Cursor <cursoragent@cursor.com>
Port the full-layer GEMV token benchmark used to compare Rust, Go, and Python throughput.

Co-authored-by: Cursor <cursoragent@cursor.com>
Ship a small command entrypoint for the Go OXK benchmark runner.

Co-authored-by: Cursor <cursoragent@cursor.com>
Mirror the Go/Rust full-token GEMV benchmark in the pure-Python port.

Co-authored-by: Cursor <cursoragent@cursor.com>
Introduce preference optimization training and 4-bit NF4 LoRA support as library building blocks.

Co-authored-by: Cursor <cursoragent@cursor.com>
…ing.

Support linear/SLERP/TIES adapter merging plus a PPO trainer stub for future RLHF workflows.

Co-authored-by: Cursor <cursoragent@cursor.com>
Expose metrics logging, early stopping helpers, and public module surface for DPO/QLoRA/merge/RLHF.

Co-authored-by: Cursor <cursoragent@cursor.com>
Replace the single-entry SFT flow with subcommands for supervised training, DPO, PPO stub, and adapter merging.

Co-authored-by: Cursor <cursoragent@cursor.com>
…dule.

Drop prototype.rs in favor of generator.rs for next-frame video model training and sampling.

Co-authored-by: Cursor <cursoragent@cursor.com>
Add gen-train/generate exports, virality-first defaults, and dataset exclude/alias merge options.

Co-authored-by: Cursor <cursoragent@cursor.com>
…tooling.

Filter excluded usernames before training and support reconstructing preview images from normalized patches.

Co-authored-by: Cursor <cursoragent@cursor.com>
Extend the train binary with generator training and sampling subcommands alongside the clip classifier.

Co-authored-by: Cursor <cursoragent@cursor.com>
Comment thread oxidize-python/oxidize_python/core/oxk.py Fixed
Comment thread oxidize-python/oxidize_python/core/oxk.py Fixed

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 962714cf39

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

blockOut[3] = byte(bits >> 24)
qsOff := 4
for i, v := range blockIn {
q := int(iscale * float64(v))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Use rounding for Q8_K activation quantization

When the Go Q4_K GEMV path quantizes an input vector to Q8_K, this truncates scaled activations toward zero, whereas the Rust kernel rounds before clamping (quantize_block_q8_k_scalar uses scaled.round()). For any non-integer scaled activation, the Go port produces different q8 bytes and bsums, so row-dot outputs no longer match the Rust/ggml reference and accuracy regresses; use math.Round before clamping.

Useful? React with 👍 / 👎.

struct.pack_into("<f", block_out, 0, d)
qs_off = 4
for i, v in enumerate(block_in):
q = int(iscale * v)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Use rounding for Q8_K activation quantization

When the Python OXK path quantizes activations for Q4_K × Q8_K GEMV, int(...) truncates instead of rounding, unlike the Rust implementation's scaled.round() before clamping. For ordinary inputs whose scaled value is fractional, this changes the Q8_K block and its bsums, so the Python port no longer has the advertised bit-identical math and can return different logits; round the scaled value first.

Useful? React with 👍 / 👎.

// reference-model log-prob tracking. Wire DpoTrainer here once the
// model-side gradient API stabilises.
eprintln!("oxidize-finetuning dpo: full training loop not yet wired — coming soon");
Ok(())

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Return an error instead of succeeding for unwired DPO

When a user runs the advertised dpo subcommand with a valid JSONL file, this path prints that an output will be written and then exits successfully without loading the model, training, or creating args.output. That makes scripts treat a no-op as a completed fine-tune and can leave callers consuming a missing or stale adapter; either wire the DPO trainer/export here or return a non-zero error until it is implemented.

Useful? React with 👍 / 👎.

// Full PPO requires a reward model and rollout collection loop.
// Wire PpoTrainer + RewardModel here once the reward-model API stabilises.
eprintln!("oxidize-finetuning ppo: full training loop not yet wired — coming soon");
Ok(())

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Return an error instead of succeeding for unwired PPO

When a user runs the advertised ppo subcommand, this path reports the target output and then exits Ok(()) without checking the model, collecting rollouts, training, or writing the adapter. Automation will see a successful RLHF run even though no artifact exists, so this should fail non-zero until the PPO loop is actually connected.

Useful? React with 👍 / 👎.

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

40 issues found across 46 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="oxidize-core/src/util/bytes.rs">

<violation number="1" location="oxidize-core/src/util/bytes.rs:13">
P1: Safe wrapper `map_readonly` wraps `unsafe Mmap::map` but relies on an unenforceable caller invariant (file must stay unmodified). Violating this invariant causes UB, making the safe signature unsound.</violation>

<violation number="2" location="oxidize-core/src/util/bytes.rs:31">
P0: Public safe function `read_volatile_byte` can cause undefined behavior when called with an out-of-bounds offset. A safe Rust function must never expose UB regardless of inputs.</violation>
</file>

<file name="oxidize-python/oxidize_python/core/oxk.py">

<violation number="1" location="oxidize-python/oxidize_python/core/oxk.py:113">
P1: `os.stderr` does not exist — will raise `AttributeError` at runtime when an invalid `OXIDIZE_OXK_PF_HINT` value is encountered. Must be `sys.stderr` with `import sys`.</violation>
</file>

<file name="oxidize-kernels/benches/oxk_token_bench.rs">

<violation number="1" location="oxidize-kernels/benches/oxk_token_bench.rs:1">
P1: Benchmark file added without a `[[bench]]` entry in Cargo.toml — `cargo bench` will not discover it.</violation>

<violation number="2" location="oxidize-kernels/benches/oxk_token_bench.rs:72">
P2: Duplicated fill_pseudo + f16 header taming logic already in oxk_q4k_bench.rs — should be shared (e.g. a test-util module or pub fn in the crate).</violation>
</file>

<file name="oxidize-finetuning/src/dpo.rs">

<violation number="1" location="oxidize-finetuning/src/dpo.rs:311">
P1: Dead `else` branch: `reference_free` flag has no effect — both branches return `(0.0, 0.0)`, so DPO always runs in reference-free mode regardless of the config.</violation>

<violation number="2" location="oxidize-finetuning/src/dpo.rs:404">
P2: Inconsistent `epochs=0` handling: `.max(1)` silently forces at least 1 epoch, while the existing SFT trainer respects `epochs=0` as zero iterations.</violation>
</file>

<file name="oxidize-golang/core/oxk/oxk.go">

<violation number="1" location="oxidize-golang/core/oxk/oxk.go:254">
P1: Exported library functions panic on invalid input instead of returning errors. This can crash host applications and violates repository error-handling conventions.</violation>

<violation number="2" location="oxidize-golang/core/oxk/oxk.go:259">
P2: Quantization loop allocates chunk slices before processing. Replace with index-based slicing to avoid per-call allocation overhead.</violation>

<violation number="3" location="oxidize-golang/core/oxk/oxk.go:291">
P2: Round the scaled activation before clamping when quantizing Q8_K values. Truncating toward zero changes q8 bytes/bsums and can make Go GEMV outputs diverge from the Rust kernel.</violation>
</file>

<file name="oxidize-finetuning/src/merge.rs">

<violation number="1" location="oxidize-finetuning/src/merge.rs:49">
P1: Slerp path accepts more than two adapters but silently ignores all after the first two. This can drop user inputs and return an unintended merge result.</violation>

<violation number="2" location="oxidize-finetuning/src/merge.rs:120">
P1: `slerp_merge` can interpolate adapters with different target/scale because compatibility checks are incomplete. The returned adapter keeps `a` metadata, causing incorrect semantics when mismatched inputs are passed.</violation>

<violation number="3" location="oxidize-finetuning/src/merge.rs:287">
P1: Linear/TIES compatibility checks omit `target` and `scale`, so incompatible adapters can merge successfully. Result keeps first adapter metadata, which can mislabel target and apply incorrect scaling at inference.</violation>
</file>

<file name="oxidize-train/src/main.rs">

<violation number="1" location="oxidize-train/src/main.rs:109">
P1: Clap `bool` field with `default_value_t = true` cannot be set to `false` from the CLI. The default `SetTrue` action means `--merge-aliases` only sets it to `true` (same as default), with no `--no-merge-aliases` negation. Users cannot disable merge_aliases.</violation>

<violation number="2" location="oxidize-train/src/main.rs:255">
P2: Misleading ffmpeg warning in `run_gen_train`: `load_gen_dataset()` hard-errors on missing ffmpeg, so the command always fails. The warning implies degraded-but-working operation like `run_video`, but that's not the case here.</violation>
</file>

<file name="oxidize-golang/cmd/bench_oxk/main.go">

<violation number="1" location="oxidize-golang/cmd/bench_oxk/main.go:131">
P1: Missing lower-bound validation for `tokens`; OXK_BENCH_TOKENS=0 causes no iterations and divide-by-zero in throughput calculations. Rust port guards with `.max(1)`.</violation>
</file>

<file name="oxidize-train/src/video/generator.rs">

<violation number="1" location="oxidize-train/src/video/generator.rs:410">
P2: Validation split leaks clip information because it is pair-level, not clip-level. Reported val_loss can be overly optimistic and unstable for model selection.</violation>

<violation number="2" location="oxidize-train/src/video/generator.rs:424">
P2: batch_size is effectively ignored because optimization runs one sample at a time inside each chunk. This makes the config misleading and prevents expected minibatch training/perf behavior.</violation>

<violation number="3" location="oxidize-train/src/video/generator.rs:435">
P1: val_split=0 makes val_loss constant 0, so best_model locks to epoch 1. Final model can drop later training progress.</violation>
</file>

<file name="oxidize-finetuning/src/main.rs">

<violation number="1" location="oxidize-finetuning/src/main.rs:191">
P2: Default output path "merged-adapter.gguf" is misleading — export_lora_gguf treats it as a directory and writes adapter_manifest.json inside, creating merged-adapter.gguf/ instead of a file.</violation>

<violation number="2" location="oxidize-finetuning/src/main.rs:360">
P1: This path reports that DPO training is not wired but still returns success. Return an error until the implementation is connected so automation does not treat a no-op as a completed run.</violation>

<violation number="3" location="oxidize-finetuning/src/main.rs:384">
P1: This stub prints that PPO is not wired and then returns `Ok(())`. Return a non-zero error until PPO training is implemented to avoid false-success runs.</violation>

<violation number="4" location="oxidize-finetuning/src/main.rs:458">
P2: TIES density is hardcoded to 0.5 with no CLI flag to customize it, making the ties strategy inflexible.</violation>
</file>

<file name="oxidize-train/src/video/config.rs">

<violation number="1" location="oxidize-train/src/video/config.rs:121">
P2: Missing validation: epochs == 0 allows a training run that performs zero iterations — a silent no-op.</violation>

<violation number="2" location="oxidize-train/src/video/config.rs:123">
P2: Missing validation: learning_rate <= 0.0 silently prevents training progress or causes divergence.</violation>
</file>

<file name="oxidize-merge/src/index.rs">

<violation number="1" location="oxidize-merge/src/index.rs:45">
P2: Duplicate `map_readonly` — identical wrapper already exists in oxidize-core/src/util/bytes.rs. Consider depending on oxidize-core or extracting a shared utility crate to avoid diverging SAFETY comments and duplicated unsafe wrappers.</violation>

<violation number="2" location="oxidize-merge/src/index.rs:46">
P2: SAFETY comment doesn't justify the unsafe call — it describes the mapping type and ownership, not the actual invariant (file must not be modified while mapped). The existing `map_readonly` in oxidize-core/src/util/bytes.rs has a correct version of this comment.</violation>
</file>

<file name="bench_oxk.py">

<violation number="1" location="bench_oxk.py:98">
P2: Missing lower-bound guard on `tokens` allows ZeroDivisionError when OXK_BENCH_TOKENS=0, unlike `n_layers` which is clamped with max(1, …).</violation>
</file>

<file name="oxidize-finetuning/src/qlora.rs">

<violation number="1" location="oxidize-finetuning/src/qlora.rs:61">
P2: `NF4Block::scale` is never read and its doc ("alpha / absmax") contradicts its assignment (`absmax`). Dead field + misleading documentation.</violation>

<violation number="2" location="oxidize-finetuning/src/qlora.rs:205">
P2: forward_batch fully dequantizes the frozen base weight into a new Vec every call. Frozen weights should be dequantized once (e.g. on construction or lazily cached) to avoid repeated allocation and recomputation.</violation>

<violation number="3" location="oxidize-finetuning/src/qlora.rs:257">
P2: adam_step accepts beta1/beta2/eps params that are silently ignored. Callers passing non-defaults get no effect — misleading API.</violation>
</file>

<file name="oxidize-quantize/src/main.rs">

<violation number="1" location="oxidize-quantize/src/main.rs:71">
P2: `nval * 4` can overflow usize before `checked_add` on 32-bit targets. Use `checked_mul(4).and_then(|b| cursor.checked_add(b))` instead.</violation>
</file>

<file name="oxidize-train/src/video/frames.rs">

<violation number="1" location="oxidize-train/src/video/frames.rs:188">
P2: patches_to_image indexes into `patches` without bounds checks — a short or malformed slice causes a panic. Add an early length guard or use get()/chunks_exact().</violation>
</file>

<file name="oxidize-cli/src/pipeline.rs">

<violation number="1" location="oxidize-cli/src/pipeline.rs:216">
P2: Per-call `Vec::with_capacity` allocation in `send_hidden` is a performance regression in the hot decode loop. Consider adding a reusable `&mut Vec<u8>` scratch parameter instead of allocating each call.</violation>

<violation number="2" location="oxidize-cli/src/pipeline.rs:256">
P2: Per-call `vec![0u8; nbytes]` allocation in `recv_hidden_payload` is a performance regression. The previously reused `f16_scratch` buffer is now unused; repurpose it (or a new byte scratch param) to avoid per-step allocation.</violation>
</file>

<file name="oxidize-core/src/model/layer_wise.rs">

<violation number="1" location="oxidize-core/src/model/layer_wise.rs:948">
P2: SAFETY comment drops the access invariant ("we only touch kv/ssm state below"). Without it, a future edit adding a cache-mutating `&mut self` call would create UB with no local reminder to check.</violation>
</file>

<file name="oxidize-finetuning/src/telemetry.rs">

<violation number="1" location="oxidize-finetuning/src/telemetry.rs:79">
P2: Returning 0.0 as mean loss for empty history is semantically incorrect — callers cannot distinguish "no data" from "average loss is zero".</violation>

<violation number="2" location="oxidize-finetuning/src/telemetry.rs:81">
P2: n=0 is treated as "all history" instead of "no steps" — a caller passing 0 likely has a bug, but the function silently returns the full-history mean.</violation>
</file>

<file name="oxidize-train/src/video/dataset.rs">

<violation number="1" location="oxidize-train/src/video/dataset.rs:285">
P2: f32 precision loss in split_indices: casting num_clips to f32 truncates large counts before multiplying by val_split. Use f64 for the intermediate calculation.</violation>
</file>

<file name="oxidize-train/src/video/error.rs">

<violation number="1" location="oxidize-train/src/video/error.rs:44">
P2: VideoError::Training variant with #[from] is dead code — never constructed, forcing exhaustive-match arms for an unreachable case in a public enum.</violation>
</file>

Tip: cubic can generate docs of your entire codebase and keep them up to date. Try it here.

Re-trigger cubic

#[inline]
pub fn read_volatile_byte(bytes: &[u8], offset: usize) -> u8 {
// SAFETY: `offset` is always page-aligned and in-bounds at call sites.
unsafe { std::ptr::read_volatile(bytes.as_ptr().add(offset)) }

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P0: Public safe function read_volatile_byte can cause undefined behavior when called with an out-of-bounds offset. A safe Rust function must never expose UB regardless of inputs.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At oxidize-core/src/util/bytes.rs, line 31:

<comment>Public safe function `read_volatile_byte` can cause undefined behavior when called with an out-of-bounds offset. A safe Rust function must never expose UB regardless of inputs.</comment>

<file context>
@@ -0,0 +1,32 @@
+#[inline]
+pub fn read_volatile_byte(bytes: &[u8], offset: usize) -> u8 {
+    // SAFETY: `offset` is always page-aligned and in-bounds at call sites.
+    unsafe { std::ptr::read_volatile(bytes.as_ptr().add(offset)) }
+}
</file context>
Suggested change
unsafe { std::ptr::read_volatile(bytes.as_ptr().add(offset)) }
assert!(offset < bytes.len(), "offset out of bounds");
unsafe { std::ptr::read_volatile(bytes.as_ptr().add(offset)) }

elif hint == "t0" or hint == "":
pf_nta = False
else:
print(f"OXIDIZE_OXK_PF_HINT={hint} unknown (use t0|nta); using t0", file=os.stderr)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: os.stderr does not exist — will raise AttributeError at runtime when an invalid OXIDIZE_OXK_PF_HINT value is encountered. Must be sys.stderr with import sys.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At oxidize-python/oxidize_python/core/oxk.py, line 113:

<comment>`os.stderr` does not exist — will raise `AttributeError` at runtime when an invalid `OXIDIZE_OXK_PF_HINT` value is encountered. Must be `sys.stderr` with `import sys`.</comment>

<file context>
@@ -0,0 +1,400 @@
+                elif hint == "t0" or hint == "":
+                    pf_nta = False
+                else:
+                    print(f"OXIDIZE_OXK_PF_HINT={hint} unknown (use t0|nta); using t0", file=os.stderr)
+                _tune_val = OxkTune(pf_bytes=blocks * BLOCK_Q4_K_SIZE, pf_nta=pf_nta)
+    return _tune_val
</file context>

@@ -0,0 +1,174 @@
//! OXK full decode-token bench — Qwen3-30B-A3B.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: Benchmark file added without a [[bench]] entry in Cargo.toml — cargo bench will not discover it.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At oxidize-kernels/benches/oxk_token_bench.rs, line 1:

<comment>Benchmark file added without a `[[bench]]` entry in Cargo.toml — `cargo bench` will not discover it.</comment>

<file context>
@@ -0,0 +1,174 @@
+//! OXK full decode-token bench — Qwen3-30B-A3B.
+//!
+//! Times one *real* decode token's worth of GEMVs (every attention + MoE
</file context>

// (treated as uniform/constant), so the ratio collapses to the policy
// ratio alone and the reference terms are 0.
let (ref_c, ref_r) = if self.dpo_config.reference_free {
(0.0_f32, 0.0_f32)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: Dead else branch: reference_free flag has no effect — both branches return (0.0, 0.0), so DPO always runs in reference-free mode regardless of the config.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At oxidize-finetuning/src/dpo.rs, line 311:

<comment>Dead `else` branch: `reference_free` flag has no effect — both branches return `(0.0, 0.0)`, so DPO always runs in reference-free mode regardless of the config.</comment>

<file context>
@@ -0,0 +1,708 @@
+        // (treated as uniform/constant), so the ratio collapses to the policy
+        // ratio alone and the reference terms are 0.
+        let (ref_c, ref_r) = if self.dpo_config.reference_free {
+            (0.0_f32, 0.0_f32)
+        } else {
+            // Without a separate reference model we approximate the reference
</file context>

// QuantizeQ8KInto quantizes vector (length nBlocks*256) into nBlocks Q8_K blocks.
func QuantizeQ8KInto(vector []float32, nBlocks int, out []byte) {
if len(vector) != nBlocks*QK_K {
panic("vector length mismatch")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: Exported library functions panic on invalid input instead of returning errors. This can crash host applications and violates repository error-handling conventions.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At oxidize-golang/core/oxk/oxk.go, line 254:

<comment>Exported library functions panic on invalid input instead of returning errors. This can crash host applications and violates repository error-handling conventions.</comment>

<file context>
@@ -0,0 +1,475 @@
+// QuantizeQ8KInto quantizes vector (length nBlocks*256) into nBlocks Q8_K blocks.
+func QuantizeQ8KInto(vector []float32, nBlocks int, out []byte) {
+	if len(vector) != nBlocks*QK_K {
+		panic("vector length mismatch")
+	}
+	if len(out) < nBlocks*BLOCK_Q8_K_BYTES {
</file context>

&mut self,
step: usize,
lr: f32,
_beta1: f32,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: adam_step accepts beta1/beta2/eps params that are silently ignored. Callers passing non-defaults get no effect — misleading API.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At oxidize-finetuning/src/qlora.rs, line 257:

<comment>adam_step accepts beta1/beta2/eps params that are silently ignored. Callers passing non-defaults get no effect — misleading API.</comment>

<file context>
@@ -0,0 +1,486 @@
+        &mut self,
+        step: usize,
+        lr: f32,
+        _beta1: f32,
+        _beta2: f32,
+        _eps: f32,
</file context>

let strategy = match args.strategy.to_lowercase().as_str() {
"linear" => MergeStrategy::Linear,
"slerp" => MergeStrategy::Slerp,
"ties" => MergeStrategy::Ties { density: 0.5 },

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: TIES density is hardcoded to 0.5 with no CLI flag to customize it, making the ties strategy inflexible.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At oxidize-finetuning/src/main.rs, line 458:

<comment>TIES density is hardcoded to 0.5 with no CLI flag to customize it, making the ties strategy inflexible.</comment>

<file context>
@@ -197,7 +328,219 @@ fn main() -> Result<()> {
+    let strategy = match args.strategy.to_lowercase().as_str() {
+        "linear" => MergeStrategy::Linear,
+        "slerp" => MergeStrategy::Slerp,
+        "ties" => MergeStrategy::Ties { density: 0.5 },
+        other => {
+            anyhow::bail!(
</file context>

if self.history.is_empty() {
return 0.0;
}
let slice = if n == 0 || n >= self.history.len() {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: n=0 is treated as "all history" instead of "no steps" — a caller passing 0 likely has a bug, but the function silently returns the full-history mean.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At oxidize-finetuning/src/telemetry.rs, line 81:

<comment>n=0 is treated as "all history" instead of "no steps" — a caller passing 0 likely has a bug, but the function silently returns the full-history mean.</comment>

<file context>
@@ -0,0 +1,455 @@
+        if self.history.is_empty() {
+            return 0.0;
+        }
+        let slice = if n == 0 || n >= self.history.len() {
+            &self.history[..]
+        } else {
</file context>

NoFrames(PathBuf),

#[error(transparent)]
Training(#[from] TrainingError),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: VideoError::Training variant with #[from] is dead code — never constructed, forcing exhaustive-match arms for an unreachable case in a public enum.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At oxidize-train/src/video/error.rs, line 44:

<comment>VideoError::Training variant with #[from] is dead code — never constructed, forcing exhaustive-match arms for an unreachable case in a public enum.</comment>

<file context>
@@ -0,0 +1,48 @@
+    NoFrames(PathBuf),
+
+    #[error(transparent)]
+    Training(#[from] TrainingError),
+
+    #[error("invalid configuration: {0}")]
</file context>

.sum()
}

fn fill_pseudo(bytes: &mut [u8], mut state: u64) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Duplicated fill_pseudo + f16 header taming logic already in oxk_q4k_bench.rs — should be shared (e.g. a test-util module or pub fn in the crate).

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At oxidize-kernels/benches/oxk_token_bench.rs, line 72:

<comment>Duplicated fill_pseudo + f16 header taming logic already in oxk_q4k_bench.rs — should be shared (e.g. a test-util module or pub fn in the crate).</comment>

<file context>
@@ -0,0 +1,174 @@
+        .sum()
+}
+
+fn fill_pseudo(bytes: &mut [u8], mut state: u64) {
+    for b in bytes.iter_mut() {
+        state ^= state << 13;
</file context>

Jackson57279 and others added 3 commits June 18, 2026 14:59
Keep hidden state resident on GPU across all transformer layers,
eliminating per-layer CPU residual-stream round trips.  The CUDA path
now holds the full 3072-element hidden vector in a pre-allocated
`GpuActivationBuffer` and only copies Q/K/V vectors to CPU for RoPE
and attention; the wo and FFN projections (gate/up/silu/down) stay
entirely on device.

Key changes:
- `gemv_q4k_f32in_kernel`: Q4_K × F32-input GEMV (no Q8K quantization
  step), one warp per output row
- `gemv_q6k_f32in_kernel`: Q6_K × F32-input GEMV with correct GGML
  interleaved block layout (two 128-element halves, lo4/hi2 split)
- `GpuActivationBuffer`: pre-allocated F32 device buffers for hidden,
  normed, ffn_gate, ffn_up, ffn_down_in
- `rms_norm_f32_kernel`: dynamic shared memory fix (pass block_size×4
  bytes at launch, not 0)
- `gpu_download_hidden`: synchronize stream before D2H copy to avoid
  racing the last residual-add kernel
- `layer_can_use_gpu_native`: per-layer eligibility check (Q4K/Q6K
  weights, no biases, 256-aligned, dense FFN)
- `q4k_or_q6k_bytes`: accepts Q4_K_S / Q4_K_M / Q6_K so that
  Llama-3.2 Q4_K_M layers (which use Q6_K for attn_v) are eligible
- Auto-detect Q4K vs Q6K from block byte-size (≥200 → Q6K)
- `kv_cache.rs`: add `copy_layer_{key,value}_prefix_values` helpers
- `gguf.rs`: expose `tensor_bytes` / `tensor_mmap` accessors

Measured on RunPod H100 80GB with Llama-3.2-3B-Instruct Q4_K_M:
  baseline (--backend cuda): ~41 tok/s
  GPU-native (--backend cuda): ~39 tok/s
The speedup is neutral because the original CUDA path already ran all
GEMVs on GPU; the per-layer D2H sync cost on H100 PCIe 5.0 is ~5 µs
(not the 100 µs estimated), so eliminating residual-stream round trips
saves only ~1 ms/token while adding ~0.3 ms of attn_out H2D uploads.
Real throughput gains require moving attention itself to GPU (next step).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
lm_head (output.weight) is a 128256×3072 matrix.  Running it on CPU
costs 6–16 ms per token (RAM bandwidth bound); on H100 HBM it takes
~100 µs (33–130× faster).  This single change is the primary bottleneck
preventing 100 tok/s on 3B models.

Changes:
- cuda.rs: add `gpu_lm_head_quantized` — uploads output.weight once,
  dispatches Q4K or Q6K GEMV kernel, downloads 512 KB logits; auto-
  detects quant type from block byte-size (same pattern as layer GEMVs)
- inference.rs: `final_head_from_workspace` tries GPU path for Q4K/Q6K
  output weights (covers all Q4_K_M / Q6_K GGUF models); falls through
  to CPU on error so the non-CUDA path is unaffected
- build.rs: probe nvcc version and emit native cubins for each Blackwell/
  Hopper/Ada/Ampere generation the installed toolkit supports, alongside
  the compute_75 PTX fallback.  SM120 (RTX 5090) and SM121 (RTX 5080)
  now get pre-compiled native code instead of per-run JIT recompilation
  when CUDA ≥ 12.8 is installed.

Expected result: ~80–120 tok/s on 3B Q4_K_M (from ~40 tok/s baseline)
once lm_head moves off CPU; full 100 tok/s parity with L40 7B benchmarks
requires GPU attention in a follow-up.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Updated `oxidize-cli/src/main.rs` to modularize submodules for better maintainability.
- Adjusted `oxidize-core/src/compute/kv_cache.rs` to separate cache functionalities into distinct files.
- Enhanced `oxidize-core/src/backends/cuda.rs` by removing unused structures and functions related to CUDA memory management.
- Updated `oxidize-core/src/model/inference.rs` and `oxidize-core/src/model/layer_wise.rs` to expose necessary functions for external use.
- Modified `continual-learning.json` to reflect updated state values.

These changes aim to streamline the codebase, making it easier to navigate and extend in the future.

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

31 issues found across 174 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="oxidize-cli/src/main/gpu_cluster.rs">

<violation number="1" location="oxidize-cli/src/main/gpu_cluster.rs:70">
P2: `--nodes` accepts `0` despite claiming a positive integer is required. This silently changes user input into defaults instead of reporting invalid input.</violation>

<violation number="2" location="oxidize-cli/src/main/gpu_cluster.rs:88">
P2: `--gpus-per-node` accepts `0` even though validation message requires a positive integer. This hides invalid input and falls back to defaults.</violation>
</file>

<file name="oxidize-cli/src/main/command_rewrite.rs">

<violation number="1" location="oxidize-cli/src/main/command_rewrite.rs:145">
P2: `--prompt` flag is not recognized as one-shot mode in rewrite logic. This can incorrectly enable chat/server behavior for prompt-only runs.</violation>
</file>

<file name="oxidize-cli/src/main/inference.rs">

<violation number="1" location="oxidize-cli/src/main/inference.rs:93">
P2: Autotune `turboquant=false` is never applied. This can leave KV quantization in the default mode instead of the planned asymmetric mode.</violation>
</file>

<file name=".omo/ulw-loop/evidence/oxidize-post-gen-status.txt">

<violation number="1" location=".omo/ulw-loop/evidence/oxidize-post-gen-status.txt:3">
P2: Raw prompt text is committed in process-output evidence. Redact user/request payloads before checking logs into git.</violation>

<violation number="2" location=".omo/ulw-loop/evidence/oxidize-post-gen-status.txt:9">
P2: Unfiltered host `ps` output leaks unrelated service details in repository history. Keep evidence scoped to oxidize processes or redact external commands.</violation>
</file>

<file name="oxidize-core/build.rs">

<violation number="1" location="oxidize-core/build.rs:140">
P2: `-ptx` forces PTX-only output, so added native `sm_*` gencode targets do not produce embedded cubins. This makes the new version-gated native-arch path ineffective and the documented perf benefit incorrect.</violation>
</file>

<file name="oxidize-core/src/backends/cuda/gemm.rs">

<violation number="1" location="oxidize-core/src/backends/cuda/gemm.rs:64">
P1: GEMM caches the dynamic activation input instead of the weight matrix. This can return stale results when left-buffer pointers are reused and also defeats intended weight-caching performance.</violation>
</file>

<file name=".pre-commit-config.yaml">

<violation number="1" location=".pre-commit-config.yaml:32">
P1: `cargo-udeps` hook references a non-existent script, so the hook cannot run successfully.</violation>
</file>

<file name=".cursor/hooks/state/continual-learning.json">

<violation number="1" location=".cursor/hooks/state/continual-learning.json:4">
P3: Volatile Cursor hook state was committed in this PR. This is unrelated runtime-local metadata and should be excluded from versioned changes.</violation>
</file>

<file name="oxidize-core/src/compute/quantization/quant_nvfp4.rs">

<violation number="1" location="oxidize-core/src/compute/quantization/quant_nvfp4.rs:14">
P3: This introduces duplicate NVFP4 decode logic that already exists in `tensor/kernels/q_kernels.rs`. Keeping two copies risks silent drift when one implementation changes.</violation>
</file>

<file name="oxidize-core/src/compute/quantization/quant_utils.rs">

<violation number="1" location="oxidize-core/src/compute/quantization/quant_utils.rs:59">
P2: New helper duplicates existing float16 conversion code in two modules. This creates divergence risk and inconsistent edge-case handling.</violation>
</file>

<file name=".omo/ulw-loop/evidence/oxidize-server-log-listener.txt">

<violation number="1" location=".omo/ulw-loop/evidence/oxidize-server-log-listener.txt:1">
P3: This adds a near-duplicate evidence log instead of extending existing evidence artifacts. Duplicate files increase repo noise and drift risk.</violation>

<violation number="2" location=".omo/ulw-loop/evidence/oxidize-server-log-listener.txt:5">
P2: Absolute local model path is committed in the evidence log. Redact host-specific paths before committing logs.</violation>
</file>

<file name="oxidize-core/src/backends/cuda/gpu_state.rs">

<violation number="1" location="oxidize-core/src/backends/cuda/gpu_state.rs:105">
P1: Budget eviction runs too late in `preload_layer`; VRAM allocation can fail before old layers are evicted. Evict headroom before each new upload.</violation>
</file>

<file name="oxidize-cli/src/main/generation.rs">

<violation number="1" location="oxidize-cli/src/main/generation.rs:67">
P2: Per-request full-vocab decode loop adds avoidable generation latency. Cache suppressed token IDs per tokenizer/model instead of recomputing each call.</violation>

<violation number="2" location="oxidize-cli/src/main/generation.rs:212">
P2: DFlash path drops generated output when decoded text is empty. Add the same token-id fallback used by other generation paths.</violation>
</file>

<file name="oxidize-core/src/backends/cuda/gpu_kernels.rs">

<violation number="1" location="oxidize-core/src/backends/cuda/gpu_kernels.rs:70">
P3: This introduces substantial duplicate CUDA launch logic already implemented in `gpu_native_forward.rs`. Keeping both paths in sync increases drift risk for kernel signatures and caching behavior.</violation>

<violation number="2" location="oxidize-core/src/backends/cuda/gpu_kernels.rs:158">
P2: `gpu_silu_mul` can issue an invalid zero-block CUDA launch for empty intermediate buffers. Return early when `n == 0`.</violation>

<violation number="3" location="oxidize-core/src/backends/cuda/gpu_kernels.rs:158">
P2: `gpu_residual_add` can launch with `grid_size == 0` when hidden size is zero. Guard `n == 0` before launching to avoid invalid CUDA launch errors.</violation>
</file>

<file name="oxidize-cli/src/main/model_resolution.rs">

<violation number="1" location="oxidize-cli/src/main/model_resolution.rs:73">
P1: Unvalidated joined filenames allow path traversal/absolute-path writes outside intended cache directories.</violation>

<violation number="2" location="oxidize-cli/src/main/model_resolution.rs:152">
P1: Cache key generation is non-injective; different model specs can map to the same cache directory and return incorrect artifacts.</violation>

<violation number="3" location="oxidize-cli/src/main/model_resolution.rs:255">
P1: Final GGUF cache writes are non-atomic, so interrupted writes can poison cache with corrupted files.</violation>
</file>

<file name="oxidize-core/src/backends/cuda/gemv_quantized.rs">

<violation number="1" location="oxidize-core/src/backends/cuda/gemv_quantized.rs:304">
P2: Combined length guard returns wrong error for short `d_input`. Split checks so vector failures return `InvalidVectorLength` with `d_input.len()`.</violation>

<violation number="2" location="oxidize-core/src/backends/cuda/gemv_quantized.rs:310">
P2: Unchecked `as u32` casts can truncate large dimensions and launch kernel with wrapped sizes. Use checked conversions and return a dimension error on overflow.</violation>
</file>

<file name="oxidize-core/src/backends/cuda/gemv_f32.rs">

<violation number="1" location="oxidize-core/src/backends/cuda/gemv_f32.rs:29">
P1: F32 weight cache inserts bypass VRAM budget tracking/eviction. Repeated new matrices can accumulate resident VRAM until OOM despite configured cache limits.</violation>

<violation number="2" location="oxidize-core/src/backends/cuda/gemv_f32.rs:151">
P2: Transposed CUDA GEMV does per-call vector allocation instead of using the existing f32 buffer pool. This adds avoidable allocation overhead and hurts throughput.</violation>
</file>

<file name="oxidize-core/src/backends/cuda.rs">

<violation number="1" location="oxidize-core/src/backends/cuda.rs:43">
P1: Removing content identity from cache keys makes weight-cache hits unsafe for mutable/reused host buffers. GEMV/GEMM can return stale results when data changes without pointer/length changes.</violation>
</file>

<file name="oxidize-cli/src/main/conversation.rs">

<violation number="1" location="oxidize-cli/src/main/conversation.rs:231">
P2: Mesh chat accepts input before leader election can complete. First prompts can time out with no streamed tokens.</violation>

<violation number="2" location="oxidize-cli/src/main/conversation.rs:347">
P1: Timeout/disconnect paths cache empty responses. Repeated prompts can return blank cache hits rather than retrying generation.</violation>
</file>

<file name="oxidize-core/src/backends/cuda/types.rs">

<violation number="1" location="oxidize-core/src/backends/cuda/types.rs:233">
P1: Quantized CUDA weight cache is never cleared by `clear_resident_cache`. This can leak VRAM across model loads and break budget enforcement because `resident_bytes` is reset while `resident_quant` still holds device buffers.</violation>
</file>

Note: This PR contains a large number of files. cubic only reviews up to 100 files per PR, so some files may not have been reviewed. cubic prioritizes the most important files to review.
On a pro plan you can use ultrareview for larger PRs.

Re-trigger cubic


with_gpu(|gpu| {
// Cache left matrix (model weights) in VRAM.
let left_key = f32_cache_key(left_matrix);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: GEMM caches the dynamic activation input instead of the weight matrix. This can return stale results when left-buffer pointers are reused and also defeats intended weight-caching performance.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At oxidize-core/src/backends/cuda/gemm.rs, line 64:

<comment>GEMM caches the dynamic activation input instead of the weight matrix. This can return stale results when left-buffer pointers are reused and also defeats intended weight-caching performance.</comment>

<file context>
@@ -0,0 +1,120 @@
+
+    with_gpu(|gpu| {
+        // Cache left matrix (model weights) in VRAM.
+        let left_key = f32_cache_key(left_matrix);
+        if !gpu.resident_f32.contains_key(&left_key) {
+            let buffer = cust::memory::DeviceBuffer::from_slice(left_matrix).map_err(stringify)?;
</file context>

Comment thread .pre-commit-config.yaml

- id: cargo-udeps
name: cargo udeps (unused dependencies)
entry: ./scripts/check-udeps.sh

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: cargo-udeps hook references a non-existent script, so the hook cannot run successfully.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At .pre-commit-config.yaml, line 32:

<comment>`cargo-udeps` hook references a non-existent script, so the hook cannot run successfully.</comment>

<file context>
@@ -0,0 +1,63 @@
+
+      - id: cargo-udeps
+        name: cargo udeps (unused dependencies)
+        entry: ./scripts/check-udeps.sh
+        language: system
+        pass_filenames: false
</file context>

for (matrix, _rows, _cols) in f32_weights {
let key = f32_cache_key(matrix);
if !gpu.resident_f32.contains_key(&key) {
let buf = cust::memory::DeviceBuffer::from_slice(*matrix).map_err(stringify)?;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: Budget eviction runs too late in preload_layer; VRAM allocation can fail before old layers are evicted. Evict headroom before each new upload.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At oxidize-core/src/backends/cuda/gpu_state.rs, line 105:

<comment>Budget eviction runs too late in `preload_layer`; VRAM allocation can fail before old layers are evicted. Evict headroom before each new upload.</comment>

<file context>
@@ -0,0 +1,150 @@
+        for (matrix, _rows, _cols) in f32_weights {
+            let key = f32_cache_key(matrix);
+            if !gpu.resident_f32.contains_key(&key) {
+                let buf = cust::memory::DeviceBuffer::from_slice(*matrix).map_err(stringify)?;
+                entry.bytes += buf.len() * std::mem::size_of::<f32>();
+                gpu.resident_f32.insert(key, buf);
</file context>

.map_err(|e| io::Error::other(format!("failed to read GGUF for requantization: {e}")))?;
let quantized = quantize_gguf_to_target(&input_bytes, GgufQuantizationType::Q8_0)
.map_err(|e| io::Error::other(format!("Q8_0 requantization failed: {e}")))?;
std::fs::write(output, &quantized)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: Final GGUF cache writes are non-atomic, so interrupted writes can poison cache with corrupted files.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At oxidize-cli/src/main/model_resolution.rs, line 255:

<comment>Final GGUF cache writes are non-atomic, so interrupted writes can poison cache with corrupted files.</comment>

<file context>
@@ -0,0 +1,324 @@
+        .map_err(|e| io::Error::other(format!("failed to read GGUF for requantization: {e}")))?;
+    let quantized = quantize_gguf_to_target(&input_bytes, GgufQuantizationType::Q8_0)
+        .map_err(|e| io::Error::other(format!("Q8_0 requantization failed: {e}")))?;
+    std::fs::write(output, &quantized)
+        .map_err(|e| io::Error::other(format!("failed to write Q8_0 GGUF: {e}")))?;
+    eprintln!(
</file context>


pub(super) fn cache_safe_name(spec: &str) -> String {
spec.chars()
.map(|ch| if ch.is_ascii_alphanumeric() { ch } else { '-' })

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: Cache key generation is non-injective; different model specs can map to the same cache directory and return incorrect artifacts.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At oxidize-cli/src/main/model_resolution.rs, line 152:

<comment>Cache key generation is non-injective; different model specs can map to the same cache directory and return incorrect artifacts.</comment>

<file context>
@@ -0,0 +1,324 @@
+
+pub(super) fn cache_safe_name(spec: &str) -> String {
+    spec.chars()
+        .map(|ch| if ch.is_ascii_alphanumeric() { ch } else { '-' })
+        .collect()
+}
</file context>


// Give the mesh node a moment to start up and discover peers.
rt.block_on(async {
tokio::time::sleep(Duration::from_secs(2)).await;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Mesh chat accepts input before leader election can complete. First prompts can time out with no streamed tokens.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At oxidize-cli/src/main/conversation.rs, line 231:

<comment>Mesh chat accepts input before leader election can complete. First prompts can time out with no streamed tokens.</comment>

<file context>
@@ -0,0 +1,358 @@
+
+    // Give the mesh node a moment to start up and discover peers.
+    rt.block_on(async {
+        tokio::time::sleep(Duration::from_secs(2)).await;
+    });
+
</file context>

"version": 1,
"lastRunAtMs": 1781685502133,
"turnsSinceLastRun": 1,
"turnsSinceLastRun": 4,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3: Volatile Cursor hook state was committed in this PR. This is unrelated runtime-local metadata and should be excluded from versioned changes.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At .cursor/hooks/state/continual-learning.json, line 4:

<comment>Volatile Cursor hook state was committed in this PR. This is unrelated runtime-local metadata and should be excluded from versioned changes.</comment>

<file context>
@@ -1,8 +1,8 @@
   "version": 1,
   "lastRunAtMs": 1781685502133,
-  "turnsSinceLastRun": 1,
+  "turnsSinceLastRun": 4,
   "lastTranscriptMtimeMs": 1781685501947.5315,
-  "lastProcessedGenerationId": "f1a2db2c-d576-4862-9869-f0392e82e294",
</file context>

}
}

pub fn dequantize_nvfp4_scalar(input: &[u8], output: &mut [f32]) -> Result<(), QuantizationError> {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3: This introduces duplicate NVFP4 decode logic that already exists in tensor/kernels/q_kernels.rs. Keeping two copies risks silent drift when one implementation changes.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At oxidize-core/src/compute/quantization/quant_nvfp4.rs, line 14:

<comment>This introduces duplicate NVFP4 decode logic that already exists in `tensor/kernels/q_kernels.rs`. Keeping two copies risks silent drift when one implementation changes.</comment>

<file context>
@@ -0,0 +1,41 @@
+    }
+}
+
+pub fn dequantize_nvfp4_scalar(input: &[u8], output: &mut [f32]) -> Result<(), QuantizationError> {
+    validate_layout(
+        GgufQuantizationType::NVFP4,
</file context>

@@ -0,0 +1,23 @@
backend: cpu (CPU)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3: This adds a near-duplicate evidence log instead of extending existing evidence artifacts. Duplicate files increase repo noise and drift risk.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At .omo/ulw-loop/evidence/oxidize-server-log-listener.txt, line 1:

<comment>This adds a near-duplicate evidence log instead of extending existing evidence artifacts. Duplicate files increase repo noise and drift risk.</comment>

<file context>
@@ -0,0 +1,23 @@
+backend: cpu (CPU)
+api server: starting in background at http://0.0.0.0:18082 (REST /v1/*, WebSocket /v1/realtime)
+load progress: 0% stage=starting bytes=0/7381381664
</file context>

@@ -0,0 +1,231 @@
#[allow(unused_imports)]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3: This introduces substantial duplicate CUDA launch logic already implemented in gpu_native_forward.rs. Keeping both paths in sync increases drift risk for kernel signatures and caching behavior.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At oxidize-core/src/backends/cuda/gpu_kernels.rs, line 70:

<comment>This introduces substantial duplicate CUDA launch logic already implemented in `gpu_native_forward.rs`. Keeping both paths in sync increases drift risk for kernel signatures and caching behavior.</comment>

<file context>
@@ -0,0 +1,231 @@
+/// cached in `resident_f32` (mmap-stable pointer identity, same as ordinary
+/// weight matrices).
+#[cfg(feature = "cuda")]
+pub fn gpu_rms_norm(weight: &[f32], eps: f32) -> Result<(), String> {
+    with_gpu(|gpu| {
+        let ab = gpu
</file context>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant