Perf/engine sweep#19
Conversation
Introduce a video pipeline for TikTok-style clip datasets with ffmpeg frame extraction, a trainable patch-embedding classifier, and a prototype subcommand that renders averaged base clips while excluding selected creators. Co-authored-by: Cursor <cursoragent@cursor.com>
Whole-engine perf sweep, round 1. All wins verified against existing bit-exact GEMM/attention tests (AVX2 fallback locally; AVX-512 paths validated on Skylake-SP target). - flash_attention: dot_product_f32 avx512/avx2 now use 4 independent accumulators + a 16/8-wide remainder loop, breaking the single-chain FMA latency bottleneck on short head_dim (64/96/128) loops. Same for the f16 dot (2 accumulators). - flash_attention: vectorize the f32 KvElem::axpy (decode V-accumulation) with AVX-512/AVX2 FMA instead of a scalar loop. - tensor/kernels: add dot4_f32_avx512 / dot_f32_avx512 (16-wide) and dispatch the Q4_K/Q6_K/Q8_0 decode-once GEMM and dot_f32_fast to them when avx512f+vl are present. Doubles dot lanes on the hottest quantized matmul path for AVX-512 hardware. - tensor/kernels: gemm_f32_cpu inner loop replaced its autovectorization- blocking black_box "prefetch" with the SIMD dot_product_f32 over the (now contiguous) transposed column. - Cargo.toml: codegen-units = 1 for whole-crate inlining/LTO scope. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Extract safe wrappers for read-only mmap, volatile page reads, and Q8_K bsum decoding so format and kernel call sites avoid scattered unsafe blocks. Co-authored-by: Cursor <cursoragent@cursor.com>
Use shared map_readonly and read_volatile_byte for model loading and NUMA warm-up checksums. Co-authored-by: Cursor <cursoragent@cursor.com>
Replace inline unsafe Mmap::map calls in SafeTensors-to-GGUF conversion and oxidize-merge index reads. Co-authored-by: Cursor <cursoragent@cursor.com>
read_q8_k_bsum now takes a byte slice instead of a raw pointer for safer kernel boundaries. Co-authored-by: Cursor <cursoragent@cursor.com>
load_q8_block helpers take &[u8] with debug_assert bounds checks instead of pointer arithmetic at call sites. Co-authored-by: Cursor <cursoragent@cursor.com>
…ode API. Add quantize_scalar_weighted, export IQ4 block constants, and refresh tensor module docs for GPU dispatch. Co-authored-by: Cursor <cursoragent@cursor.com>
…line. Add streaming weighted quantization options to the quantize tool and serialize hidden-state payloads with safe byte copies in the distributed pipeline. Co-authored-by: Cursor <cursoragent@cursor.com>
Clarify lifetime extension and disjoint slice invariants without changing behavior. Co-authored-by: Cursor <cursoragent@cursor.com>
Measure per-token GEMV throughput across all layers instead of a single isolated matvec. Co-authored-by: Cursor <cursoragent@cursor.com>
Port the full-layer GEMV token benchmark used to compare Rust, Go, and Python throughput. Co-authored-by: Cursor <cursoragent@cursor.com>
Ship a small command entrypoint for the Go OXK benchmark runner. Co-authored-by: Cursor <cursoragent@cursor.com>
Mirror the Go/Rust full-token GEMV benchmark in the pure-Python port. Co-authored-by: Cursor <cursoragent@cursor.com>
Introduce preference optimization training and 4-bit NF4 LoRA support as library building blocks. Co-authored-by: Cursor <cursoragent@cursor.com>
…ing. Support linear/SLERP/TIES adapter merging plus a PPO trainer stub for future RLHF workflows. Co-authored-by: Cursor <cursoragent@cursor.com>
Expose metrics logging, early stopping helpers, and public module surface for DPO/QLoRA/merge/RLHF. Co-authored-by: Cursor <cursoragent@cursor.com>
Replace the single-entry SFT flow with subcommands for supervised training, DPO, PPO stub, and adapter merging. Co-authored-by: Cursor <cursoragent@cursor.com>
…dule. Drop prototype.rs in favor of generator.rs for next-frame video model training and sampling. Co-authored-by: Cursor <cursoragent@cursor.com>
Add gen-train/generate exports, virality-first defaults, and dataset exclude/alias merge options. Co-authored-by: Cursor <cursoragent@cursor.com>
…tooling. Filter excluded usernames before training and support reconstructing preview images from normalized patches. Co-authored-by: Cursor <cursoragent@cursor.com>
Extend the train binary with generator training and sampling subcommands alongside the clip classifier. Co-authored-by: Cursor <cursoragent@cursor.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 962714cf39
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| blockOut[3] = byte(bits >> 24) | ||
| qsOff := 4 | ||
| for i, v := range blockIn { | ||
| q := int(iscale * float64(v)) |
There was a problem hiding this comment.
Use rounding for Q8_K activation quantization
When the Go Q4_K GEMV path quantizes an input vector to Q8_K, this truncates scaled activations toward zero, whereas the Rust kernel rounds before clamping (quantize_block_q8_k_scalar uses scaled.round()). For any non-integer scaled activation, the Go port produces different q8 bytes and bsums, so row-dot outputs no longer match the Rust/ggml reference and accuracy regresses; use math.Round before clamping.
Useful? React with 👍 / 👎.
| struct.pack_into("<f", block_out, 0, d) | ||
| qs_off = 4 | ||
| for i, v in enumerate(block_in): | ||
| q = int(iscale * v) |
There was a problem hiding this comment.
Use rounding for Q8_K activation quantization
When the Python OXK path quantizes activations for Q4_K × Q8_K GEMV, int(...) truncates instead of rounding, unlike the Rust implementation's scaled.round() before clamping. For ordinary inputs whose scaled value is fractional, this changes the Q8_K block and its bsums, so the Python port no longer has the advertised bit-identical math and can return different logits; round the scaled value first.
Useful? React with 👍 / 👎.
| // reference-model log-prob tracking. Wire DpoTrainer here once the | ||
| // model-side gradient API stabilises. | ||
| eprintln!("oxidize-finetuning dpo: full training loop not yet wired — coming soon"); | ||
| Ok(()) |
There was a problem hiding this comment.
Return an error instead of succeeding for unwired DPO
When a user runs the advertised dpo subcommand with a valid JSONL file, this path prints that an output will be written and then exits successfully without loading the model, training, or creating args.output. That makes scripts treat a no-op as a completed fine-tune and can leave callers consuming a missing or stale adapter; either wire the DPO trainer/export here or return a non-zero error until it is implemented.
Useful? React with 👍 / 👎.
| // Full PPO requires a reward model and rollout collection loop. | ||
| // Wire PpoTrainer + RewardModel here once the reward-model API stabilises. | ||
| eprintln!("oxidize-finetuning ppo: full training loop not yet wired — coming soon"); | ||
| Ok(()) |
There was a problem hiding this comment.
Return an error instead of succeeding for unwired PPO
When a user runs the advertised ppo subcommand, this path reports the target output and then exits Ok(()) without checking the model, collecting rollouts, training, or writing the adapter. Automation will see a successful RLHF run even though no artifact exists, so this should fail non-zero until the PPO loop is actually connected.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
40 issues found across 46 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="oxidize-core/src/util/bytes.rs">
<violation number="1" location="oxidize-core/src/util/bytes.rs:13">
P1: Safe wrapper `map_readonly` wraps `unsafe Mmap::map` but relies on an unenforceable caller invariant (file must stay unmodified). Violating this invariant causes UB, making the safe signature unsound.</violation>
<violation number="2" location="oxidize-core/src/util/bytes.rs:31">
P0: Public safe function `read_volatile_byte` can cause undefined behavior when called with an out-of-bounds offset. A safe Rust function must never expose UB regardless of inputs.</violation>
</file>
<file name="oxidize-python/oxidize_python/core/oxk.py">
<violation number="1" location="oxidize-python/oxidize_python/core/oxk.py:113">
P1: `os.stderr` does not exist — will raise `AttributeError` at runtime when an invalid `OXIDIZE_OXK_PF_HINT` value is encountered. Must be `sys.stderr` with `import sys`.</violation>
</file>
<file name="oxidize-kernels/benches/oxk_token_bench.rs">
<violation number="1" location="oxidize-kernels/benches/oxk_token_bench.rs:1">
P1: Benchmark file added without a `[[bench]]` entry in Cargo.toml — `cargo bench` will not discover it.</violation>
<violation number="2" location="oxidize-kernels/benches/oxk_token_bench.rs:72">
P2: Duplicated fill_pseudo + f16 header taming logic already in oxk_q4k_bench.rs — should be shared (e.g. a test-util module or pub fn in the crate).</violation>
</file>
<file name="oxidize-finetuning/src/dpo.rs">
<violation number="1" location="oxidize-finetuning/src/dpo.rs:311">
P1: Dead `else` branch: `reference_free` flag has no effect — both branches return `(0.0, 0.0)`, so DPO always runs in reference-free mode regardless of the config.</violation>
<violation number="2" location="oxidize-finetuning/src/dpo.rs:404">
P2: Inconsistent `epochs=0` handling: `.max(1)` silently forces at least 1 epoch, while the existing SFT trainer respects `epochs=0` as zero iterations.</violation>
</file>
<file name="oxidize-golang/core/oxk/oxk.go">
<violation number="1" location="oxidize-golang/core/oxk/oxk.go:254">
P1: Exported library functions panic on invalid input instead of returning errors. This can crash host applications and violates repository error-handling conventions.</violation>
<violation number="2" location="oxidize-golang/core/oxk/oxk.go:259">
P2: Quantization loop allocates chunk slices before processing. Replace with index-based slicing to avoid per-call allocation overhead.</violation>
<violation number="3" location="oxidize-golang/core/oxk/oxk.go:291">
P2: Round the scaled activation before clamping when quantizing Q8_K values. Truncating toward zero changes q8 bytes/bsums and can make Go GEMV outputs diverge from the Rust kernel.</violation>
</file>
<file name="oxidize-finetuning/src/merge.rs">
<violation number="1" location="oxidize-finetuning/src/merge.rs:49">
P1: Slerp path accepts more than two adapters but silently ignores all after the first two. This can drop user inputs and return an unintended merge result.</violation>
<violation number="2" location="oxidize-finetuning/src/merge.rs:120">
P1: `slerp_merge` can interpolate adapters with different target/scale because compatibility checks are incomplete. The returned adapter keeps `a` metadata, causing incorrect semantics when mismatched inputs are passed.</violation>
<violation number="3" location="oxidize-finetuning/src/merge.rs:287">
P1: Linear/TIES compatibility checks omit `target` and `scale`, so incompatible adapters can merge successfully. Result keeps first adapter metadata, which can mislabel target and apply incorrect scaling at inference.</violation>
</file>
<file name="oxidize-train/src/main.rs">
<violation number="1" location="oxidize-train/src/main.rs:109">
P1: Clap `bool` field with `default_value_t = true` cannot be set to `false` from the CLI. The default `SetTrue` action means `--merge-aliases` only sets it to `true` (same as default), with no `--no-merge-aliases` negation. Users cannot disable merge_aliases.</violation>
<violation number="2" location="oxidize-train/src/main.rs:255">
P2: Misleading ffmpeg warning in `run_gen_train`: `load_gen_dataset()` hard-errors on missing ffmpeg, so the command always fails. The warning implies degraded-but-working operation like `run_video`, but that's not the case here.</violation>
</file>
<file name="oxidize-golang/cmd/bench_oxk/main.go">
<violation number="1" location="oxidize-golang/cmd/bench_oxk/main.go:131">
P1: Missing lower-bound validation for `tokens`; OXK_BENCH_TOKENS=0 causes no iterations and divide-by-zero in throughput calculations. Rust port guards with `.max(1)`.</violation>
</file>
<file name="oxidize-train/src/video/generator.rs">
<violation number="1" location="oxidize-train/src/video/generator.rs:410">
P2: Validation split leaks clip information because it is pair-level, not clip-level. Reported val_loss can be overly optimistic and unstable for model selection.</violation>
<violation number="2" location="oxidize-train/src/video/generator.rs:424">
P2: batch_size is effectively ignored because optimization runs one sample at a time inside each chunk. This makes the config misleading and prevents expected minibatch training/perf behavior.</violation>
<violation number="3" location="oxidize-train/src/video/generator.rs:435">
P1: val_split=0 makes val_loss constant 0, so best_model locks to epoch 1. Final model can drop later training progress.</violation>
</file>
<file name="oxidize-finetuning/src/main.rs">
<violation number="1" location="oxidize-finetuning/src/main.rs:191">
P2: Default output path "merged-adapter.gguf" is misleading — export_lora_gguf treats it as a directory and writes adapter_manifest.json inside, creating merged-adapter.gguf/ instead of a file.</violation>
<violation number="2" location="oxidize-finetuning/src/main.rs:360">
P1: This path reports that DPO training is not wired but still returns success. Return an error until the implementation is connected so automation does not treat a no-op as a completed run.</violation>
<violation number="3" location="oxidize-finetuning/src/main.rs:384">
P1: This stub prints that PPO is not wired and then returns `Ok(())`. Return a non-zero error until PPO training is implemented to avoid false-success runs.</violation>
<violation number="4" location="oxidize-finetuning/src/main.rs:458">
P2: TIES density is hardcoded to 0.5 with no CLI flag to customize it, making the ties strategy inflexible.</violation>
</file>
<file name="oxidize-train/src/video/config.rs">
<violation number="1" location="oxidize-train/src/video/config.rs:121">
P2: Missing validation: epochs == 0 allows a training run that performs zero iterations — a silent no-op.</violation>
<violation number="2" location="oxidize-train/src/video/config.rs:123">
P2: Missing validation: learning_rate <= 0.0 silently prevents training progress or causes divergence.</violation>
</file>
<file name="oxidize-merge/src/index.rs">
<violation number="1" location="oxidize-merge/src/index.rs:45">
P2: Duplicate `map_readonly` — identical wrapper already exists in oxidize-core/src/util/bytes.rs. Consider depending on oxidize-core or extracting a shared utility crate to avoid diverging SAFETY comments and duplicated unsafe wrappers.</violation>
<violation number="2" location="oxidize-merge/src/index.rs:46">
P2: SAFETY comment doesn't justify the unsafe call — it describes the mapping type and ownership, not the actual invariant (file must not be modified while mapped). The existing `map_readonly` in oxidize-core/src/util/bytes.rs has a correct version of this comment.</violation>
</file>
<file name="bench_oxk.py">
<violation number="1" location="bench_oxk.py:98">
P2: Missing lower-bound guard on `tokens` allows ZeroDivisionError when OXK_BENCH_TOKENS=0, unlike `n_layers` which is clamped with max(1, …).</violation>
</file>
<file name="oxidize-finetuning/src/qlora.rs">
<violation number="1" location="oxidize-finetuning/src/qlora.rs:61">
P2: `NF4Block::scale` is never read and its doc ("alpha / absmax") contradicts its assignment (`absmax`). Dead field + misleading documentation.</violation>
<violation number="2" location="oxidize-finetuning/src/qlora.rs:205">
P2: forward_batch fully dequantizes the frozen base weight into a new Vec every call. Frozen weights should be dequantized once (e.g. on construction or lazily cached) to avoid repeated allocation and recomputation.</violation>
<violation number="3" location="oxidize-finetuning/src/qlora.rs:257">
P2: adam_step accepts beta1/beta2/eps params that are silently ignored. Callers passing non-defaults get no effect — misleading API.</violation>
</file>
<file name="oxidize-quantize/src/main.rs">
<violation number="1" location="oxidize-quantize/src/main.rs:71">
P2: `nval * 4` can overflow usize before `checked_add` on 32-bit targets. Use `checked_mul(4).and_then(|b| cursor.checked_add(b))` instead.</violation>
</file>
<file name="oxidize-train/src/video/frames.rs">
<violation number="1" location="oxidize-train/src/video/frames.rs:188">
P2: patches_to_image indexes into `patches` without bounds checks — a short or malformed slice causes a panic. Add an early length guard or use get()/chunks_exact().</violation>
</file>
<file name="oxidize-cli/src/pipeline.rs">
<violation number="1" location="oxidize-cli/src/pipeline.rs:216">
P2: Per-call `Vec::with_capacity` allocation in `send_hidden` is a performance regression in the hot decode loop. Consider adding a reusable `&mut Vec<u8>` scratch parameter instead of allocating each call.</violation>
<violation number="2" location="oxidize-cli/src/pipeline.rs:256">
P2: Per-call `vec![0u8; nbytes]` allocation in `recv_hidden_payload` is a performance regression. The previously reused `f16_scratch` buffer is now unused; repurpose it (or a new byte scratch param) to avoid per-step allocation.</violation>
</file>
<file name="oxidize-core/src/model/layer_wise.rs">
<violation number="1" location="oxidize-core/src/model/layer_wise.rs:948">
P2: SAFETY comment drops the access invariant ("we only touch kv/ssm state below"). Without it, a future edit adding a cache-mutating `&mut self` call would create UB with no local reminder to check.</violation>
</file>
<file name="oxidize-finetuning/src/telemetry.rs">
<violation number="1" location="oxidize-finetuning/src/telemetry.rs:79">
P2: Returning 0.0 as mean loss for empty history is semantically incorrect — callers cannot distinguish "no data" from "average loss is zero".</violation>
<violation number="2" location="oxidize-finetuning/src/telemetry.rs:81">
P2: n=0 is treated as "all history" instead of "no steps" — a caller passing 0 likely has a bug, but the function silently returns the full-history mean.</violation>
</file>
<file name="oxidize-train/src/video/dataset.rs">
<violation number="1" location="oxidize-train/src/video/dataset.rs:285">
P2: f32 precision loss in split_indices: casting num_clips to f32 truncates large counts before multiplying by val_split. Use f64 for the intermediate calculation.</violation>
</file>
<file name="oxidize-train/src/video/error.rs">
<violation number="1" location="oxidize-train/src/video/error.rs:44">
P2: VideoError::Training variant with #[from] is dead code — never constructed, forcing exhaustive-match arms for an unreachable case in a public enum.</violation>
</file>
Tip: cubic can generate docs of your entire codebase and keep them up to date. Try it here.
Re-trigger cubic
| #[inline] | ||
| pub fn read_volatile_byte(bytes: &[u8], offset: usize) -> u8 { | ||
| // SAFETY: `offset` is always page-aligned and in-bounds at call sites. | ||
| unsafe { std::ptr::read_volatile(bytes.as_ptr().add(offset)) } |
There was a problem hiding this comment.
P0: Public safe function read_volatile_byte can cause undefined behavior when called with an out-of-bounds offset. A safe Rust function must never expose UB regardless of inputs.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At oxidize-core/src/util/bytes.rs, line 31:
<comment>Public safe function `read_volatile_byte` can cause undefined behavior when called with an out-of-bounds offset. A safe Rust function must never expose UB regardless of inputs.</comment>
<file context>
@@ -0,0 +1,32 @@
+#[inline]
+pub fn read_volatile_byte(bytes: &[u8], offset: usize) -> u8 {
+ // SAFETY: `offset` is always page-aligned and in-bounds at call sites.
+ unsafe { std::ptr::read_volatile(bytes.as_ptr().add(offset)) }
+}
</file context>
| unsafe { std::ptr::read_volatile(bytes.as_ptr().add(offset)) } | |
| assert!(offset < bytes.len(), "offset out of bounds"); | |
| unsafe { std::ptr::read_volatile(bytes.as_ptr().add(offset)) } |
| elif hint == "t0" or hint == "": | ||
| pf_nta = False | ||
| else: | ||
| print(f"OXIDIZE_OXK_PF_HINT={hint} unknown (use t0|nta); using t0", file=os.stderr) |
There was a problem hiding this comment.
P1: os.stderr does not exist — will raise AttributeError at runtime when an invalid OXIDIZE_OXK_PF_HINT value is encountered. Must be sys.stderr with import sys.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At oxidize-python/oxidize_python/core/oxk.py, line 113:
<comment>`os.stderr` does not exist — will raise `AttributeError` at runtime when an invalid `OXIDIZE_OXK_PF_HINT` value is encountered. Must be `sys.stderr` with `import sys`.</comment>
<file context>
@@ -0,0 +1,400 @@
+ elif hint == "t0" or hint == "":
+ pf_nta = False
+ else:
+ print(f"OXIDIZE_OXK_PF_HINT={hint} unknown (use t0|nta); using t0", file=os.stderr)
+ _tune_val = OxkTune(pf_bytes=blocks * BLOCK_Q4_K_SIZE, pf_nta=pf_nta)
+ return _tune_val
</file context>
| @@ -0,0 +1,174 @@ | |||
| //! OXK full decode-token bench — Qwen3-30B-A3B. | |||
There was a problem hiding this comment.
P1: Benchmark file added without a [[bench]] entry in Cargo.toml — cargo bench will not discover it.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At oxidize-kernels/benches/oxk_token_bench.rs, line 1:
<comment>Benchmark file added without a `[[bench]]` entry in Cargo.toml — `cargo bench` will not discover it.</comment>
<file context>
@@ -0,0 +1,174 @@
+//! OXK full decode-token bench — Qwen3-30B-A3B.
+//!
+//! Times one *real* decode token's worth of GEMVs (every attention + MoE
</file context>
| // (treated as uniform/constant), so the ratio collapses to the policy | ||
| // ratio alone and the reference terms are 0. | ||
| let (ref_c, ref_r) = if self.dpo_config.reference_free { | ||
| (0.0_f32, 0.0_f32) |
There was a problem hiding this comment.
P1: Dead else branch: reference_free flag has no effect — both branches return (0.0, 0.0), so DPO always runs in reference-free mode regardless of the config.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At oxidize-finetuning/src/dpo.rs, line 311:
<comment>Dead `else` branch: `reference_free` flag has no effect — both branches return `(0.0, 0.0)`, so DPO always runs in reference-free mode regardless of the config.</comment>
<file context>
@@ -0,0 +1,708 @@
+ // (treated as uniform/constant), so the ratio collapses to the policy
+ // ratio alone and the reference terms are 0.
+ let (ref_c, ref_r) = if self.dpo_config.reference_free {
+ (0.0_f32, 0.0_f32)
+ } else {
+ // Without a separate reference model we approximate the reference
</file context>
| // QuantizeQ8KInto quantizes vector (length nBlocks*256) into nBlocks Q8_K blocks. | ||
| func QuantizeQ8KInto(vector []float32, nBlocks int, out []byte) { | ||
| if len(vector) != nBlocks*QK_K { | ||
| panic("vector length mismatch") |
There was a problem hiding this comment.
P1: Exported library functions panic on invalid input instead of returning errors. This can crash host applications and violates repository error-handling conventions.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At oxidize-golang/core/oxk/oxk.go, line 254:
<comment>Exported library functions panic on invalid input instead of returning errors. This can crash host applications and violates repository error-handling conventions.</comment>
<file context>
@@ -0,0 +1,475 @@
+// QuantizeQ8KInto quantizes vector (length nBlocks*256) into nBlocks Q8_K blocks.
+func QuantizeQ8KInto(vector []float32, nBlocks int, out []byte) {
+ if len(vector) != nBlocks*QK_K {
+ panic("vector length mismatch")
+ }
+ if len(out) < nBlocks*BLOCK_Q8_K_BYTES {
</file context>
| &mut self, | ||
| step: usize, | ||
| lr: f32, | ||
| _beta1: f32, |
There was a problem hiding this comment.
P2: adam_step accepts beta1/beta2/eps params that are silently ignored. Callers passing non-defaults get no effect — misleading API.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At oxidize-finetuning/src/qlora.rs, line 257:
<comment>adam_step accepts beta1/beta2/eps params that are silently ignored. Callers passing non-defaults get no effect — misleading API.</comment>
<file context>
@@ -0,0 +1,486 @@
+ &mut self,
+ step: usize,
+ lr: f32,
+ _beta1: f32,
+ _beta2: f32,
+ _eps: f32,
</file context>
| let strategy = match args.strategy.to_lowercase().as_str() { | ||
| "linear" => MergeStrategy::Linear, | ||
| "slerp" => MergeStrategy::Slerp, | ||
| "ties" => MergeStrategy::Ties { density: 0.5 }, |
There was a problem hiding this comment.
P2: TIES density is hardcoded to 0.5 with no CLI flag to customize it, making the ties strategy inflexible.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At oxidize-finetuning/src/main.rs, line 458:
<comment>TIES density is hardcoded to 0.5 with no CLI flag to customize it, making the ties strategy inflexible.</comment>
<file context>
@@ -197,7 +328,219 @@ fn main() -> Result<()> {
+ let strategy = match args.strategy.to_lowercase().as_str() {
+ "linear" => MergeStrategy::Linear,
+ "slerp" => MergeStrategy::Slerp,
+ "ties" => MergeStrategy::Ties { density: 0.5 },
+ other => {
+ anyhow::bail!(
</file context>
| if self.history.is_empty() { | ||
| return 0.0; | ||
| } | ||
| let slice = if n == 0 || n >= self.history.len() { |
There was a problem hiding this comment.
P2: n=0 is treated as "all history" instead of "no steps" — a caller passing 0 likely has a bug, but the function silently returns the full-history mean.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At oxidize-finetuning/src/telemetry.rs, line 81:
<comment>n=0 is treated as "all history" instead of "no steps" — a caller passing 0 likely has a bug, but the function silently returns the full-history mean.</comment>
<file context>
@@ -0,0 +1,455 @@
+ if self.history.is_empty() {
+ return 0.0;
+ }
+ let slice = if n == 0 || n >= self.history.len() {
+ &self.history[..]
+ } else {
</file context>
| NoFrames(PathBuf), | ||
|
|
||
| #[error(transparent)] | ||
| Training(#[from] TrainingError), |
There was a problem hiding this comment.
P2: VideoError::Training variant with #[from] is dead code — never constructed, forcing exhaustive-match arms for an unreachable case in a public enum.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At oxidize-train/src/video/error.rs, line 44:
<comment>VideoError::Training variant with #[from] is dead code — never constructed, forcing exhaustive-match arms for an unreachable case in a public enum.</comment>
<file context>
@@ -0,0 +1,48 @@
+ NoFrames(PathBuf),
+
+ #[error(transparent)]
+ Training(#[from] TrainingError),
+
+ #[error("invalid configuration: {0}")]
</file context>
| .sum() | ||
| } | ||
|
|
||
| fn fill_pseudo(bytes: &mut [u8], mut state: u64) { |
There was a problem hiding this comment.
P2: Duplicated fill_pseudo + f16 header taming logic already in oxk_q4k_bench.rs — should be shared (e.g. a test-util module or pub fn in the crate).
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At oxidize-kernels/benches/oxk_token_bench.rs, line 72:
<comment>Duplicated fill_pseudo + f16 header taming logic already in oxk_q4k_bench.rs — should be shared (e.g. a test-util module or pub fn in the crate).</comment>
<file context>
@@ -0,0 +1,174 @@
+ .sum()
+}
+
+fn fill_pseudo(bytes: &mut [u8], mut state: u64) {
+ for b in bytes.iter_mut() {
+ state ^= state << 13;
</file context>
Keep hidden state resident on GPU across all transformer layers,
eliminating per-layer CPU residual-stream round trips. The CUDA path
now holds the full 3072-element hidden vector in a pre-allocated
`GpuActivationBuffer` and only copies Q/K/V vectors to CPU for RoPE
and attention; the wo and FFN projections (gate/up/silu/down) stay
entirely on device.
Key changes:
- `gemv_q4k_f32in_kernel`: Q4_K × F32-input GEMV (no Q8K quantization
step), one warp per output row
- `gemv_q6k_f32in_kernel`: Q6_K × F32-input GEMV with correct GGML
interleaved block layout (two 128-element halves, lo4/hi2 split)
- `GpuActivationBuffer`: pre-allocated F32 device buffers for hidden,
normed, ffn_gate, ffn_up, ffn_down_in
- `rms_norm_f32_kernel`: dynamic shared memory fix (pass block_size×4
bytes at launch, not 0)
- `gpu_download_hidden`: synchronize stream before D2H copy to avoid
racing the last residual-add kernel
- `layer_can_use_gpu_native`: per-layer eligibility check (Q4K/Q6K
weights, no biases, 256-aligned, dense FFN)
- `q4k_or_q6k_bytes`: accepts Q4_K_S / Q4_K_M / Q6_K so that
Llama-3.2 Q4_K_M layers (which use Q6_K for attn_v) are eligible
- Auto-detect Q4K vs Q6K from block byte-size (≥200 → Q6K)
- `kv_cache.rs`: add `copy_layer_{key,value}_prefix_values` helpers
- `gguf.rs`: expose `tensor_bytes` / `tensor_mmap` accessors
Measured on RunPod H100 80GB with Llama-3.2-3B-Instruct Q4_K_M:
baseline (--backend cuda): ~41 tok/s
GPU-native (--backend cuda): ~39 tok/s
The speedup is neutral because the original CUDA path already ran all
GEMVs on GPU; the per-layer D2H sync cost on H100 PCIe 5.0 is ~5 µs
(not the 100 µs estimated), so eliminating residual-stream round trips
saves only ~1 ms/token while adding ~0.3 ms of attn_out H2D uploads.
Real throughput gains require moving attention itself to GPU (next step).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
lm_head (output.weight) is a 128256×3072 matrix. Running it on CPU costs 6–16 ms per token (RAM bandwidth bound); on H100 HBM it takes ~100 µs (33–130× faster). This single change is the primary bottleneck preventing 100 tok/s on 3B models. Changes: - cuda.rs: add `gpu_lm_head_quantized` — uploads output.weight once, dispatches Q4K or Q6K GEMV kernel, downloads 512 KB logits; auto- detects quant type from block byte-size (same pattern as layer GEMVs) - inference.rs: `final_head_from_workspace` tries GPU path for Q4K/Q6K output weights (covers all Q4_K_M / Q6_K GGUF models); falls through to CPU on error so the non-CUDA path is unaffected - build.rs: probe nvcc version and emit native cubins for each Blackwell/ Hopper/Ada/Ampere generation the installed toolkit supports, alongside the compute_75 PTX fallback. SM120 (RTX 5090) and SM121 (RTX 5080) now get pre-compiled native code instead of per-run JIT recompilation when CUDA ≥ 12.8 is installed. Expected result: ~80–120 tok/s on 3B Q4_K_M (from ~40 tok/s baseline) once lm_head moves off CPU; full 100 tok/s parity with L40 7B benchmarks requires GPU attention in a follow-up. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Updated `oxidize-cli/src/main.rs` to modularize submodules for better maintainability. - Adjusted `oxidize-core/src/compute/kv_cache.rs` to separate cache functionalities into distinct files. - Enhanced `oxidize-core/src/backends/cuda.rs` by removing unused structures and functions related to CUDA memory management. - Updated `oxidize-core/src/model/inference.rs` and `oxidize-core/src/model/layer_wise.rs` to expose necessary functions for external use. - Modified `continual-learning.json` to reflect updated state values. These changes aim to streamline the codebase, making it easier to navigate and extend in the future.
There was a problem hiding this comment.
31 issues found across 174 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="oxidize-cli/src/main/gpu_cluster.rs">
<violation number="1" location="oxidize-cli/src/main/gpu_cluster.rs:70">
P2: `--nodes` accepts `0` despite claiming a positive integer is required. This silently changes user input into defaults instead of reporting invalid input.</violation>
<violation number="2" location="oxidize-cli/src/main/gpu_cluster.rs:88">
P2: `--gpus-per-node` accepts `0` even though validation message requires a positive integer. This hides invalid input and falls back to defaults.</violation>
</file>
<file name="oxidize-cli/src/main/command_rewrite.rs">
<violation number="1" location="oxidize-cli/src/main/command_rewrite.rs:145">
P2: `--prompt` flag is not recognized as one-shot mode in rewrite logic. This can incorrectly enable chat/server behavior for prompt-only runs.</violation>
</file>
<file name="oxidize-cli/src/main/inference.rs">
<violation number="1" location="oxidize-cli/src/main/inference.rs:93">
P2: Autotune `turboquant=false` is never applied. This can leave KV quantization in the default mode instead of the planned asymmetric mode.</violation>
</file>
<file name=".omo/ulw-loop/evidence/oxidize-post-gen-status.txt">
<violation number="1" location=".omo/ulw-loop/evidence/oxidize-post-gen-status.txt:3">
P2: Raw prompt text is committed in process-output evidence. Redact user/request payloads before checking logs into git.</violation>
<violation number="2" location=".omo/ulw-loop/evidence/oxidize-post-gen-status.txt:9">
P2: Unfiltered host `ps` output leaks unrelated service details in repository history. Keep evidence scoped to oxidize processes or redact external commands.</violation>
</file>
<file name="oxidize-core/build.rs">
<violation number="1" location="oxidize-core/build.rs:140">
P2: `-ptx` forces PTX-only output, so added native `sm_*` gencode targets do not produce embedded cubins. This makes the new version-gated native-arch path ineffective and the documented perf benefit incorrect.</violation>
</file>
<file name="oxidize-core/src/backends/cuda/gemm.rs">
<violation number="1" location="oxidize-core/src/backends/cuda/gemm.rs:64">
P1: GEMM caches the dynamic activation input instead of the weight matrix. This can return stale results when left-buffer pointers are reused and also defeats intended weight-caching performance.</violation>
</file>
<file name=".pre-commit-config.yaml">
<violation number="1" location=".pre-commit-config.yaml:32">
P1: `cargo-udeps` hook references a non-existent script, so the hook cannot run successfully.</violation>
</file>
<file name=".cursor/hooks/state/continual-learning.json">
<violation number="1" location=".cursor/hooks/state/continual-learning.json:4">
P3: Volatile Cursor hook state was committed in this PR. This is unrelated runtime-local metadata and should be excluded from versioned changes.</violation>
</file>
<file name="oxidize-core/src/compute/quantization/quant_nvfp4.rs">
<violation number="1" location="oxidize-core/src/compute/quantization/quant_nvfp4.rs:14">
P3: This introduces duplicate NVFP4 decode logic that already exists in `tensor/kernels/q_kernels.rs`. Keeping two copies risks silent drift when one implementation changes.</violation>
</file>
<file name="oxidize-core/src/compute/quantization/quant_utils.rs">
<violation number="1" location="oxidize-core/src/compute/quantization/quant_utils.rs:59">
P2: New helper duplicates existing float16 conversion code in two modules. This creates divergence risk and inconsistent edge-case handling.</violation>
</file>
<file name=".omo/ulw-loop/evidence/oxidize-server-log-listener.txt">
<violation number="1" location=".omo/ulw-loop/evidence/oxidize-server-log-listener.txt:1">
P3: This adds a near-duplicate evidence log instead of extending existing evidence artifacts. Duplicate files increase repo noise and drift risk.</violation>
<violation number="2" location=".omo/ulw-loop/evidence/oxidize-server-log-listener.txt:5">
P2: Absolute local model path is committed in the evidence log. Redact host-specific paths before committing logs.</violation>
</file>
<file name="oxidize-core/src/backends/cuda/gpu_state.rs">
<violation number="1" location="oxidize-core/src/backends/cuda/gpu_state.rs:105">
P1: Budget eviction runs too late in `preload_layer`; VRAM allocation can fail before old layers are evicted. Evict headroom before each new upload.</violation>
</file>
<file name="oxidize-cli/src/main/generation.rs">
<violation number="1" location="oxidize-cli/src/main/generation.rs:67">
P2: Per-request full-vocab decode loop adds avoidable generation latency. Cache suppressed token IDs per tokenizer/model instead of recomputing each call.</violation>
<violation number="2" location="oxidize-cli/src/main/generation.rs:212">
P2: DFlash path drops generated output when decoded text is empty. Add the same token-id fallback used by other generation paths.</violation>
</file>
<file name="oxidize-core/src/backends/cuda/gpu_kernels.rs">
<violation number="1" location="oxidize-core/src/backends/cuda/gpu_kernels.rs:70">
P3: This introduces substantial duplicate CUDA launch logic already implemented in `gpu_native_forward.rs`. Keeping both paths in sync increases drift risk for kernel signatures and caching behavior.</violation>
<violation number="2" location="oxidize-core/src/backends/cuda/gpu_kernels.rs:158">
P2: `gpu_silu_mul` can issue an invalid zero-block CUDA launch for empty intermediate buffers. Return early when `n == 0`.</violation>
<violation number="3" location="oxidize-core/src/backends/cuda/gpu_kernels.rs:158">
P2: `gpu_residual_add` can launch with `grid_size == 0` when hidden size is zero. Guard `n == 0` before launching to avoid invalid CUDA launch errors.</violation>
</file>
<file name="oxidize-cli/src/main/model_resolution.rs">
<violation number="1" location="oxidize-cli/src/main/model_resolution.rs:73">
P1: Unvalidated joined filenames allow path traversal/absolute-path writes outside intended cache directories.</violation>
<violation number="2" location="oxidize-cli/src/main/model_resolution.rs:152">
P1: Cache key generation is non-injective; different model specs can map to the same cache directory and return incorrect artifacts.</violation>
<violation number="3" location="oxidize-cli/src/main/model_resolution.rs:255">
P1: Final GGUF cache writes are non-atomic, so interrupted writes can poison cache with corrupted files.</violation>
</file>
<file name="oxidize-core/src/backends/cuda/gemv_quantized.rs">
<violation number="1" location="oxidize-core/src/backends/cuda/gemv_quantized.rs:304">
P2: Combined length guard returns wrong error for short `d_input`. Split checks so vector failures return `InvalidVectorLength` with `d_input.len()`.</violation>
<violation number="2" location="oxidize-core/src/backends/cuda/gemv_quantized.rs:310">
P2: Unchecked `as u32` casts can truncate large dimensions and launch kernel with wrapped sizes. Use checked conversions and return a dimension error on overflow.</violation>
</file>
<file name="oxidize-core/src/backends/cuda/gemv_f32.rs">
<violation number="1" location="oxidize-core/src/backends/cuda/gemv_f32.rs:29">
P1: F32 weight cache inserts bypass VRAM budget tracking/eviction. Repeated new matrices can accumulate resident VRAM until OOM despite configured cache limits.</violation>
<violation number="2" location="oxidize-core/src/backends/cuda/gemv_f32.rs:151">
P2: Transposed CUDA GEMV does per-call vector allocation instead of using the existing f32 buffer pool. This adds avoidable allocation overhead and hurts throughput.</violation>
</file>
<file name="oxidize-core/src/backends/cuda.rs">
<violation number="1" location="oxidize-core/src/backends/cuda.rs:43">
P1: Removing content identity from cache keys makes weight-cache hits unsafe for mutable/reused host buffers. GEMV/GEMM can return stale results when data changes without pointer/length changes.</violation>
</file>
<file name="oxidize-cli/src/main/conversation.rs">
<violation number="1" location="oxidize-cli/src/main/conversation.rs:231">
P2: Mesh chat accepts input before leader election can complete. First prompts can time out with no streamed tokens.</violation>
<violation number="2" location="oxidize-cli/src/main/conversation.rs:347">
P1: Timeout/disconnect paths cache empty responses. Repeated prompts can return blank cache hits rather than retrying generation.</violation>
</file>
<file name="oxidize-core/src/backends/cuda/types.rs">
<violation number="1" location="oxidize-core/src/backends/cuda/types.rs:233">
P1: Quantized CUDA weight cache is never cleared by `clear_resident_cache`. This can leak VRAM across model loads and break budget enforcement because `resident_bytes` is reset while `resident_quant` still holds device buffers.</violation>
</file>
Note: This PR contains a large number of files. cubic only reviews up to 100 files per PR, so some files may not have been reviewed. cubic prioritizes the most important files to review.
On a pro plan you can use ultrareview for larger PRs.
Re-trigger cubic
|
|
||
| with_gpu(|gpu| { | ||
| // Cache left matrix (model weights) in VRAM. | ||
| let left_key = f32_cache_key(left_matrix); |
There was a problem hiding this comment.
P1: GEMM caches the dynamic activation input instead of the weight matrix. This can return stale results when left-buffer pointers are reused and also defeats intended weight-caching performance.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At oxidize-core/src/backends/cuda/gemm.rs, line 64:
<comment>GEMM caches the dynamic activation input instead of the weight matrix. This can return stale results when left-buffer pointers are reused and also defeats intended weight-caching performance.</comment>
<file context>
@@ -0,0 +1,120 @@
+
+ with_gpu(|gpu| {
+ // Cache left matrix (model weights) in VRAM.
+ let left_key = f32_cache_key(left_matrix);
+ if !gpu.resident_f32.contains_key(&left_key) {
+ let buffer = cust::memory::DeviceBuffer::from_slice(left_matrix).map_err(stringify)?;
</file context>
|
|
||
| - id: cargo-udeps | ||
| name: cargo udeps (unused dependencies) | ||
| entry: ./scripts/check-udeps.sh |
There was a problem hiding this comment.
P1: cargo-udeps hook references a non-existent script, so the hook cannot run successfully.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At .pre-commit-config.yaml, line 32:
<comment>`cargo-udeps` hook references a non-existent script, so the hook cannot run successfully.</comment>
<file context>
@@ -0,0 +1,63 @@
+
+ - id: cargo-udeps
+ name: cargo udeps (unused dependencies)
+ entry: ./scripts/check-udeps.sh
+ language: system
+ pass_filenames: false
</file context>
| for (matrix, _rows, _cols) in f32_weights { | ||
| let key = f32_cache_key(matrix); | ||
| if !gpu.resident_f32.contains_key(&key) { | ||
| let buf = cust::memory::DeviceBuffer::from_slice(*matrix).map_err(stringify)?; |
There was a problem hiding this comment.
P1: Budget eviction runs too late in preload_layer; VRAM allocation can fail before old layers are evicted. Evict headroom before each new upload.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At oxidize-core/src/backends/cuda/gpu_state.rs, line 105:
<comment>Budget eviction runs too late in `preload_layer`; VRAM allocation can fail before old layers are evicted. Evict headroom before each new upload.</comment>
<file context>
@@ -0,0 +1,150 @@
+ for (matrix, _rows, _cols) in f32_weights {
+ let key = f32_cache_key(matrix);
+ if !gpu.resident_f32.contains_key(&key) {
+ let buf = cust::memory::DeviceBuffer::from_slice(*matrix).map_err(stringify)?;
+ entry.bytes += buf.len() * std::mem::size_of::<f32>();
+ gpu.resident_f32.insert(key, buf);
</file context>
| .map_err(|e| io::Error::other(format!("failed to read GGUF for requantization: {e}")))?; | ||
| let quantized = quantize_gguf_to_target(&input_bytes, GgufQuantizationType::Q8_0) | ||
| .map_err(|e| io::Error::other(format!("Q8_0 requantization failed: {e}")))?; | ||
| std::fs::write(output, &quantized) |
There was a problem hiding this comment.
P1: Final GGUF cache writes are non-atomic, so interrupted writes can poison cache with corrupted files.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At oxidize-cli/src/main/model_resolution.rs, line 255:
<comment>Final GGUF cache writes are non-atomic, so interrupted writes can poison cache with corrupted files.</comment>
<file context>
@@ -0,0 +1,324 @@
+ .map_err(|e| io::Error::other(format!("failed to read GGUF for requantization: {e}")))?;
+ let quantized = quantize_gguf_to_target(&input_bytes, GgufQuantizationType::Q8_0)
+ .map_err(|e| io::Error::other(format!("Q8_0 requantization failed: {e}")))?;
+ std::fs::write(output, &quantized)
+ .map_err(|e| io::Error::other(format!("failed to write Q8_0 GGUF: {e}")))?;
+ eprintln!(
</file context>
|
|
||
| pub(super) fn cache_safe_name(spec: &str) -> String { | ||
| spec.chars() | ||
| .map(|ch| if ch.is_ascii_alphanumeric() { ch } else { '-' }) |
There was a problem hiding this comment.
P1: Cache key generation is non-injective; different model specs can map to the same cache directory and return incorrect artifacts.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At oxidize-cli/src/main/model_resolution.rs, line 152:
<comment>Cache key generation is non-injective; different model specs can map to the same cache directory and return incorrect artifacts.</comment>
<file context>
@@ -0,0 +1,324 @@
+
+pub(super) fn cache_safe_name(spec: &str) -> String {
+ spec.chars()
+ .map(|ch| if ch.is_ascii_alphanumeric() { ch } else { '-' })
+ .collect()
+}
</file context>
|
|
||
| // Give the mesh node a moment to start up and discover peers. | ||
| rt.block_on(async { | ||
| tokio::time::sleep(Duration::from_secs(2)).await; |
There was a problem hiding this comment.
P2: Mesh chat accepts input before leader election can complete. First prompts can time out with no streamed tokens.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At oxidize-cli/src/main/conversation.rs, line 231:
<comment>Mesh chat accepts input before leader election can complete. First prompts can time out with no streamed tokens.</comment>
<file context>
@@ -0,0 +1,358 @@
+
+ // Give the mesh node a moment to start up and discover peers.
+ rt.block_on(async {
+ tokio::time::sleep(Duration::from_secs(2)).await;
+ });
+
</file context>
| "version": 1, | ||
| "lastRunAtMs": 1781685502133, | ||
| "turnsSinceLastRun": 1, | ||
| "turnsSinceLastRun": 4, |
There was a problem hiding this comment.
P3: Volatile Cursor hook state was committed in this PR. This is unrelated runtime-local metadata and should be excluded from versioned changes.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At .cursor/hooks/state/continual-learning.json, line 4:
<comment>Volatile Cursor hook state was committed in this PR. This is unrelated runtime-local metadata and should be excluded from versioned changes.</comment>
<file context>
@@ -1,8 +1,8 @@
"version": 1,
"lastRunAtMs": 1781685502133,
- "turnsSinceLastRun": 1,
+ "turnsSinceLastRun": 4,
"lastTranscriptMtimeMs": 1781685501947.5315,
- "lastProcessedGenerationId": "f1a2db2c-d576-4862-9869-f0392e82e294",
</file context>
| } | ||
| } | ||
|
|
||
| pub fn dequantize_nvfp4_scalar(input: &[u8], output: &mut [f32]) -> Result<(), QuantizationError> { |
There was a problem hiding this comment.
P3: This introduces duplicate NVFP4 decode logic that already exists in tensor/kernels/q_kernels.rs. Keeping two copies risks silent drift when one implementation changes.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At oxidize-core/src/compute/quantization/quant_nvfp4.rs, line 14:
<comment>This introduces duplicate NVFP4 decode logic that already exists in `tensor/kernels/q_kernels.rs`. Keeping two copies risks silent drift when one implementation changes.</comment>
<file context>
@@ -0,0 +1,41 @@
+ }
+}
+
+pub fn dequantize_nvfp4_scalar(input: &[u8], output: &mut [f32]) -> Result<(), QuantizationError> {
+ validate_layout(
+ GgufQuantizationType::NVFP4,
</file context>
| @@ -0,0 +1,23 @@ | |||
| backend: cpu (CPU) | |||
There was a problem hiding this comment.
P3: This adds a near-duplicate evidence log instead of extending existing evidence artifacts. Duplicate files increase repo noise and drift risk.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At .omo/ulw-loop/evidence/oxidize-server-log-listener.txt, line 1:
<comment>This adds a near-duplicate evidence log instead of extending existing evidence artifacts. Duplicate files increase repo noise and drift risk.</comment>
<file context>
@@ -0,0 +1,23 @@
+backend: cpu (CPU)
+api server: starting in background at http://0.0.0.0:18082 (REST /v1/*, WebSocket /v1/realtime)
+load progress: 0% stage=starting bytes=0/7381381664
</file context>
| @@ -0,0 +1,231 @@ | |||
| #[allow(unused_imports)] | |||
There was a problem hiding this comment.
P3: This introduces substantial duplicate CUDA launch logic already implemented in gpu_native_forward.rs. Keeping both paths in sync increases drift risk for kernel signatures and caching behavior.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At oxidize-core/src/backends/cuda/gpu_kernels.rs, line 70:
<comment>This introduces substantial duplicate CUDA launch logic already implemented in `gpu_native_forward.rs`. Keeping both paths in sync increases drift risk for kernel signatures and caching behavior.</comment>
<file context>
@@ -0,0 +1,231 @@
+/// cached in `resident_f32` (mmap-stable pointer identity, same as ordinary
+/// weight matrices).
+#[cfg(feature = "cuda")]
+pub fn gpu_rms_norm(weight: &[f32], eps: f32) -> Result<(), String> {
+ with_gpu(|gpu| {
+ let ab = gpu
</file context>
Summary by cubic
Whole‑engine perf sweep with AVX‑512/AVX2 upgrades, CUDA GPU‑native forward pass and GPU
lm_head, safer I/O, and per‑token OXK benchmarks. Adds DPO/QLoRA/RLHF tools inoxidize-finetuning, a CPU video training/generator pipeline inoxidize-train, importance‑weighted quantization, and a modular CLI.Performance
lm_headGEMV; keeps activations on device and auto‑detects quantization.compute_75PTX plus native cubins (SM80/89/90/120) when available to cut JIT time.mmapand volatile reads; removed unsafe F16 payload writes;codegen-units = 1.New Features
oxidize-finetuning: DPO trainer, QLoRA (NF4), PPO/RLHF scaffolding, adapter merge (Linear/SLERP/TIES), telemetry; CLI split into SFT/DPO/PPO/Merge.oxidize-train: CPU video pipeline with ffmpeg frame extraction, clip classifier, and autoregressive generator (new subcommands).bench_oxk.py;oxidize-quantizeimportance‑weighted quantization.oxidize-core/oxidize-cli: KV cache rewritten into eviction/storage modules with tests; CLI modularized (server/generation/model resolution/GPU cluster); added release workflow and CODEOWNERS.Written for commit 8708715. Summary will update on new commits.