Improve RHT + LoRA decode by CC-Yeh · Pull Request #360 · trymirai/uzu

CC-Yeh · 2026-04-21T16:08:30Z

3.3% faster decode on LFM2.5-1.2B-RHT-QLoRA

Offline math trick: A_down' = A_down · H at load time
Fused A_down' into RMSNorm kernel (decode)
Fused A_up SG0-tail into QmvFast kernel (decode)
Per-rank dispatch (fused at r=16, unfused fallback otherwise)
One-line prefill recovery
LORA_RANK plumbed as kernel VARIANT
CPU LoRA reference + cross-backend tests
Shared adapter_up buffer (−22 MB)

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a855456cd4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-21T16:15:31Z

+    uint tid = simd_group * 32 + simd_lane;
+    if (tid < LORA_RANK) {
+      h_lora[tid] = static_cast<float>(h_input[tid]);
+    }


Offset LoRA intermediate by batch in Metal QmvFast

The fused LoRA path loads h_input into threadgroup memory without applying a batch offset, so every batch_idx > 0 reuses batch 0’s h vector. This produces incorrect LoRA deltas whenever fused QmvFast is used with batch_size > 1 (for example small prefill batches that stay on the matrix-vector path), causing wrong outputs for all nonzero batches.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-21T16:15:31Z

+            let (down_projection, down_input_hadamard_factors, _) = <dyn Linear<B>>::new_extracting_input_fusions(
                &dense_config.linear_config,
                false,
                hidden_dimension,


Avoid fusing A_down for MLP down projection

This call switches the MLP down projection to new_extracting_input_fusions but discards the returned LoRA fusion payload, while new_extracting_input_fusions still enables rms_norm_fuses_a_down for RHT+QLoRA linears. In that mode QLoRALinearWrapper::encode skips computing adapter_down and reads state.common_aux.lora_intermediate instead, but down projection has no preceding RMSNorm fusion site to populate h from MlpHidden, so it can consume stale h and apply an incorrect LoRA update.

Useful? React with 👍 / 👎.

uuuvn · 2026-04-22T12:46:31Z

+    pub h_buffer: Option<&'a B::Buffer>,
+    pub adapter_up: Option<&'a B::Buffer>,
+    pub lora_scale: f32,


Are these always used together/not used together? If so it should be another struct and one option of that structure. As much invariants (like h_buffer.is_some() == adapter_up.is_some(), lora_scale is not needed at not used when lora not enabled, etc) as possible should be expressed via type system (like via having one top option and inner struct).

uuuvn · 2026-04-22T12:53:23Z

        input_array_id: ArrayId,
        output_array_id: ArrayId,
-    ) -> Result<(Box<dyn Linear<B>>, Option<B::Buffer>), LinearBlockError<B>> {
+    ) -> Result<(Box<dyn Linear<B>>, Option<B::Buffer>, Option<LoraFusion<B>>), LinearBlockError<B>> {


Big scary tuple, let's make extracted fusions a struct. And struct can have nice helpers for extracting fusions and erroring if something was silently dropped. We can also do the same with declaring fusion capabilities (a struct of bools of what extracted fusions we can handle)

uuuvn · 2026-04-22T12:56:31Z

@@ -46,34 +63,13 @@ pub struct QLoRALinearWrapper<B: Backend> {
    base_linear: QuantizedLinear<B>,
    adapter_kernel: RefCell<<B::Kernels as ManualKernels>::MatmulKernel>,
    adapter_down: B::Buffer,
-    adapter_up: B::Buffer,
    input_dim: usize,
    output_dim: usize,
    lora_rank: usize,
    lora_scale: f32,
    input_array_id: ArrayId,
    output_array_id: ArrayId,
-}
-
-// TODO: figure out how to make this generic over QLoRAWrapperError::InvalidTensor or make one global "Invalid Tensor" error and make this a common helper
-fn validate_tensor<'file, 'context, 'leaf, B: Backend>(
-    weights_leaf: &ParameterLeaf<'file, 'context, 'leaf, B::Context>,


The move ate the todo

ry2009 · 2026-04-22T20:34:37Z

Noticed a potential issue with the sign/Hadamard ordering in compose_rotated_adapter_down.

The precomposition needs A_down' @ x == A_down @ H_rht(x), which requires applying H_rht^T to the rows of A_down, not H_rht.

Since H_rht = H @ diag(s) (signs first, then butterfly -- matching simdgroup_random_hadamard_transform), the transpose is H_rht^T = diag(s) @ H (butterfly first, then signs).

But compose_rotated_adapter_down calls hadamard_kernel.encode() which applies the standard H @ diag(s) (signs first)... this gives the wrong result.

Quick repro (JAX, same math):

Apply H @ diag(s) to rows -> max error vs ground truth: 112.9
Apply diag(s) @ H to rows -> max error vs ground truth: 0.033 (f32 noise)

Fix: apply Hadamard butterfly to A_down rows first, then multiply by signs -- instead of the current order... lmk if this is intended though

CC-Yeh · 2026-04-22T21:19:30Z

Noticed a potential issue with the sign/Hadamard ordering in compose_rotated_adapter_down.

The precomposition needs A_down' @ x == A_down @ H_rht(x), which requires applying H_rht^T to the rows of A_down, not H_rht.

Since H_rht = H @ diag(s) (signs first, then butterfly -- matching simdgroup_random_hadamard_transform), the transpose is H_rht^T = diag(s) @ H (butterfly first, then signs).

But compose_rotated_adapter_down calls hadamard_kernel.encode() which applies the standard H @ diag(s) (signs first)... this gives the wrong result.

Quick repro (JAX, same math):

Apply H @ diag(s) to rows -> max error vs ground truth: 112.9

Apply diag(s) @ H to rows -> max error vs ground truth: 0.033 (f32 noise)

Fix: apply Hadamard butterfly to A_down rows first, then multiply by signs -- instead of the current order... lmk if this is intended though

Not intended at all, thanks for catching that!

uuuvn · 2026-05-08T08:16:59Z

@CC-Yeh what's the status of this pr? We definitely want it after the review fixes

CC-Yeh · 2026-05-08T09:16:30Z

@CC-Yeh what's the status of this pr? We definitely want it after the review fixes

Forgot to compute A_down for those layers can't be fused (not RMSNorm), the performance is the same as main after patching that, still trying to figure out a way to speed this up. Will spend 1-2 days more on this

CC-Yeh added 4 commits April 21, 2026 18:08

A_down' draft

4b50f84

add test

1cd48e0

fix + cleanup

e75b213

fix + cleanup + improvements

a855456

CC-Yeh force-pushed the improve_QLoRA_rebased branch from 339ae1c to a855456 Compare April 21, 2026 16:08

CC-Yeh requested review from eugenebokhan and uuuvn April 21, 2026 16:09

chatgpt-codex-connector Bot reviewed Apr 21, 2026

View reviewed changes

uuuvn requested changes Apr 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve RHT + LoRA decode#360

Improve RHT + LoRA decode#360
CC-Yeh wants to merge 4 commits intomainfrom
improve_QLoRA_rebased

CC-Yeh commented Apr 21, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Apr 21, 2026

Uh oh!

chatgpt-codex-connector Bot Apr 21, 2026

Uh oh!

uuuvn Apr 22, 2026

Uh oh!

uuuvn Apr 22, 2026

Uh oh!

uuuvn Apr 22, 2026

Uh oh!

ry2009 commented Apr 22, 2026 •

edited

Loading

Uh oh!

CC-Yeh commented Apr 22, 2026

Uh oh!

uuuvn commented May 8, 2026

Uh oh!

CC-Yeh commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

CC-Yeh commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

uuuvn Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

uuuvn Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

uuuvn Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

ry2009 commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CC-Yeh commented Apr 22, 2026

Uh oh!

uuuvn commented May 8, 2026

Uh oh!

CC-Yeh commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

CC-Yeh commented Apr 21, 2026 •

edited

Loading

ry2009 commented Apr 22, 2026 •

edited

Loading