Skip to content

feat(model): Gemma 2/3/4 architecture support (Rust + Go + Python)#9

Merged
Jackson57279 merged 4 commits into
masterfrom
feat/gemma-3-4-support
Jun 7, 2026
Merged

feat(model): Gemma 2/3/4 architecture support (Rust + Go + Python)#9
Jackson57279 merged 4 commits into
masterfrom
feat/gemma-3-4-support

Conversation

@Jackson57279

@Jackson57279 Jackson57279 commented Jun 6, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds full Gemma-family inference support across the Rust core and the Go and Python ports. All Gemma behavior is gated behind the Gemma architecture, so other models are unaffected.

What proper Gemma support required (none of this existed before — without it, output is garbage):

  • Dual RoPE theta — local sliding-window layers use rope_theta_swa (10000), global layers use rope_theta (1e6)
  • Interleaved local/global attention — every Nth layer is global (gemma2=2, gemma3/4=6); sliding-window masking via slicing the last-N KV rows (absolute-pos RoPE keeps it exact)
  • Embedding scaling by sqrt(hidden_size) (input only; tied lm_head stays unscaled)
  • Sandwich normalization — post-attention and post-FFN norms before each residual (post_ffw_norm now mapped); previously post_attention_norm was wrongly used as the pre-FFN norm
  • GeGLU (tanh-GELU) activation instead of SwiGLU

Non-obvious detail: Gemma's (1+weight) RMSNorm is already baked into GGUF norm weights by the converter, so standard rms-norm (plain multiply) is correct.

Per-language notes

  • Rust (oxidize-core): config fields + layer_is_global/layer_rope_theta/layer_sliding_window helpers; sandwich norms + GeGLU + windowing in both forward_single and forward_batched.
  • Go (oxidize-golang): UsesParallelAttnFFN now Phi-only (Gemma is sequential, not parallel — was a latent bug); decode + batch use the existing windowed flash kernel.
  • Python (oxidize-python): mirrors the above; also fixes pre-existing forward_batch write-back bugs (list slices are copies, so RoPE / attention output / RMSNorms were silently discarded).
  • Mistral's uniform sliding window is preserved in Go/Python (window>0 & pattern==0 ⇒ local).

Tests

  • New Gemma/Mistral layer-pattern unit tests in all three languages.
  • Rust core, Go ./core/..., and the full Python suite pass (pre-existing unrelated failures only: TestManualQAScriptShape, and two dirty-tree Rust tests touching head_scratch/sampling).

Deferred

Per-local-layer KV ring buffer (memory optimization), Gemma 3 long-context RoPE linear scaling, Gemma 2 logit softcapping.

⚠️ Scope note

This branch also bundles unrelated in-progress work that was already present in the working tree (a Rust↔Go/Python FFI bridge: oxidize-ffi/, generation/dflash changes, ffi.py, rust_model.go, Cargo.*), committed at the author's request to capture everything in the tree. The Gemma changes are the model/config/forward files listed above.

🤖 Generated with Claude Code


Summary by cubic

Adds full Gemma 2/3/4 inference across oxidize-core, oxidize-golang, and oxidize-python, plus a Rust FFI fast path for GEMV and end-to-end forward used by Go/Python benches and CLI. Also improves generation flow and performance (AVX2 prefetch, NumPy paths) without affecting other architectures.

  • New Features

    • Correct Gemma behavior (dual RoPE bases, interleaved global layers gemma2=2 / gemma3/4=6 via KV slicing, input embedding scale sqrt(hidden_size), sandwich norms, GeGLU), gated per-architecture in Rust/Go/Python forward.
    • New oxidize-ffi crate exposing quantized GEMV and full model load/forward; Go cgo and Python ctypes bindings; benches/CLI prefer the Rust fast path when present.
    • Performance: AVX2 L1 prefetch in Rust GEMV; NumPy vectorization for Python GEMV/flash-attn/RoPE; Go worker sizing respects GOMAXPROCS.
    • Generation/bench UX: Go/Python streaming refactor with explicit prefill/decode states and better stop handling; CLI/bench add --max-context to clamp context; integration tests target models/Qwen3-4B-Q4_K_M.gguf.
  • Bug Fixes

    • Rust: fixed SWA gating when no global pattern; logits buffer validation in FFI forward; added a uniform SWA test.
    • Go: Rust GEMV path gated behind cgo,rust_ffi and propagates errors; skip RoPE under ALiBi; legacy run restores 128-token default.
    • Python: deterministic RustModel.close(), safe __del__, sanitized LD_LIBRARY_PATH; fixed suppress_tokens bounds and a generation off-by-one; GGUF token-embedding dim fallback.

Written for commit 218fc86. Summary will update on new commits.

Review in cubic

Implements full Gemma-family inference across the Rust core and the Go and
Python ports, gated behind the Gemma architecture so other models are
unaffected:

- Dual RoPE theta: local sliding-window layers use rope_theta_swa (10000),
  global layers use rope_theta (1e6)
- Interleaved local/global attention: every Nth layer is global
  (gemma2=2, gemma3/4=6); sliding-window masking via last-N KV slicing
- Embedding scaling by sqrt(hidden_size)
- Sandwich normalization: post-attention and post-FFN norms before each
  residual (post_ffw_norm mapped); stops post_attention_norm being misused
  as the pre-FFN norm
- GeGLU (tanh-GELU) activation instead of SwiGLU

Go: UsesParallelAttnFFN now Phi-only (Gemma is sequential, not parallel).
Python: also fixes pre-existing forward_batch write-back bugs (list slices
are copies, so RoPE / attention output / norms were silently discarded).

Adds Gemma/Mistral layer-pattern unit tests in all three languages.

Note: this commit also bundles unrelated in-progress work present in the
working tree (Rust<->Go/Python FFI bridge: oxidize-ffi/, generation/dflash
changes, ffi.py, rust_model.go) per request to commit everything.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Comment thread oxidize-python/oxidize_python/core/ffi.py Fixed
Comment thread oxidize-python/oxidize_python/core/ffi.py Fixed

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a97203c1a5

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread oxidize-golang/core/quantization/gemv_rust.go
Comment thread oxidize-golang/core/model/gguf_decoder_load.go

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

24 issues found across 49 files

Tip: cubic can generate docs of your entire codebase and keep them up to date. Try it here.

Re-trigger cubic

Comment thread oxidize-python/oxidize_python/core/ffi.py
Comment thread oxidize-golang/core/model/llama_decoder_forward.go Outdated
Comment thread oxidize-golang/internal/cli/cli.go
Comment thread oxidize-golang/core/model/gguf_decoder_load.go
Comment thread oxidize-golang/core/model/inference.go
Comment thread oxidize-golang/core/quantization/gemv_rust.go
Comment thread oxidize-python/oxidize_python/core/ffi.py Outdated
Comment thread oxidize-python/oxidize_python/core/ffi.py
Comment thread oxidize-python/oxidize_python/core/tensor/gemv.py
Fixes identified by cubic AI review and GitHub CI failures:

Rust:
- Fix sliding-window attention for Mistral/Qwen in layer_is_global
  (separate sliding_window == 0 from sliding_window_pattern == 0)
- Fix oxidize_model_forward to reject logits buffer size mismatches
- Add test for uniform SWA models
- Run cargo fmt --all across workspace

Go:
- Fix GemvRust error handling in dflash_weights (don't swallow errors)
- Fix llama_decoder_forward to skip RoPE when ALiBi is enabled
- Fix bench.go to use actual prompt tokens instead of hardcoded BOS
- Fix bench.go zero-guard for average tok/s calculation
- Fix decoder_ffn to reuse normed buffer instead of allocating
- Fix parallelWorkers to use runtime.GOMAXPROCS(0)

Python:
- Fix generation.py suppress_tokens lower bounds check
- Fix cli.py off-by-one token generation loop
- Fix inference_config.py token-embedding dimension fallback
- Fix ruff errors: E501 (line too long), I001 (import sorting),
  F401 (unused imports), E741 (ambiguous variable names)

Tests:
- Fix oxidize-cli test expecting wrong binary name in help output
Comment thread oxidize-python/oxidize_python/core/ffi.py Fixed
Comment thread oxidize-python/oxidize_python/core/ffi.py Fixed
- Python: Added deterministic RustModel.close(), safe __del__, and LD_LIBRARY_PATH sanitization.
- Go: Gated CGO GemvRust behind build tags (cgo,rust_ffi).
- Go: Restored 128-token MaxNewTokens default in CLI legacy run.
Comment thread oxidize-python/oxidize_python/core/ffi.py Fixed
Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
Jackson57279 added a commit that referenced this pull request Jun 7, 2026
Fixes unsafe_op_in_unsafe_fn errors introduced by PR #9:
- tensor.rs: swiglu_avx2, swiglu_avx2_inplace AVX2 intrinsics
- oxidize-ffi/src/lib.rs: all FFI raw pointer operations
Jackson57279 added a commit that referenced this pull request Jun 7, 2026
Adds full Gemma-family inference support across Rust core and Go/Python ports.
Includes dual RoPE theta, interleaved local/global attention, embedding scaling,
sandwich normalization, and GeGLU activation.

Also includes oxidize-ffi crate for Rust↔Go/Python fast path.
Jackson57279 added a commit that referenced this pull request Jun 7, 2026
Fixes multiple issues that were blocking CI on master after merging PR #9:

**Rust 2024 unsafe_op_in_unsafe_fn compliance:**
- Add #[allow(unsafe_op_in_unsafe_fn)] to all SIMD target_feature functions
  in tensor.rs (AVX2, AVX-512, NEON kernels)
- Wrap FFI raw pointer operations in explicit unsafe blocks in oxidize-ffi

**Dead code / unused import cleanup:**
- Remove unused imports across oxidize-server (audit, metrics, routes)
- Fix unused variables in oxidize-finetuning (lora.rs, trainer.rs)
- Allow dead_code for PrefixCache.config and DecoderState::Done
- Allow dead_code for inference_config_from_dflash in bench.rs

**Docker build fixes:**
- Add missing COPY statements for oxidize-train, oxidize-finetuning,
  oxidize-convert, oxidize-ffi workspace members in both Dockerfiles
- Update base Rust image from 1.88 to 1.95-bookworm for AVX-512 stable support

**MLX macOS Send/Sync:**
- Add unsafe impl Send/Sync for MlxComputeBackend (MLX C API is thread-safe)

**Test status:**
- Workspace compiles cleanly with RUSTFLAGS='-D warnings'
- 533 tests pass; 2 pre-existing failures in inference::tests and sampling::tests
  (documented in PR #9, not regressions)
- Docker builds (server + cli) pass successfully
@Jackson57279 Jackson57279 merged commit af3016a into master Jun 7, 2026
12 of 17 checks passed
@Jackson57279 Jackson57279 deleted the feat/gemma-3-4-support branch June 15, 2026 20:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant