feat(model): Gemma 2/3/4 architecture support (Rust + Go + Python) by Jackson57279 · Pull Request #9 · Zapdev-labs/oxidize

Jackson57279 · 2026-06-06T15:56:31Z

Summary

Adds full Gemma-family inference support across the Rust core and the Go and Python ports. All Gemma behavior is gated behind the Gemma architecture, so other models are unaffected.

What proper Gemma support required (none of this existed before — without it, output is garbage):

Dual RoPE theta — local sliding-window layers use rope_theta_swa (10000), global layers use rope_theta (1e6)
Interleaved local/global attention — every Nth layer is global (gemma2=2, gemma3/4=6); sliding-window masking via slicing the last-N KV rows (absolute-pos RoPE keeps it exact)
Embedding scaling by sqrt(hidden_size) (input only; tied lm_head stays unscaled)
Sandwich normalization — post-attention and post-FFN norms before each residual (post_ffw_norm now mapped); previously post_attention_norm was wrongly used as the pre-FFN norm
GeGLU (tanh-GELU) activation instead of SwiGLU

Non-obvious detail: Gemma's (1+weight) RMSNorm is already baked into GGUF norm weights by the converter, so standard rms-norm (plain multiply) is correct.

Per-language notes

Rust (oxidize-core): config fields + layer_is_global/layer_rope_theta/layer_sliding_window helpers; sandwich norms + GeGLU + windowing in both forward_single and forward_batched.
Go (oxidize-golang): UsesParallelAttnFFN now Phi-only (Gemma is sequential, not parallel — was a latent bug); decode + batch use the existing windowed flash kernel.
Python (oxidize-python): mirrors the above; also fixes pre-existing forward_batch write-back bugs (list slices are copies, so RoPE / attention output / RMSNorms were silently discarded).
Mistral's uniform sliding window is preserved in Go/Python (window>0 & pattern==0 ⇒ local).

Tests

New Gemma/Mistral layer-pattern unit tests in all three languages.
Rust core, Go ./core/..., and the full Python suite pass (pre-existing unrelated failures only: TestManualQAScriptShape, and two dirty-tree Rust tests touching head_scratch/sampling).

Deferred

Per-local-layer KV ring buffer (memory optimization), Gemma 3 long-context RoPE linear scaling, Gemma 2 logit softcapping.

⚠️ Scope note

This branch also bundles unrelated in-progress work that was already present in the working tree (a Rust↔Go/Python FFI bridge: oxidize-ffi/, generation/dflash changes, ffi.py, rust_model.go, Cargo.*), committed at the author's request to capture everything in the tree. The Gemma changes are the model/config/forward files listed above.

🤖 Generated with Claude Code

Summary by cubic

Adds full Gemma 2/3/4 inference across oxidize-core, oxidize-golang, and oxidize-python, plus a Rust FFI fast path for GEMV and end-to-end forward used by Go/Python benches and CLI. Also improves generation flow and performance (AVX2 prefetch, NumPy paths) without affecting other architectures.

New Features
- Correct Gemma behavior (dual RoPE bases, interleaved global layers gemma2=2 / gemma3/4=6 via KV slicing, input embedding scale sqrt(hidden_size), sandwich norms, GeGLU), gated per-architecture in Rust/Go/Python forward.
- New oxidize-ffi crate exposing quantized GEMV and full model load/forward; Go cgo and Python ctypes bindings; benches/CLI prefer the Rust fast path when present.
- Performance: AVX2 L1 prefetch in Rust GEMV; NumPy vectorization for Python GEMV/flash-attn/RoPE; Go worker sizing respects GOMAXPROCS.
- Generation/bench UX: Go/Python streaming refactor with explicit prefill/decode states and better stop handling; CLI/bench add --max-context to clamp context; integration tests target models/Qwen3-4B-Q4_K_M.gguf.
Bug Fixes
- Rust: fixed SWA gating when no global pattern; logits buffer validation in FFI forward; added a uniform SWA test.
- Go: Rust GEMV path gated behind cgo,rust_ffi and propagates errors; skip RoPE under ALiBi; legacy run restores 128-token default.
- Python: deterministic RustModel.close(), safe __del__, sanitized LD_LIBRARY_PATH; fixed suppress_tokens bounds and a generation off-by-one; GGUF token-embedding dim fallback.

^{Written for commit 218fc86. Summary will update on new commits.}

Implements full Gemma-family inference across the Rust core and the Go and Python ports, gated behind the Gemma architecture so other models are unaffected: - Dual RoPE theta: local sliding-window layers use rope_theta_swa (10000), global layers use rope_theta (1e6) - Interleaved local/global attention: every Nth layer is global (gemma2=2, gemma3/4=6); sliding-window masking via last-N KV slicing - Embedding scaling by sqrt(hidden_size) - Sandwich normalization: post-attention and post-FFN norms before each residual (post_ffw_norm mapped); stops post_attention_norm being misused as the pre-FFN norm - GeGLU (tanh-GELU) activation instead of SwiGLU Go: UsesParallelAttnFFN now Phi-only (Gemma is sequential, not parallel). Python: also fixes pre-existing forward_batch write-back bugs (list slices are copies, so RoPE / attention output / norms were silently discarded). Adds Gemma/Mistral layer-pattern unit tests in all three languages. Note: this commit also bundles unrelated in-progress work present in the working tree (Rust<->Go/Python FFI bridge: oxidize-ffi/, generation/dflash changes, ffi.py, rust_model.go) per request to commit everything. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a97203c1a5

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

cubic-dev-ai

24 issues found across 49 files

_{Tip: cubic can generate docs of your entire codebase and keep them up to date. Try it here.

Re-trigger cubic}

Fixes identified by cubic AI review and GitHub CI failures: Rust: - Fix sliding-window attention for Mistral/Qwen in layer_is_global (separate sliding_window == 0 from sliding_window_pattern == 0) - Fix oxidize_model_forward to reject logits buffer size mismatches - Add test for uniform SWA models - Run cargo fmt --all across workspace Go: - Fix GemvRust error handling in dflash_weights (don't swallow errors) - Fix llama_decoder_forward to skip RoPE when ALiBi is enabled - Fix bench.go to use actual prompt tokens instead of hardcoded BOS - Fix bench.go zero-guard for average tok/s calculation - Fix decoder_ffn to reuse normed buffer instead of allocating - Fix parallelWorkers to use runtime.GOMAXPROCS(0) Python: - Fix generation.py suppress_tokens lower bounds check - Fix cli.py off-by-one token generation loop - Fix inference_config.py token-embedding dimension fallback - Fix ruff errors: E501 (line too long), I001 (import sorting), F401 (unused imports), E741 (ambiguous variable names) Tests: - Fix oxidize-cli test expecting wrong binary name in help output

- Python: Added deterministic RustModel.close(), safe __del__, and LD_LIBRARY_PATH sanitization. - Go: Gated CGO GemvRust behind build tags (cgo,rust_ffi). - Go: Restored 128-token MaxNewTokens default in CLI legacy run.

Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>

Fixes unsafe_op_in_unsafe_fn errors introduced by PR #9: - tensor.rs: swiglu_avx2, swiglu_avx2_inplace AVX2 intrinsics - oxidize-ffi/src/lib.rs: all FFI raw pointer operations

Adds full Gemma-family inference support across Rust core and Go/Python ports. Includes dual RoPE theta, interleaved local/global attention, embedding scaling, sandwich normalization, and GeGLU activation. Also includes oxidize-ffi crate for Rust↔Go/Python fast path.

Fixes multiple issues that were blocking CI on master after merging PR #9: **Rust 2024 unsafe_op_in_unsafe_fn compliance:** - Add #[allow(unsafe_op_in_unsafe_fn)] to all SIMD target_feature functions in tensor.rs (AVX2, AVX-512, NEON kernels) - Wrap FFI raw pointer operations in explicit unsafe blocks in oxidize-ffi **Dead code / unused import cleanup:** - Remove unused imports across oxidize-server (audit, metrics, routes) - Fix unused variables in oxidize-finetuning (lora.rs, trainer.rs) - Allow dead_code for PrefixCache.config and DecoderState::Done - Allow dead_code for inference_config_from_dflash in bench.rs **Docker build fixes:** - Add missing COPY statements for oxidize-train, oxidize-finetuning, oxidize-convert, oxidize-ffi workspace members in both Dockerfiles - Update base Rust image from 1.88 to 1.95-bookworm for AVX-512 stable support **MLX macOS Send/Sync:** - Add unsafe impl Send/Sync for MlxComputeBackend (MLX C API is thread-safe) **Test status:** - Workspace compiles cleanly with RUSTFLAGS='-D warnings' - 533 tests pass; 2 pre-existing failures in inference::tests and sampling::tests (documented in PR #9, not regressions) - Docker builds (server + cli) pass successfully

github-code-quality Bot found potential problems Jun 6, 2026

View reviewed changes

Comment thread oxidize-python/oxidize_python/core/ffi.py Fixed

Comment thread oxidize-python/oxidize_python/core/ffi.py Fixed

chatgpt-codex-connector Bot reviewed Jun 6, 2026

View reviewed changes

Comment thread oxidize-golang/core/quantization/gemv_rust.go

Comment thread oxidize-golang/core/model/gguf_decoder_load.go

cubic-dev-ai Bot reviewed Jun 6, 2026

View reviewed changes

github-code-quality Bot found potential problems Jun 6, 2026

View reviewed changes

Comment thread oxidize-python/oxidize_python/core/ffi.py Fixed

Comment thread oxidize-python/oxidize_python/core/ffi.py Fixed

github-code-quality Bot found potential problems Jun 6, 2026

View reviewed changes

Comment thread oxidize-python/oxidize_python/core/ffi.py Fixed

Potential fix for pull request finding 'Empty except'

218fc86

Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>

Jackson57279 merged commit af3016a into master Jun 7, 2026
12 of 17 checks passed

Jackson57279 deleted the feat/gemma-3-4-support branch June 15, 2026 20:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(model): Gemma 2/3/4 architecture support (Rust + Go + Python)#9

feat(model): Gemma 2/3/4 architecture support (Rust + Go + Python)#9
Jackson57279 merged 4 commits into
masterfrom
feat/gemma-3-4-support

Jackson57279 commented Jun 6, 2026 •

edited by cubic-dev-ai Bot

Loading

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai Bot left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Jackson57279 commented Jun 6, 2026 • edited by cubic-dev-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Per-language notes

Tests

Deferred

⚠️ Scope note

Summary by cubic

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Jackson57279 commented Jun 6, 2026 •

edited by cubic-dev-ai Bot

Loading

cubic-dev-ai Bot left a comment •

edited

Loading