feat(model): Gemma 2/3/4 architecture support (Rust + Go + Python)#9
Merged
Conversation
Implements full Gemma-family inference across the Rust core and the Go and Python ports, gated behind the Gemma architecture so other models are unaffected: - Dual RoPE theta: local sliding-window layers use rope_theta_swa (10000), global layers use rope_theta (1e6) - Interleaved local/global attention: every Nth layer is global (gemma2=2, gemma3/4=6); sliding-window masking via last-N KV slicing - Embedding scaling by sqrt(hidden_size) - Sandwich normalization: post-attention and post-FFN norms before each residual (post_ffw_norm mapped); stops post_attention_norm being misused as the pre-FFN norm - GeGLU (tanh-GELU) activation instead of SwiGLU Go: UsesParallelAttnFFN now Phi-only (Gemma is sequential, not parallel). Python: also fixes pre-existing forward_batch write-back bugs (list slices are copies, so RoPE / attention output / norms were silently discarded). Adds Gemma/Mistral layer-pattern unit tests in all three languages. Note: this commit also bundles unrelated in-progress work present in the working tree (Rust<->Go/Python FFI bridge: oxidize-ffi/, generation/dflash changes, ffi.py, rust_model.go) per request to commit everything. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a97203c1a5
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Contributor
There was a problem hiding this comment.
24 issues found across 49 files
Tip: cubic can generate docs of your entire codebase and keep them up to date. Try it here.
Re-trigger cubic
Fixes identified by cubic AI review and GitHub CI failures: Rust: - Fix sliding-window attention for Mistral/Qwen in layer_is_global (separate sliding_window == 0 from sliding_window_pattern == 0) - Fix oxidize_model_forward to reject logits buffer size mismatches - Add test for uniform SWA models - Run cargo fmt --all across workspace Go: - Fix GemvRust error handling in dflash_weights (don't swallow errors) - Fix llama_decoder_forward to skip RoPE when ALiBi is enabled - Fix bench.go to use actual prompt tokens instead of hardcoded BOS - Fix bench.go zero-guard for average tok/s calculation - Fix decoder_ffn to reuse normed buffer instead of allocating - Fix parallelWorkers to use runtime.GOMAXPROCS(0) Python: - Fix generation.py suppress_tokens lower bounds check - Fix cli.py off-by-one token generation loop - Fix inference_config.py token-embedding dimension fallback - Fix ruff errors: E501 (line too long), I001 (import sorting), F401 (unused imports), E741 (ambiguous variable names) Tests: - Fix oxidize-cli test expecting wrong binary name in help output
- Python: Added deterministic RustModel.close(), safe __del__, and LD_LIBRARY_PATH sanitization. - Go: Gated CGO GemvRust behind build tags (cgo,rust_ffi). - Go: Restored 128-token MaxNewTokens default in CLI legacy run.
Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
Jackson57279
added a commit
that referenced
this pull request
Jun 7, 2026
Fixes unsafe_op_in_unsafe_fn errors introduced by PR #9: - tensor.rs: swiglu_avx2, swiglu_avx2_inplace AVX2 intrinsics - oxidize-ffi/src/lib.rs: all FFI raw pointer operations
Jackson57279
added a commit
that referenced
this pull request
Jun 7, 2026
Adds full Gemma-family inference support across Rust core and Go/Python ports. Includes dual RoPE theta, interleaved local/global attention, embedding scaling, sandwich normalization, and GeGLU activation. Also includes oxidize-ffi crate for Rust↔Go/Python fast path.
Jackson57279
added a commit
that referenced
this pull request
Jun 7, 2026
Fixes multiple issues that were blocking CI on master after merging PR #9: **Rust 2024 unsafe_op_in_unsafe_fn compliance:** - Add #[allow(unsafe_op_in_unsafe_fn)] to all SIMD target_feature functions in tensor.rs (AVX2, AVX-512, NEON kernels) - Wrap FFI raw pointer operations in explicit unsafe blocks in oxidize-ffi **Dead code / unused import cleanup:** - Remove unused imports across oxidize-server (audit, metrics, routes) - Fix unused variables in oxidize-finetuning (lora.rs, trainer.rs) - Allow dead_code for PrefixCache.config and DecoderState::Done - Allow dead_code for inference_config_from_dflash in bench.rs **Docker build fixes:** - Add missing COPY statements for oxidize-train, oxidize-finetuning, oxidize-convert, oxidize-ffi workspace members in both Dockerfiles - Update base Rust image from 1.88 to 1.95-bookworm for AVX-512 stable support **MLX macOS Send/Sync:** - Add unsafe impl Send/Sync for MlxComputeBackend (MLX C API is thread-safe) **Test status:** - Workspace compiles cleanly with RUSTFLAGS='-D warnings' - 533 tests pass; 2 pre-existing failures in inference::tests and sampling::tests (documented in PR #9, not regressions) - Docker builds (server + cli) pass successfully
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds full Gemma-family inference support across the Rust core and the Go and Python ports. All Gemma behavior is gated behind the Gemma architecture, so other models are unaffected.
What proper Gemma support required (none of this existed before — without it, output is garbage):
rope_theta_swa(10000), global layers userope_theta(1e6)sqrt(hidden_size)(input only; tied lm_head stays unscaled)post_ffw_normnow mapped); previouslypost_attention_normwas wrongly used as the pre-FFN normNon-obvious detail: Gemma's
(1+weight)RMSNorm is already baked into GGUF norm weights by the converter, so standard rms-norm (plain multiply) is correct.Per-language notes
oxidize-core): config fields +layer_is_global/layer_rope_theta/layer_sliding_windowhelpers; sandwich norms + GeGLU + windowing in bothforward_singleandforward_batched.oxidize-golang):UsesParallelAttnFFNnow Phi-only (Gemma is sequential, not parallel — was a latent bug); decode + batch use the existing windowed flash kernel.oxidize-python): mirrors the above; also fixes pre-existingforward_batchwrite-back bugs (list slices are copies, so RoPE / attention output / RMSNorms were silently discarded).Tests
./core/..., and the full Python suite pass (pre-existing unrelated failures only:TestManualQAScriptShape, and two dirty-tree Rust tests touchinghead_scratch/sampling).Deferred
Per-local-layer KV ring buffer (memory optimization), Gemma 3 long-context RoPE linear scaling, Gemma 2 logit softcapping.
This branch also bundles unrelated in-progress work that was already present in the working tree (a Rust↔Go/Python FFI bridge:
oxidize-ffi/,generation/dflashchanges,ffi.py,rust_model.go,Cargo.*), committed at the author's request to capture everything in the tree. The Gemma changes are the model/config/forward files listed above.🤖 Generated with Claude Code
Summary by cubic
Adds full Gemma 2/3/4 inference across
oxidize-core,oxidize-golang, andoxidize-python, plus a Rust FFI fast path for GEMV and end-to-end forward used by Go/Python benches and CLI. Also improves generation flow and performance (AVX2 prefetch, NumPy paths) without affecting other architectures.New Features
oxidize-fficrate exposing quantized GEMV and full model load/forward; Gocgoand Pythonctypesbindings; benches/CLI prefer the Rust fast path when present.GOMAXPROCS.--max-contextto clamp context; integration tests targetmodels/Qwen3-4B-Q4_K_M.gguf.Bug Fixes
cgo,rust_ffiand propagates errors; skip RoPE under ALiBi; legacyrunrestores 128-token default.RustModel.close(), safe__del__, sanitizedLD_LIBRARY_PATH; fixedsuppress_tokensbounds and a generation off-by-one; GGUF token-embedding dim fallback.Written for commit 218fc86. Summary will update on new commits.