Skip to content

SIGBUS crash on ARM64/Metal (Apple M4 Pro) after ~400+ embeddings #38

Description

@nirajkvinit

System Info

  • OS: macOS Sequoia 15.x (Apple M4 Pro)
  • Go: 1.25.5
  • gollama.cpp: v0.2.2-llamacpp.b6862
  • Model: nomic-embed-text-v1.5.Q8_0.gguf (138MB)

Description

When generating embeddings for ~400+ text chunks using the embedding API, the process crashes with SIGBUS (bus error). The crash occurs in Metal GPU operations and is reproducible.

Stack Trace

SIGBUS: bus error
PC=0x1913e6388 m=17 sigcode=2 addr=0x1913e6388

goroutine 0 gp=0x2003d8f00 m=17 mp=0x200400d08 [idle]:
runtime.asyncPreempt2()
    /opt/homebrew/Cellar/go/1.25.5/libexec/src/runtime/preempt.go:254 +0x28

Key observation: PC == addr indicates an instruction fetch fault, suggesting code pointer corruption.

Reproduction Steps

  1. Load model with default params (Metal GPU enabled)
  2. Call Decode() repeatedly in a loop for embedding generation
  3. Process ~400+ text chunks (each ~500 chars)
  4. Crash occurs consistently after processing ~400 files worth of embeddings
// Simplified reproduction
modelParams := gollama.Model_default_params()
model, _ := gollama.Model_load_from_file("nomic-embed-text-v1.5.Q8_0.gguf", modelParams)

ctxParams := gollama.Context_default_params()
ctxParams.Embeddings = 1
ctxParams.PoolingType = gollama.LLAMA_POOLING_TYPE_MEAN
ctx, _ := gollama.Init_from_model(model, ctxParams)

// Process many texts - crash after ~400 iterations
for _, text := range manyTexts {
    tokens, _ := gollama.Tokenize(model, text, true, false)
    gollama.Memory_clear(ctx, true)
    batch := gollama.Batch_get_one(tokens)
    gollama.Decode(ctx, batch)  // <-- SIGBUS occurs here eventually
    embPtr := gollama.Get_embeddings_ith(ctx, 0)
    // ... copy embeddings
}

Attempted Workarounds (All Failed)

Approach Result
NGpuLayers = 0 Metal still initializes, crash persists
runtime.LockOSThread() Still crashes
GODEBUG=asyncpreemptoff=1 Still crashes

Current Workaround

Falling back to non-neural embeddings entirely, which defeats the purpose of using llama.cpp.

Analysis

The crash appears to be related to:

  1. Metal GPU state corruption during repeated Decode() calls
  2. Possible interaction between Go's async preemption (SIGURG signals) and purego FFI
  3. Memory alignment issues in Metal operations on ARM64

Since NGpuLayers = 0 doesn't disable Metal (it still initializes), there's no way to use CPU-only mode with the pre-built binaries.

Questions

  1. Is there a way to completely disable Metal backend at runtime?
  2. Are CPU-only pre-built binaries available for macOS ARM64?
  3. Is this a known issue with llama.cpp Metal on Apple Silicon?

Environment Details

ggml_metal_device_init: GPU name:   Apple M4 Pro
ggml_metal_device_init: GPU family: MTLGPUFamilyApple9  (1009)
ggml_metal_device_init: has unified memory    = true
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 17179.89 MB

Thank you for this excellent library! Looking forward to any guidance on this issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions