Skip to content

UPSTREAM PR #1287: Update ggml to 0.9.7 release#61

Open
loci-dev wants to merge 1 commit intomainfrom
loci/pr-1287-update-ggml
Open

UPSTREAM PR #1287: Update ggml to 0.9.7 release#61
loci-dev wants to merge 1 commit intomainfrom
loci/pr-1287-update-ggml

Conversation

@loci-dev
Copy link

Note

Source pull request: leejet/stable-diffusion.cpp#1287

@loci-dev loci-dev deployed to stable-diffusion-cpp-prod February 19, 2026 04:20 — with GitHub Actions Active
@loci-review
Copy link

loci-review bot commented Feb 19, 2026

Overview

The GGML 0.9.7 library update introduces mixed performance impacts across stable-diffusion.cpp. Analysis of 48,349 total functions reveals 416 modified (0.86%), 52 new, 2 removed, and 47,879 unchanged functions.

Binaries Analyzed:

  • build.bin.sd-server: Power consumption increased 1.23% (515,491 nJ → 521,813 nJ)
  • build.bin.sd-cli: Power consumption increased 1.42% (480,110 nJ → 486,915 nJ)

Overall Impact: Minor performance regression with estimated 2-4% inference slowdown, driven primarily by quantized matrix operation regressions partially offset by activation function improvements.

Function Analysis

Critical Regressions:

ggml_gemm_q6_K_8x8_q8_K_generic (quantized GEMM kernel):

  • sd-server: Response time +381ns (+12.1%), throughput time -3,008ns (-99.1%)
  • sd-cli: Response time +395ns (+12.5%), throughput time -3,008ns (-99.1%)
  • Refactored from inline computation to function delegation, introducing call overhead in performance-critical matrix multiplication operations

ggml_gemv_q6_K_8x8_q8_K_generic (quantized GEMV kernel):

  • sd-server: Response time +343ns (+12.9%), throughput time -2,523ns (-98.9%)
  • sd-cli: Response time +348ns (+13.1%), throughput time -2,520ns (-98.9%)
  • Similar refactoring pattern affecting matrix-vector multiplication operations

Notable Improvements:

Activation Functions (GELU/SiLU):

  • gelu_f16: Response time -40ns (-2.4%), throughput time +521ns (+86.3%)
  • gelu_f32: Response time -55ns (-2.4%), throughput time +507ns (+83.8%)
  • silu_f16: Response time -43ns (-2.0%), throughput time +518ns (+83.3%)
  • Inlining optimizations improve end-to-end performance despite increased self-time

Unary Operations (negation, absolute value, square):

  • All variants: Response time -461ns (-25%), throughput time +4ns (+0.6%)
  • Optimized child function implementations significantly improve tensor operations

Other analyzed functions including STL utilities and memory management operations showed mixed results with minimal cumulative impact on inference performance.

Additional Findings

The update demonstrates intentional architectural trade-offs in GGML 0.9.7: refactoring complex matrix operations for maintainability while inlining simpler activation functions for performance. Matrix operations (GEMM/GEMV) are the computational backbone of neural network inference, called thousands of times per inference pass. The 12-13% regression in these critical kernels directly impacts overall throughput, particularly for models using Q6_K quantization. Activation function improvements (2-2.5%) and unary operation gains (25%) partially offset these regressions but cannot fully compensate given the dominance of matrix operations in inference workloads. The changes prioritize long-term code organization over short-term raw performance.

🔎 Full breakdown: Loci Inspector.
💬 Questions? Tag @loci-dev.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments