UPSTREAM PR #1287: Update ggml to 0.9.7 release by loci-dev · Pull Request #61 · auroralabs-loci/stable-diffusion.cpp

loci-dev · 2026-02-19T04:20:38Z

Note

Source pull request: leejet/stable-diffusion.cpp#1287

loci-review · 2026-02-19T05:32:02Z

Overview

The GGML 0.9.7 library update introduces mixed performance impacts across stable-diffusion.cpp. Analysis of 48,349 total functions reveals 416 modified (0.86%), 52 new, 2 removed, and 47,879 unchanged functions.

Binaries Analyzed:

build.bin.sd-server: Power consumption increased 1.23% (515,491 nJ → 521,813 nJ)
build.bin.sd-cli: Power consumption increased 1.42% (480,110 nJ → 486,915 nJ)

Overall Impact: Minor performance regression with estimated 2-4% inference slowdown, driven primarily by quantized matrix operation regressions partially offset by activation function improvements.

Function Analysis

Critical Regressions:

ggml_gemm_q6_K_8x8_q8_K_generic (quantized GEMM kernel):

sd-server: Response time +381ns (+12.1%), throughput time -3,008ns (-99.1%)
sd-cli: Response time +395ns (+12.5%), throughput time -3,008ns (-99.1%)
Refactored from inline computation to function delegation, introducing call overhead in performance-critical matrix multiplication operations

ggml_gemv_q6_K_8x8_q8_K_generic (quantized GEMV kernel):

sd-server: Response time +343ns (+12.9%), throughput time -2,523ns (-98.9%)
sd-cli: Response time +348ns (+13.1%), throughput time -2,520ns (-98.9%)
Similar refactoring pattern affecting matrix-vector multiplication operations

Notable Improvements:

Activation Functions (GELU/SiLU):

gelu_f16: Response time -40ns (-2.4%), throughput time +521ns (+86.3%)
gelu_f32: Response time -55ns (-2.4%), throughput time +507ns (+83.8%)
silu_f16: Response time -43ns (-2.0%), throughput time +518ns (+83.3%)
Inlining optimizations improve end-to-end performance despite increased self-time

Unary Operations (negation, absolute value, square):

All variants: Response time -461ns (-25%), throughput time +4ns (+0.6%)
Optimized child function implementations significantly improve tensor operations

Other analyzed functions including STL utilities and memory management operations showed mixed results with minimal cumulative impact on inference performance.

Additional Findings

The update demonstrates intentional architectural trade-offs in GGML 0.9.7: refactoring complex matrix operations for maintainability while inlining simpler activation functions for performance. Matrix operations (GEMM/GEMV) are the computational backbone of neural network inference, called thousands of times per inference pass. The 12-13% regression in these critical kernels directly impacts overall throughput, particularly for models using Q6_K quantization. Activation function improvements (2-2.5%) and unary operation gains (25%) partially offset these regressions but cannot fully compensate given the dominance of matrix operations in inference workloads. The changes prioritize long-term code organization over short-term raw performance.

🔎 Full breakdown: Loci Inspector.
💬 Questions? Tag @loci-dev.

Update ggml to 0.9.7 release

04a9c74

loci-dev deployed to stable-diffusion-cpp-prod February 19, 2026 04:20 — with GitHub Actions Active

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #1287: Update ggml to 0.9.7 release#61

UPSTREAM PR #1287: Update ggml to 0.9.7 release#61
loci-dev wants to merge 1 commit intomainfrom
loci/pr-1287-update-ggml

loci-dev commented Feb 19, 2026

Uh oh!

loci-review bot commented Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

loci-dev commented Feb 19, 2026

Uh oh!

loci-review bot commented Feb 19, 2026

Overview

Function Analysis

Additional Findings

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments