Skip to content

UPSTREAM PR #1273: fix: avoid black images if using an invalid VAE (for SDXL)#62

Open
loci-dev wants to merge 1 commit intomainfrom
loci/pr-1273-leejet_reorg
Open

UPSTREAM PR #1273: fix: avoid black images if using an invalid VAE (for SDXL)#62
loci-dev wants to merge 1 commit intomainfrom
loci/pr-1273-leejet_reorg

Conversation

@loci-dev
Copy link

Note

Source pull request: leejet/stable-diffusion.cpp#1273

If we inadvertently provide an invalid VAE file (for example --vae sdxl_invalid_vae.sft ) we will get a black image later after some U-Net loops. This can happen due to typos, invalid symlinks etc. etc.
So now we better act like using option --force-sdxl-vae-conv-scale .

@loci-dev loci-dev temporarily deployed to stable-diffusion-cpp-prod February 19, 2026 04:20 — with GitHub Actions Inactive
@loci-review
Copy link

loci-review bot commented Feb 19, 2026

Overview

Analysis of 48,313 functions across two binaries reveals mixed performance impact from a single commit implementing VAE validation for SDXL models. Modified functions: 58 (0.12%). New: 0. Removed: 0. Unchanged: 48,255.

Binaries analyzed:

  • build.bin.sd-server: 515,491 nJ → 518,784 nJ (+0.64%)
  • build.bin.sd-cli: 480,110 nJ → 483,568 nJ (+0.72%)

Function Analysis

Critical improvements:

  • ggml_vec_dot_f32 (build.bin.sd-cli): Response time 1966ns → 1771ns (-10.0%, -196ns), throughput time 1950ns → 1755ns (-10.0%, -196ns). This ARM NEON vectorized function is called millions of times per inference, providing substantial cumulative performance gains across all matrix operations.

Concerning regressions:

  • forward_mul_mat (build.bin.sd-server): Response time 15,028ns → 15,736ns (+4.7%, +708ns), throughput time stable at 2,377ns (-0.016%). Regression occurs in child functions rather than core algorithm. Affects quantized matrix multiplication in UNet layers, called thousands of times per inference.

Initialization regressions:

  • std::vector<gguf_kv>::begin() (build.bin.sd-server): Throughput time 61ns → 243ns (+297%, +182ns)
  • std::shared_ptr::_M_destroy for T5CLIPEmbedder (build.bin.sd-cli): Throughput time 105ns → 294ns (+180%, +189ns)
  • make_block_q4_Kx8 (build.bin.sd-server): Response time 8,126ns → 8,768ns (+7.9%, +642ns)

Initialization improvements:

  • std::make_move_iterator (build.bin.sd-server): Throughput time 246ns → 78ns (-68.4%, -169ns)
  • Darts::AutoPool::resize_buf (build.bin.sd-cli): Throughput time 300ns → 247ns (-17.5%, -53ns)

Other analyzed functions showed minor changes in STL operations, swap functions, and container management with negligible cumulative impact.

Additional Findings

The commit modified only src/stable-diffusion.cpp to add VAE validation logic preventing black images in SDXL. Most performance changes stem from compiler optimization variations rather than source modifications. The 10% improvement in ggml_vec_dot_f32 (extremely high call frequency) likely outweighs the 4.7% regression in forward_mul_mat, resulting in net positive inference performance. Initialization regressions add microseconds to multi-second model loading, representing negligible user impact. The correctness improvement justifies minor performance trade-offs.

🔎 Full breakdown: Loci Inspector.
💬 Questions? Tag @loci-dev.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments