UPSTREAM PR #1124: feat: support for cancelling generations by loci-dev · Pull Request #44 · auroralabs-loci/stable-diffusion.cpp

loci-dev · 2026-02-02T11:44:13Z

Note

Source pull request: leejet/stable-diffusion.cpp#1124

Adds an sd_cancel_generation function that can be called asynchronously to interrupt the current generation.

The log handling is still a bit rough on the edges, but I wanted to gather more feedback before polishing it. I've included a flag to allow finer control of what to cancel: everything, or keep and decode already-generated latents but cancel the current and next generations. Would an extra "finish the already started latent but cancel the batch" mode be useful? Or should I simplify it instead, keeping just the cancel-everything mode?

The function should be safe to be called from the progress or preview callbacks, a separate thread, or a signal handler. I've included a Unix signal handler on main.cpp just to be able to test it: the first Ctrl+C cancels the batch and the current gen, but still finishes the already generated latents, while a second Ctrl+C cancels everything (although it won't interrupt it in the middle of a generation step anymore).

fixes #1036

loci-review · 2026-02-02T12:51:33Z

Overview

Analysis of 47,950 functions across two binaries reveals minimal net performance impact between versions. Modified functions: 75 (0.16%), new: 59, removed: 31, unchanged: 47,785 (99.66%).

Power Consumption:

build.bin.sd-cli: 469,513.77 nJ (base: 469,680.15 nJ, -0.035%)
build.bin.sd-server: 502,702.24 nJ (base: 502,761.30 nJ, -0.012%)

Both binaries show negligible power consumption changes, indicating balanced performance across modifications.

Function Analysis

Most performance variations occur in C++ Standard Library functions and external GGML library code rather than application code. The primary code change—adding atomic-based cancellation support—has minimal direct performance impact.

Significant Regressions:

std::_Rb_tree::end() (build.bin.sd-cli): Response time +183.29 ns (+227.95%), throughput time +183.29 ns (+306.60%). STL function regression likely from compiler optimization differences.
ggml_view_2d (build.bin.sd-cli): Throughput time +32.12 ns (+47.00%), response time +24.03 ns (+1.16%). Critical tensor reshaping operation used extensively in attention mechanisms.
gguf_writer::write (build.bin.sd-cli): Throughput time +85.70 ns (+43.14%), response time +88.51 ns (+0.61%). Affects model serialization, not inference hot paths.
ggml_vec_scale_f16 (build.bin.sd-cli): Throughput time +76.95 ns (+8.66%), response time +76.99 ns (+5.62%). SIMD vector scaling operation in inference path.

Significant Improvements:

std::_Rb_tree::_M_insert_unique() (build.bin.sd-cli): Throughput time -90.67 ns (-46.04%), response time -91.62 ns (-5.20%). STL red-black tree insertion optimization.
std::unordered_map::operator[] (build.bin.sd-cli): Throughput time -62.72 ns (-32.31%), response time -64.65 ns (-1.16%). Benefits LoRA state management operations.
std::map::operator[] (build.bin.sd-cli): Throughput time -61.63 ns (-28.93%), response time -62.83 ns (-1.47%). Improves parameter lookup operations.

Other analyzed functions showed minor changes in STL container operations, quantization validation, and memory management, with absolute impacts under 50 ns per call.

Additional Findings

ML tensor operations show modest cumulative regressions. The combination of ggml_view_2d (+32 ns), ggml_vec_scale_f16 (+77 ns), and apply_unary_op (+71 ns) adds approximately 180 ns overhead per operation set. For diffusion models with multiple attention layers and denoising steps, this could accumulate to low milliseconds per generation. However, improvements in LoRA state management and container operations partially offset these regressions. All changes originate from external GGML library or compiler optimizations rather than application code modifications.

🔎 Full breakdown: Loci Inspector.
💬 Questions? Tag @loci-dev.

Co-authored-by: donington <jandastroy@gmail.com>

loci-review · 2026-02-14T05:14:34Z

Overview

Analysis of stable-diffusion.cpp across 48,972 functions (107 modified, 658 new, 0 removed) reveals major performance improvements in server responsiveness through architectural changes implementing asynchronous execution with GPU operation cancellation.

Binaries analyzed:

build.bin.sd-cli: +0.703% power consumption (480,109.60 nJ → 483,484.74 nJ)
build.bin.sd-server: +1.421% power consumption (515,491.29 nJ → 522,818.11 nJ)

Function Analysis

HTTP Request Handlers (build.bin.sd-server) — Three endpoint handlers show dramatic improvements:

Image Edits endpoint (main.cpp__ZZ4mainENKUlRKN7httplib7RequestERNS_8ResponseEbE_clES2_S4_b): Response time reduced from 36,331,944 ns to 4,453,149 ns (-87.74%, -31.88 ms). Throughput time decreased minimally from 3,952.54 ns to 3,842.31 ns (-2.79%). Code changes wrap generate_image() in std::async with 1-second polling for client disconnections, invoking sd_cancel_generation() to abort GPU operations when clients disconnect.
Image Variations endpoint (main.cpp__ZZ4mainENKUlRKN7httplib7RequestERNS_8ResponseEE3_clES2_S4_): Response time reduced from 37,797,292 ns to 5,922,231 ns (-84.33%, -31.88 ms). Throughput time decreased from 4,221.51 ns to 4,156.33 ns (-1.54%). Same async execution pattern prevents wasted GPU computation.
Image Generations endpoint (main.cpp__ZZ4mainENKUlRKN7httplib7RequestERNS_8ResponseEE2_clES2_S4_): Response time reduced from 39,296,084 ns to 7,424,411 ns (-81.11%, -31.87 ms). Throughput time decreased from 2,603.25 ns to 2,542.30 ns (-2.34%). Primary text-to-image endpoint benefits from cancellation infrastructure.

Standard Library Functions — Multiple STL functions show mixed performance with sub-microsecond absolute impacts. Iterator operations for LoraModel vectors improved 42-48%, while allocator functions regressed 23-307% in throughput time. These changes appear compiler-related rather than code-driven, with negligible real-world impact given absolute times remain under 400 nanoseconds.

Additional Findings

The architectural transformation introduces cancellation check points in three critical GPU loops (denoising, batch processing, VAE decoding) using atomic flags. The minimal throughput time changes in HTTP handlers (-1.5% to -2.8%) confirm improvements stem from eliminated GPU operations in call chains rather than handler code modifications. The 1.075% power consumption increase is negligible compared to prevented GPU computation on abandoned requests, which typically saves 5-25 seconds of inference time per cancelled operation. This represents production-ready resource management for ML inference servers handling long-running stable diffusion workloads.

🔎 Full breakdown: Loci Inspector.
💬 Questions? Tag @loci-dev.

loci-dev temporarily deployed to stable-diffusion-cpp-prod February 2, 2026 11:44 — with GitHub Actions Inactive

loci-dev force-pushed the main branch from 473a170 to 32e2075 Compare February 2, 2026 12:21

loci-dev force-pushed the main branch 26 times, most recently from 68f62a5 to 342c73d Compare February 9, 2026 04:49

loci-dev force-pushed the main branch from 342c73d to 8c51734 Compare February 10, 2026 04:52

feat: support for canceling the ongoing generation

771edfa

loci-dev force-pushed the main branch 2 times, most recently from 3ad80c4 to 74d69ae Compare February 12, 2026 04:47

feat(server): cancel current generation on client disconnect

2f7bae7

Co-authored-by: donington <jandastroy@gmail.com>

loci-dev force-pushed the loci/pr-1124-sd_cancel branch from d8382d6 to 2f7bae7 Compare February 14, 2026 04:15

loci-dev temporarily deployed to stable-diffusion-cpp-prod February 14, 2026 04:15 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #1124: feat: support for cancelling generations#44

UPSTREAM PR #1124: feat: support for cancelling generations#44
loci-dev wants to merge 2 commits intomainfrom
loci/pr-1124-sd_cancel

loci-dev commented Feb 2, 2026

Uh oh!

loci-review bot commented Feb 2, 2026

Uh oh!

loci-review bot commented Feb 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

loci-dev commented Feb 2, 2026

Uh oh!

loci-review bot commented Feb 2, 2026

Overview

Function Analysis

Additional Findings

Uh oh!

loci-review bot commented Feb 14, 2026

Overview

Function Analysis

Additional Findings

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments