UPSTREAM PR #1124: feat: support for cancelling generations#44
UPSTREAM PR #1124: feat: support for cancelling generations#44
Conversation
OverviewAnalysis of 47,950 functions across two binaries reveals minimal net performance impact between versions. Modified functions: 75 (0.16%), new: 59, removed: 31, unchanged: 47,785 (99.66%). Power Consumption:
Both binaries show negligible power consumption changes, indicating balanced performance across modifications. Function AnalysisMost performance variations occur in C++ Standard Library functions and external GGML library code rather than application code. The primary code change—adding atomic-based cancellation support—has minimal direct performance impact. Significant Regressions:
Significant Improvements:
Other analyzed functions showed minor changes in STL container operations, quantization validation, and memory management, with absolute impacts under 50 ns per call. Additional FindingsML tensor operations show modest cumulative regressions. The combination of 🔎 Full breakdown: Loci Inspector. |
68f62a5 to
342c73d
Compare
3ad80c4 to
74d69ae
Compare
Co-authored-by: donington <jandastroy@gmail.com>
d8382d6 to
2f7bae7
Compare
OverviewAnalysis of stable-diffusion.cpp across 48,972 functions (107 modified, 658 new, 0 removed) reveals major performance improvements in server responsiveness through architectural changes implementing asynchronous execution with GPU operation cancellation. Binaries analyzed:
Function AnalysisHTTP Request Handlers (build.bin.sd-server) — Three endpoint handlers show dramatic improvements:
Standard Library Functions — Multiple STL functions show mixed performance with sub-microsecond absolute impacts. Iterator operations for LoraModel vectors improved 42-48%, while allocator functions regressed 23-307% in throughput time. These changes appear compiler-related rather than code-driven, with negligible real-world impact given absolute times remain under 400 nanoseconds. Additional FindingsThe architectural transformation introduces cancellation check points in three critical GPU loops (denoising, batch processing, VAE decoding) using atomic flags. The minimal throughput time changes in HTTP handlers (-1.5% to -2.8%) confirm improvements stem from eliminated GPU operations in call chains rather than handler code modifications. The 1.075% power consumption increase is negligible compared to prevented GPU computation on abandoned requests, which typically saves 5-25 seconds of inference time per cancelled operation. This represents production-ready resource management for ML inference servers handling long-running stable diffusion workloads. 🔎 Full breakdown: Loci Inspector. |
Note
Source pull request: leejet/stable-diffusion.cpp#1124
Adds an
sd_cancel_generationfunction that can be called asynchronously to interrupt the current generation.The log handling is still a bit rough on the edges, but I wanted to gather more feedback before polishing it. I've included a flag to allow finer control of what to cancel: everything, or keep and decode already-generated latents but cancel the current and next generations. Would an extra "finish the already started latent but cancel the batch" mode be useful? Or should I simplify it instead, keeping just the cancel-everything mode?
The function should be safe to be called from the progress or preview callbacks, a separate thread, or a signal handler. I've included a Unix signal handler on
main.cppjust to be able to test it: the first Ctrl+C cancels the batch and the current gen, but still finishes the already generated latents, while a second Ctrl+C cancels everything (although it won't interrupt it in the middle of a generation step anymore).fixes #1036