fix(pool): prevent num_idle underflow that wedges maintenance in a CPU spin#4289
Open
nipunn1313 wants to merge 2 commits into
Open
fix(pool): prevent num_idle underflow that wedges maintenance in a CPU spin#4289nipunn1313 wants to merge 2 commits into
nipunn1313 wants to merge 2 commits into
Conversation
…U spin Fixes transact-rs#3645 - A rare 100% CPU and a hang on shutdown. I made this fix w/ the assistance of an LLM, but I was looking quite carefully to make sure the code changes make sense. I opted to repro via a stress test - which shows the underflow if you run the test without the fix. Leaving an LLM generated description down below in case you find it helpful. --------------------------------------------------------------------- `PoolInner::release` made a returned connection acquirable (push to the idle queue + release its semaphore permit) *before* incrementing `num_idle`. A concurrent `pop_idle` (via `try_acquire`) could pop that connection and run its `num_idle.fetch_sub(1)` in the window before the increment landed; if `num_idle` was 0 at that moment the `usize` wrapped to `usize::MAX`. The background maintenance task builds its sweep range from `for _ in 0..pool.num_idle()`, and once the idle queue drains the loop body is fully synchronous (`try_acquire`/`release` never `.await`). So a single bad read of `usize::MAX` makes the task spin ~10^19 iterations without yielding, pinning a worker thread forever and never observing the pool close. This is wide open on shutdown, where dropping every `PoolConnection` spawns a `release`, racing the maintenance tick. A `MultiDbPool` (one pool per database) wedges several tasks at once, pegging every worker thread — the reported 100%-CPU-on-shutdown symptom. Fixes, in `sqlx-core/src/pool/inner.rs`: - release: increment `num_idle` *before* publishing the connection. With this order each completed `push` is preceded by its `fetch_add`, so `#dec <= #pop <= #push <= #inc` and `num_idle = #inc - #dec >= 0` always. Transient over-counting is harmless (`pop_idle` finds an empty queue and returns the permit without decrementing). The permit is still released after the push, preserving the woken-waiter-sees-the-connection invariant. - pop_idle: saturating decrement, so a stray `fetch_sub` can never wrap even if a future change reintroduces an ordering bug. Adds a multi-threaded stress test that drives the real `try_acquire` and `release` paths on a small, heavily oversubscribed pool (using the in-crate `Any` database with a stub backend, so no DB or runtime is needed) and asserts the sampled `num_idle` never exceeds `max_connections`. It catches the wrapped `usize::MAX` reliably on the original code (5/5 debug, 3/3 release runs) and passes with the fix. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
nipunn1313
commented
May 30, 2026
| #[derive(Debug)] | ||
| struct StubBackend; | ||
|
|
||
| impl AnyConnectionBackend for StubBackend { |
Contributor
Author
There was a problem hiding this comment.
would take advice on how to write the test here - this is a bit gnarly, but it worked.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #3645 - A rare 100% CPU and a hang on shutdown.
I made this fix w/ the assistance of an LLM, but I was looking quite carefully to make sure the code changes make sense.
I opted to repro via a stress test - which shows the underflow if you run the test without the fix.
At Convex - we run an hourly job that makes heavy use of sqlx with idle connection configuration. Something like once a week or two, we run into this, where it hangs on shutdown with 100% CPU. I was able to profile traces and narrow down the issue.
Leaving an LLM generated description down below in case you find it helpful.
PoolInner::releasemade a returned connection acquirable (push to the idle queue + release its semaphore permit) before incrementingnum_idle. A concurrentpop_idle(viatry_acquire) could pop that connection and run itsnum_idle.fetch_sub(1)in the window before the increment landed; ifnum_idlewas 0 at that moment theusizewrapped tousize::MAX.The background maintenance task builds its sweep range from
for _ in 0..pool.num_idle(), and once the idle queue drains the loop body is fully synchronous (try_acquire/releasenever.await). So a single bad read ofusize::MAXmakes the task spin ~10^19 iterations without yielding, pinning a worker thread forever and never observing the pool close. This is wide open on shutdown, where dropping everyPoolConnectionspawns arelease, racing the maintenance tick. AMultiDbPool(one pool per database) wedges several tasks at once, pegging every worker thread — the reported 100%-CPU-on-shutdown symptom.Fixes, in
sqlx-core/src/pool/inner.rs:num_idlebefore publishing the connection. With this order each completedpushis preceded by itsfetch_add, so#dec <= #pop <= #push <= #incandnum_idle = #inc - #dec >= 0always. Transient over-counting is harmless (pop_idlefinds an empty queue and returns the permit without decrementing). The permit is still released after the push, preserving the woken-waiter-sees-the-connection invariant.fetch_subcan never wrap even if a future change reintroduces an ordering bug.Adds a multi-threaded stress test that drives the real
try_acquireandreleasepaths on a small, heavily oversubscribed pool (using the in-crateAnydatabase with a stub backend, so no DB or runtime is needed) and asserts the samplednum_idlenever exceedsmax_connections. It catches the wrappedusize::MAXreliably on the original code (5/5 debug, 3/3 release runs) and passes with the fix.