Skip to content

fix: flush pgstat counters from worker so autovacuum sees its writes#254

Merged
utkarash2991 merged 1 commit into
masterfrom
fix/pgstat-flush-in-worker
May 18, 2026
Merged

fix: flush pgstat counters from worker so autovacuum sees its writes#254
utkarash2991 merged 1 commit into
masterfrom
fix/pgstat-flush-in-worker

Conversation

@utkarash2991
Copy link
Copy Markdown
Contributor

@utkarash2991 utkarash2991 commented May 17, 2026

The `pg_net` background worker performs DML on `net._http_response` and `net.http_request_queue` via 
SPI but never calls `pgstat_report_stat()`. As a result, per-write counters (`n_tup_ins`, `n_tup_del`,
`n_mod_since_analyze`) for the worker's writes never reach shared stats. Autovacuum/autoanalyze
read those counters to decide when to vacuum — when they stay at zero, the launcher never schedules a
run, and `net._http_response` accumulates bloat indefinitely.
  
User-backend INSERTs into `net.http_request_queue` (via `net.http_get/post/delete`) keep that 
table's autovacuum cadence healthy in practice — user backends flush pgstat automatically via the main
loop in `tcop/postgres.c`. So the customer-visible failure is specific to `net._http_response`, which 
has no user-backend traffic to compensate. Eventually the bloated `_http_response_created_idx` 
makes the worker's expiry query (`ORDER BY created LIMIT $batch`) walk huge stretches of dead index 
entries,  yielding 20–100s DELETEs and IO-wait spikes.

What kind of change does this PR introduce?

Bug fix

What is the current behavior?

Auto vacuum/analyze will not trigger on the net._http_response

Fix

Call pgstat_report_stat(false) after each worker transaction commits. Per pgstat.c:

"Must be called by processes that performs DML: tcop/postgres.c, logical receiver processes, SPI worker, etc. to flush pending statistics updates to shared memory."

Regular user backends invoke it from the main loop after each query; background workers have no equivalent and must flush themselves.

Tests

Two regression tests added to test/test_worker_behavior.py:

  1. test_worker_writes_increment_pgstat_counters — drives 30 requests through the worker and asserts pg_stat_user_tables.n_tup_ins > 0 on net._http_response. Before the fix, this stays at 0; with the fix, it reflects the worker's INSERTs.
  2. test_worker_writes_trigger_autoanalyze_on_http_response — end-to-end test that proves the customer-impacting symptom is resolved. Configures autovacuum_naptime=1s plus low per-table thresholds, drives traffic, and waits for autoanalyze_count > 0.

Linear: https://linear.app/supabase/issue/PSQL-1216

Background workers performing DML via SPI must call pgstat_report_stat()
themselves; regular user backends get this for free via tcop/postgres.c's
main loop. Without it, per-write counters (n_tup_ins, n_tup_del,
n_mod_since_analyze) for the worker's writes to net._http_response and
net.http_request_queue never reach shared stats. Autovacuum/autoanalyze
never see anything to vacuum, and net._http_response (written only by the
worker) silently bloats - eventually surfacing as 20-100s expiry DELETEs
on a bloated `created` index.
Copy link
Copy Markdown

@za-arthur za-arthur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! LGTM. I have just a suggestion below.

Comment thread test/test_worker_behavior.py
@utkarash2991 utkarash2991 merged commit 3736d0f into master May 18, 2026
28 checks passed
@utkarash2991 utkarash2991 deleted the fix/pgstat-flush-in-worker branch May 18, 2026 17:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants