carrier: revert workersPerEndpoint to 3 (fixes upload throughput)#143
Merged
Conversation
…ession) Empirically restores v1.5 upload throughput. The 3→4 bump in eeab6fd ("fix(carrier): increase workersPerEndpoint from 3 to 4 for improved parallelism") instead caused a 5× regression in single-session upload, which is what users in #113, #117, and #123 have been complaining about. Bench (1 endpoint, no account labels, same machine), 3 runs each: Metric v1.5 v1.6 this fix throughput_up_8MB_1session 22 MB/s 4.5 MB/s 22 MB/s throughput_up_8MB_4sessions 81 MB/s 18 MB/s 87 MB/s sessions_per_sec 2.8 /s 50 /s 4.5 /s throughput_down_8MB_1session 19 MB/s 20 MB/s 20 MB/s ttfb_p50_p95 ~352 ms ~352 ms ~352 ms The fix Pareto-improves over v1.5 (same upload, slightly higher sessions/sec) and addresses the user-reported headline issue. The sessions/sec drop vs v1.6 is the v1.5 baseline behavior — see below. Why one extra worker hurt this much (root cause): When SOCKS5 dials a new session, the carrier marks synNeeded=true and kicks workers. With 3 workers, the SYN drain races the bench's subsequent conn.Write(N bytes), and *the writes usually land first* — EnqueueInitialData prepends them to txBuf so DrainTx emits a SYN frame carrying the connect_data payload (up to 256 KB) plus follow-up data frames in the same poll. Server processes the upload immediately; parallel polls pipeline the remaining bytes. With 4 workers, the SYN drain *wins* the race more often — the SYN frame ships alone before any write reaches the buffer. The data then has to follow in separate polls, each costing one ActiveDrainWindow (350 ms). For an 8 MB upload that's 4 sequential round-trips instead of one pipelined batch. Same logic harms every TLS handshake, every HTTP request — any flow that does "open connection, immediately send N bytes" loses its connect_data bundling. The real fix is to make EnqueueInitialData/connect_data more reliable (e.g. brief synNeeded hold-off until either data lands or a deadline elapses) — that would give us both this PR's upload throughput AND v1.6's session-open rate. Tracking that separately. This revert is the safe immediate fix for the user-visible bug.
This was referenced May 21, 2026
Kianmhz
added a commit
that referenced
this pull request
May 21, 2026
…iters (#147) EnqueueInitialData was implemented to prepend rather than append, on the assumption it would be called once before the SYN drains. But the SOCKS5 adapter calls it on every Write, and a fast local writer (the bench harness, or any application that issues several small Writes immediately after connecting) can fit many calls before a poll worker drains the SYN. Each prepend pushed the new data in front of what was already buffered, so the on-wire byte order ended up reversed. For any protocol whose leading bytes carry framing — a TLS record header, an HTTP request line, a length prefix — the upstream parses garbage or rejects the stream. The bug has existed since the connect_data optimization landed in v1.4. A specific consequence worth flagging: every upload-throughput bench result ever recorded for this project has been wrong. The bench's sized-sink reads the first 8 bytes as a size header. With prepend, those 8 bytes came from the LAST chunk written before the SYN drained — usually a body chunk full of zeros, decoded as size=0. The sink ACKed instantly without reading the body, and the harness measured the time-to-instant-ACK as upload throughput. That made v1.5 look like 22 MB/s when actual upload was ~5 MB/s. The "v1.5→v1.6 upload regression" narrative behind #143 was a bench artifact; the real per-session ceiling is `maxDrainFramesPerSession × MaxFramePayload / ActiveDrainWindow` ≈ 5.85 MB/s in both versions. The fix is one-line (append instead of prepend); the comment is the expensive part. Test added to catch the regression directly.
Kianmhz
added a commit
that referenced
this pull request
May 21, 2026
…aphore v1.6 changed the worker formula from v1.5's `workersPerEndpoint × len(endpoints)` to `(workersPerEndpoint + idleSlots - 1) × bucketCount`, where bucketCount is the number of distinct Google account labels. For configs without account labels — the most common legacy pattern, since most users never edited the deprecated v1.4 example config — every endpoint shares one empty-string bucket, so worker count collapses: config v1.5 v1.6 (incl #143) this PR ---------------------------------------------------------------------- 5 unlabeled endpoints 15 3 15 5 deployments across 5 labeled accounts 15 15 15 9 deployments across 5 accounts (#113) 27 15 27 #143 already shipped the right per-worker behavior (revert workersPerEndpoint to 3, fixes upload throughput at the bench's 1-endpoint config). But it does nothing about the worker count regression for multi-endpoint configs — those users still run with bucketCount-scaled workers. This PR restores v1.5's endpoint-count scaling AND replaces the global idle-slot counter with a per-account semaphore inside the picker. The semaphore keeps the anti-abuse cap (issue #56 — multiple deployments under one account couldn't sustain the v1.5 worker count's concurrency on the same Google account) while letting workers freely rotate across accounts. Two pickers now: - pickRelayEndpoint: blacklist-aware, no cap. Used for active polls carrying TX, which terminate quickly with the drained batch and don't camp an account's concurrency budget — matches v1.5 behavior. - pickIdleEndpoint: blacklist-aware AND requires a free idle slot in the candidate's bucket. Atomically reserves; pollOnce releases on return. Each unlabeled endpoint gets its own implicit bucket (key = "url:"+url) so legacy multi-endpoint configs no longer share one bucket with 1 idle slot — they get one slot per endpoint, like v1.5. Bench on the 1-endpoint config (the only one the harness can drive): throughput_up_8MB_1session 22.26 MB/s (unchanged from main) throughput_up_8MB_4sessions 87.17 MB/s (unchanged) sessions_per_sec 4.39 /s (unchanged, within noise) No regression on the configs the bench covers. The actual benefit lives in the worker-count table above — verifiable from the math, not the bench. A multi-endpoint bench scenario should be a follow-up so this kind of regression can be caught automatically next time.
Kianmhz
added a commit
that referenced
this pull request
May 21, 2026
…aphore (#146) v1.6 changed the worker formula from v1.5's `workersPerEndpoint × len(endpoints)` to `(workersPerEndpoint + idleSlots - 1) × bucketCount`, where bucketCount is the number of distinct Google account labels. For configs without account labels — the most common legacy pattern, since most users never edited the deprecated v1.4 example config — every endpoint shares one empty-string bucket, so worker count collapses: config v1.5 v1.6 (incl #143) this PR ---------------------------------------------------------------------- 5 unlabeled endpoints 15 3 15 5 deployments across 5 labeled accounts 15 15 15 9 deployments across 5 accounts (#113) 27 15 27 #143 already shipped the right per-worker behavior (revert workersPerEndpoint to 3, fixes upload throughput at the bench's 1-endpoint config). But it does nothing about the worker count regression for multi-endpoint configs — those users still run with bucketCount-scaled workers. This PR restores v1.5's endpoint-count scaling AND replaces the global idle-slot counter with a per-account semaphore inside the picker. The semaphore keeps the anti-abuse cap (issue #56 — multiple deployments under one account couldn't sustain the v1.5 worker count's concurrency on the same Google account) while letting workers freely rotate across accounts. Two pickers now: - pickRelayEndpoint: blacklist-aware, no cap. Used for active polls carrying TX, which terminate quickly with the drained batch and don't camp an account's concurrency budget — matches v1.5 behavior. - pickIdleEndpoint: blacklist-aware AND requires a free idle slot in the candidate's bucket. Atomically reserves; pollOnce releases on return. Each unlabeled endpoint gets its own implicit bucket (key = "url:"+url) so legacy multi-endpoint configs no longer share one bucket with 1 idle slot — they get one slot per endpoint, like v1.5. Bench on the 1-endpoint config (the only one the harness can drive): throughput_up_8MB_1session 22.26 MB/s (unchanged from main) throughput_up_8MB_4sessions 87.17 MB/s (unchanged) sessions_per_sec 4.39 /s (unchanged, within noise) No regression on the configs the bench covers. The actual benefit lives in the worker-count table above — verifiable from the math, not the bench. A multi-endpoint bench scenario should be a follow-up so this kind of regression can be caught automatically next time.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
One-line revert of eeab6fd (
workersPerEndpoint3→4). Bisected against v1.5.0 — this is the source of the upload-throughput complaints in #113, #117, and #123.Bench (1 endpoint, no account labels, same machine, 3 runs each)
throughput_up_8MB_1sessionthroughput_up_8MB_4sessionssessions_per_secthroughput_down_8MB_1sessionttfb_p50_p95This fix Pareto-improves over v1.5: same upload, slightly higher session-open rate. Sessions/sec drops vs v1.6 because v1.6's apparent gain there came from the same root-cause bug — see below.
Root cause (the interesting part)
The 3→4 worker bump unintentionally broke the
connect_dataoptimization. When SOCKS5 opens a new session, the carrier markssynNeeded=trueand kicks workers. With 3 workers, the SYN drain race lets the bench's subsequentconn.Write(N bytes)land in the session buffer before any worker drains.EnqueueInitialDataprepends those bytes totxBuf, soDrainTxemits a SYN frame carrying up to 256 KB of payload plus follow-up data frames — all in one poll. Parallel polls then pipeline the remaining bytes.With 4 workers, the SYN drain wins the race more often. The SYN ships alone before any write reaches the buffer. The data then has to follow in separate polls, each costing one
ActiveDrainWindow(350 ms). For an 8 MB upload that's 4 sequential round-trips instead of one pipelined batch — exactly what the bench measures (~1.78s vs ~360ms).This isn't a bench artifact. Same logic costs:
So the bisect line — "increase workersPerEndpoint from 3 to 4 for improved parallelism" — was a real regression, not just a benchmark quirk.
Follow-up
The proper fix is to make
EnqueueInitialData/connect_datamore reliable instead of relying on race timing — e.g. a brief hold onsynNeededuntil either the buffer gets at least one write or a small deadline (say 1 ms) elapses. That would give us both this PR's upload throughput and v1.6's session-open rate. I'll open a separate issue for tracking.Verification
🤖 Generated with Claude Code