carrier: revert workersPerEndpoint to 3 (fixes upload throughput) by Kianmhz · Pull Request #143 · Kianmhz/GooseRelayVPN

Kianmhz · 2026-05-21T00:08:12Z

Summary

One-line revert of eeab6fd (workersPerEndpoint 3→4). Bisected against v1.5.0 — this is the source of the upload-throughput complaints in #113, #117, and #123.

Bench (1 endpoint, no account labels, same machine, 3 runs each)

Metric	v1.5.0	v1.6.0	this fix
`throughput_up_8MB_1session`	22 MB/s	4.5 MB/s	22 MB/s
`throughput_up_8MB_4sessions`	81 MB/s	18 MB/s	87 MB/s
`sessions_per_sec`	2.8 /s	50 /s	4.5 /s
`throughput_down_8MB_1session`	19 MB/s	20 MB/s	20 MB/s
`ttfb_p50_p95`	~352 ms	~352 ms	~352 ms

This fix Pareto-improves over v1.5: same upload, slightly higher session-open rate. Sessions/sec drops vs v1.6 because v1.6's apparent gain there came from the same root-cause bug — see below.

Root cause (the interesting part)

The 3→4 worker bump unintentionally broke the connect_data optimization. When SOCKS5 opens a new session, the carrier marks synNeeded=true and kicks workers. With 3 workers, the SYN drain race lets the bench's subsequent conn.Write(N bytes) land in the session buffer before any worker drains. EnqueueInitialData prepends those bytes to txBuf, so DrainTx emits a SYN frame carrying up to 256 KB of payload plus follow-up data frames — all in one poll. Parallel polls then pipeline the remaining bytes.

With 4 workers, the SYN drain wins the race more often. The SYN ships alone before any write reaches the buffer. The data then has to follow in separate polls, each costing one ActiveDrainWindow (350 ms). For an 8 MB upload that's 4 sequential round-trips instead of one pipelined batch — exactly what the bench measures (~1.78s vs ~360ms).

This isn't a bench artifact. Same logic costs:

Every TLS handshake (ClientHello within microseconds of TCP connect) loses its bundling — TTFB roughly doubles.
Every HTTP/1.1 request — the request line is sent immediately after open, normally bundled.
Any "open + immediately send" flow.

So the bisect line — "increase workersPerEndpoint from 3 to 4 for improved parallelism" — was a real regression, not just a benchmark quirk.

Follow-up

The proper fix is to make EnqueueInitialData/connect_data more reliable instead of relying on race timing — e.g. a brief hold on synNeeded until either the buffer gets at least one write or a small deadline (say 1 ms) elapses. That would give us both this PR's upload throughput and v1.6's session-open rate. I'll open a separate issue for tracking.

Verification

`go test -count=1 -timeout 90s ./...` — all green
`go vet ./...` — clean
`./bench/bench.sh --baseline v1.6.0` — only `sessions_per_sec` regresses (expected, see above)
`./bench/bench.sh --baseline v1.5.0` — matches v1.5 on upload, beats it on download/sessions/TTFB

🤖 Generated with Claude Code

…ession) Empirically restores v1.5 upload throughput. The 3→4 bump in eeab6fd ("fix(carrier): increase workersPerEndpoint from 3 to 4 for improved parallelism") instead caused a 5× regression in single-session upload, which is what users in #113, #117, and #123 have been complaining about. Bench (1 endpoint, no account labels, same machine), 3 runs each: Metric v1.5 v1.6 this fix throughput_up_8MB_1session 22 MB/s 4.5 MB/s 22 MB/s throughput_up_8MB_4sessions 81 MB/s 18 MB/s 87 MB/s sessions_per_sec 2.8 /s 50 /s 4.5 /s throughput_down_8MB_1session 19 MB/s 20 MB/s 20 MB/s ttfb_p50_p95 ~352 ms ~352 ms ~352 ms The fix Pareto-improves over v1.5 (same upload, slightly higher sessions/sec) and addresses the user-reported headline issue. The sessions/sec drop vs v1.6 is the v1.5 baseline behavior — see below. Why one extra worker hurt this much (root cause): When SOCKS5 dials a new session, the carrier marks synNeeded=true and kicks workers. With 3 workers, the SYN drain races the bench's subsequent conn.Write(N bytes), and *the writes usually land first* — EnqueueInitialData prepends them to txBuf so DrainTx emits a SYN frame carrying the connect_data payload (up to 256 KB) plus follow-up data frames in the same poll. Server processes the upload immediately; parallel polls pipeline the remaining bytes. With 4 workers, the SYN drain *wins* the race more often — the SYN frame ships alone before any write reaches the buffer. The data then has to follow in separate polls, each costing one ActiveDrainWindow (350 ms). For an 8 MB upload that's 4 sequential round-trips instead of one pipelined batch. Same logic harms every TLS handshake, every HTTP request — any flow that does "open connection, immediately send N bytes" loses its connect_data bundling. The real fix is to make EnqueueInitialData/connect_data more reliable (e.g. brief synNeeded hold-off until either data lands or a deadline elapses) — that would give us both this PR's upload throughput AND v1.6's session-open rate. Tracking that separately. This revert is the safe immediate fix for the user-visible bug.

…iters (#147) EnqueueInitialData was implemented to prepend rather than append, on the assumption it would be called once before the SYN drains. But the SOCKS5 adapter calls it on every Write, and a fast local writer (the bench harness, or any application that issues several small Writes immediately after connecting) can fit many calls before a poll worker drains the SYN. Each prepend pushed the new data in front of what was already buffered, so the on-wire byte order ended up reversed. For any protocol whose leading bytes carry framing — a TLS record header, an HTTP request line, a length prefix — the upstream parses garbage or rejects the stream. The bug has existed since the connect_data optimization landed in v1.4. A specific consequence worth flagging: every upload-throughput bench result ever recorded for this project has been wrong. The bench's sized-sink reads the first 8 bytes as a size header. With prepend, those 8 bytes came from the LAST chunk written before the SYN drained — usually a body chunk full of zeros, decoded as size=0. The sink ACKed instantly without reading the body, and the harness measured the time-to-instant-ACK as upload throughput. That made v1.5 look like 22 MB/s when actual upload was ~5 MB/s. The "v1.5→v1.6 upload regression" narrative behind #143 was a bench artifact; the real per-session ceiling is `maxDrainFramesPerSession × MaxFramePayload / ActiveDrainWindow` ≈ 5.85 MB/s in both versions. The fix is one-line (append instead of prepend); the comment is the expensive part. Test added to catch the regression directly.

…aphore v1.6 changed the worker formula from v1.5's `workersPerEndpoint × len(endpoints)` to `(workersPerEndpoint + idleSlots - 1) × bucketCount`, where bucketCount is the number of distinct Google account labels. For configs without account labels — the most common legacy pattern, since most users never edited the deprecated v1.4 example config — every endpoint shares one empty-string bucket, so worker count collapses: config v1.5 v1.6 (incl #143) this PR ---------------------------------------------------------------------- 5 unlabeled endpoints 15 3 15 5 deployments across 5 labeled accounts 15 15 15 9 deployments across 5 accounts (#113) 27 15 27 #143 already shipped the right per-worker behavior (revert workersPerEndpoint to 3, fixes upload throughput at the bench's 1-endpoint config). But it does nothing about the worker count regression for multi-endpoint configs — those users still run with bucketCount-scaled workers. This PR restores v1.5's endpoint-count scaling AND replaces the global idle-slot counter with a per-account semaphore inside the picker. The semaphore keeps the anti-abuse cap (issue #56 — multiple deployments under one account couldn't sustain the v1.5 worker count's concurrency on the same Google account) while letting workers freely rotate across accounts. Two pickers now: - pickRelayEndpoint: blacklist-aware, no cap. Used for active polls carrying TX, which terminate quickly with the drained batch and don't camp an account's concurrency budget — matches v1.5 behavior. - pickIdleEndpoint: blacklist-aware AND requires a free idle slot in the candidate's bucket. Atomically reserves; pollOnce releases on return. Each unlabeled endpoint gets its own implicit bucket (key = "url:"+url) so legacy multi-endpoint configs no longer share one bucket with 1 idle slot — they get one slot per endpoint, like v1.5. Bench on the 1-endpoint config (the only one the harness can drive): throughput_up_8MB_1session 22.26 MB/s (unchanged from main) throughput_up_8MB_4sessions 87.17 MB/s (unchanged) sessions_per_sec 4.39 /s (unchanged, within noise) No regression on the configs the bench covers. The actual benefit lives in the worker-count table above — verifiable from the math, not the bench. A multi-endpoint bench scenario should be a follow-up so this kind of regression can be caught automatically next time.

…aphore (#146) v1.6 changed the worker formula from v1.5's `workersPerEndpoint × len(endpoints)` to `(workersPerEndpoint + idleSlots - 1) × bucketCount`, where bucketCount is the number of distinct Google account labels. For configs without account labels — the most common legacy pattern, since most users never edited the deprecated v1.4 example config — every endpoint shares one empty-string bucket, so worker count collapses: config v1.5 v1.6 (incl #143) this PR ---------------------------------------------------------------------- 5 unlabeled endpoints 15 3 15 5 deployments across 5 labeled accounts 15 15 15 9 deployments across 5 accounts (#113) 27 15 27 #143 already shipped the right per-worker behavior (revert workersPerEndpoint to 3, fixes upload throughput at the bench's 1-endpoint config). But it does nothing about the worker count regression for multi-endpoint configs — those users still run with bucketCount-scaled workers. This PR restores v1.5's endpoint-count scaling AND replaces the global idle-slot counter with a per-account semaphore inside the picker. The semaphore keeps the anti-abuse cap (issue #56 — multiple deployments under one account couldn't sustain the v1.5 worker count's concurrency on the same Google account) while letting workers freely rotate across accounts. Two pickers now: - pickRelayEndpoint: blacklist-aware, no cap. Used for active polls carrying TX, which terminate quickly with the drained batch and don't camp an account's concurrency budget — matches v1.5 behavior. - pickIdleEndpoint: blacklist-aware AND requires a free idle slot in the candidate's bucket. Atomically reserves; pollOnce releases on return. Each unlabeled endpoint gets its own implicit bucket (key = "url:"+url) so legacy multi-endpoint configs no longer share one bucket with 1 idle slot — they get one slot per endpoint, like v1.5. Bench on the 1-endpoint config (the only one the harness can drive): throughput_up_8MB_1session 22.26 MB/s (unchanged from main) throughput_up_8MB_4sessions 87.17 MB/s (unchanged) sessions_per_sec 4.39 /s (unchanged, within noise) No regression on the configs the bench covers. The actual benefit lives in the worker-count table above — verifiable from the math, not the bench. A multi-endpoint bench scenario should be a follow-up so this kind of regression can be caught automatically next time.

Kianmhz mentioned this pull request May 21, 2026

carrier: connect_data optimization is timing-sensitive (race between SYN drain and first write) #144

Closed

Kianmhz merged commit 967c385 into main May 21, 2026
6 checks passed

Kianmhz deleted the fix/workers-per-endpoint-throughput branch May 21, 2026 00:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

carrier: revert workersPerEndpoint to 3 (fixes upload throughput)#143

carrier: revert workersPerEndpoint to 3 (fixes upload throughput)#143
Kianmhz merged 1 commit into
mainfrom
fix/workers-per-endpoint-throughput

Kianmhz commented May 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Kianmhz commented May 21, 2026

Summary

Bench (1 endpoint, no account labels, same machine, 3 runs each)

Root cause (the interesting part)

Follow-up

Verification

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant