Skip to content

carrier: revert workersPerEndpoint to 3 (fixes upload throughput)#143

Merged
Kianmhz merged 1 commit into
mainfrom
fix/workers-per-endpoint-throughput
May 21, 2026
Merged

carrier: revert workersPerEndpoint to 3 (fixes upload throughput)#143
Kianmhz merged 1 commit into
mainfrom
fix/workers-per-endpoint-throughput

Conversation

@Kianmhz
Copy link
Copy Markdown
Owner

@Kianmhz Kianmhz commented May 21, 2026

Summary

One-line revert of eeab6fd (workersPerEndpoint 3→4). Bisected against v1.5.0 — this is the source of the upload-throughput complaints in #113, #117, and #123.

Bench (1 endpoint, no account labels, same machine, 3 runs each)

Metric v1.5.0 v1.6.0 this fix
throughput_up_8MB_1session 22 MB/s 4.5 MB/s 22 MB/s
throughput_up_8MB_4sessions 81 MB/s 18 MB/s 87 MB/s
sessions_per_sec 2.8 /s 50 /s 4.5 /s
throughput_down_8MB_1session 19 MB/s 20 MB/s 20 MB/s
ttfb_p50_p95 ~352 ms ~352 ms ~352 ms

This fix Pareto-improves over v1.5: same upload, slightly higher session-open rate. Sessions/sec drops vs v1.6 because v1.6's apparent gain there came from the same root-cause bug — see below.

Root cause (the interesting part)

The 3→4 worker bump unintentionally broke the connect_data optimization. When SOCKS5 opens a new session, the carrier marks synNeeded=true and kicks workers. With 3 workers, the SYN drain race lets the bench's subsequent conn.Write(N bytes) land in the session buffer before any worker drains. EnqueueInitialData prepends those bytes to txBuf, so DrainTx emits a SYN frame carrying up to 256 KB of payload plus follow-up data frames — all in one poll. Parallel polls then pipeline the remaining bytes.

With 4 workers, the SYN drain wins the race more often. The SYN ships alone before any write reaches the buffer. The data then has to follow in separate polls, each costing one ActiveDrainWindow (350 ms). For an 8 MB upload that's 4 sequential round-trips instead of one pipelined batch — exactly what the bench measures (~1.78s vs ~360ms).

This isn't a bench artifact. Same logic costs:

  • Every TLS handshake (ClientHello within microseconds of TCP connect) loses its bundling — TTFB roughly doubles.
  • Every HTTP/1.1 request — the request line is sent immediately after open, normally bundled.
  • Any "open + immediately send" flow.

So the bisect line — "increase workersPerEndpoint from 3 to 4 for improved parallelism" — was a real regression, not just a benchmark quirk.

Follow-up

The proper fix is to make EnqueueInitialData/connect_data more reliable instead of relying on race timing — e.g. a brief hold on synNeeded until either the buffer gets at least one write or a small deadline (say 1 ms) elapses. That would give us both this PR's upload throughput and v1.6's session-open rate. I'll open a separate issue for tracking.

Verification

  • `go test -count=1 -timeout 90s ./...` — all green
  • `go vet ./...` — clean
  • `./bench/bench.sh --baseline v1.6.0` — only `sessions_per_sec` regresses (expected, see above)
  • `./bench/bench.sh --baseline v1.5.0` — matches v1.5 on upload, beats it on download/sessions/TTFB

🤖 Generated with Claude Code

…ession)

Empirically restores v1.5 upload throughput. The 3→4 bump in eeab6fd
("fix(carrier): increase workersPerEndpoint from 3 to 4 for improved
parallelism") instead caused a 5× regression in single-session upload,
which is what users in #113, #117, and #123 have been complaining about.

Bench (1 endpoint, no account labels, same machine), 3 runs each:

  Metric                          v1.5      v1.6      this fix
  throughput_up_8MB_1session      22 MB/s   4.5 MB/s  22 MB/s
  throughput_up_8MB_4sessions     81 MB/s   18 MB/s   87 MB/s
  sessions_per_sec                2.8 /s    50 /s     4.5 /s
  throughput_down_8MB_1session    19 MB/s   20 MB/s   20 MB/s
  ttfb_p50_p95                    ~352 ms   ~352 ms   ~352 ms

The fix Pareto-improves over v1.5 (same upload, slightly higher
sessions/sec) and addresses the user-reported headline issue. The
sessions/sec drop vs v1.6 is the v1.5 baseline behavior — see below.

Why one extra worker hurt this much (root cause):

When SOCKS5 dials a new session, the carrier marks synNeeded=true and
kicks workers. With 3 workers, the SYN drain races the bench's
subsequent conn.Write(N bytes), and *the writes usually land first* —
EnqueueInitialData prepends them to txBuf so DrainTx emits a SYN frame
carrying the connect_data payload (up to 256 KB) plus follow-up data
frames in the same poll. Server processes the upload immediately;
parallel polls pipeline the remaining bytes.

With 4 workers, the SYN drain *wins* the race more often — the SYN
frame ships alone before any write reaches the buffer. The data then
has to follow in separate polls, each costing one ActiveDrainWindow
(350 ms). For an 8 MB upload that's 4 sequential round-trips instead
of one pipelined batch. Same logic harms every TLS handshake, every
HTTP request — any flow that does "open connection, immediately send
N bytes" loses its connect_data bundling.

The real fix is to make EnqueueInitialData/connect_data more reliable
(e.g. brief synNeeded hold-off until either data lands or a deadline
elapses) — that would give us both this PR's upload throughput AND
v1.6's session-open rate. Tracking that separately. This revert is the
safe immediate fix for the user-visible bug.
@Kianmhz Kianmhz merged commit 967c385 into main May 21, 2026
6 checks passed
@Kianmhz Kianmhz deleted the fix/workers-per-endpoint-throughput branch May 21, 2026 00:12
Kianmhz added a commit that referenced this pull request May 21, 2026
…iters (#147)

EnqueueInitialData was implemented to prepend rather than append, on the
assumption it would be called once before the SYN drains. But the SOCKS5
adapter calls it on every Write, and a fast local writer (the bench harness,
or any application that issues several small Writes immediately after
connecting) can fit many calls before a poll worker drains the SYN. Each
prepend pushed the new data in front of what was already buffered, so the
on-wire byte order ended up reversed.

For any protocol whose leading bytes carry framing — a TLS record header, an
HTTP request line, a length prefix — the upstream parses garbage or rejects
the stream. The bug has existed since the connect_data optimization landed
in v1.4.

A specific consequence worth flagging: every upload-throughput bench result
ever recorded for this project has been wrong. The bench's sized-sink reads
the first 8 bytes as a size header. With prepend, those 8 bytes came from
the LAST chunk written before the SYN drained — usually a body chunk full
of zeros, decoded as size=0. The sink ACKed instantly without reading the
body, and the harness measured the time-to-instant-ACK as upload throughput.
That made v1.5 look like 22 MB/s when actual upload was ~5 MB/s. The
"v1.5→v1.6 upload regression" narrative behind #143 was a bench artifact;
the real per-session ceiling is `maxDrainFramesPerSession × MaxFramePayload /
ActiveDrainWindow` ≈ 5.85 MB/s in both versions.

The fix is one-line (append instead of prepend); the comment is the
expensive part. Test added to catch the regression directly.
Kianmhz added a commit that referenced this pull request May 21, 2026
…aphore

v1.6 changed the worker formula from v1.5's `workersPerEndpoint × len(endpoints)`
to `(workersPerEndpoint + idleSlots - 1) × bucketCount`, where bucketCount is
the number of distinct Google account labels. For configs without account
labels — the most common legacy pattern, since most users never edited the
deprecated v1.4 example config — every endpoint shares one empty-string bucket,
so worker count collapses:

  config                                  v1.5  v1.6 (incl #143)  this PR
  ----------------------------------------------------------------------
  5 unlabeled endpoints                     15                 3       15
  5 deployments across 5 labeled accounts   15                15       15
  9 deployments across 5 accounts (#113)    27                15       27

#143 already shipped the right per-worker behavior (revert workersPerEndpoint
to 3, fixes upload throughput at the bench's 1-endpoint config). But it does
nothing about the worker count regression for multi-endpoint configs — those
users still run with bucketCount-scaled workers.

This PR restores v1.5's endpoint-count scaling AND replaces the global idle-slot
counter with a per-account semaphore inside the picker. The semaphore keeps the
anti-abuse cap (issue #56 — multiple deployments under one account couldn't
sustain the v1.5 worker count's concurrency on the same Google account) while
letting workers freely rotate across accounts.

Two pickers now:
- pickRelayEndpoint: blacklist-aware, no cap. Used for active polls carrying
  TX, which terminate quickly with the drained batch and don't camp an
  account's concurrency budget — matches v1.5 behavior.
- pickIdleEndpoint: blacklist-aware AND requires a free idle slot in the
  candidate's bucket. Atomically reserves; pollOnce releases on return.

Each unlabeled endpoint gets its own implicit bucket (key = "url:"+url) so
legacy multi-endpoint configs no longer share one bucket with 1 idle slot —
they get one slot per endpoint, like v1.5.

Bench on the 1-endpoint config (the only one the harness can drive):
  throughput_up_8MB_1session    22.26 MB/s (unchanged from main)
  throughput_up_8MB_4sessions   87.17 MB/s (unchanged)
  sessions_per_sec              4.39 /s    (unchanged, within noise)

No regression on the configs the bench covers. The actual benefit lives in
the worker-count table above — verifiable from the math, not the bench. A
multi-endpoint bench scenario should be a follow-up so this kind of
regression can be caught automatically next time.
Kianmhz added a commit that referenced this pull request May 21, 2026
…aphore (#146)

v1.6 changed the worker formula from v1.5's `workersPerEndpoint × len(endpoints)`
to `(workersPerEndpoint + idleSlots - 1) × bucketCount`, where bucketCount is
the number of distinct Google account labels. For configs without account
labels — the most common legacy pattern, since most users never edited the
deprecated v1.4 example config — every endpoint shares one empty-string bucket,
so worker count collapses:

  config                                  v1.5  v1.6 (incl #143)  this PR
  ----------------------------------------------------------------------
  5 unlabeled endpoints                     15                 3       15
  5 deployments across 5 labeled accounts   15                15       15
  9 deployments across 5 accounts (#113)    27                15       27

#143 already shipped the right per-worker behavior (revert workersPerEndpoint
to 3, fixes upload throughput at the bench's 1-endpoint config). But it does
nothing about the worker count regression for multi-endpoint configs — those
users still run with bucketCount-scaled workers.

This PR restores v1.5's endpoint-count scaling AND replaces the global idle-slot
counter with a per-account semaphore inside the picker. The semaphore keeps the
anti-abuse cap (issue #56 — multiple deployments under one account couldn't
sustain the v1.5 worker count's concurrency on the same Google account) while
letting workers freely rotate across accounts.

Two pickers now:
- pickRelayEndpoint: blacklist-aware, no cap. Used for active polls carrying
  TX, which terminate quickly with the drained batch and don't camp an
  account's concurrency budget — matches v1.5 behavior.
- pickIdleEndpoint: blacklist-aware AND requires a free idle slot in the
  candidate's bucket. Atomically reserves; pollOnce releases on return.

Each unlabeled endpoint gets its own implicit bucket (key = "url:"+url) so
legacy multi-endpoint configs no longer share one bucket with 1 idle slot —
they get one slot per endpoint, like v1.5.

Bench on the 1-endpoint config (the only one the harness can drive):
  throughput_up_8MB_1session    22.26 MB/s (unchanged from main)
  throughput_up_8MB_4sessions   87.17 MB/s (unchanged)
  sessions_per_sec              4.39 /s    (unchanged, within noise)

No regression on the configs the bench covers. The actual benefit lives in
the worker-count table above — verifiable from the math, not the bench. A
multi-endpoint bench scenario should be a follow-up so this kind of
regression can be caught automatically next time.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant