Adding blog post for architecting for 1b rps#533
Conversation
|
@stockholmux @madolson can we post this on May 7th? |
|
@crystalphamLF bumping to get this back into the review queue |
stockholmux
left a comment
There was a problem hiding this comment.
A few minor issues.
Otherwise:
- The diagrams are weird - they look like a UI and it should be clear that they're really an infographic.
- Make the title more closely align to the content
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (3)
✅ Files skipped from review due to trivial changes (1)
📝 WalkthroughWalkthroughA new blog post titled "Managing Connection Storms in Valkey at Scale" is added to the content directory. The article explains failure modes from connection storms at fleet scale, describes Valkey's architectural mitigations including I/O threading and TLS offload, covers cluster-internal throttling and proxy-layer considerations, and concludes with operational safeguards from production case studies. ChangesManaging Connection Storms in Valkey at Scale
Possibly related issues
Suggested reviewers
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Signed-off-by: Allen Helton <allenheltondev@gmail.com>
Signed-off-by: Allen Helton <allenheltondev@gmail.com>
5ab52f4 to
e95a43e
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
content/blog/2026-05-05-architecting-for-1b-rps/index.md (1)
47-47: 💤 Low valueConsider simplifying word choice.
The style checker suggests "exactly" may be overused. Consider simplifying to "which is what this threading work frees up" or "which this threading work directly addresses."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@content/blog/2026-05-05-architecting-for-1b-rps/index.md` at line 47, The phrase "which is exactly what this threading work frees up" is flagged for wordiness—replace "which is exactly what" with a simpler construction such as "which is what this threading work frees up" or "which this threading work directly addresses" in the sentence starting "The main thread is the bottleneck..." to reduce redundancy and improve clarity while keeping the rest of the paragraph unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@content/blog/2026-05-05-architecting-for-1b-rps/index.md`:
- Line 4: The post metadata 'date' field (date = 2026-05-25) doesn't match the
containing directory name (2026-05-05) and earlier PR note about May 7; choose
the intended publication date and make them consistent by either updating the
frontmatter 'date' value to match the directory (e.g., 2026-05-05 or 2026-05-07)
or renaming the directory to match the frontmatter (2026-05-25); ensure any
references to the post URL/slug that derive from the directory are updated
accordingly and verify the final chosen date is correct before merging.
- Around line 106-113: Update the text to remove the incorrect claim that
"Dual-channel replication is on by default in Valkey 9.0" and instead state the
correct default status (it was not enabled by default in 9.0 and the change was
retargeted to 10), and add a note that dual-channel replication requires the
primary to have repl-diskless-sync enabled; specifically, adjust the paragraph
referencing "Dual-channel replication is on by default in Valkey 9.0", keep the
config example for dual-channel-replication-enabled, and append a brief sentence
mentioning the prerequisite repl-diskless-sync on the primary so readers know
both the runtime flag (dual-channel-replication-enabled) and the required
primary setting (repl-diskless-sync).
---
Nitpick comments:
In `@content/blog/2026-05-05-architecting-for-1b-rps/index.md`:
- Line 47: The phrase "which is exactly what this threading work frees up" is
flagged for wordiness—replace "which is exactly what" with a simpler
construction such as "which is what this threading work frees up" or "which this
threading work directly addresses" in the sentence starting "The main thread is
the bottleneck..." to reduce redundancy and improve clarity while keeping the
rest of the paragraph unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro Plus
Run ID: d22c9668-8cf7-40db-a412-26e4b7d79083
📒 Files selected for processing (3)
content/blog/2026-05-05-architecting-for-1b-rps/architecting_for_1b_1.webpcontent/blog/2026-05-05-architecting-for-1b-rps/architecting_for_1b_2.webpcontent/blog/2026-05-05-architecting-for-1b-rps/index.md
There was a problem hiding this comment.
♻️ Duplicate comments (2)
content/blog/2026-05-05-architecting-for-1b-rps/index.md (2)
106-113:⚠️ Potential issue | 🟠 Major | ⚡ Quick winFix incorrect default-status claim for dual-channel replication in Valkey 9.0.
Line 106 states it is “on by default in Valkey 9.0,” but Valkey 9.0 config defaults set
dual-channel-replication-enabledtono. Please correct this statement and adjust Line 108 wording so it doesn’t imply only pre-9.0 users need to enable it.Suggested minimal wording update
- Valkey 8.0 introduced [dual-channel replication](https://valkey.io/blog/valkey-8-0-0-rc1/#replication) to address this directly. Rather than a single connection handling both the RDB transfer and the incremental command stream, dual-channel replication opens a dedicated RDB channel for bulk data while maintaining a separate main channel for streaming updates. The replica buffers incremental updates locally while loading the snapshot and applies them once loading completes. COB pressure on the primary drops substantially because the primary no longer needs to buffer the entire write stream for the duration of a large sync. In scenarios with heavy read commands, sync time can be cut by up to 50%. Dual-channel replication is on by default in Valkey 9.0. + Valkey 8.0 introduced [dual-channel replication](https://valkey.io/blog/valkey-8-0-0-rc1/#replication) to address this directly. Rather than a single connection handling both the RDB transfer and the incremental command stream, dual-channel replication opens a dedicated RDB channel for bulk data while maintaining a separate main channel for streaming updates. The replica buffers incremental updates locally while loading the snapshot and applies them once loading completes. COB pressure on the primary drops substantially because the primary no longer needs to buffer the entire write stream for the duration of a large sync. In scenarios with heavy read commands, sync time can be cut by up to 50%. @@ - For teams not yet on Valkey 9.0, enabling dual-channel replication is a single config line. The flag can be toggled at runtime without a restart, which makes it straightforward to validate against a live workload before committing: + Enabling dual-channel replication is a single config line. The flag can be toggled at runtime without a restart, which makes it straightforward to validate against a live workload before committing:🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@content/blog/2026-05-05-architecting-for-1b-rps/index.md` around lines 106 - 113, The post incorrectly claims "Dual-channel replication is on by default in Valkey 9.0"; update the text to reflect the actual default (it defaults to no) and reword the following sentence so it doesn't imply only pre-9.0 users need to enable it—mention that the default is disabled in 9.0 and that teams can toggle the feature at runtime via the dual-channel-replication-enabled setting (e.g., explain that setting dual-channel-replication-enabled yes enables it without restart).
4-4:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winAlign publish date with folder slug date.
Line 4 (
2026-05-25) conflicts with the dated folder path (2026-05-05-...), which can create confusing publication metadata/URL expectations. Please make these consistent before merge.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@content/blog/2026-05-05-architecting-for-1b-rps/index.md` at line 4, The frontmatter "date" key currently reads 2026-05-25 but must match the folder slug date 2026-05-05; update the "date" value to 2026-05-05 (or alternatively rename the folder slug to 2026-05-25) so the post's metadata and URL are consistent, targeting the "date" field in the file.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@content/blog/2026-05-05-architecting-for-1b-rps/index.md`:
- Around line 106-113: The post incorrectly claims "Dual-channel replication is
on by default in Valkey 9.0"; update the text to reflect the actual default (it
defaults to no) and reword the following sentence so it doesn't imply only
pre-9.0 users need to enable it—mention that the default is disabled in 9.0 and
that teams can toggle the feature at runtime via the
dual-channel-replication-enabled setting (e.g., explain that setting
dual-channel-replication-enabled yes enables it without restart).
- Line 4: The frontmatter "date" key currently reads 2026-05-25 but must match
the folder slug date 2026-05-05; update the "date" value to 2026-05-05 (or
alternatively rename the folder slug to 2026-05-25) so the post's metadata and
URL are consistent, targeting the "date" field in the file.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro Plus
Run ID: 5ce0444d-3615-4a5c-9271-6bfb7a8fe1dc
📒 Files selected for processing (3)
content/blog/2026-05-05-architecting-for-1b-rps/architecting_for_1b_1.webpcontent/blog/2026-05-05-architecting-for-1b-rps/architecting_for_1b_2.webpcontent/blog/2026-05-05-architecting-for-1b-rps/index.md
Signed-off-by: Allen Helton <allenheltondev@gmail.com>
|
Got all the requested changes in, thank you for your thoughtful review! |
|
@stockholmux @crystalphamLF can we get this out this week? Happy to change the date to whatever works for the publishing timeline |
|
@madolson @stockholmux can we get this pushed out next Tuesday, June 2? |
ksmotiv8
left a comment
There was a problem hiding this comment.
- Overall: Really love the tone and technical depth of this piece. It feels grounded in real-world operational lessons instead of theoretical architecture advice. The production anecdotes and failure analysis make it much stronger than the typical infrastructure blog post.
- Line ~17 (Uber runs roughly 1 billion RPS across 2,000 clusters): Uber: the connection-storm lessons stand on their own — let’s just make sure we’re not accidentally implying Valkey adoption anywhere.
- Line ~33 (Uber's immediate fix was to use iptables rules): The iptables workaround is more of an emergency mitigation than a real solution. You already hint at that below, but I’d tighten the wording so nobody interprets this as a recommended steady-state pattern.
- Line ~55 (### The proxy can be the storm too): This currently reads a little anti-proxy, but that’s really a specific proxy's implementation issue, not a proxy issue in general. Well-designed proxies can actually mitigate exactly these classes of failures. I’d reframe the takeaway as: poorly designed proxies amplify storms, while good proxies reduce connection fanout and isolate failures. This could also be a nice place to reference the NxM connection-management discussion.
- Line ~61 (...will look worse in a benchmark...): “will look worse” → “may look worse”
- Line ~67 (...directly reduces the p in the n x p equation): Multiplexing helps reduce connection fanout, but it also introduces head-of-line blocking tradeoffs, especially when a proxy is multiplexing across multiple backend clusters. Worth acknowledging the tradeoff so we don’t oversell it.
- Line ~71 (### Write throttling at 95% CPU): This is essentially load shedding. I’d call that out explicitly because it’s a well-understood systems pattern. The power-grid analogy works really well here too: controlled brownouts are better than a total grid collapse.
- Line ~93 (## What to carry out of this): I’d rework the closing around Werner Vogels’ famous line: “Everything fails, all the time.” Then build toward the idea that benchmarks only measure the happy path, while production systems are defined by how they behave during failure. Planning for the failures you can anticipate is often the difference between a catastrophic outage and graceful degradation. Uber and Snap learned many of these lessons the hard way, in production, often after outages. The real takeaway is not to copy exact mitigations blindly, but to internalize the operational mindset behind them.
Signed-off-by: Allen Helton <allenheltondev@gmail.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@content/blog/2026-05-28-architecting-for-1b-rps/index.md`:
- Line 4: Update the frontmatter `date` value in this post from `2026-05-28` to
the requested publish date `2026-06-02`; locate the `date` field at the top of
the markdown (the frontmatter key `date`) in the file `index.md` and change its
value to `2026-06-02` so the post will be published on the correct Tuesday
schedule.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro Plus
Run ID: f49976c3-3da8-4f20-96a9-8c0e3694b230
📒 Files selected for processing (3)
content/blog/2026-05-28-architecting-for-1b-rps/architecting_for_1b_1.webpcontent/blog/2026-05-28-architecting-for-1b-rps/architecting_for_1b_2.webpcontent/blog/2026-05-28-architecting-for-1b-rps/index.md
|
| Filename | Overview |
|---|---|
| content/blog/2026-06-02-architecting-for-1b-rps/index.md | New blog post on connection storms and Valkey scale; two technical claims need verification: the dual-channel activation condition omits a required config flag, and the INFO replication field names cited for monitoring don't appear in official documentation. |
| content/blog/2026-06-02-architecting-for-1b-rps/architecting_for_1b_1.webp | Image asset for connection storm diagram; no issues. |
| content/blog/2026-06-02-architecting-for-1b-rps/architecting_for_1b_2.webp | Image asset for Valkey 7.2 vs 8.0 throughput/latency benchmark chart; no issues. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[Node becomes briefly unavailable] --> B[All N service instances detect I/O error]
B --> C[Each instance closes connection pool]
C --> D[NxP simultaneous reconnect attempts]
D --> E{Node CPU}
E -->|Amplified load| F[100% CPU - node stops responding]
F --> G[More timeouts - more retries]
G --> D
D -->|Mitigation: iptables cut| H[Retry amplification stopped]
H --> I[Node recovers]
E -->|Valkey 8.0 IO threading| J[Main thread free for commands]
E -->|Valkey 8.1 TLS offload| K[+300% new connection acceptance]
E -->|Valkey 9.0 cluster throttle| L[Intra-cluster reconnect rate limited]
Reviews (2): Last reviewed commit: "change to new date, and updates from cod..." | Re-trigger Greptile
Signed-off-by: Allen Helton <allenheltondev@gmail.com>
|
Actionable comments posted: 0 |
ksmotiv8
left a comment
There was a problem hiding this comment.
this looks good - @crystalphamLF wanted to get this out asap as well, so approving for publication.
Description
Adding blog post taking lessons learned from Unlocked San Jose for architecting systems for 1B RPS.
Issues Resolved
#474
Check List
--signoffBy submitting this pull request, I confirm that my contribution is made under the terms of the BSD-3-Clause License.
Summary by CodeRabbit