Skip to content

Adding blog post for architecting for 1b rps#533

Open
allenheltondev wants to merge 5 commits into
valkey-io:mainfrom
allenheltondev:blog/architecting-for-1b-rps
Open

Adding blog post for architecting for 1b rps#533
allenheltondev wants to merge 5 commits into
valkey-io:mainfrom
allenheltondev:blog/architecting-for-1b-rps

Conversation

@allenheltondev
Copy link
Copy Markdown
Contributor

@allenheltondev allenheltondev commented Apr 29, 2026

Description

Adding blog post taking lessons learned from Unlocked San Jose for architecting systems for 1B RPS.

Issues Resolved

#474

Check List

  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the BSD-3-Clause License.

Summary by CodeRabbit

  • Documentation
    • Added a comprehensive guide on managing connection storms in Valkey at scale, covering fleet-level failure modes, I/O threading and TLS implications, cluster-internal throttling, proxy design patterns, and operational safeguards derived from production case studies.

Review Change Stack

@crystalphamLF
Copy link
Copy Markdown
Contributor

@stockholmux @madolson can we post this on May 7th?

@allenheltondev
Copy link
Copy Markdown
Contributor Author

@crystalphamLF bumping to get this back into the review queue

Copy link
Copy Markdown
Member

@stockholmux stockholmux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor issues.

Otherwise:

  • The diagrams are weird - they look like a UI and it should be clear that they're really an infographic.
  • Make the title more closely align to the content

Comment thread content/blog/2026-05-05-architecting-for-1b-rps/index.md Outdated
Comment thread content/blog/2026-05-05-architecting-for-1b-rps/index.md Outdated
Comment thread content/blog/2026-05-05-architecting-for-1b-rps/index.md Outdated
Comment thread content/blog/2026-05-05-architecting-for-1b-rps/index.md Outdated
Comment thread content/blog/2026-05-05-architecting-for-1b-rps/index.md Outdated
Comment thread content/blog/2026-05-05-architecting-for-1b-rps/index.md Outdated
Comment thread content/blog/2026-05-05-architecting-for-1b-rps/index.md Outdated
Comment thread content/blog/2026-05-05-architecting-for-1b-rps/index.md Outdated
Comment thread content/blog/2026-05-05-architecting-for-1b-rps/index.md Outdated
Comment thread content/blog/2026-05-05-architecting-for-1b-rps/index.md Outdated
@allenheltondev allenheltondev requested a review from a team as a code owner May 20, 2026 21:20
@valkeyrie-ops valkeyrie-ops Bot requested a review from a team May 20, 2026 21:20
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 20, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro Plus

Run ID: b835a920-f567-4952-9d41-cb421adfe047

📥 Commits

Reviewing files that changed from the base of the PR and between 2551abf and 77bf9bb.

📒 Files selected for processing (3)
  • content/blog/2026-06-02-architecting-for-1b-rps/architecting_for_1b_1.webp
  • content/blog/2026-06-02-architecting-for-1b-rps/architecting_for_1b_2.webp
  • content/blog/2026-06-02-architecting-for-1b-rps/index.md
✅ Files skipped from review due to trivial changes (1)
  • content/blog/2026-06-02-architecting-for-1b-rps/index.md

📝 Walkthrough

Walkthrough

A new blog post titled "Managing Connection Storms in Valkey at Scale" is added to the content directory. The article explains failure modes from connection storms at fleet scale, describes Valkey's architectural mitigations including I/O threading and TLS offload, covers cluster-internal throttling and proxy-layer considerations, and concludes with operational safeguards from production case studies.

Changes

Managing Connection Storms in Valkey at Scale

Layer / File(s) Summary
Complete blog post with metadata and technical content
content/blog/2026-06-02-architecting-for-1b-rps/index.md
Blog post including frontmatter (title, description, publication date, authors, taxonomy, featured image), connection-storm problem scenario and retry amplification at scale, server-side mitigations (I/O threading, TLS offload), cluster and proxy-layer prevention patterns, client-side strategies (caching, deduplication, batching, multiplexing), write throttling and dual-channel replication safeguards, and verification checklist for system design under load.

Possibly related issues

  • Issue #474: The blog post "Managing Connection Storms in Valkey at Scale" directly implements the proposed "Architecting for 1B+ RPS" blog post topic.

Suggested reviewers

  • madolson

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Signed-off-by: Allen Helton <allenheltondev@gmail.com>
Signed-off-by: Allen Helton <allenheltondev@gmail.com>
@allenheltondev allenheltondev force-pushed the blog/architecting-for-1b-rps branch from 5ab52f4 to e95a43e Compare May 20, 2026 21:24
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
content/blog/2026-05-05-architecting-for-1b-rps/index.md (1)

47-47: 💤 Low value

Consider simplifying word choice.

The style checker suggests "exactly" may be overused. Consider simplifying to "which is what this threading work frees up" or "which this threading work directly addresses."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@content/blog/2026-05-05-architecting-for-1b-rps/index.md` at line 47, The
phrase "which is exactly what this threading work frees up" is flagged for
wordiness—replace "which is exactly what" with a simpler construction such as
"which is what this threading work frees up" or "which this threading work
directly addresses" in the sentence starting "The main thread is the
bottleneck..." to reduce redundancy and improve clarity while keeping the rest
of the paragraph unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@content/blog/2026-05-05-architecting-for-1b-rps/index.md`:
- Line 4: The post metadata 'date' field (date = 2026-05-25) doesn't match the
containing directory name (2026-05-05) and earlier PR note about May 7; choose
the intended publication date and make them consistent by either updating the
frontmatter 'date' value to match the directory (e.g., 2026-05-05 or 2026-05-07)
or renaming the directory to match the frontmatter (2026-05-25); ensure any
references to the post URL/slug that derive from the directory are updated
accordingly and verify the final chosen date is correct before merging.
- Around line 106-113: Update the text to remove the incorrect claim that
"Dual-channel replication is on by default in Valkey 9.0" and instead state the
correct default status (it was not enabled by default in 9.0 and the change was
retargeted to 10), and add a note that dual-channel replication requires the
primary to have repl-diskless-sync enabled; specifically, adjust the paragraph
referencing "Dual-channel replication is on by default in Valkey 9.0", keep the
config example for dual-channel-replication-enabled, and append a brief sentence
mentioning the prerequisite repl-diskless-sync on the primary so readers know
both the runtime flag (dual-channel-replication-enabled) and the required
primary setting (repl-diskless-sync).

---

Nitpick comments:
In `@content/blog/2026-05-05-architecting-for-1b-rps/index.md`:
- Line 47: The phrase "which is exactly what this threading work frees up" is
flagged for wordiness—replace "which is exactly what" with a simpler
construction such as "which is what this threading work frees up" or "which this
threading work directly addresses" in the sentence starting "The main thread is
the bottleneck..." to reduce redundancy and improve clarity while keeping the
rest of the paragraph unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro Plus

Run ID: d22c9668-8cf7-40db-a412-26e4b7d79083

📥 Commits

Reviewing files that changed from the base of the PR and between 37ee528 and 5ab52f4.

📒 Files selected for processing (3)
  • content/blog/2026-05-05-architecting-for-1b-rps/architecting_for_1b_1.webp
  • content/blog/2026-05-05-architecting-for-1b-rps/architecting_for_1b_2.webp
  • content/blog/2026-05-05-architecting-for-1b-rps/index.md

Comment thread content/blog/2026-05-25-architecting-for-1b-rps/index.md Outdated
Comment thread content/blog/2026-05-05-architecting-for-1b-rps/index.md Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
content/blog/2026-05-05-architecting-for-1b-rps/index.md (2)

106-113: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fix incorrect default-status claim for dual-channel replication in Valkey 9.0.

Line 106 states it is “on by default in Valkey 9.0,” but Valkey 9.0 config defaults set dual-channel-replication-enabled to no. Please correct this statement and adjust Line 108 wording so it doesn’t imply only pre-9.0 users need to enable it.

Suggested minimal wording update
- Valkey 8.0 introduced [dual-channel replication](https://valkey.io/blog/valkey-8-0-0-rc1/#replication) to address this directly. Rather than a single connection handling both the RDB transfer and the incremental command stream, dual-channel replication opens a dedicated RDB channel for bulk data while maintaining a separate main channel for streaming updates. The replica buffers incremental updates locally while loading the snapshot and applies them once loading completes. COB pressure on the primary drops substantially because the primary no longer needs to buffer the entire write stream for the duration of a large sync. In scenarios with heavy read commands, sync time can be cut by up to 50%. Dual-channel replication is on by default in Valkey 9.0.
+ Valkey 8.0 introduced [dual-channel replication](https://valkey.io/blog/valkey-8-0-0-rc1/#replication) to address this directly. Rather than a single connection handling both the RDB transfer and the incremental command stream, dual-channel replication opens a dedicated RDB channel for bulk data while maintaining a separate main channel for streaming updates. The replica buffers incremental updates locally while loading the snapshot and applies them once loading completes. COB pressure on the primary drops substantially because the primary no longer needs to buffer the entire write stream for the duration of a large sync. In scenarios with heavy read commands, sync time can be cut by up to 50%.
@@
- For teams not yet on Valkey 9.0, enabling dual-channel replication is a single config line. The flag can be toggled at runtime without a restart, which makes it straightforward to validate against a live workload before committing:
+ Enabling dual-channel replication is a single config line. The flag can be toggled at runtime without a restart, which makes it straightforward to validate against a live workload before committing:
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@content/blog/2026-05-05-architecting-for-1b-rps/index.md` around lines 106 -
113, The post incorrectly claims "Dual-channel replication is on by default in
Valkey 9.0"; update the text to reflect the actual default (it defaults to no)
and reword the following sentence so it doesn't imply only pre-9.0 users need to
enable it—mention that the default is disabled in 9.0 and that teams can toggle
the feature at runtime via the dual-channel-replication-enabled setting (e.g.,
explain that setting dual-channel-replication-enabled yes enables it without
restart).

4-4: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Align publish date with folder slug date.

Line 4 (2026-05-25) conflicts with the dated folder path (2026-05-05-...), which can create confusing publication metadata/URL expectations. Please make these consistent before merge.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@content/blog/2026-05-05-architecting-for-1b-rps/index.md` at line 4, The
frontmatter "date" key currently reads 2026-05-25 but must match the folder slug
date 2026-05-05; update the "date" value to 2026-05-05 (or alternatively rename
the folder slug to 2026-05-25) so the post's metadata and URL are consistent,
targeting the "date" field in the file.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@content/blog/2026-05-05-architecting-for-1b-rps/index.md`:
- Around line 106-113: The post incorrectly claims "Dual-channel replication is
on by default in Valkey 9.0"; update the text to reflect the actual default (it
defaults to no) and reword the following sentence so it doesn't imply only
pre-9.0 users need to enable it—mention that the default is disabled in 9.0 and
that teams can toggle the feature at runtime via the
dual-channel-replication-enabled setting (e.g., explain that setting
dual-channel-replication-enabled yes enables it without restart).
- Line 4: The frontmatter "date" key currently reads 2026-05-25 but must match
the folder slug date 2026-05-05; update the "date" value to 2026-05-05 (or
alternatively rename the folder slug to 2026-05-25) so the post's metadata and
URL are consistent, targeting the "date" field in the file.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 5ce0444d-3615-4a5c-9271-6bfb7a8fe1dc

📥 Commits

Reviewing files that changed from the base of the PR and between 5ab52f4 and e95a43e.

📒 Files selected for processing (3)
  • content/blog/2026-05-05-architecting-for-1b-rps/architecting_for_1b_1.webp
  • content/blog/2026-05-05-architecting-for-1b-rps/architecting_for_1b_2.webp
  • content/blog/2026-05-05-architecting-for-1b-rps/index.md

Signed-off-by: Allen Helton <allenheltondev@gmail.com>
@allenheltondev
Copy link
Copy Markdown
Contributor Author

Got all the requested changes in, thank you for your thoughtful review!

@allenheltondev
Copy link
Copy Markdown
Contributor Author

@stockholmux @crystalphamLF can we get this out this week? Happy to change the date to whatever works for the publishing timeline

@crystalphamLF
Copy link
Copy Markdown
Contributor

@madolson @stockholmux can we get this pushed out next Tuesday, June 2?

Copy link
Copy Markdown

@ksmotiv8 ksmotiv8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Overall: Really love the tone and technical depth of this piece. It feels grounded in real-world operational lessons instead of theoretical architecture advice. The production anecdotes and failure analysis make it much stronger than the typical infrastructure blog post.
  • Line ~17 (Uber runs roughly 1 billion RPS across 2,000 clusters): Uber: the connection-storm lessons stand on their own — let’s just make sure we’re not accidentally implying Valkey adoption anywhere.
  • Line ~33 (Uber's immediate fix was to use iptables rules): The iptables workaround is more of an emergency mitigation than a real solution. You already hint at that below, but I’d tighten the wording so nobody interprets this as a recommended steady-state pattern.
  • Line ~55 (### The proxy can be the storm too): This currently reads a little anti-proxy, but that’s really a specific proxy's implementation issue, not a proxy issue in general. Well-designed proxies can actually mitigate exactly these classes of failures. I’d reframe the takeaway as: poorly designed proxies amplify storms, while good proxies reduce connection fanout and isolate failures. This could also be a nice place to reference the NxM connection-management discussion.
  • Line ~61 (...will look worse in a benchmark...): “will look worse” → “may look worse”
  • Line ~67 (...directly reduces the p in the n x p equation): Multiplexing helps reduce connection fanout, but it also introduces head-of-line blocking tradeoffs, especially when a proxy is multiplexing across multiple backend clusters. Worth acknowledging the tradeoff so we don’t oversell it.
  • Line ~71 (### Write throttling at 95% CPU): This is essentially load shedding. I’d call that out explicitly because it’s a well-understood systems pattern. The power-grid analogy works really well here too: controlled brownouts are better than a total grid collapse.
  • Line ~93 (## What to carry out of this): I’d rework the closing around Werner Vogels’ famous line: “Everything fails, all the time.” Then build toward the idea that benchmarks only measure the happy path, while production systems are defined by how they behave during failure. Planning for the failures you can anticipate is often the difference between a catastrophic outage and graceful degradation. Uber and Snap learned many of these lessons the hard way, in production, often after outages. The real takeaway is not to copy exact mitigations blindly, but to internalize the operational mindset behind them.

Signed-off-by: Allen Helton <allenheltondev@gmail.com>
@valkeyrie-ops valkeyrie-ops Bot requested a review from a team May 28, 2026 17:05
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@content/blog/2026-05-28-architecting-for-1b-rps/index.md`:
- Line 4: Update the frontmatter `date` value in this post from `2026-05-28` to
the requested publish date `2026-06-02`; locate the `date` field at the top of
the markdown (the frontmatter key `date`) in the file `index.md` and change its
value to `2026-06-02` so the post will be published on the correct Tuesday
schedule.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro Plus

Run ID: f49976c3-3da8-4f20-96a9-8c0e3694b230

📥 Commits

Reviewing files that changed from the base of the PR and between cd448dc and 2551abf.

📒 Files selected for processing (3)
  • content/blog/2026-05-28-architecting-for-1b-rps/architecting_for_1b_1.webp
  • content/blog/2026-05-28-architecting-for-1b-rps/architecting_for_1b_2.webp
  • content/blog/2026-05-28-architecting-for-1b-rps/index.md

Comment thread content/blog/2026-05-28-architecting-for-1b-rps/index.md Outdated
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 28, 2026

Greptile Summary

This PR adds a new blog post titled "Managing Connection Storms in Valkey at Scale," synthesizing lessons from Uber and Snap's talks at Unlocked San Jose into actionable guidance on connection storms, proxy design, load shedding, and replication reliability. The writing and structure are strong, but two technical claims need correction before this goes live.

  • The prose describing dual-channel replication activation says "it engages only when diskless sync is enabled," omitting that dual-channel-replication-enabled yes is also required (it defaults to no). The accompanying code block is correct, but operators reading the prose alone will miss the second flag.
  • The INFO replication field names replicas_repl_buffer_size and replicas_repl_buffer_peak cited for monitoring dual-channel sync progress do not appear in the official Valkey INFO documentation; the closest documented field is mem_replicas_repl_buffer under INFO memory (Valkey 9.1).

Confidence Score: 3/5

The post contains two technical inaccuracies that could mislead operators configuring production Valkey clusters; it should not merge until both are corrected.

The dual-channel replication section tells readers that enabling diskless sync is the key prerequisite, but the official valkey.conf documentation requires a second explicit opt-in flag (dual-channel-replication-enabled yes) that defaults to off. Separately, the INFO replication field names given for monitoring the replica-side buffer don't appear in the documented Valkey API.

content/blog/2026-06-02-architecting-for-1b-rps/index.md — the dual-channel configuration prose (around line 118) and the INFO replication field names (line 126) both need correction.

Important Files Changed

Filename Overview
content/blog/2026-06-02-architecting-for-1b-rps/index.md New blog post on connection storms and Valkey scale; two technical claims need verification: the dual-channel activation condition omits a required config flag, and the INFO replication field names cited for monitoring don't appear in official documentation.
content/blog/2026-06-02-architecting-for-1b-rps/architecting_for_1b_1.webp Image asset for connection storm diagram; no issues.
content/blog/2026-06-02-architecting-for-1b-rps/architecting_for_1b_2.webp Image asset for Valkey 7.2 vs 8.0 throughput/latency benchmark chart; no issues.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Node becomes briefly unavailable] --> B[All N service instances detect I/O error]
    B --> C[Each instance closes connection pool]
    C --> D[NxP simultaneous reconnect attempts]
    D --> E{Node CPU}
    E -->|Amplified load| F[100% CPU - node stops responding]
    F --> G[More timeouts - more retries]
    G --> D
    D -->|Mitigation: iptables cut| H[Retry amplification stopped]
    H --> I[Node recovers]
    E -->|Valkey 8.0 IO threading| J[Main thread free for commands]
    E -->|Valkey 8.1 TLS offload| K[+300% new connection acceptance]
    E -->|Valkey 9.0 cluster throttle| L[Intra-cluster reconnect rate limited]
Loading

Reviews (2): Last reviewed commit: "change to new date, and updates from cod..." | Re-trigger Greptile

Comment thread content/blog/2026-05-28-architecting-for-1b-rps/index.md Outdated
Comment thread content/blog/2026-05-28-architecting-for-1b-rps/index.md Outdated
Signed-off-by: Allen Helton <allenheltondev@gmail.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 28, 2026

Actionable comments posted: 0

@allenheltondev allenheltondev requested a review from ksmotiv8 May 28, 2026 18:48
Copy link
Copy Markdown

@ksmotiv8 ksmotiv8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks good - @crystalphamLF wanted to get this out asap as well, so approving for publication.

@zuiderkwast
Copy link
Copy Markdown
Contributor

@hpatro

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants