Skip to content

HLD for orchagent notifications queue policy#2334

Open
senthil-nexthop wants to merge 1 commit into
sonic-net:masterfrom
nexthop-ai:senthil.orch-notification-coalesce
Open

HLD for orchagent notifications queue policy#2334
senthil-nexthop wants to merge 1 commit into
sonic-net:masterfrom
nexthop-ai:senthil.orch-notification-coalesce

Conversation

@senthil-nexthop
Copy link
Copy Markdown

@senthil-nexthop senthil-nexthop commented May 14, 2026

Summary

  • Add HLD describing an opt-in LRU-dedup queue policy for swss::NotificationConsumer (Phase 1 of the FDB-flap mitigation work).
  • New NotificationQueueBase strategy interface with two concrete policies: FifoNotificationQueue (default, unchanged behavior) and LruDedupNotificationQueue (opt-in; collapses byte-identical payloads on enqueue, bounding queue depth at the universe of distinct payloads).
  • New 5-arg NotificationConsumer constructor takes a NotificationQueuePolicy enum; original 4-arg constructor preserved as a separate overload so the existing mangled symbol stays in libswsscommon.so and consumers like python3-swsscommon need no rebuild.
  • Initial opt-ins: FdbOrch::m_fdbNotificationConsumer, PortsOrch::m_portStatusNotificationConsumer, PortsOrch::m_portHostTxReadyNotificationConsumer. All other NotificationConsumer callers are unchanged.
  • Out-of-scope items (Phase 2 producer-side policy in syncd, SAI/ASIC rate limiting, redis pubsub backpressure, COUNTERS_DB telemetry) called out in §2.1.

Why

  • NotificationConsumer::m_queue is an unbounded std::queuestd::string. Under sustained FDB-flap or link-flap storms, the producer rate exceeds the consumer drain rate and the queue grows without limit, OOM-killing orchagent.
  • The events on the affected channels (NOTIFICATIONS for FDB, port-state, port-host-tx-ready) are end-state-idempotent — only the most-recent payload per logical key matters. Collapsing duplicates loses no correctness.
  • A blanket change (every NotificationConsumer dedups) is too risky because not all channels are end-state-idempotent (e.g. FLUSHFDBREQUEST is order-sensitive). Opt-in per consumer puts the audit responsibility at the channel boundary.
  • Naming (NotificationQueueBase, NotificationQueuePolicy) is chosen to be neutral so the same strategy types can be reused for the Phase 2 producer-side policy in syncd::NotificationQueue without renaming.

** Related PRs **

Repo PR Title Status
sonic-net/sonic-swss-common Adding Queue Policy Definitions Open
sonic-net/sonic-sairedis Switch SaiRedis to LruDedup Open
sonic-net/sonic-swss Switch FDB and PortsOrch to LruDedup Open
sonic-net/sonic-mgmt Sonic-mgmt tests Open

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

No pipelines are associated with this pull request.

@senthil-nexthop senthil-nexthop changed the title HLD for orchagent notifications queue coalescing HLD for orchagent notifications queue policy May 15, 2026
@senthil-nexthop senthil-nexthop force-pushed the senthil.orch-notification-coalesce branch from e91f65d to 492c78c Compare May 18, 2026 04:03
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

No pipelines are associated with this pull request.

@senthil-nexthop senthil-nexthop marked this pull request as ready for review May 18, 2026 04:05
@senthil-nexthop senthil-nexthop force-pushed the senthil.orch-notification-coalesce branch from 492c78c to 949d79d Compare May 19, 2026 03:35
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

No pipelines are associated with this pull request.

@senthil-nexthop senthil-nexthop force-pushed the senthil.orch-notification-coalesce branch from 949d79d to ee9d71f Compare May 19, 2026 03:36
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

No pipelines are associated with this pull request.

@dgsudharsan
Copy link
Copy Markdown
Contributor

@senthil-nexthop Can you please make sure that the link dampening feature is not affected by this feature? #1071

@prsunny FYI

@senthil-nexthop
Copy link
Copy Markdown
Author

@senthil-nexthop Can you please make sure that the link dampening feature is not affected by this feature? #1071

@prsunny FYI

Having looked at the link dampening feature closely, the consumer queue policy proposed in this HLD is not affected by the dampening feature. The link dampening happens at the producer side but the proposed changes are in the consumer side.

I switched PortsOrch to use the LruDedup policy as well based on this determination.

Signed-off-by: Senthil Krishnamurthy <senthil@nexthop.ai>
@senthil-nexthop senthil-nexthop force-pushed the senthil.orch-notification-coalesce branch from ee9d71f to 0671b0a Compare May 25, 2026 21:08
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

No pipelines are associated with this pull request.

@senthil-nexthop
Copy link
Copy Markdown
Author

senthil-nexthop commented May 26, 2026

@dgsudharsan @prsunny I addressed some of your comments. Could you please add yourself as reviewers and take another look?

Comment thread doc/orchagent/notification_queue_policy_hld.md
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants