Skip to content

LruDedup queue policy for Notification Consumer#4586

Merged
lguohan merged 1 commit into
sonic-net:masterfrom
nexthop-ai:lrudedup-queue-policy-notification
May 30, 2026
Merged

LruDedup queue policy for Notification Consumer#4586
lguohan merged 1 commit into
sonic-net:masterfrom
nexthop-ai:lrudedup-queue-policy-notification

Conversation

@senthil-nexthop
Copy link
Copy Markdown
Contributor

What I did
Wires every orchagent-owned NotificationConsumer to the swss-common primitives (companion PR) and adds NotificationConsumerStatsOrch to publish per-consumer counters to COUNTERS_DB:NOTIFICATION_CONSUMER_STATS:<name>.

How I did it

Consumer Queue policy Allowlist
FdbOrch:fdb_event LruDedup {fdb_event}
FdbOrch:flush Fifo {ALL, PORT, VLAN, PORTVLAN}
PortsOrch:port_state_change LruDedup {port_state_change}
PortsOrch:port_host_tx_ready (CMIS only) LruDedup {port_host_tx_ready}
BfdOrch:bfd_session_state_change Fifo {bfd_session_state_change}
IcmpOrch:icmp_echo_session_state_change Fifo {icmp_echo_session_state_change}
TwampOrch:twamp_session_event Fifo {twamp_session_event}
P4Orch:port_state_change Fifo {port_state_change}

Each consumer also calls setStatsLabel(...) with the same key it registers under in COUNTERS_DB:NOTIFICATION_CONSUMER_STATS:<name>. The COUNTERS_DB publish includes:

field meaning
channel Redis channel SUBSCRIBE'd to
received total processReply admissions
dropped_allowlist dropped at admission by the op-allowlist
admitted received - dropped_allowlist
admit_ratio_pct integer percent
queue_policy "Fifo" or "LruDedup"
lru_pushed / lru_dedup_hits / lru_dedup_ratio_pct / lru_current_depth / lru_high_watermark LruDedup consumers only

gNotifConsumerStatsOrch is constructed early in OrchDaemon::init and added to m_orchList. All registration call sites null-check the global so unit-test contexts that don't construct an OrchDaemon remain unaffected.

How I verified it
Unit Tests

Details if related
HLD: sonic-net/SONiC#2334
sonic-sairedis PR: sonic-net/sonic-sairedis#1899
sonic-swss-common PR: sonic-net/sonic-swss-common#1191

@senthil-nexthop senthil-nexthop requested a review from prsunny as a code owner May 19, 2026 23:15
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Copy Markdown
Contributor

@securely1g securely1g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: LruDedup Queue Policy for Notification Consumer

Summary

Wires orchagent's NotificationConsumer instances to the new swss-common LRU-dedup queue primitives (companion PRs in sonic-swss-common and sonic-sairedis). Adds NotificationConsumerStatsOrch that periodically publishes per-consumer admission/dedup counters to COUNTERS_DB:NOTIFICATION_CONSUMER_STATS:<name>.

+392/-3 across 14 files, 1 commit. Clean single-commit structure.


What's Good

  • Well-structured feature flag#ifdef SWSS_NOTIFICATIONCONSUMER_HAS_LRU_DEDUP guards all new code paths, ensuring backward compatibility with older libswsscommon builds
  • Clean separation — stats orch is standalone, no coupling to existing orch logic
  • Good defensive coding — null-checks on gNotifConsumerStatsOrch at every registration site, stub implementations for test contexts
  • Sensible queue policy choices — FDB events and port state changes are LruDedup (end-state-idempotent under byte-identical payloads), BFD/ICMP/TWAMP are FIFO (state transitions are ordering-sensitive)
  • Comprehensive COUNTERS_DB schema — includes both admission stats and LRU-specific dedup counters with ratio percentages
  • Thorough inline comments explaining why LruDedup is safe for FDB (byte-identical = same vlan/mac/port/event_type; distinct events queue separately)

🔴 High Priority Issues

1. Multiple subscribers on the same Redis NOTIFICATIONS channel

PortsOrch, FdbOrch, BfdOrch, IcmpOrch, TwampOrch, and P4Orch all create separate NotificationConsumer instances subscribed to the same "NOTIFICATIONS" channel on ASIC_DB. Each gets a copy of every message published to that channel; the setOpAllowList filters drop non-matching ops. This means:

  • Every notification is delivered to N consumers, admitted by one, dropped by N-1
  • dropped_allowlist will be consistently high for all consumers except the one that matches — this might look alarming in monitoring unless documented
  • The LRU dedup queues are per-consumer-instance, so there's no cross-consumer dedup benefit

This isn't a bug per se (it's how the existing architecture works), but the stats output could be confusing. Consider adding a note in COUNTERS_DB documentation that dropped_allowlist being high is expected for consumers that share a channel.

2. P4Orch uses stack-allocated DBConnector

swss::DBConnector notificationsDb("ASIC_DB", 0);
m_portStatusNotificationConsumer = new swss::NotificationConsumer(&notificationsDb, "NOTIFICATIONS");

The DBConnector is stack-local in P4Orch's constructor. After the constructor returns, the pointer passed to NotificationConsumer dangles. This is a pre-existing bug (not introduced by this PR), but registering this consumer for periodic stats polling now means getStats()/getLruDedupQueue() may access the dead DBConnector indirectly. Worth noting as a follow-up fix.


🟡 Medium Priority Issues

3. Timer interval is hardcoded at 10s

kPublishIntervalSec = 10 is reasonable but not configurable. For high-scale deployments where COUNTERS_DB write volume matters, or for debugging where 1s granularity would help, a CONFIG_DB knob (even if defaulted to 10s) would be useful. Low priority but worth considering.

4. No cleanup of COUNTERS_DB entries on process exit

If orchagent restarts, stale NOTIFICATION_CONSUMER_STATS entries persist in COUNTERS_DB. The stats get overwritten on the next publish cycle after restart, but during the gap (before the timer fires), stale counters could confuse monitoring. Consider writing a TTL or clearing the table on init.

5. getChannel() in fake stub

The fake NotificationConsumer in p4orch tests provides the LRU-dedup-aware constructors but doesn't show a getChannel() implementation. If doTask ever runs against the fake (it won't currently since the global is null), it would need this. Fine for now but fragile for future test expansion.


🟢 Minor / Style

6. The #ifdef blocks add visual noise. If swss-common adoption is expected to be universal soon, consider setting a target release after which the #else legacy paths can be removed.

7. admit_ratio_pct uses integer division (100 * admitted / received). With low event counts this rounds aggressively (e.g., 1/3 = 33% not 33.3%). Fine for operational monitoring but worth a comment.

8. The fake_notificationconsumerstatsorch.cpp file duplicates the stub from notificationconsumerstatsorch.cpp (the #else branch). Could factor into a shared header or inline the stubs, but this is cosmetic.


Questions

  • Is there a plan to consolidate the N separate NotificationConsumer instances on the same "NOTIFICATIONS" channel into a single subscriber with a dispatcher? The allowlist filtering works but N redundant Redis subscriptions is wasteful.
  • The HLD (SONiC/pull/2334) presumably defines the COUNTERS_DB schema — is there a CLI/show command planned (e.g., show notification-consumer stats) or is this purely for external scrapers?
  • What's the expected dedup ratio in production? Under normal operation (no MAC storms), LRU dedup hits should be near zero. Has this been load-tested under storm conditions to validate queue depth bounding?

Dependency Check

This PR depends on:

  • sonic-swss-common: SWSS_NOTIFICATIONCONSUMER_HAS_LRU_DEDUP define + NotificationQueuePolicy, LruDedupNotificationQueue, getStats(), setOpAllowList(), setStatsLabel(), getChannel(), getLruDedupQueue(), DEFAULT_NC_POP_BATCH_SIZE
  • sonic-sairedis PR #1899

Ensure the swss-common PR is merged first or the #ifdef guards will keep this entire feature dead-code.


Verdict: Clean, well-guarded implementation with good backward compatibility. The shared-channel redundancy (issue #1) is architectural debt rather than a bug introduced here. The P4Orch dangling pointer (issue #2) is pre-existing but now slightly more exposed. Overall looks good to merge once the companion swss-common PR lands.

@senthil-nexthop senthil-nexthop force-pushed the lrudedup-queue-policy-notification branch from c5c4a0a to d3d10de Compare May 25, 2026 20:19
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@senthil-nexthop
Copy link
Copy Markdown
Contributor Author

🔴 High Priority Issues

1. Multiple subscribers on the same Redis NOTIFICATIONS channel

Added a note about why dropped_allowlist is expected to be high for consumers that share a channel.

2. P4Orch uses stack-allocated DBConnector

The DBConnector NotificationConsumer::m_db is referenced only in subscribe() and it is not touched by getStats() or bt getLruDedupQueue(), so the risk of the dangling consumer is non-existent.

🟡 Medium Priority Issues

3. Timer interval is hardcoded at 10s

It is reasonable for now, we can make it configuration in future if needed.

4. No cleanup of COUNTERS_DB entries on process exit

This is an existing pattern for stats in COUNTERS_DB. The stats get overwritten when the orchs restart.

🟢 Minor / Style

6. The #ifdef blocks add visual noise. If swss-common adoption is expected to be universal soon, consider setting a target release after which the #else legacy paths can be removed.

The merge-dependency could be a real issue, the #ifdef is a reasonable solution for now. Will bring it up with the maintainers on a good long term solution.

7. admit_ratio_pct uses integer division (100 * admitted / received). With low event counts this rounds aggressively (e.g., 1/3 = 33% not 33.3%). Fine for operational monitoring but worth a comment.

These are post-computed stats, so it should be fine.

Questions

  • Is there a plan to consolidate the N separate NotificationConsumer instances on the same "NOTIFICATIONS" channel into a single subscriber with a dispatcher? The allowlist filtering works but N redundant Redis subscriptions is wasteful.

This was discussed in the HLD review. One possible solution is to create a separate channel for each consumer, it was noted down as a Phase2 task.

  • The HLD (SONiC/pull/2334) presumably defines the COUNTERS_DB schema — is there a CLI/show command planned (e.g., show notification-consumer stats) or is this purely for external scrapers?

Currently none, but it can be exported via GNMI.

  • What's the expected dedup ratio in production? Under normal operation (no MAC storms), LRU dedup hits should be near zero. Has this been load-tested under storm conditions to validate queue depth bounding?

The ratio was roughly 90%+ for FDB events under stress testing.

@senthil-nexthop senthil-nexthop force-pushed the lrudedup-queue-policy-notification branch from d3d10de to e84b6bb Compare May 25, 2026 21:06
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a NotificationConsumerStatsOrch orchestrator to periodically publish per-NotificationConsumer admission/LRU-dedup statistics into COUNTERS_DB, and updates several orchagents to (optionally) configure allowlists, stats labels, and LRU-dedup queue policy when building against the updated swss-common primitives.

Changes:

  • Add NotificationConsumerStatsOrch and wire it into OrchDaemon so consumers can register for periodic stats publishing.
  • Update multiple orchs’ NotificationConsumer instances to set op allowlists / stats labels and (where applicable) use the LruDedup queue policy under SWSS_NOTIFICATIONCONSUMER_HAS_LRU_DEDUP.
  • Extend unit-test build plumbing (including p4orch fakes) to link successfully with the new stats orch and new NotificationConsumer APIs.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
tests/mock_tests/Makefile.am Links the new stats orch into mock test binaries.
orchagent/Makefile.am Builds the new notificationconsumerstatsorch.cpp into orchagent.
orchagent/notificationconsumerstatsorch.h Declares the new stats orch and global pointer.
orchagent/notificationconsumerstatsorch.cpp Implements periodic publish logic + legacy stub path.
orchagent/orchdaemon.cpp Constructs/registers the stats orch and adds it to m_orchList.
orchagent/fdborch.cpp Enables allowlist/stats label and LruDedup for FDB event consumer (plus stats registration).
orchagent/portsorch.cpp Enables allowlist/stats label and LruDedup for port-state related consumers (plus stats registration).
orchagent/bfdorch.cpp Adds allowlist/stats label and stats registration for BFD notifications.
orchagent/icmporch.cpp Adds allowlist/stats label and stats registration for ICMP echo notifications.
orchagent/twamporch.cpp Adds allowlist/stats label and stats registration for TWAMP notifications.
orchagent/p4orch/p4orch.cpp Adds allowlist/stats label and stats registration for P4Orch port-state notifications.
orchagent/p4orch/tests/Makefile.am Adds a fake stats orch to keep p4orch unit tests linkable.
orchagent/p4orch/tests/fake_notificationconsumer.cpp Adds LruDedup/allowlist/stats-label API stubs for tests under the feature macro.
orchagent/p4orch/tests/fake_notificationconsumerstatsorch.cpp Adds minimal no-op NotificationConsumerStatsOrch + global pointer stub for p4orch tests.

Comment thread orchagent/orchdaemon.cpp
Comment thread orchagent/orchdaemon.cpp
Comment thread orchagent/notificationconsumerstatsorch.h Outdated
Comment thread orchagent/notificationconsumerstatsorch.h
Comment thread orchagent/notificationconsumerstatsorch.cpp Outdated
Comment thread orchagent/p4orch/p4orch.cpp Outdated
@senthil-nexthop senthil-nexthop force-pushed the lrudedup-queue-policy-notification branch from e84b6bb to 810f43b Compare May 28, 2026 16:40
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@senthil-nexthop senthil-nexthop force-pushed the lrudedup-queue-policy-notification branch from 810f43b to 31d407c Compare May 28, 2026 18:52
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@senthil-nexthop senthil-nexthop force-pushed the lrudedup-queue-policy-notification branch from 31d407c to 4e6a78e Compare May 29, 2026 16:23
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@senthil-nexthop
Copy link
Copy Markdown
Contributor Author

/azp run

@azure-pipelines
Copy link
Copy Markdown

Commenter does not have sufficient privileges for PR 4586 in repo sonic-net/sonic-swss

@senthil-nexthop senthil-nexthop force-pushed the lrudedup-queue-policy-notification branch from 4e6a78e to 30ef314 Compare May 29, 2026 22:55
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@senthil-nexthop senthil-nexthop force-pushed the lrudedup-queue-policy-notification branch from 30ef314 to 4f05197 Compare May 30, 2026 02:22
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Signed-off-by: Senthil Krishnamurthy <senthil@nexthop.ai>
@senthil-nexthop senthil-nexthop force-pushed the lrudedup-queue-policy-notification branch from 4f05197 to a0966b4 Compare May 30, 2026 03:38
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@lguohan lguohan merged commit dcd97c9 into sonic-net:master May 30, 2026
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants