LruDedup queue policy for Notification Consumer#4586
Conversation
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
securely1g
left a comment
There was a problem hiding this comment.
Review: LruDedup Queue Policy for Notification Consumer
Summary
Wires orchagent's NotificationConsumer instances to the new swss-common LRU-dedup queue primitives (companion PRs in sonic-swss-common and sonic-sairedis). Adds NotificationConsumerStatsOrch that periodically publishes per-consumer admission/dedup counters to COUNTERS_DB:NOTIFICATION_CONSUMER_STATS:<name>.
+392/-3 across 14 files, 1 commit. Clean single-commit structure.
What's Good
- Well-structured feature flag —
#ifdef SWSS_NOTIFICATIONCONSUMER_HAS_LRU_DEDUPguards all new code paths, ensuring backward compatibility with older libswsscommon builds - Clean separation — stats orch is standalone, no coupling to existing orch logic
- Good defensive coding — null-checks on
gNotifConsumerStatsOrchat every registration site, stub implementations for test contexts - Sensible queue policy choices — FDB events and port state changes are LruDedup (end-state-idempotent under byte-identical payloads), BFD/ICMP/TWAMP are FIFO (state transitions are ordering-sensitive)
- Comprehensive COUNTERS_DB schema — includes both admission stats and LRU-specific dedup counters with ratio percentages
- Thorough inline comments explaining why LruDedup is safe for FDB (byte-identical = same vlan/mac/port/event_type; distinct events queue separately)
🔴 High Priority Issues
1. Multiple subscribers on the same Redis NOTIFICATIONS channel
PortsOrch, FdbOrch, BfdOrch, IcmpOrch, TwampOrch, and P4Orch all create separate NotificationConsumer instances subscribed to the same "NOTIFICATIONS" channel on ASIC_DB. Each gets a copy of every message published to that channel; the setOpAllowList filters drop non-matching ops. This means:
- Every notification is delivered to N consumers, admitted by one, dropped by N-1
dropped_allowlistwill be consistently high for all consumers except the one that matches — this might look alarming in monitoring unless documented- The LRU dedup queues are per-consumer-instance, so there's no cross-consumer dedup benefit
This isn't a bug per se (it's how the existing architecture works), but the stats output could be confusing. Consider adding a note in COUNTERS_DB documentation that dropped_allowlist being high is expected for consumers that share a channel.
2. P4Orch uses stack-allocated DBConnector
swss::DBConnector notificationsDb("ASIC_DB", 0);
m_portStatusNotificationConsumer = new swss::NotificationConsumer(¬ificationsDb, "NOTIFICATIONS");The DBConnector is stack-local in P4Orch's constructor. After the constructor returns, the pointer passed to NotificationConsumer dangles. This is a pre-existing bug (not introduced by this PR), but registering this consumer for periodic stats polling now means getStats()/getLruDedupQueue() may access the dead DBConnector indirectly. Worth noting as a follow-up fix.
🟡 Medium Priority Issues
3. Timer interval is hardcoded at 10s
kPublishIntervalSec = 10 is reasonable but not configurable. For high-scale deployments where COUNTERS_DB write volume matters, or for debugging where 1s granularity would help, a CONFIG_DB knob (even if defaulted to 10s) would be useful. Low priority but worth considering.
4. No cleanup of COUNTERS_DB entries on process exit
If orchagent restarts, stale NOTIFICATION_CONSUMER_STATS entries persist in COUNTERS_DB. The stats get overwritten on the next publish cycle after restart, but during the gap (before the timer fires), stale counters could confuse monitoring. Consider writing a TTL or clearing the table on init.
5. getChannel() in fake stub
The fake NotificationConsumer in p4orch tests provides the LRU-dedup-aware constructors but doesn't show a getChannel() implementation. If doTask ever runs against the fake (it won't currently since the global is null), it would need this. Fine for now but fragile for future test expansion.
🟢 Minor / Style
6. The #ifdef blocks add visual noise. If swss-common adoption is expected to be universal soon, consider setting a target release after which the #else legacy paths can be removed.
7. admit_ratio_pct uses integer division (100 * admitted / received). With low event counts this rounds aggressively (e.g., 1/3 = 33% not 33.3%). Fine for operational monitoring but worth a comment.
8. The fake_notificationconsumerstatsorch.cpp file duplicates the stub from notificationconsumerstatsorch.cpp (the #else branch). Could factor into a shared header or inline the stubs, but this is cosmetic.
Questions
- Is there a plan to consolidate the N separate
NotificationConsumerinstances on the same"NOTIFICATIONS"channel into a single subscriber with a dispatcher? The allowlist filtering works but N redundant Redis subscriptions is wasteful. - The HLD (SONiC/pull/2334) presumably defines the COUNTERS_DB schema — is there a CLI/show command planned (e.g.,
show notification-consumer stats) or is this purely for external scrapers? - What's the expected dedup ratio in production? Under normal operation (no MAC storms), LRU dedup hits should be near zero. Has this been load-tested under storm conditions to validate queue depth bounding?
Dependency Check
This PR depends on:
- sonic-swss-common:
SWSS_NOTIFICATIONCONSUMER_HAS_LRU_DEDUPdefine +NotificationQueuePolicy,LruDedupNotificationQueue,getStats(),setOpAllowList(),setStatsLabel(),getChannel(),getLruDedupQueue(),DEFAULT_NC_POP_BATCH_SIZE - sonic-sairedis PR #1899
Ensure the swss-common PR is merged first or the #ifdef guards will keep this entire feature dead-code.
Verdict: Clean, well-guarded implementation with good backward compatibility. The shared-channel redundancy (issue #1) is architectural debt rather than a bug introduced here. The P4Orch dangling pointer (issue #2) is pre-existing but now slightly more exposed. Overall looks good to merge once the companion swss-common PR lands.
c5c4a0a to
d3d10de
Compare
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Added a note about why
The DBConnector
It is reasonable for now, we can make it configuration in future if needed.
This is an existing pattern for stats in COUNTERS_DB. The stats get overwritten when the orchs restart.
The merge-dependency could be a real issue, the #ifdef is a reasonable solution for now. Will bring it up with the maintainers on a good long term solution.
These are post-computed stats, so it should be fine.
This was discussed in the HLD review. One possible solution is to create a separate channel for each consumer, it was noted down as a Phase2 task.
Currently none, but it can be exported via GNMI.
The ratio was roughly 90%+ for FDB events under stress testing. |
d3d10de to
e84b6bb
Compare
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Pull request overview
This PR introduces a NotificationConsumerStatsOrch orchestrator to periodically publish per-NotificationConsumer admission/LRU-dedup statistics into COUNTERS_DB, and updates several orchagents to (optionally) configure allowlists, stats labels, and LRU-dedup queue policy when building against the updated swss-common primitives.
Changes:
- Add
NotificationConsumerStatsOrchand wire it intoOrchDaemonso consumers can register for periodic stats publishing. - Update multiple orchs’
NotificationConsumerinstances to set op allowlists / stats labels and (where applicable) use the LruDedup queue policy underSWSS_NOTIFICATIONCONSUMER_HAS_LRU_DEDUP. - Extend unit-test build plumbing (including p4orch fakes) to link successfully with the new stats orch and new
NotificationConsumerAPIs.
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/mock_tests/Makefile.am | Links the new stats orch into mock test binaries. |
| orchagent/Makefile.am | Builds the new notificationconsumerstatsorch.cpp into orchagent. |
| orchagent/notificationconsumerstatsorch.h | Declares the new stats orch and global pointer. |
| orchagent/notificationconsumerstatsorch.cpp | Implements periodic publish logic + legacy stub path. |
| orchagent/orchdaemon.cpp | Constructs/registers the stats orch and adds it to m_orchList. |
| orchagent/fdborch.cpp | Enables allowlist/stats label and LruDedup for FDB event consumer (plus stats registration). |
| orchagent/portsorch.cpp | Enables allowlist/stats label and LruDedup for port-state related consumers (plus stats registration). |
| orchagent/bfdorch.cpp | Adds allowlist/stats label and stats registration for BFD notifications. |
| orchagent/icmporch.cpp | Adds allowlist/stats label and stats registration for ICMP echo notifications. |
| orchagent/twamporch.cpp | Adds allowlist/stats label and stats registration for TWAMP notifications. |
| orchagent/p4orch/p4orch.cpp | Adds allowlist/stats label and stats registration for P4Orch port-state notifications. |
| orchagent/p4orch/tests/Makefile.am | Adds a fake stats orch to keep p4orch unit tests linkable. |
| orchagent/p4orch/tests/fake_notificationconsumer.cpp | Adds LruDedup/allowlist/stats-label API stubs for tests under the feature macro. |
| orchagent/p4orch/tests/fake_notificationconsumerstatsorch.cpp | Adds minimal no-op NotificationConsumerStatsOrch + global pointer stub for p4orch tests. |
e84b6bb to
810f43b
Compare
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
810f43b to
31d407c
Compare
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
31d407c to
4e6a78e
Compare
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Commenter does not have sufficient privileges for PR 4586 in repo sonic-net/sonic-swss |
4e6a78e to
30ef314
Compare
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
30ef314 to
4f05197
Compare
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Signed-off-by: Senthil Krishnamurthy <senthil@nexthop.ai>
4f05197 to
a0966b4
Compare
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
What I did
Wires every orchagent-owned
NotificationConsumerto the swss-common primitives (companion PR) and addsNotificationConsumerStatsOrchto publish per-consumer counters toCOUNTERS_DB:NOTIFICATION_CONSUMER_STATS:<name>.How I did it
FdbOrch:fdb_event{fdb_event}FdbOrch:flush{ALL, PORT, VLAN, PORTVLAN}PortsOrch:port_state_change{port_state_change}PortsOrch:port_host_tx_ready(CMIS only){port_host_tx_ready}BfdOrch:bfd_session_state_change{bfd_session_state_change}IcmpOrch:icmp_echo_session_state_change{icmp_echo_session_state_change}TwampOrch:twamp_session_event{twamp_session_event}P4Orch:port_state_change{port_state_change}Each consumer also calls
setStatsLabel(...)with the same key it registers under inCOUNTERS_DB:NOTIFICATION_CONSUMER_STATS:<name>. The COUNTERS_DB publish includes:channelreceivedprocessReplyadmissionsdropped_allowlistadmittedreceived - dropped_allowlistadmit_ratio_pctqueue_policy"Fifo"or"LruDedup"lru_pushed/lru_dedup_hits/lru_dedup_ratio_pct/lru_current_depth/lru_high_watermarkgNotifConsumerStatsOrchis constructed early inOrchDaemon::initand added tom_orchList. All registration call sites null-check the global so unit-test contexts that don't construct anOrchDaemonremain unaffected.How I verified it
Unit Tests
Details if related
HLD: sonic-net/SONiC#2334
sonic-sairedis PR: sonic-net/sonic-sairedis#1899
sonic-swss-common PR: sonic-net/sonic-swss-common#1191