[pmon]: HLD: Enhance DPU Robustness in Smart Switch by vvolam · Pull Request #2310 · sonic-net/SONiC

vvolam · 2026-04-27T18:51:13Z

What I did

Add a High Level Design document for DPU failure scenarios on SmartSwitch from the PMON (Platform Monitor) perspective.

Why I did it

SmartSwitch DPU lifecycle management requires clear specification of failure detection, DB state tracking, and recovery actions performed by chassisd and other PMON sub-daemons. This HLD documents all failure and planned operation scenarios to guide implementation.

How I did it

Added doc/smart-switch/pmon/enhance-dpu-robustness.md covering:

DPU software failures: critical process restart, persistent failure, pmon/databasedpu crashes on NPU
DPU hardware failures: complete DPU down, power failure, PCIe failure
NPU/switch-level failures: kernel crash, memory exhaustion
Planned operations: graceful shutdown, cold reboot, full SmartSwitch reboot
New DB fields: ready_status, recovery_status, reset_count, last_down_time, last_ready_time in CHASSIS_STATE_DB
New feature flag: FEATURE|dpu-auto-recovery in CONFIG_DB
DPU recovery state machine: Mermaid diagram with state table
Timers and thresholds: configurable via platform.json
Race condition handling: concurrent operations on the same DPU

How to verify it

Review the HLD document for completeness and correctness of failure scenarios, DB state transitions, and recovery actions.

Repo	PR Title / Link	Status
sonic-platform-daemons	[chassisd]: Add DPU recovery state machine and new DB fields
sonic-buildimage	[smartswitch]: Add dpu-auto-recovery feature to SmartSwitch NPU default config
sonic-utilties	[cli]: Add DPU recovery CLI commands for SmartSwitch

Add High Level Design document covering DPU failure scenarios for Smart Switch, including software failures, hardware failures, and NPU/switch level failures. Signed-off-by: Vasundhara Volam <vvolam@microsoft.com>

mssonicbld · 2026-04-27T18:51:21Z

/azp run

azure-pipelines · 2026-04-27T18:51:27Z

No pipelines are associated with this pull request.

DPU control plane, midplane, and data plane states are always 'down' during booting, never 'unknown'. Update terminology, state machine table, and scenario summary accordingly. Signed-off-by: Vasundhara Volam <vvolam@microsoft.com>

mssonicbld · 2026-04-27T21:45:35Z

/azp run

azure-pipelines · 2026-04-27T21:45:40Z

No pipelines are associated with this pull request.

…covery gating - Replace the ambiguous two-timer model (60s auto-recovery + 180s power-cycle) with a single, clearly-named dpu_auto_recovery_timeout (60s). Update the timer table, state machine edge labels, and all DPU software/hardware failure scenarios to use the consistent name. - Rename 'Critical process' subsections to 'Process' for accuracy; update TOC anchors and Scope wording accordingly. - Add ManualIntervention state to the DPU recovery state machine and gate SWFailure/HW-failure transitions on the auto-recovery feature flag. Add a global note plus per-scenario 'When auto-recovery is disabled' bullets so the FEATURE|dpu-auto-recovery=disabled behavior is consistent across every failure scenario. - Rework NPU Kernel Crash recovery: chassisd unconditionally power-cycles every admin-up DPU via the platform vendor path (power_down/pci_detach/power_up/pci_reattach) instead of using gNOI Reboot RPC against potentially unresponsive DPUs. Admin-down DPUs are left offline. Add reset_count row to the DB transition table with a note about chassisd-restart zeroing. - Fix 'Table of Content' typo and add Existing/New DB entries sub-entries to the table of contents. - Replace literal pipe inside backticks in the state table cell with HTML entity so the markdown table renders correctly on GitHub. Signed-off-by: Vasundhara Volam <vvolam@microsoft.com>

mssonicbld · 2026-04-30T00:35:58Z

/azp run

azure-pipelines · 2026-04-30T00:36:04Z

No pipelines are associated with this pull request.

Drop the dpu_auto_recovery_timeout self-heal grace period. chassisd now initiates a DPU power-cycle as soon as it observes dpu_control_plane_state (or dpu_midplane_link_state) as down on its next 10s health poll, regardless of whether the failure is a transient process restart or a persistent crash-loop. - Remove dpu_auto_recovery_timeout from the timer table; clarify chassisd health poll interval description to state immediate power-cycle on detection. - Combine 'Process restart on DPU' and 'Process persistently down on DPU' into a single 'Process crash/restart on DPU' section since chassisd applies the same recovery path in both cases. Update TOC and DB transition table accordingly. - State machine: keep SWFailure as a transient state on control-plane-down, branching directly into PowerCycle (auto-recovery enabled) or ManualIntervention (auto-recovery disabled) without any timer wait. HW-failure path goes directly from Ready to PowerCycle / ManualIntervention. - Drop 'skipping dpu_auto_recovery_timeout' parentheticals from HW Failure / Power Failure / PCIe Failure scenarios. Update Scenario DB State Summary row for control plane restart to reflect the immediate power-cycle behavior. Signed-off-by: Vasundhara Volam <vvolam@microsoft.com>

mssonicbld · 2026-04-30T05:54:16Z

/azp run

azure-pipelines · 2026-04-30T05:54:23Z

No pipelines are associated with this pull request.

gpunathilell · 2026-05-26T17:13:34Z

Regarding the
databasedpu crash on NPU section, there are two DBs here, one is CHASSIS_STATE_DB to which the DPU is sending data from the chassisd running on the DPU, but this is not the same as databasedpuN dockers running on the switch, these dockers are the ones which orchagent on the DPU is writing/reading from, and chassisd is not involved here, databasedpuN issues in itself should not be possible to be detected, but assuming this is caused by failure in midplane, then either way CHASSIS_STATE_DB is also inaccessible to the DPU

gpunathilell · 2026-05-26T17:25:44Z

We also need a sonic-mgmt section, about how the regular smartswitch tests are planned to be executed (with/without autorecovery or not)

gpunathilell · 2026-05-26T18:44:15Z

@vvolam it is not clear from the document as to when exactly the autorecovery is triggered, the control plane goes down even when we execute shutdown/reboot, please mention in the document about the exact scenario when the autorecovery is triggered, and if we have a timeout configured for it

- Clarify databasedpu crash detection: chassisd detects indirectly via dpu_control_plane_state going down, not by monitoring databasedpuN Redis instances directly. - Add auto-recovery trigger disambiguation: document that chassisd checks state_transition_in_progress before triggering recovery, skipping auto-recovery during planned shutdown/reboot operations. - Add Testing section with sonic-mgmt test plan covering all failure mode scenarios (8 test classes) and test infrastructure details. Signed-off-by: Vasundhara Volam <vvolam@microsoft.com>

mssonicbld · 2026-05-26T22:32:23Z

/azp run

azure-pipelines · 2026-05-26T22:32:30Z

No pipelines are associated with this pull request.

vvolam · 2026-05-26T22:35:27Z

@gpunathilell I have updated the document addressing the comments. Please review

- Clarify recovery timing: power-cycle triggered on same poll cycle that detects failure (no additional timeout beyond 10s poll interval). - Add CLI section: show chassis modules status extended to display ready_status, recovery_status, reset_count, last_down_time, last_ready_time from CHASSIS_STATE_DB. - Fix PCIe failure recovery: since DPU is already offline, chassisd updates the status and does not power-cycle. Signed-off-by: Vasundhara Volam <vvolam@microsoft.com>

mssonicbld · 2026-05-28T04:31:02Z

/azp run

- Add dpu_self_recovery_timeout (300s) for DPU self-recovery grace period - Consolidate all DPU failure types into single 'DPU Failure' category with unified WaitForSelfRecovery state - Consolidate NPU failures into 'NPU Ungraceful Reboot' category - Update state machine: replace SWFailure/WaitForWatchdog with WaitForSelfRecovery state - Make Key DB Indicators column explicit with exact DB field conditions - Remove unused auto_restart/high_mem_alert from dpu-auto-recovery feature - Simplify chassisd health poll interval and dpu_boot_timeout descriptions - Fix gRPC abbreviation expansion Signed-off-by: Vasundhara Volam <vvolam@microsoft.com>

mssonicbld · 2026-05-29T20:17:42Z

/azp run

azure-pipelines · 2026-05-29T20:17:49Z

No pipelines are associated with this pull request.

mssonicbld · 2026-05-29T22:34:52Z

/azp run

azure-pipelines · 2026-05-29T22:34:58Z

No pipelines are associated with this pull request.

mssonicbld · 2026-05-29T23:34:18Z

/azp run

azure-pipelines · 2026-05-29T23:34:24Z

No pipelines are associated with this pull request.

Update the Booting state's Key DB Indicators to use timer-based condition: 'dpu_boot_timeout timer running AND NOT (midplane up AND control plane up)' instead of assuming specific intermediate link states. Signed-off-by: Vasundhara Volam <vvolam@microsoft.com>

mssonicbld · 2026-05-30T00:26:14Z

/azp run

azure-pipelines · 2026-05-30T00:26:21Z

No pipelines are associated with this pull request.

Escape the pipe character in the ManualIntervention row's Key DB Indicators column to prevent GitHub Markdown from breaking the table. Signed-off-by: Vasundhara Volam <vvolam@microsoft.com>

mssonicbld · 2026-05-30T01:57:49Z

/azp run

azure-pipelines · 2026-05-30T01:57:56Z

No pipelines are associated with this pull request.

judsonwilson-nvidia · 2026-05-30T19:05:21Z

+  "dpu_data_plane_state":    "up" | "down",
+  "dpu_data_plane_time":     "<UTC timestamp>",
+  "dpu_midplane_link_state": "up" | "down",
+  "dpu_midplane_link_time":  "<UTC timestamp>"


Does dpu_midplane_link_time still exist?

Add Smart Switch DPU Reliability and Availability HLD

5b6e466

Add High Level Design document covering DPU failure scenarios for Smart Switch, including software failures, hardware failures, and NPU/switch level failures. Signed-off-by: Vasundhara Volam <vvolam@microsoft.com>

vvolam changed the title ~~[pmon]: Add Smart Switch DPU Reliability and Availability HLD~~ [pmon]: Add Smart Switch Enhance DPU Robustness HLD Apr 27, 2026

vvolam changed the title ~~[pmon]: Add Smart Switch Enhance DPU Robustness HLD~~ [pmon]: HLD: Enhance DPU Robustness in Smart Switch Apr 29, 2026

vvolam marked this pull request as ready for review May 4, 2026 22:35

gpunathilell requested review from dgsudharsan, gpunathilell and vivekrnv May 7, 2026 17:20

vvolam requested a review from rameshraghupathy May 7, 2026 19:25

vvolam mentioned this pull request May 19, 2026

[SmartSwitch] Handle DPU failure scenarios - Design and Implementation sonic-net/sonic-buildimage#27450

Open

gpunathilell reviewed May 27, 2026

View reviewed changes

Comment thread doc/smart-switch/pmon/enhance-dpu-robustness.md Outdated

Comment thread doc/smart-switch/pmon/enhance-dpu-robustness.md

Comment thread doc/smart-switch/pmon/enhance-dpu-robustness.md Outdated

rameshraghupathy reviewed May 29, 2026

View reviewed changes

Comment thread doc/smart-switch/pmon/enhance-dpu-robustness.md

rameshraghupathy reviewed May 29, 2026

View reviewed changes

Comment thread doc/smart-switch/pmon/enhance-dpu-robustness.md

rameshraghupathy reviewed May 29, 2026

View reviewed changes

Comment thread doc/smart-switch/pmon/enhance-dpu-robustness.md

judsonwilson-nvidia reviewed May 29, 2026

View reviewed changes

Comment thread doc/smart-switch/pmon/enhance-dpu-robustness.md Outdated