[pmon]: HLD: Enhance DPU Robustness in Smart Switch#2310
Conversation
Add High Level Design document covering DPU failure scenarios for Smart Switch, including software failures, hardware failures, and NPU/switch level failures. Signed-off-by: Vasundhara Volam <vvolam@microsoft.com>
|
/azp run |
|
No pipelines are associated with this pull request. |
DPU control plane, midplane, and data plane states are always 'down' during booting, never 'unknown'. Update terminology, state machine table, and scenario summary accordingly. Signed-off-by: Vasundhara Volam <vvolam@microsoft.com>
|
/azp run |
|
No pipelines are associated with this pull request. |
…covery gating - Replace the ambiguous two-timer model (60s auto-recovery + 180s power-cycle) with a single, clearly-named dpu_auto_recovery_timeout (60s). Update the timer table, state machine edge labels, and all DPU software/hardware failure scenarios to use the consistent name. - Rename 'Critical process' subsections to 'Process' for accuracy; update TOC anchors and Scope wording accordingly. - Add ManualIntervention state to the DPU recovery state machine and gate SWFailure/HW-failure transitions on the auto-recovery feature flag. Add a global note plus per-scenario 'When auto-recovery is disabled' bullets so the FEATURE|dpu-auto-recovery=disabled behavior is consistent across every failure scenario. - Rework NPU Kernel Crash recovery: chassisd unconditionally power-cycles every admin-up DPU via the platform vendor path (power_down/pci_detach/power_up/pci_reattach) instead of using gNOI Reboot RPC against potentially unresponsive DPUs. Admin-down DPUs are left offline. Add reset_count row to the DB transition table with a note about chassisd-restart zeroing. - Fix 'Table of Content' typo and add Existing/New DB entries sub-entries to the table of contents. - Replace literal pipe inside backticks in the state table cell with HTML entity so the markdown table renders correctly on GitHub. Signed-off-by: Vasundhara Volam <vvolam@microsoft.com>
|
/azp run |
|
No pipelines are associated with this pull request. |
Drop the dpu_auto_recovery_timeout self-heal grace period. chassisd now initiates a DPU power-cycle as soon as it observes dpu_control_plane_state (or dpu_midplane_link_state) as down on its next 10s health poll, regardless of whether the failure is a transient process restart or a persistent crash-loop. - Remove dpu_auto_recovery_timeout from the timer table; clarify chassisd health poll interval description to state immediate power-cycle on detection. - Combine 'Process restart on DPU' and 'Process persistently down on DPU' into a single 'Process crash/restart on DPU' section since chassisd applies the same recovery path in both cases. Update TOC and DB transition table accordingly. - State machine: keep SWFailure as a transient state on control-plane-down, branching directly into PowerCycle (auto-recovery enabled) or ManualIntervention (auto-recovery disabled) without any timer wait. HW-failure path goes directly from Ready to PowerCycle / ManualIntervention. - Drop 'skipping dpu_auto_recovery_timeout' parentheticals from HW Failure / Power Failure / PCIe Failure scenarios. Update Scenario DB State Summary row for control plane restart to reflect the immediate power-cycle behavior. Signed-off-by: Vasundhara Volam <vvolam@microsoft.com>
|
/azp run |
|
No pipelines are associated with this pull request. |
|
Regarding the |
|
We also need a sonic-mgmt section, about how the regular smartswitch tests are planned to be executed (with/without autorecovery or not) |
|
@vvolam it is not clear from the document as to when exactly the autorecovery is triggered, the control plane goes down even when we execute shutdown/reboot, please mention in the document about the exact scenario when the autorecovery is triggered, and if we have a timeout configured for it |
- Clarify databasedpu crash detection: chassisd detects indirectly via dpu_control_plane_state going down, not by monitoring databasedpuN Redis instances directly. - Add auto-recovery trigger disambiguation: document that chassisd checks state_transition_in_progress before triggering recovery, skipping auto-recovery during planned shutdown/reboot operations. - Add Testing section with sonic-mgmt test plan covering all failure mode scenarios (8 test classes) and test infrastructure details. Signed-off-by: Vasundhara Volam <vvolam@microsoft.com>
|
/azp run |
|
No pipelines are associated with this pull request. |
|
@gpunathilell I have updated the document addressing the comments. Please review |
- Clarify recovery timing: power-cycle triggered on same poll cycle that detects failure (no additional timeout beyond 10s poll interval). - Add CLI section: show chassis modules status extended to display ready_status, recovery_status, reset_count, last_down_time, last_ready_time from CHASSIS_STATE_DB. - Fix PCIe failure recovery: since DPU is already offline, chassisd updates the status and does not power-cycle. Signed-off-by: Vasundhara Volam <vvolam@microsoft.com>
|
/azp run |
- Add dpu_self_recovery_timeout (300s) for DPU self-recovery grace period - Consolidate all DPU failure types into single 'DPU Failure' category with unified WaitForSelfRecovery state - Consolidate NPU failures into 'NPU Ungraceful Reboot' category - Update state machine: replace SWFailure/WaitForWatchdog with WaitForSelfRecovery state - Make Key DB Indicators column explicit with exact DB field conditions - Remove unused auto_restart/high_mem_alert from dpu-auto-recovery feature - Simplify chassisd health poll interval and dpu_boot_timeout descriptions - Fix gRPC abbreviation expansion Signed-off-by: Vasundhara Volam <vvolam@microsoft.com>
|
/azp run |
|
No pipelines are associated with this pull request. |
|
/azp run |
|
No pipelines are associated with this pull request. |
|
/azp run |
|
No pipelines are associated with this pull request. |
Update the Booting state's Key DB Indicators to use timer-based condition: 'dpu_boot_timeout timer running AND NOT (midplane up AND control plane up)' instead of assuming specific intermediate link states. Signed-off-by: Vasundhara Volam <vvolam@microsoft.com>
|
/azp run |
|
No pipelines are associated with this pull request. |
Escape the pipe character in the ManualIntervention row's Key DB Indicators column to prevent GitHub Markdown from breaking the table. Signed-off-by: Vasundhara Volam <vvolam@microsoft.com>
|
/azp run |
|
No pipelines are associated with this pull request. |
| "dpu_data_plane_state": "up" | "down", | ||
| "dpu_data_plane_time": "<UTC timestamp>", | ||
| "dpu_midplane_link_state": "up" | "down", | ||
| "dpu_midplane_link_time": "<UTC timestamp>" |
There was a problem hiding this comment.
Does dpu_midplane_link_time still exist?
What I did
Add a High Level Design document for DPU failure scenarios on SmartSwitch from the PMON (Platform Monitor) perspective.
Why I did it
SmartSwitch DPU lifecycle management requires clear specification of failure detection, DB state tracking, and recovery actions performed by
chassisdand other PMON sub-daemons. This HLD documents all failure and planned operation scenarios to guide implementation.How I did it
Added
doc/smart-switch/pmon/enhance-dpu-robustness.mdcovering:ready_status,recovery_status,reset_count,last_down_time,last_ready_timein CHASSIS_STATE_DBFEATURE|dpu-auto-recoveryin CONFIG_DBplatform.jsonHow to verify it
Review the HLD document for completeness and correctness of failure scenarios, DB state transitions, and recovery actions.