Skip to content

[cli]: Add DPU recovery CLI commands for SmartSwitch#4572

Open
vvolam wants to merge 1 commit into
sonic-net:masterfrom
vvolam:enhance-dpu-robustness-cli
Open

[cli]: Add DPU recovery CLI commands for SmartSwitch#4572
vvolam wants to merge 1 commit into
sonic-net:masterfrom
vvolam:enhance-dpu-robustness-cli

Conversation

@vvolam
Copy link
Copy Markdown
Contributor

@vvolam vvolam commented May 28, 2026

What I did

Added CLI support for DPU recovery monitoring on SmartSwitch platforms as specified in the Enhance DPU Robustness HLD:

  1. Extended show chassis modules status with a Ready-Status column on SmartSwitch platforms.
  2. Added new show chassis modules recovery command to expose detailed DPU recovery state.

How I did it

  • Modified show/chassis_modules.py:

    • On SmartSwitch, the status command now connects to CHASSIS_STATE_DB and reads ready_status from DPU_STATE|DPU<N> for each module, appending it as a new column.
    • Added a new recovery subcommand that reads ready_status, recovery_status, reset_count, last_down_time, and last_ready_time from CHASSIS_STATE_DB: DPU_STATE|DPU<N>.
    • The recovery command is gated to SmartSwitch platforms only.
  • Added unit tests in tests/chassis_modules_test.py:

    • TestChassisModulesRecovery class with 7 test cases covering all DPUs, single DPU filter, non-SmartSwitch guard, no-data scenario, Ready-Status in status command, unrecoverable DPU display, and missing fields handling.

How to verify it

# On a SmartSwitch platform:
show chassis modules status
# Should now include a "Ready-Status" column

show chassis modules recovery
# Should display DPU recovery details

# Run unit tests:
cd src/sonic-utilities
python3 -m pytest tests/chassis_modules_test.py::TestChassisModulesRecovery -v

Previous command output (if the output of a command-line utility has changed)

admin@sonic:~$ show chassis modules status
  Name    Description    Physical-Slot    Oper-Status    Admin-Status    Serial
------  -------------  ---------------  -------------  --------------  --------
  DPU0    <DPU Sku>              N/A         Online              up    <serial>
  DPU1    <DPU Sku>              N/A         Online              up    <serial>

New command output (if the output of a command-line utility has changed)

admin@sonic:~$ show chassis modules status
  Name    Description    Physical-Slot    Oper-Status    Admin-Status    Serial    Ready-Status
------  -------------  ---------------  -------------  --------------  --------  --------------
  DPU0    <DPU Sku>              N/A         Online              up    <serial>            true
  DPU1    <DPU Sku>              N/A         Online              up    <serial>            true

admin@sonic:~$ show chassis modules recovery
  Name    Ready-Status    Recovery-Status    Reset-Count                   Last-Down-Time                  Last-Ready-Time
------  --------------  -----------------  -------------  -------------------------------  -------------------------------
  DPU0            true        recoverable              0  Fri May 29 09:26:33 PM UTC 2026  Fri May 29 09:26:52 PM UTC 2026
  DPU1            true        recoverable              0  Fri May 29 09:26:33 PM UTC 2026  Fri May 29 09:26:52 PM UTC 2026
  DPU2            true        recoverable              0  Fri May 29 09:26:33 PM UTC 2026  Fri May 29 09:26:52 PM UTC 2026

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Extend 'show chassis modules status' with a Ready-Status column on
SmartSwitch platforms. Add new 'show chassis modules recovery' command
to expose detailed DPU recovery state (recovery_status, reset_count,
last_down_time, last_ready_time) from CHASSIS_STATE_DB.

Add unit tests for the new CLI commands.

Signed-off-by: Vasundhara Volam <vvolam@nvidia.com>
Signed-off-by: Vasundhara Volam <vvolam@microsoft.com>
@vvolam vvolam force-pushed the enhance-dpu-robustness-cli branch from 1831f0e to 0b1de1a Compare May 29, 2026 21:37
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@vvolam vvolam marked this pull request as ready for review May 29, 2026 21:38
Copilot AI review requested due to automatic review settings May 29, 2026 21:38
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds DPU recovery visibility to the SmartSwitch CLI: a new Ready-Status column on show chassis modules status and a new show chassis modules recovery subcommand that surfaces ready/recovery state, reset counts, and last down/ready timestamps from CHASSIS_STATE_DB:DPU_STATE|DPU<N>.

Changes:

  • Extend show chassis modules status on SmartSwitch with a Ready-Status column sourced from CHASSIS_STATE_DB.
  • Add new show chassis modules recovery [<module>] subcommand (SmartSwitch-only) showing ready/recovery state, reset count, and timestamps.
  • Add TestChassisModulesRecovery unit tests covering all-DPU listing, single-DPU filter, non-SmartSwitch guard, no-data, missing-fields, unrecoverable DPU, and Ready-Status column rendering.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
show/chassis_modules.py Adds DPU_STATE constants, CHASSIS_STATE_DB lookups, the Ready-Status column on status, and the new recovery subcommand.
tests/chassis_modules_test.py Adds TestChassisModulesRecovery test class exercising the new column and recovery command across happy and edge cases.

Comment thread show/chassis_modules.py

ready_status = data_dict.get(DPU_STATE_READY_STATUS_FIELD, 'N/A')
recovery_status = data_dict.get(DPU_STATE_RECOVERY_STATUS_FIELD, 'N/A')
reset_count = data_dict.get(DPU_STATE_RESET_COUNT_FIELD, '0')
Comment thread show/chassis_modules.py
Comment on lines +160 to +162
if not keys:
click.echo('No DPU recovery data available')
return
Comment thread show/chassis_modules.py
Comment on lines +82 to +93
# For SmartSwitch, connect to CHASSIS_STATE_DB to read DPU_STATE
dpu_state_data = {}
if smartswitch:
chassis_state_db = SonicV2Connector(host=CHASSIS_SERVER, port=CHASSIS_SERVER_PORT)
chassis_state_db.connect(chassis_state_db.CHASSIS_STATE_DB)
dpu_key_pattern = DPU_STATE_TABLE + '|*'
dpu_keys = chassis_state_db.keys(chassis_state_db.CHASSIS_STATE_DB, dpu_key_pattern)
if dpu_keys:
for dpu_key in dpu_keys:
dpu_name = dpu_key.split('|')[1]
dpu_state_data[dpu_name] = chassis_state_db.get_all(
chassis_state_db.CHASSIS_STATE_DB, dpu_key)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants