Skip to content

[smartswitch] Record DPU reboot time at command invocation#680

Open
vvolam wants to merge 1 commit into
sonic-net:masterfrom
vvolam:record-reboot-time-on-command
Open

[smartswitch] Record DPU reboot time at command invocation#680
vvolam wants to merge 1 commit into
sonic-net:masterfrom
vvolam:record-reboot-time-on-command

Conversation

@vvolam
Copy link
Copy Markdown
Contributor

@vvolam vvolam commented May 27, 2026

Description

Write prev_reboot_time.txt from module_base.py when reboot() or set_admin_state(down) is issued, rather than relying solely on chassisd to write it when it detects the Online→Offline transition.

Changes:

  • Add _record_dpu_reboot_time() helper to ModuleBase that writes the current UTC timestamp to /host/reboot-cause/module/{module}/prev_reboot_time.txt
  • Add reboot_gracefully(reboot_type) wrapper that records the reboot time then calls reboot()
  • Call _record_dpu_reboot_time() in set_admin_state_gracefully(up=False) at the start of the down path
  • Add unit tests for the new methods

Fixes: sonic-net/sonic-buildimage#24275

Motivation and Context

There is a race condition in the current DPU reboot-cause persistence flow: if config reload is issued shortly after reboot -d DPUX, chassisd restarts before it can detect the Online→Offline transition and write prev_reboot_time.txt. When chassisd comes back, it sees EMPTY→Offline and has no reboot time to compare against, potentially losing the reboot event.

By writing prev_reboot_time.txt at the moment the user issues the reboot/shutdown command (in module_base), the timestamp is already persisted before any race can occur. Chassisd's existing write during Online→Offline detection is preserved as a fallback for spontaneous DPU crashes (not user-initiated).

Related PRs: sonic-net/sonic-platform-daemons#771
sonic-net/sonic-utilities#4571

How Has This Been Tested?

  1. Unit tests — all 5 new test cases pass:

    • test_record_dpu_reboot_time_success — verifies file creation and format
    • test_record_dpu_reboot_time_failure — verifies graceful error handling
    • test_reboot_gracefully_records_time_then_reboots — verifies wrapper calls both methods
    • test_reboot_gracefully_still_reboots_on_time_record_failure — verifies reboot proceeds on time-write failure
    • test_set_admin_state_gracefully_down_records_time — verifies admin-down calls _record_dpu_reboot_time
    • test_set_admin_state_gracefully_up_does_not_record_time — verifies admin-up does NOT record time
  2. Existing module_base_test.py tests updated to patch _record_dpu_reboot_time in down-path tests — all continue to pass.

Additional Information (Optional)

This change is complementary to sonic-net/sonic-platform-daemons#771. The two writes are safe to coexist:

Scenario module_base writes chassisd writes
reboot -d DPU0 (normal) ✅ at command time ✅ at Online→Offline detection
reboot -d DPU0 + config reload (race) ✅ at command time ❌ missed (chassisd restarted)
Spontaneous DPU crash ❌ (no command) ✅ at Online→Offline detection

Write prev_reboot_time.txt from module_base when reboot() or
set_admin_state(down) is issued, rather than waiting for chassisd
to detect the Online→Offline transition. This eliminates the race
condition where config reload restarts chassisd before it observes
the status change.

Changes:
- Add _record_dpu_reboot_time() helper to ModuleBase
- Add reboot_gracefully() wrapper that records time then reboots
- Call _record_dpu_reboot_time() in set_admin_state_gracefully(down)
- Add unit tests for new methods

Signed-off-by: Vasundhara Volam <vvolam@microsoft.com>
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Copy Markdown
Contributor

@gpunathilell gpunathilell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved. @vvolam since we are introducing reboot_gracefully we need to call this from the reboot script as well

@vvolam
Copy link
Copy Markdown
Contributor Author

vvolam commented May 28, 2026

Approved. @vvolam since we are introducing reboot_gracefully we need to call this from the reboot script as well

That's right. So, both shutdown and reboot paths are covered. I will raise it as a bug next week, once I have all PRs merged

vvolam added a commit to vvolam/sonic-platform-daemons that referenced this pull request May 28, 2026
prev_reboot_time.txt is now exclusively written by module_base.py
at command invocation time (reboot/set_admin_state), ensuring a
single source of truth and eliminating the race window between
DPU power-off and chassisd's poll detection.

Depends-On: sonic-net/sonic-platform-common#680
Signed-off-by: Vasundhara Volam <vvolam@microsoft.com>
@vvolam vvolam requested a review from rameshraghupathy May 28, 2026 16:30
@vvolam vvolam changed the title [module_base] Record DPU reboot time at command invocation [smartswitch] Record DPU reboot time at command invocation May 28, 2026
@rameshraghupathy
Copy link
Copy Markdown
Contributor

@vvolam I think we should pause merging #680 until the overall reboot-cause design across #680, #771, and sonic-utilities#4571 is aligned.

The current SmartSwitch design intentionally latches/persists reboot-cause context when the DPU transitions from known non-Offline → Offline, because the DPU may be dead or unreachable afterward.

#680 by itself looks reasonable for the config-reload race. But together with #771, the design shifts toward command-time tracking and post-online reconciliation, while removing chassisd as the fallback writer.

That only covers command-initiated reboot/shutdown flows and does not holistically address spontaneous crash, kernel panic, dead/unreachable DPU, or cases where the DPU never comes back online.

Also, replacing only chassisd with the #771 version on a working 202511 image already caused reboot -d DPUx failures in our testing, which suggests the current direction is changing active SmartSwitch reboot semantics, not just fixing a corner case.

IMO we should:

  • preserve the existing known non-Offline → Offline latching behavior in chassisd,
  • add minimal EMPTY→Offline handling for config reload,
  • and use module_base.py recording only as additional metadata, not the exclusive reboot-event source.

Copy link
Copy Markdown
Contributor

@rameshraghupathy rameshraghupathy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should pause merging #680 until the overall reboot-cause design across #680, #771, and sonic-utilities#4571 is aligned.

@rameshraghupathy rameshraghupathy self-requested a review May 29, 2026 13:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug:[Smartswitch] DPU reboot cause is not updated if reboot is ongoing and config reload is executed

5 participants