[smartswitch] Record DPU reboot time at command invocation#680
Conversation
Write prev_reboot_time.txt from module_base when reboot() or set_admin_state(down) is issued, rather than waiting for chassisd to detect the Online→Offline transition. This eliminates the race condition where config reload restarts chassisd before it observes the status change. Changes: - Add _record_dpu_reboot_time() helper to ModuleBase - Add reboot_gracefully() wrapper that records time then reboots - Call _record_dpu_reboot_time() in set_admin_state_gracefully(down) - Add unit tests for new methods Signed-off-by: Vasundhara Volam <vvolam@microsoft.com>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
gpunathilell
left a comment
There was a problem hiding this comment.
Approved. @vvolam since we are introducing reboot_gracefully we need to call this from the reboot script as well
That's right. So, both shutdown and reboot paths are covered. I will raise it as a bug next week, once I have all PRs merged |
prev_reboot_time.txt is now exclusively written by module_base.py at command invocation time (reboot/set_admin_state), ensuring a single source of truth and eliminating the race window between DPU power-off and chassisd's poll detection. Depends-On: sonic-net/sonic-platform-common#680 Signed-off-by: Vasundhara Volam <vvolam@microsoft.com>
|
@vvolam I think we should pause merging #680 until the overall reboot-cause design across #680, #771, and sonic-utilities#4571 is aligned. The current SmartSwitch design intentionally latches/persists reboot-cause context when the DPU transitions from known non-Offline → Offline, because the DPU may be dead or unreachable afterward. #680 by itself looks reasonable for the config-reload race. But together with #771, the design shifts toward command-time tracking and post-online reconciliation, while removing chassisd as the fallback writer. That only covers command-initiated reboot/shutdown flows and does not holistically address spontaneous crash, kernel panic, dead/unreachable DPU, or cases where the DPU never comes back online. Also, replacing only chassisd with the #771 version on a working 202511 image already caused reboot -d DPUx failures in our testing, which suggests the current direction is changing active SmartSwitch reboot semantics, not just fixing a corner case. IMO we should:
|
Description
Write
prev_reboot_time.txtfrommodule_base.pywhenreboot()orset_admin_state(down)is issued, rather than relying solely on chassisd to write it when it detects the Online→Offline transition.Changes:
_record_dpu_reboot_time()helper toModuleBasethat writes the current UTC timestamp to/host/reboot-cause/module/{module}/prev_reboot_time.txtreboot_gracefully(reboot_type)wrapper that records the reboot time then callsreboot()_record_dpu_reboot_time()inset_admin_state_gracefully(up=False)at the start of the down pathFixes: sonic-net/sonic-buildimage#24275
Motivation and Context
There is a race condition in the current DPU reboot-cause persistence flow: if
config reloadis issued shortly afterreboot -d DPUX, chassisd restarts before it can detect the Online→Offline transition and writeprev_reboot_time.txt. When chassisd comes back, it sees EMPTY→Offline and has no reboot time to compare against, potentially losing the reboot event.By writing
prev_reboot_time.txtat the moment the user issues the reboot/shutdown command (inmodule_base), the timestamp is already persisted before any race can occur. Chassisd's existing write during Online→Offline detection is preserved as a fallback for spontaneous DPU crashes (not user-initiated).Related PRs: sonic-net/sonic-platform-daemons#771
sonic-net/sonic-utilities#4571
How Has This Been Tested?
Unit tests — all 5 new test cases pass:
test_record_dpu_reboot_time_success— verifies file creation and formattest_record_dpu_reboot_time_failure— verifies graceful error handlingtest_reboot_gracefully_records_time_then_reboots— verifies wrapper calls both methodstest_reboot_gracefully_still_reboots_on_time_record_failure— verifies reboot proceeds on time-write failuretest_set_admin_state_gracefully_down_records_time— verifies admin-down calls_record_dpu_reboot_timetest_set_admin_state_gracefully_up_does_not_record_time— verifies admin-up does NOT record timeExisting
module_base_test.pytests updated to patch_record_dpu_reboot_timein down-path tests — all continue to pass.Additional Information (Optional)
This change is complementary to sonic-net/sonic-platform-daemons#771. The two writes are safe to coexist:
reboot -d DPU0(normal)reboot -d DPU0+config reload(race)