[smartswitch][chassisd] Fix missing DPU reboot cause when config-reload interrupts DPU reboot#771
[smartswitch][chassisd] Fix missing DPU reboot cause when config-reload interrupts DPU reboot#771vvolam wants to merge 14 commits into
Conversation
When chassisd restarts (e.g. during config reload) while a DPU is offline, STATE_DB is flushed so the previous operational status is EMPTY. The existing code only persists reboot cause on a non-EMPTY non-Offline to Offline transition, causing the new reboot cause to be lost if it changed while chassisd was down. Add handling for the EMPTY-to-Offline case: compare the current reboot cause from the platform with the last persisted one, and record it if they differ. Skip the check on first boot (no previous-reboot-cause.json) to avoid duplicates. Fixes: sonic-net/sonic-buildimage#24275 Signed-off-by: Vasundhara Volam <vvolam@microsoft.com>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Pull request overview
This PR updates sonic-chassisd SmartSwitch (DPU) handling to ensure DPU reboot causes are persisted even when chassisd restarts and STATE_DB has been flushed, covering the previously missed EMPTY → Offline transition scenario.
Changes:
- Add
EMPTY → Offlinetransition logic inSmartSwitchModuleUpdater.module_db_update()to compare platform reboot cause vs. last persisted cause and persist when changed. - Add unit tests validating persist/skip behavior for
EMPTY → Offlinebased on cause changes and first-boot conditions. - Minor whitespace/formatting cleanup in existing code/tests.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
sonic-chassisd/scripts/chassisd |
Implements EMPTY → Offline reboot-cause persistence logic for SmartSwitch DPUs after STATE_DB flush/restart. |
sonic-chassisd/tests/test_chassisd.py |
Adds regression tests for the new transition handling and expected persistence/skip behavior. |
Do not call get_reboot_cause while a DPU is powered off / offline. Instead, persist only the reboot time on the Known->Offline transition and defer the reboot-cause query until the DPU comes back online. For EMPTY->Offline (chassisd restart while DPU is down), defer entirely and only persist if the cause actually changed once the DPU is online. Remove the now-unused _record_reboot helper and its tests. Fixes: sonic-net/sonic-buildimage#26254 Signed-off-by: Vasundhara Volam <vvolam@microsoft.com>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
The deferred=='reboot' path (Online->Offline->Online) was unconditionally persisting the reboot cause. For back-to-back reboots where the cause and timestamp are within MAX_DPU_REBOOT_DURATION, this created duplicate history entries. Add is_reboot detection to the deferred=='reboot' path: if the current cause matches the stored cause and the time delta is within the threshold, skip persist_dpu_reboot_cause but still update the DB to keep it in sync. Add unit test test_deferred_reboot_skips_back_to_back_same_cause to cover the new behavior. Signed-off-by: Vasundhara Volam <vvolam@microsoft.com>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
@vvolam would this add a reboot cause for first boot of the system? |
@gpunathilell No, first boot is safe. In the This is covered by |
|
test_ss_module_db_update.sh |
Manually patched |
|
The fix introduces new bugs: the After running After: Then run And then run |
|
@vvolam there is still a race condition in this implementation - the reboot -d DPUX is executed but the chassisd is restarted before the reboot_time file is created right i.e. the chassisd has to detect that the dpu is being powered off and only then this would work right? can we have the implementation for this in the module_base implementation (i.e. for |
|
@gpunathilell Good catch — yes, there's a narrow race if Agree that writing the timestamp at the platform layer ( Since this is a platform-API-level change affecting vendor implementations, I'd prefer to track it as a follow-up rather than expanding this PR's scope. Will file a separate issue. The current PR already covers the common case (chassisd detects the transition before config reload), and the remaining gap is the narrow same-cause + config-reload-within-10s edge case. |
|
@vvolam its not narrow, its the complete offline cycle, assuming reboot -d DPUX is executed, chassisd does not detect dpu offline until dpu is shutdown, so technically the window is time taken for DPU to actually go offline + ~10 seconds. But I am okay to have a seperate PR, we can use the same issue to track it: sonic-net/sonic-buildimage#24275, we can close the issue sonic-net/sonic-buildimage#26254 based on this pr. |
@gpunathilell I have also raised sonic-net/sonic-platform-common#680 to complete the scenario |
|
@vvolam please remove the |
prev_reboot_time.txt is now exclusively written by module_base.py at command invocation time (reboot/set_admin_state), ensuring a single source of truth and eliminating the race window between DPU power-off and chassisd's poll detection. Depends-On: sonic-net/sonic-platform-common#680 Signed-off-by: Vasundhara Volam <vvolam@microsoft.com>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
@gpunathilell Thank you, I have addressed the change. |
|
@vvolam I still think we are mixing two different issues here. The config-reload issue is clear: chassisd restarts while DPU reboot is in progress, STATE_DB is empty, and the transition becomes EMPTY→Offline instead of Online→Offline. That missing transition can be handled directly in SmartSwitchModuleUpdater.module_db_update(), where we already read prev_status via get_module_current_status(key), update the module table, and compare previous/current operational state. IMO the fix should be limited to adding EMPTY→Offline handling in the existing transition logic. I do not agree with changing the broader reboot-cause model to defer get_reboot_cause until the DPU comes back online. The current Online→Offline behavior is intentional. Reboot cause is latched/mirrored through the NPU path specifically so it can be retrieved even when the DPU is offline or dead. That is required for debugging unreachable DPUs and is consistent with the existing design. So the concern is not whether the new deferred flow can be validated. The concern is that it changes the design contract. If a DPU never comes back online, deferring reboot-cause collection defeats the purpose of the latched reboot-cause mechanism. Can we keep the existing Online→Offline reboot-cause persistence behavior unchanged and only add the missing EMPTY→Offline/config-reload case as a minimal targeted change? |
| and prev_status != str(ModuleBase.MODULE_STATUS_OFFLINE) | ||
| and current_status == str(ModuleBase.MODULE_STATUS_OFFLINE)) | ||
|
|
||
| def _is_empty_to_offline(self, prev_status, current_status): |
There was a problem hiding this comment.
@vvolum IMO this can be handled with a minimal addition in the existing module_db_update() logic itself.
Today, module_db_update() already:
- reads prev_status using get_module_current_status(),
- updates the module table,
- and compares previous/current oper-status for transition handling.
The new EMPTY→Offline case is simply another transition condition in that same flow.
The helper functions introduced in this PR ultimately still reduce to checking:
prev_status == EMPTY and current_status == Offline
That does not require introducing a new transition framework, deferred reboot-cause processing, or restructuring the existing reboot-cause handling model.
A couple of targeted lines in the existing transition handling path should be sufficient while preserving the current Online→Offline reboot-cause persistence semantics and overall design behavior.
There was a problem hiding this comment.
@rameshraghupathy after our discussion in the meeting offline, do you have any other suggestions?
|
@vvolam I am not against checking/reading reboot-cause after the DPU powers back on. The important requirement is that the reboot-cause event must be latched/recorded when the DPU transitions from a known non-Offline state to Offline. That transition is the point where we know the DPU went down, and the DPU may never come back online. Reading or reconciling reboot-cause later after power-on is fine as an additional step, but it should not replace the existing down-transition latching behavior. So the design should be:
But the Online path must not be the only place where reboot-cause is captured, because that loses the dead-DPU debugging use case. |
This is a platform specific handling and not the use-case for reboot cause, the get_reboot_cause is used for strictly obtaining the previous reboot cause only so it is implied that is checked when DPU is online, |
|
|
@vvolam I tested this on a 202511 SmartSwitch image where When I replaced only chassisd with the master chassisd containing the #771 changes, This reinforces my concern that #771 is not a safe/minimal fix. The old chassisd persists reboot time and reboot cause immediately on known non-Offline→Offline transition. The #771 code changes that behavior and only records a deferred pending reboot check. That changes the established down-transition reboot-cause model. For the config-reload issue, I still think the fix should be minimal:
Given this test result, I do not think #771 should be merged in its current form. The key counterpoint is already in the SmartSwitch PMON HLD: “The NPU hardware should be capable of providing the DPU reboot-cause even when the DPUs are dead.” So for SmartSwitch, this is not just “previous reboot cause when DPU is online.” The platform was explicitly designed to expose reboot cause even when the DPU is dead/offline. |
|
@rameshraghupathy I have a few concerns with that solution. I tested this on a smartswitch and I think there are three issues we are leaving unresolved.
@rameshraghupathy the "only add EMPTY→Offline handling, leave everything else as is" idea doesnt resolve any of these. The Online→Offline persist you want to preserve is exactly what creates #3 - the cause we capture at offline transition is the cause of the PREVIOUS reboot (per the docstring If we want to capture state at the moment the DPU goes offline (for dead-DPU debugging), thats fine - but its not "reboot cause", its an "operational state change" event. That needs a separate API (an oper-state-change callback, or a different function), not overloading I can suggest a design that solves all three - we can discuss in the meeting |
@rameshraghupathy there are preceding PRs: sonic-net/sonic-platform-common#680 @vvolam Yes, with #680 and sonic-utilities#4571 patched, the specific command-initiated test may pass because reboot -d DPUx will write prev_reboot_time.txt before config reload. But that only covers command-initiated reboot/shutdown. It does not address spontaneous DPU crash/kernel panic/dead-DPU cases where there is no module_base.py command invocation. I am still concerned that #771 removes chassisd as a writer of prev_reboot_time.txt. The original #680 description even said chassisd write is preserved as fallback for spontaneous DPU crashes. But the later #771 commit changes that and makes module_base.py the exclusive writer. So passing the config-reload test with all three PRs does not prove the design is safe. It only proves the command-path race is covered. |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
…us DPU crashes When module_base.py has not created prev_reboot_time.txt (e.g. spontaneous DPU crash/kernel panic with no command invocation), chassisd generates it as a fallback on the Online->Offline transition. Detection: compare the time in prev_reboot_time.txt with the stored reboot cause timestamp from previous-reboot-cause.json. If the file does not exist or its time is <= the stored cause time (stale from a prior reboot), write the current time. If module_base.py already created it (file time > stored cause time), skip to avoid overwriting the more accurate command-time. Signed-off-by: Vasundhara Volam <vvolam@microsoft.com>
251151a to
5403e81
Compare
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
The DPU reboot-cause testbed verification script is decoupled into a separate PR to keep this change focused on the core logic fix. Signed-off-by: Vasundhara Volam <vvolam@microsoft.com>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Description
Refactor DPU reboot-cause tracking in
SmartSwitchModuleUpdaterto correctly handle all operational-status transitions, including edge cases aroundconfig reloadand chassisd restarts.Key changes:
_handle_module_status_change()frommodule_db_update()for clearer transition handlingget_reboot_causequeries until the DPU is online — querying a powered-off DPU is unreliable (fixes #26254)Known→Offline, persist the reboot execution time as fallback (ifmodule_base.pyhas not already created it) and defer the cause queryEMPTY→Offline(chassisd restart while DPU is down), defer entirely until the DPU comes back online, then persist only if the cause actually changedEMPTY→ONLINEwith no deferred state — just repopulateCHASSIS_STATE_DBfrom on-disk history (no reboot happened, chassisd just restarted). On first boot, persist the initial cause.MAX_DPU_REBOOT_DURATION(800s) timeout with direct timestamp comparison: compare the stored cause timestamp against the reboot execution time fromprev_reboot_time.txtto determine whether the cause was already persistedCHASSIS_STATE_DBfrom on-disk history after chassisd restart, even when no new cause is persistedpersist_dpu_reboot_time_if_needed()fallback: on Online→Offline, ifmodule_base.pyhas not already createdprev_reboot_time.txtfor this reboot cycle (detected by comparing file time vs stored cause time), chassisd writes it. This covers spontaneous DPU crashes/kernel panics where no command invocation triggersmodule_base.py.log_infotolog_noticefor reboot-cause persistence messagesTransition Matrix
prev_reboot_time.txtgenerationPrimary source:
module_base.pywritesprev_reboot_time.txtat command invocation time (via sonic-net/sonic-platform-common#680 and sonic-net/sonic-utilities#4571).Fallback (this PR): On Online→Offline transition,
persist_dpu_reboot_time_if_needed()checks whethermodule_base.pyhas already created the file for this reboot cycle by comparing the file timestamp against the stored cause timestamp. If the file is missing or stale, chassisd writes the current time. This ensures spontaneous DPU crashes (no CLI command) still get a reboot time marker.Motivation and Context
Fixes sonic-net/sonic-buildimage#24275
Fixes sonic-net/sonic-buildimage#26254
On SmartSwitch platforms, issuing
config reloadshortly afterreboot -d DPU0causes chassisd to restart. After restart, STATE_DB is empty so chassisd seesEMPTY→Offlineinstead ofOnline→Offline, and the DPU reboot cause was never recorded. Additionally, queryingget_reboot_causeon a powered-off DPU is unreliable and was causing spurious updates.Preceding PRs: sonic-net/sonic-platform-common#680
sonic-net/sonic-utilities#4571
How Has This Been Tested?
Unit tests — 63 tests in
sonic-chassisd/tests/test_chassisd.py, all passing:test_smartswitch_moduleupdater_status_transitions— full ONLINE→OFFLINE→ONLINE cycletest_admin_shutdown_does_not_query_reboot_cause— verifiesget_reboot_causeNOT called while DPU is offtest_persist_dpu_reboot_time_if_needed_creates_file_when_missing— fallback writes when file missingtest_persist_dpu_reboot_time_if_needed_skips_when_file_newer— skips when module_base.py already wrotetest_persist_dpu_reboot_time_if_needed_writes_when_file_stale— writes when file is stale from prior reboottest_online_transition_skips_reboot_update— EMPTY→ONLINE repopulates DB without persistingtest_empty_to_online_first_boot_persists— first boot persists initial causetest_empty_to_online_no_spurious_persist— regression test for config reload spurious updatetest_deferred_reboot_skips_already_persisted— stored_time >= reboot_time → skiptest_deferred_reboot_persists_new_reboot— stored_time < reboot_time → persisttest_empty_to_offline_persists_changed_reboot_cause— deferred restart with changed causetest_empty_to_offline_skips_same_reboot_cause— deferred restart with same cause, DB repopulatedtest_extract_cause_tuple— helper extracts scalar from tuple/list/stringSmartSwitch hardware validation on
NVIDIA SS:reboot -d DPU0followed byconfig reload -yafter 10 secondsTestbed verification script —
sonic-chassisd/tests/testbed/test_dpu_reboot_cause_persistence.sh: end-to-end script covering all 6 DPU reboot-cause transition scenarios on real hardware (normal reboot, config reload during reboot, admin shutdown, history preservation, same-cause deferred check, back-to-back rapid reboot).Additional Information (Optional)
The fix only applies to
SmartSwitchModuleUpdater(SmartSwitch/DPU platforms). It does not affect line card or supervisor module updater logic.