Skip to content

[smartswitch][chassisd] Fix missing DPU reboot cause when config-reload interrupts DPU reboot#771

Open
vvolam wants to merge 14 commits into
sonic-net:masterfrom
vvolam:fix-reboot-cause-config-reload
Open

[smartswitch][chassisd] Fix missing DPU reboot cause when config-reload interrupts DPU reboot#771
vvolam wants to merge 14 commits into
sonic-net:masterfrom
vvolam:fix-reboot-cause-config-reload

Conversation

@vvolam
Copy link
Copy Markdown
Contributor

@vvolam vvolam commented Mar 10, 2026

Description

Refactor DPU reboot-cause tracking in SmartSwitchModuleUpdater to correctly handle all operational-status transitions, including edge cases around config reload and chassisd restarts.

Key changes:

  • Extract _handle_module_status_change() from module_db_update() for clearer transition handling
  • Defer get_reboot_cause queries until the DPU is online — querying a powered-off DPU is unreliable (fixes #26254)
  • On Known→Offline, persist the reboot execution time as fallback (if module_base.py has not already created it) and defer the cause query
  • On EMPTY→Offline (chassisd restart while DPU is down), defer entirely until the DPU comes back online, then persist only if the cause actually changed
  • Short-circuit EMPTY→ONLINE with no deferred state — just repopulate CHASSIS_STATE_DB from on-disk history (no reboot happened, chassisd just restarted). On first boot, persist the initial cause.
  • Replace the arbitrary MAX_DPU_REBOOT_DURATION (800s) timeout with direct timestamp comparison: compare the stored cause timestamp against the reboot execution time from prev_reboot_time.txt to determine whether the cause was already persisted
  • Always repopulate CHASSIS_STATE_DB from on-disk history after chassisd restart, even when no new cause is persisted
  • Add persist_dpu_reboot_time_if_needed() fallback: on Online→Offline, if module_base.py has not already created prev_reboot_time.txt for this reboot cycle (detected by comparing file time vs stored cause time), chassisd writes it. This covers spontaneous DPU crashes/kernel panics where no command invocation triggers module_base.py.
  • Change log_info to log_notice for reboot-cause persistence messages

Transition Matrix

Transition Condition Action
ONLINE→OFFLINE Known, non-offline → offline Persist reboot time (fallback if module_base.py has not), defer cause query
EMPTY→OFFLINE Unknown → offline Defer cause query (no I/O)
EMPTY→ONLINE (no deferred) First boot Persist initial cause + repopulate DB
EMPTY→ONLINE (no deferred) Not first boot Repopulate DB only (no reboot happened)
OFFLINE→ONLINE (deferred=reboot) stored_time >= reboot_time Skip persist, repopulate DB
OFFLINE→ONLINE (deferred=reboot) stored_time < reboot_time Persist new cause + repopulate DB
OFFLINE→ONLINE (deferred=restart) Cause changed Persist new cause + repopulate DB
OFFLINE→ONLINE (deferred=restart) Cause unchanged / first boot Repopulate DB only

prev_reboot_time.txt generation

Primary source: module_base.py writes prev_reboot_time.txt at command invocation time (via sonic-net/sonic-platform-common#680 and sonic-net/sonic-utilities#4571).

Fallback (this PR): On Online→Offline transition, persist_dpu_reboot_time_if_needed() checks whether module_base.py has already created the file for this reboot cycle by comparing the file timestamp against the stored cause timestamp. If the file is missing or stale, chassisd writes the current time. This ensures spontaneous DPU crashes (no CLI command) still get a reboot time marker.

Motivation and Context

Fixes sonic-net/sonic-buildimage#24275
Fixes sonic-net/sonic-buildimage#26254

On SmartSwitch platforms, issuing config reload shortly after reboot -d DPU0 causes chassisd to restart. After restart, STATE_DB is empty so chassisd sees EMPTY→Offline instead of Online→Offline, and the DPU reboot cause was never recorded. Additionally, querying get_reboot_cause on a powered-off DPU is unreliable and was causing spurious updates.

Preceding PRs: sonic-net/sonic-platform-common#680
sonic-net/sonic-utilities#4571

How Has This Been Tested?

  1. Unit tests — 63 tests in sonic-chassisd/tests/test_chassisd.py, all passing:

    • test_smartswitch_moduleupdater_status_transitions — full ONLINE→OFFLINE→ONLINE cycle
    • test_admin_shutdown_does_not_query_reboot_cause — verifies get_reboot_cause NOT called while DPU is off
    • test_persist_dpu_reboot_time_if_needed_creates_file_when_missing — fallback writes when file missing
    • test_persist_dpu_reboot_time_if_needed_skips_when_file_newer — skips when module_base.py already wrote
    • test_persist_dpu_reboot_time_if_needed_writes_when_file_stale — writes when file is stale from prior reboot
    • test_online_transition_skips_reboot_update — EMPTY→ONLINE repopulates DB without persisting
    • test_empty_to_online_first_boot_persists — first boot persists initial cause
    • test_empty_to_online_no_spurious_persist — regression test for config reload spurious update
    • test_deferred_reboot_skips_already_persisted — stored_time >= reboot_time → skip
    • test_deferred_reboot_persists_new_reboot — stored_time < reboot_time → persist
    • test_empty_to_offline_persists_changed_reboot_cause — deferred restart with changed cause
    • test_empty_to_offline_skips_same_reboot_cause — deferred restart with same cause, DB repopulated
    • test_extract_cause_tuple — helper extracts scalar from tuple/list/string
  2. SmartSwitch hardware validation on NVIDIA SS:

    • Ran reboot -d DPU0 followed by config reload -y after 10 seconds
    • With original code: reboot cause was stale (bug reproduced)
    • With patched code: new reboot cause entry correctly recorded
  3. Testbed verification scriptsonic-chassisd/tests/testbed/test_dpu_reboot_cause_persistence.sh: end-to-end script covering all 6 DPU reboot-cause transition scenarios on real hardware (normal reboot, config reload during reboot, admin shutdown, history preservation, same-cause deferred check, back-to-back rapid reboot).

Additional Information (Optional)

The fix only applies to SmartSwitchModuleUpdater (SmartSwitch/DPU platforms). It does not affect line card or supervisor module updater logic.

When chassisd restarts (e.g. during config reload) while a DPU is
offline, STATE_DB is flushed so the previous operational status is
EMPTY. The existing code only persists reboot cause on a non-EMPTY
non-Offline to Offline transition, causing the new reboot cause to
be lost if it changed while chassisd was down.

Add handling for the EMPTY-to-Offline case: compare the current
reboot cause from the platform with the last persisted one, and
record it if they differ. Skip the check on first boot (no
previous-reboot-cause.json) to avoid duplicates.

Fixes: sonic-net/sonic-buildimage#24275
Signed-off-by: Vasundhara Volam <vvolam@microsoft.com>
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@vvolam vvolam changed the title [chassisd] Persist DPU reboot cause on EMPTY-to-Offline transition [smartswitch][chassisd] Persist DPU reboot cause on EMPTY-to-Offline transition Mar 10, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates sonic-chassisd SmartSwitch (DPU) handling to ensure DPU reboot causes are persisted even when chassisd restarts and STATE_DB has been flushed, covering the previously missed EMPTY → Offline transition scenario.

Changes:

  • Add EMPTY → Offline transition logic in SmartSwitchModuleUpdater.module_db_update() to compare platform reboot cause vs. last persisted cause and persist when changed.
  • Add unit tests validating persist/skip behavior for EMPTY → Offline based on cause changes and first-boot conditions.
  • Minor whitespace/formatting cleanup in existing code/tests.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
sonic-chassisd/scripts/chassisd Implements EMPTY → Offline reboot-cause persistence logic for SmartSwitch DPUs after STATE_DB flush/restart.
sonic-chassisd/tests/test_chassisd.py Adds regression tests for the new transition handling and expected persistence/skip behavior.

Comment thread sonic-chassisd/tests/test_chassisd.py
Comment thread sonic-chassisd/scripts/chassisd Outdated
Do not call get_reboot_cause while a DPU is powered off / offline.
Instead, persist only the reboot time on the Known->Offline transition
and defer the reboot-cause query until the DPU comes back online.

For EMPTY->Offline (chassisd restart while DPU is down), defer entirely
and only persist if the cause actually changed once the DPU is online.

Remove the now-unused _record_reboot helper and its tests.

Fixes: sonic-net/sonic-buildimage#26254
Signed-off-by: Vasundhara Volam <vvolam@microsoft.com>
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

The deferred=='reboot' path (Online->Offline->Online) was unconditionally
persisting the reboot cause.  For back-to-back reboots where the cause
and timestamp are within MAX_DPU_REBOOT_DURATION, this created duplicate
history entries.

Add is_reboot detection to the deferred=='reboot' path: if the current
cause matches the stored cause and the time delta is within the threshold,
skip persist_dpu_reboot_cause but still update the DB to keep it in sync.

Add unit test test_deferred_reboot_skips_back_to_back_same_cause to cover
the new behavior.

Signed-off-by: Vasundhara Volam <vvolam@microsoft.com>
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@gpunathilell
Copy link
Copy Markdown
Contributor

@vvolam would this add a reboot cause for first boot of the system?

Comment thread sonic-chassisd/scripts/chassisd Outdated
@vvolam
Copy link
Copy Markdown
Contributor Author

vvolam commented Apr 8, 2026

@vvolam would this add a reboot cause for first boot of the system?

@gpunathilell No, first boot is safe. In the deferred == 'restart' path, the code checks stored_cause is not None before persisting. On first boot there's no previous-reboot-cause.json, so retrieve_dpu_reboot_info() returns (None, None) — the guard fails and we skip persistence entirely.

This is covered by test_empty_to_offline_skips_first_boot in the test file.

@vvolam
Copy link
Copy Markdown
Contributor Author

vvolam commented Apr 8, 2026

test_ss_module_db_update.sh
@gpunathilell I have executed this test script to verify all scenarios on the smartswitch

Comment thread sonic-chassisd/scripts/chassisd Outdated
@chartsai-nvidia
Copy link
Copy Markdown

test_ss_module_db_update.sh

Manually patched chassisd and ran the tests, confirmed all tests passed.
test_script_changes_diff.txt

@chartsai-nvidia
Copy link
Copy Markdown

The fix introduces new bugs: the reboot-cause result is unstable.

After running config reload, DPU0 didn't reboot but the reboot-cause was updated.
Before:

Device    Name                 Cause         Time                             User
--------  -------------------  ------------  -------------------------------  ------
NPU       2026_05_01_20_37_23  reboot        Fri May  1 08:32:15 PM UTC 2026  admin
DPU3      2026_05_04_20_53_07  Non-Hardware  Mon May 04 08:53:07 PM UTC 2026
DPU2      2026_05_04_20_53_06  Non-Hardware  Mon May 04 08:53:06 PM UTC 2026
DPU1      2026_05_04_20_53_06  Non-Hardware  Mon May 04 08:53:06 PM UTC 2026
DPU0      2026_05_02_06_45_53  Non-Hardware  **Sat May 02 06:45:53 AM UTC 2026**

After:

Device    Name                 Cause         Time                             User
--------  -------------------  ------------  -------------------------------  ------
NPU       2026_05_01_20_37_23  reboot        Fri May  1 08:32:15 PM UTC 2026  admin
DPU3      2026_05_04_20_53_07  Non-Hardware  Mon May 04 08:53:07 PM UTC 2026
DPU2      2026_05_04_20_53_06  Non-Hardware  Mon May 04 08:53:06 PM UTC 2026
DPU1      2026_05_04_20_53_06  Non-Hardware  Mon May 04 08:53:06 PM UTC 2026
DPU0      2026_05_04_21_05_08  Non-Hardware  **Mon May 04 09:05:08 PM UTC 2026**

Then runreboot -d DPU0 and config reload again, all other DPUs was updated.

Device    Name                 Cause         Time                             User
--------  -------------------  ------------  -------------------------------  ------
NPU       2026_05_01_20_37_23  reboot        Fri May  1 08:32:15 PM UTC 2026  admin
DPU3      2026_05_04_21_22_45  Non-Hardware  Mon May 04 09:22:45 PM UTC 2026
DPU2      2026_05_04_21_22_44  Non-Hardware  Mon May 04 09:22:44 PM UTC 2026
DPU1      2026_05_04_20_53_37  Non-Hardware  Mon May 04 08:53:37 PM UTC 2026
DPU0      2026_05_04_21_18_30  Non-Hardware  Mon May 04 09:18:30 PM UTC 2026

And then run reboot -d DPU0, the DPU0 didn't update.

admin@bobcat-1250:~$ show reboot-cause all
Device    Name                 Cause         Time                             User
--------  -------------------  ------------  -------------------------------  ------
NPU       2026_05_01_20_37_23  reboot        Fri May  1 08:32:15 PM UTC 2026  admin
DPU3      2026_05_04_21_22_45  Non-Hardware  Mon May 04 09:22:45 PM UTC 2026
DPU2      2026_05_04_21_22_44  Non-Hardware  Mon May 04 09:22:44 PM UTC 2026
DPU1      2026_05_04_20_53_37  Non-Hardware  Mon May 04 08:53:37 PM UTC 2026
DPU0      2026_05_04_21_18_30  Non-Hardware  Mon May 04 09:18:30 PM UTC 2026

@gpunathilell
Copy link
Copy Markdown
Contributor

@vvolam there is still a race condition in this implementation - the reboot -d DPUX is executed but the chassisd is restarted before the reboot_time file is created right i.e. the chassisd has to detect that the dpu is being powered off and only then this would work right? can we have the implementation for this in the module_base implementation (i.e. for admin_state down and the reboot function calls?) this way chassisd is not waiting until dpu is powered off to create the file. This makes it more easier to control, and we can have the reboot cause time aligned with when the user executed the command, not sure if we want to cover as bug fix for this PR. We can discuss offline if requried

@vvolam
Copy link
Copy Markdown
Contributor Author

vvolam commented May 26, 2026

@gpunathilell Good catch — yes, there's a narrow race if config reload happens within the ~10s poll interval before chassisd detects Online→Offline and writes prev_reboot_time.txt. In that case the timestamp-based comparison can't fire and same-cause reboots would be missed.

Agree that writing the timestamp at the platform layer (module_base.reboot() / set_admin_state) is the right long-term fix — it decouples from chassisd's poll timing and aligns the timestamp with the actual user action.

Since this is a platform-API-level change affecting vendor implementations, I'd prefer to track it as a follow-up rather than expanding this PR's scope. Will file a separate issue. The current PR already covers the common case (chassisd detects the transition before config reload), and the remaining gap is the narrow same-cause + config-reload-within-10s edge case.

@gpunathilell
Copy link
Copy Markdown
Contributor

@vvolam its not narrow, its the complete offline cycle, assuming reboot -d DPUX is executed, chassisd does not detect dpu offline until dpu is shutdown, so technically the window is time taken for DPU to actually go offline + ~10 seconds. But I am okay to have a seperate PR, we can use the same issue to track it: sonic-net/sonic-buildimage#24275, we can close the issue sonic-net/sonic-buildimage#26254 based on this pr.

@vvolam
Copy link
Copy Markdown
Contributor Author

vvolam commented May 27, 2026

@vvolam its not narrow, its the complete offline cycle, assuming reboot -d DPUX is executed, chassisd does not detect dpu offline until dpu is shutdown, so technically the window is time taken for DPU to actually go offline + ~10 seconds. But I am okay to have a seperate PR, we can use the same issue to track it: sonic-net/sonic-buildimage#24275, we can close the issue sonic-net/sonic-buildimage#26254 based on this pr.

@gpunathilell I have also raised sonic-net/sonic-platform-common#680 to complete the scenario

@gpunathilell
Copy link
Copy Markdown
Contributor

@vvolam please remove the prev_reboot_time.txt file generation from this file, so that we only have one source for the file creation (from module_base.py) other changes look good. @chartsai-nvidia will test and once we confirm, will approve the PR

prev_reboot_time.txt is now exclusively written by module_base.py
at command invocation time (reboot/set_admin_state), ensuring a
single source of truth and eliminating the race window between
DPU power-off and chassisd's poll detection.

Depends-On: sonic-net/sonic-platform-common#680
Signed-off-by: Vasundhara Volam <vvolam@microsoft.com>
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@vvolam
Copy link
Copy Markdown
Contributor Author

vvolam commented May 28, 2026

@vvolam please remove the prev_reboot_time.txt file generation from this file, so that we only have one source for the file creation (from module_base.py) other changes look good. @chartsai-nvidia will test and once we confirm, will approve the PR

@gpunathilell Thank you, I have addressed the change.

@rameshraghupathy
Copy link
Copy Markdown
Contributor

@vvolam I still think we are mixing two different issues here.

The config-reload issue is clear: chassisd restarts while DPU reboot is in progress, STATE_DB is empty, and the transition becomes EMPTY→Offline instead of Online→Offline. That missing transition can be handled directly in SmartSwitchModuleUpdater.module_db_update(), where we already read prev_status via get_module_current_status(key), update the module table, and compare previous/current operational state.

IMO the fix should be limited to adding EMPTY→Offline handling in the existing transition logic.

I do not agree with changing the broader reboot-cause model to defer get_reboot_cause until the DPU comes back online. The current Online→Offline behavior is intentional. Reboot cause is latched/mirrored through the NPU path specifically so it can be retrieved even when the DPU is offline or dead. That is required for debugging unreachable DPUs and is consistent with the existing design.

So the concern is not whether the new deferred flow can be validated. The concern is that it changes the design contract. If a DPU never comes back online, deferring reboot-cause collection defeats the purpose of the latched reboot-cause mechanism.

Can we keep the existing Online→Offline reboot-cause persistence behavior unchanged and only add the missing EMPTY→Offline/config-reload case as a minimal targeted change?

and prev_status != str(ModuleBase.MODULE_STATUS_OFFLINE)
and current_status == str(ModuleBase.MODULE_STATUS_OFFLINE))

def _is_empty_to_offline(self, prev_status, current_status):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vvolum IMO this can be handled with a minimal addition in the existing module_db_update() logic itself.

Today, module_db_update() already:

  • reads prev_status using get_module_current_status(),
  • updates the module table,
  • and compares previous/current oper-status for transition handling.

The new EMPTY→Offline case is simply another transition condition in that same flow.

The helper functions introduced in this PR ultimately still reduce to checking:

prev_status == EMPTY and current_status == Offline

That does not require introducing a new transition framework, deferred reboot-cause processing, or restructuring the existing reboot-cause handling model.

A couple of targeted lines in the existing transition handling path should be sufficient while preserving the current Online→Offline reboot-cause persistence semantics and overall design behavior.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rameshraghupathy after our discussion in the meeting offline, do you have any other suggestions?

@rameshraghupathy
Copy link
Copy Markdown
Contributor

@vvolam I am not against checking/reading reboot-cause after the DPU powers back on.

The important requirement is that the reboot-cause event must be latched/recorded when the DPU transitions from a known non-Offline state to Offline. That transition is the point where we know the DPU went down, and the DPU may never come back online.

Reading or reconciling reboot-cause later after power-on is fine as an additional step, but it should not replace the existing down-transition latching behavior.

So the design should be:

  1. On known non-Offline → Offline:
    latch/persist the reboot-cause event or at least preserve the latched reboot-cause context.

  2. On Offline/Empty → Online:
    optionally read/reconcile/update reboot-cause if needed.

But the Online path must not be the only place where reboot-cause is captured, because that loses the dead-DPU debugging use case.

@gpunathilell
Copy link
Copy Markdown
Contributor

@vvolam I am not against checking/reading reboot-cause after the DPU powers back on.

The important requirement is that the reboot-cause event must be latched/recorded when the DPU transitions from a known non-Offline state to Offline. That transition is the point where we know the DPU went down, and the DPU may never come back online.

Reading or reconciling reboot-cause later after power-on is fine as an additional step, but it should not replace the existing down-transition latching behavior.

So the design should be:

  1. On known non-Offline → Offline:
    latch/persist the reboot-cause event or at least preserve the latched reboot-cause context.
  2. On Offline/Empty → Online:
    optionally read/reconcile/update reboot-cause if needed.

But the Online path must not be the only place where reboot-cause is captured, because that loses the dead-DPU debugging use case.

This is a platform specific handling and not the use-case for reboot cause, the get_reboot_cause is used for strictly obtaining the previous reboot cause only so it is implied that is checked when DPU is online, Retrieves the cause of the previous reboot of the DPU module as the docstring mentions, so it does not make sense to call it when the DPU is going from any state to Offline. If you want to save/record something while the DPU is powered off (so that it can be used later for actually determining the reboot cause) we can have a callback implemented for operational_status_offline which the platform can override and have its own implementation. you can use that to save information. and use it later when the get_reboot_cause is called. Would that work? @vvolam for viz

@rameshraghupathy
Copy link
Copy Markdown
Contributor

@vvolam I am not against checking/reading reboot-cause after the DPU powers back on.
The important requirement is that the reboot-cause event must be latched/recorded when the DPU transitions from a known non-Offline state to Offline. That transition is the point where we know the DPU went down, and the DPU may never come back online.
Reading or reconciling reboot-cause later after power-on is fine as an additional step, but it should not replace the existing down-transition latching behavior.
So the design should be:

  1. On known non-Offline → Offline:
    latch/persist the reboot-cause event or at least preserve the latched reboot-cause context.
  2. On Offline/Empty → Online:
    optionally read/reconcile/update reboot-cause if needed.

But the Online path must not be the only place where reboot-cause is captured, because that loses the dead-DPU debugging use case.

This is a platform specific handling and not the use-case for reboot cause, the get_reboot_cause is used for strictly obtaining the previous reboot cause only so it is implied that is checked when DPU is online, Retrieves the cause of the previous reboot of the DPU module as the docstring mentions, so it does not make sense to call it when the DPU is going from any state to Offline. If you want to save/record something while the DPU is powered off (so that it can be used later for actually determining the reboot cause) we can have a callback implemented for operational_status_offline which the platform can override and have its own implementation. you can use that to save information. and use it later when the get_reboot_cause is called. Would that work? @vvolam for viz

Thanks Gagan. I understand the generic interpretation of get_reboot_cause() as retrieving the previous reboot cause, and for many platforms that may naturally imply reading it when the module is online.

However, SmartSwitch has an explicit platform requirement that is different from a generic module model. The SmartSwitch PMON HLD says the NPU hardware should be capable of providing the DPU reboot-cause even when the DPUs are dead, and get_reboot_cause returns the current reboot-cause of the module.

So for SmartSwitch, calling get_reboot_cause while the DPU is offline is intentional platform-specific behavior, not an accidental misuse of the API.

I am okay with adding a platform callback like operational_status_offline if we want to make this cleaner architecturally. But the callback must preserve the existing behavior: when the DPU transitions from known non-Offline → Offline, the platform must be able to latch/persist the reboot-cause context immediately.

What I do not think is safe is changing the behavior so reboot-cause is only captured after Offline/Empty → Online. That loses the key SmartSwitch requirement where the DPU may be dead or may never come back online.

Also, the config-reload issue can still be fixed minimally by detecting EMPTY→Offline in the existing module_db_update() transition logic. A larger deferred reboot-cause framework should not be required for that narrow issue.

@rameshraghupathy
Copy link
Copy Markdown
Contributor

rameshraghupathy commented May 29, 2026

@vvolam I tested this on a 202511 SmartSwitch image where reboot -d DPUx works with the original 202511 chassisd.

When I replaced only chassisd with the master chassisd containing the #771 changes, reboot -d DPUx started failing. The reboot path timed out and invoked fallback powercycle:

Pre-shutdown: timed out waiting for DPU 
reboot: status already failed for DPU2

This reinforces my concern that #771 is not a safe/minimal fix.

The old chassisd persists reboot time and reboot cause immediately on known non-Offline→Offline transition. The #771 code changes that behavior and only records a deferred pending reboot check. That changes the established down-transition reboot-cause model.

For the config-reload issue, I still think the fix should be minimal:

  • preserve existing known non-Offline→Offline behavior unchanged,
  • add only the missing EMPTY→Offline handling in module_db_update(),
  • avoid deferring or restructuring the normal down-transition reboot-cause path.

Given this test result, I do not think #771 should be merged in its current form.

The key counterpoint is already in the SmartSwitch PMON HLD:

“The NPU hardware should be capable of providing the DPU reboot-cause even when the DPUs are dead.”
“The get_reboot_cause will return the current reboot-cause of the module.”

So for SmartSwitch, this is not just “previous reboot cause when DPU is online.” The platform was explicitly designed to expose reboot cause even when the DPU is dead/offline.

@rameshraghupathy rameshraghupathy self-requested a review May 29, 2026 13:18
@gpunathilell
Copy link
Copy Markdown
Contributor

@rameshraghupathy I have a few concerns with that solution. I tested this on a smartswitch and I think there are three issues we are leaving unresolved.

  1. 800 seconds timeout doesnt work reliably - this is the only check separating "same reboot we already persisted" from "different reboot we need to record" in the Offline→Online and EMPTY→Online paths. Two problems with it:

    • A real second reboot of the same cause within 800s gets dropped (the bug @chartsai-nvidia reported, partially addressed by 1627248).
    • A spurious reboot cause entry gets created on config reload (verified on smartswitch). DPU was online, ran sudo config reload -y, a new entry appeared in /host/reboot-cause/module/dpuX/history with the current timestamp and the symlink moved to it, even though no reboot happened. The path: STATE_DB flush → new chassisd sees prev_status=EMPTY, current_status=Online → enters EMPTY→Online → previously persisted entry is older than 800s so is_reboot=True → persist runs and writes a fake entry.
  2. Reboot cause gets missed when config reload is executed - this is the original bug this PR was expected to solve. If chassisd is restarted via config reload while a DPU reboot is in progress, STATE_DB is flushed and chassisd loses track of the in-progress reboot, so the actual reboot cause is not recorded.

  3. Two reboot causes get persisted for one reboot if the offline and online queries return different strings - get_reboot_cause is called twice per reboot cycle (line 807 on Online→Offline, line 815 on Offline→Online). The dedup at lines 823-832 only fires when both cause strings match AND delta < 800s. If get_reboot_cause returns anything different across the two calls, we end up with two history entries with two different timestamps for the same reboot.

@rameshraghupathy the "only add EMPTY→Offline handling, leave everything else as is" idea doesnt resolve any of these. The Online→Offline persist you want to preserve is exactly what creates #3 - the cause we capture at offline transition is the cause of the PREVIOUS reboot (per the docstring Retrieves the cause of the previous reboot of the DPU module), not the reboot currently in progress. We then re-query on the online side, may get a different string back, and persist a second time.

If we want to capture state at the moment the DPU goes offline (for dead-DPU debugging), thats fine - but its not "reboot cause", its an "operational state change" event. That needs a separate API (an oper-state-change callback, or a different function), not overloading get_reboot_cause for both. The reboot cause of one reboot event should be one value, captured when its authoritative (DPU online, we can still have the latching mechanism you mentioned).

I can suggest a design that solves all three - we can discuss in the meeting

@vvolam
Copy link
Copy Markdown
Contributor Author

vvolam commented May 29, 2026

@vvolam I tested this on a 202511 SmartSwitch image where reboot -d DPUx works with the original 202511 chassisd.

When I replaced only chassisd with the master chassisd containing the #771 changes, reboot -d DPUx started failing. The reboot path timed out and invoked fallback powercycle:

Pre-shutdown: timed out waiting for DPU 
reboot: status already failed for DPU2

This reinforces my concern that #771 is not a safe/minimal fix.

The old chassisd persists reboot time and reboot cause immediately on known non-Offline→Offline transition. The #771 code changes that behavior and only records a deferred pending reboot check. That changes the established down-transition reboot-cause model.

For the config-reload issue, I still think the fix should be minimal:

  • preserve existing known non-Offline→Offline behavior unchanged,
  • add only the missing EMPTY→Offline handling in module_db_update(),
  • avoid deferring or restructuring the normal down-transition reboot-cause path.

Given this test result, I do not think #771 should be merged in its current form.

The key counterpoint is already in the SmartSwitch PMON HLD:

“The NPU hardware should be capable of providing the DPU reboot-cause even when the DPUs are dead.” “The get_reboot_cause will return the current reboot-cause of the module.”

So for SmartSwitch, this is not just “previous reboot cause when DPU is online.” The platform was explicitly designed to expose reboot cause even when the DPU is dead/offline.

@rameshraghupathy there are preceding PRs: sonic-net/sonic-platform-common#680
sonic-net/sonic-utilities#4571. You have to test all 3 PRs together to have previous behavior

@vvolam Yes, with #680 and sonic-utilities#4571 patched, the specific command-initiated test may pass because reboot -d DPUx will write prev_reboot_time.txt before config reload.

But that only covers command-initiated reboot/shutdown. It does not address spontaneous DPU crash/kernel panic/dead-DPU cases where there is no module_base.py command invocation.

I am still concerned that #771 removes chassisd as a writer of prev_reboot_time.txt. The original #680 description even said chassisd write is preserved as fallback for spontaneous DPU crashes. But the later #771 commit changes that and makes module_base.py the exclusive writer.

So passing the config-reload test with all three PRs does not prove the design is safe. It only proves the command-path race is covered.

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@vvolam vvolam changed the title [smartswitch][chassisd] Persist DPU reboot cause on EMPTY-to-Offline transition [smartswitch][chassisd] Fix missing DPU reboot cause when config-reload interrupts DPU reboot May 29, 2026
…us DPU crashes

When module_base.py has not created prev_reboot_time.txt (e.g. spontaneous
DPU crash/kernel panic with no command invocation), chassisd generates it
as a fallback on the Online->Offline transition.

Detection: compare the time in prev_reboot_time.txt with the stored reboot
cause timestamp from previous-reboot-cause.json. If the file does not exist
or its time is <= the stored cause time (stale from a prior reboot), write
the current time. If module_base.py already created it (file time > stored
cause time), skip to avoid overwriting the more accurate command-time.

Signed-off-by: Vasundhara Volam <vvolam@microsoft.com>
@vvolam vvolam force-pushed the fix-reboot-cause-config-reload branch from 251151a to 5403e81 Compare May 29, 2026 22:01
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

The DPU reboot-cause testbed verification script is decoupled into
a separate PR to keep this change focused on the core logic fix.

Signed-off-by: Vasundhara Volam <vvolam@microsoft.com>
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

8 participants