Skip to content

[smartswitch] Use dpu_halt_services_timeout from platform.json, fallback to HALT_TIMEOUT=60#376

Merged
vvolam merged 11 commits into
sonic-net:masterfrom
rameshraghupathy:fix-dpu-halt-timeout
May 28, 2026
Merged

[smartswitch] Use dpu_halt_services_timeout from platform.json, fallback to HALT_TIMEOUT=60#376
vvolam merged 11 commits into
sonic-net:masterfrom
rameshraghupathy:fix-dpu-halt-timeout

Conversation

@rameshraghupathy
Copy link
Copy Markdown
Contributor

@rameshraghupathy rameshraghupathy commented Apr 16, 2026

Description of PR

Summary:
On SmartSwitch platforms, HALT reboot may require additional time for DPU services and containers to terminate cleanly before the reboot sequence is considered complete.

The current implementation uses a fixed HALT_TIMEOUT=60 seconds, which may not be sufficient on all platforms and hardware combinations.

This change makes the HALT timeout configurable through platform.json using:

"dpu_halt_services_timeout"

while preserving the existing behavior by falling back to:

HALT_TIMEOUT = 60

when:

the key is missing,
the value is null,
the value is invalid,
the file cannot be read,
or the configured value is non-positive.

Type of change

  • [ x] Testbed and Framework(new/improvement)

Back port request

  • 202205
  • 202305
  • 202311
  • 202405
  • 202411
  • 202505
  • 202511

How did you do it

Added helper function get_dpu_halt_services_timeout()
Read dpu_halt_services_timeout from platform.json
Added validation to ensure only positive timeout values are accepted
Added informational logging when fallback timeout is used
Updated HALT reboot flow to use platform-configured timeout
Added unit test coverage for:
missing key in platform.json
stable mocking of timeout helper in HALT reboot tests
Default behavior

If dpu_halt_services_timeout is not configured, the existing default behavior remains unchanged:

HALT_TIMEOUT = 60

How did you verify/test it?

normal HALT reboot flow
HALT timeout failure flow
missing platform.json key handling
invalid/unreadable timeout fallback handling
existing reboot UTs

…ack to HALT_TIMEOUT=60

Signed-off-by: Ramesh Raghupathy <ram@cisco.com>
@mssonicbld
Copy link
Copy Markdown

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Copy Markdown
Contributor

@vvolam vvolam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rameshraghupathy Changes are fine. Could you fix the coverage?

@rameshraghupathy
Copy link
Copy Markdown
Contributor Author

@rameshraghupathy Changes are fine. Could you fix the coverage?

@vvolam Sure, will do.

Signed-off-by: Ramesh Raghupathy <ram@cisco.com>
@mssonicbld
Copy link
Copy Markdown

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Signed-off-by: Ramesh Raghupathy <ram@cisco.com>
@mssonicbld
Copy link
Copy Markdown

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Signed-off-by: Ramesh Raghupathy <ram@cisco.com>
@mssonicbld
Copy link
Copy Markdown

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Signed-off-by: Ramesh Raghupathy <ram@cisco.com>
@mssonicbld
Copy link
Copy Markdown

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Signed-off-by: Ramesh Raghupathy <ram@cisco.com>
@mssonicbld
Copy link
Copy Markdown

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Signed-off-by: Ramesh Raghupathy <ram@cisco.com>
@mssonicbld
Copy link
Copy Markdown

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@vvolam
Copy link
Copy Markdown
Contributor

vvolam commented Apr 21, 2026

@rameshraghupathy please fix PR description and use PR template

Copy link
Copy Markdown
Contributor

@vvolam vvolam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1. PR description contradicts the code

The description says "increasing the timeout to 120s for now" but the code reads dpu_halt_services_timeout from platform.json with a 60s fallback — it doesn't hardcode 120. Please update the PR description to accurately explain the platform.json lookup mechanism and the fallback behavior. Also please use the PR template.
@vvolam Fixed it

Comment thread host_modules/reboot.py Outdated
Comment thread host_modules/reboot.py
Comment thread tests/host_modules/reboot_test.py
Comment thread host_modules/reboot.py
@vvolam
Copy link
Copy Markdown
Contributor

vvolam commented Apr 30, 2026

@rameshraghupathy could you address the comments?

@mssonicbld
Copy link
Copy Markdown

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Signed-off-by: Ramesh Raghupathy <ram@cisco.com>
@mssonicbld
Copy link
Copy Markdown

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Signed-off-by: Ramesh Raghupathy <ram@cisco.com>
@mssonicbld
Copy link
Copy Markdown

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Signed-off-by: Ramesh Raghupathy <ram@cisco.com>
@mssonicbld
Copy link
Copy Markdown

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Copy Markdown
Contributor

@vvolam vvolam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, Thanks

@vvolam vvolam merged commit 72fd0b9 into sonic-net:master May 28, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants