Skip to content

Add ssd health pre-check for warm-reboot#4086

Open
byu343 wants to merge 2 commits into
sonic-net:masterfrom
byu343:ssd-check-for-warm-reboot
Open

Add ssd health pre-check for warm-reboot#4086
byu343 wants to merge 2 commits into
sonic-net:masterfrom
byu343:ssd-check-for-warm-reboot

Conversation

@byu343
Copy link
Copy Markdown
Contributor

@byu343 byu343 commented Oct 7, 2025

What I did

Check SSD health using ssdutil before warm-reboot

How I did it

Check the health of SSD based on the output of ssdutil. Stop warm-reboot early if the health number is 0.

How to verify it

The check will be skipped if the command ssdutil returned with error
The added lines can correctly parse the output of ssdutil in the format of "Health : X%" or "Health : X.Y%"
The check will block fast-reboot/warm-reboot if the extracted health number is 0

Previous command output (if the output of a command-line utility has changed)

New command output (if the output of a command-line utility has changed)

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@vaibhavhd vaibhavhd requested a review from judyjoseph October 7, 2025 19:38
Comment thread scripts/fast-reboot Outdated
debug "SSD Health is $health_value% — OK."
else
error "Warning: Health is $health_value% — Possible drive failure!"
exit "${EXIT_SDD_HEALTH_FAILURE}"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From Boyang - this is very basic / minimal check that can be done to prevent the issue.

The core idea is to collect debug information to then implement a solution to really prevent these issues.

Comment thread scripts/fast-reboot
# Check SSD health
if [ -x "${SSD_UTIL}" ]; then
debug "Checking ssd health before ${REBOOT_TYPE}..."
health_line=$(${SSD_UTIL} | grep -E "Health\s*:\s*[0-9]+\.?[0-9]*%" || true)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. What is runtime for this new utility? Can it extend the warm-reboot overall runtime considerably?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. What is runtime for this new utility? Can it extend the warm-reboot overall runtime considerably?

Hi, this util is equivalent to show platform ssdhealth, only takes less than a second

Comment thread scripts/fast-reboot
PLATFORM=$(sonic-cfggen -H -v DEVICE_METADATA.localhost.platform)
PLATFORM_PLUGIN="${REBOOT_TYPE}_plugin"
LOG_SSD_HEALTH="/usr/local/bin/log_ssd_health"
SSD_UTIL="/usr/local/bin/ssdutil"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we backport these changes?

Comment thread scripts/fast-reboot Outdated

check_db_integrity

# Check SSD health
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you wrap this in a function like how it is done in check_pfc_storm_active and call it directly below the check_pfc_storm_active call in reboot_pre_check. That will help the ongoing maintanance of this.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fixed now.

@byu343 byu343 force-pushed the ssd-check-for-warm-reboot branch from 4600d8f to c185913 Compare May 29, 2026 17:59
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Signed-off-by: Boyang Yu <byu@arista.com>
@byu343 byu343 force-pushed the ssd-check-for-warm-reboot branch from c185913 to f501caa Compare May 29, 2026 18:01
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Signed-off-by: Boyang Yu <byu@arista.com>
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants