Add ssd health pre-check for warm-reboot#4086
Open
byu343 wants to merge 2 commits into
Open
Conversation
Collaborator
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
vaibhavhd
reviewed
Oct 7, 2025
| debug "SSD Health is $health_value% — OK." | ||
| else | ||
| error "Warning: Health is $health_value% — Possible drive failure!" | ||
| exit "${EXIT_SDD_HEALTH_FAILURE}" |
Contributor
There was a problem hiding this comment.
From Boyang - this is very basic / minimal check that can be done to prevent the issue.
The core idea is to collect debug information to then implement a solution to really prevent these issues.
vaibhavhd
reviewed
Oct 7, 2025
| # Check SSD health | ||
| if [ -x "${SSD_UTIL}" ]; then | ||
| debug "Checking ssd health before ${REBOOT_TYPE}..." | ||
| health_line=$(${SSD_UTIL} | grep -E "Health\s*:\s*[0-9]+\.?[0-9]*%" || true) |
Contributor
There was a problem hiding this comment.
- What is runtime for this new utility? Can it extend the warm-reboot overall runtime considerably?
There was a problem hiding this comment.
- What is runtime for this new utility? Can it extend the warm-reboot overall runtime considerably?
Hi, this util is equivalent to show platform ssdhealth, only takes less than a second
| PLATFORM=$(sonic-cfggen -H -v DEVICE_METADATA.localhost.platform) | ||
| PLATFORM_PLUGIN="${REBOOT_TYPE}_plugin" | ||
| LOG_SSD_HEALTH="/usr/local/bin/log_ssd_health" | ||
| SSD_UTIL="/usr/local/bin/ssdutil" |
Contributor
There was a problem hiding this comment.
Should we backport these changes?
Ryangwaite
requested changes
May 28, 2026
|
|
||
| check_db_integrity | ||
|
|
||
| # Check SSD health |
Contributor
There was a problem hiding this comment.
Can you wrap this in a function like how it is done in check_pfc_storm_active and call it directly below the check_pfc_storm_active call in reboot_pre_check. That will help the ongoing maintanance of this.
4600d8f to
c185913
Compare
Collaborator
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Signed-off-by: Boyang Yu <byu@arista.com>
c185913 to
f501caa
Compare
Collaborator
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Signed-off-by: Boyang Yu <byu@arista.com>
f501caa to
f1d95c5
Compare
Collaborator
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What I did
Check SSD health using ssdutil before warm-reboot
How I did it
Check the health of SSD based on the output of ssdutil. Stop warm-reboot early if the health number is 0.
How to verify it
The check will be skipped if the command ssdutil returned with error
The added lines can correctly parse the output of ssdutil in the format of "Health : X%" or "Health : X.Y%"
The check will block fast-reboot/warm-reboot if the extracted health number is 0
Previous command output (if the output of a command-line utility has changed)
New command output (if the output of a command-line utility has changed)