Write SAI failure state to state DB#4622
Draft
prabhataravind wants to merge 4 commits into
Draft
Conversation
Collaborator
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Collaborator
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Collaborator
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Collaborator
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Collaborator
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Signed-off-by: Prabhat Aravind <paravind@microsoft.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR replaces orchagent’s in-memory SAI failure health flag with a STATE_DB-backed PROCESS_HEALTH|orchagent status so external components/operators can observe and clear the unhealthy state.
Changes:
- Adds SAI failure health table initialization, read, and write helpers.
- Updates SAI failure handling and orchdaemon polling to use STATE_DB health status.
- Adds CI artifact collection for mock test logs on pipeline failure.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
orchagent/saihelper.h |
Declares new SAI failure health table helper APIs. |
orchagent/saihelper.cpp |
Implements STATE_DB-backed health status storage and updates failure handling. |
orchagent/orchdaemon.cpp |
Reads health status from STATE_DB during the main daemon loop. |
orchagent/main.cpp |
Initializes and clears the SAI failure health table on startup. |
tests/mock_tests/mock_orchagent_main.cpp |
Removes the obsolete mock gOrchUnhealthy global. |
.azure-pipelines/build-template.yml |
Collects mock test logs as build artifacts on failure. |
000a064 to
e17a446
Compare
Collaborator
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Collaborator
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Collaborator
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Collaborator
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What I did
Replaced the in-memory
gOrchUnhealthyglobal boolean with a STATE_DB-backed health status table (PROCESS_HEALTH), enabling external reset of the orchagent unhealthy flag without restarting the process.Changes:
bool gOrchUnhealthyglobal variable from main.cpp, orchdaemon.cpp, saihelper.cpp, and mock_orchagent_main.cppPROCESS_HEALTHtable in STATE_DB with key orchagent and fieldsunhealthy("true"/"false") anderror(failure description)initSaiFailureTable()— creates the STATE_DB connection and table objectsetSaiFailureStatus()— writes health status to STATE_DB and updates a local cachegetSaiFailureStatus()— returns cached status; only reads STATE_DB when unhealthy (to detect external resets)handleSaiFailure()callssetSaiFailureStatus(true, errorString)instead of setting globalOrchDaemon::start()loop callsgetSaiFailureStatus()inside the existingSELECT_TIMEOUTperiodic block, throttling the check to ~1/secondmain()callsinitSaiFailureTable()+setSaiFailureStatus(false)after option parsing, immediately before SAI initializationinitSaiFailureTable/setSaiFailureStatus/getSaiFailureStatusround-trip, including external reset detectionWhy I did it
The
gOrchUnhealthyflag had no reset path — once set by a non-fatal SAI failure, it stayedtrueforever, causing the orchdaemon loop to log the error every ~1 second until the process was restarted. Moving the flag to STATE_DB allows operators to clear it externally:This also makes orchagent health status observable by other SONiC components via standard DB subscriptions.
How I verified it
gOrchUnhealthyin the codebasesetSaiFailureStatus()is only called on discrete events (startup and SAI failures), not in the polling loopgetSaiFailureStatus()uses a local cache (gOrchUnhealthyCached) so the healthy path incurs zero Redis I/O; the unhealthy path uses a singleHGETALL(one Redis round-trip) to fetch both fieldsOrchDaemon::start()is inside theSELECT_TIMEOUTguard, so it runs at most once per second even when the select loop spins (e.g., after a SAI failure leavesm_readynon-empty)orchagent -hdoes not require RedisDetails if related
PROCESS_HEALTH|orchagentwith fieldsunhealthyanderrorabort_on_failure=falsecall sites inhandleSaiCreateStatus,handleSaiSetStatus,handleSaiRemoveStatusleave orchagent running unhealthy — these are the paths that benefit from external resetSWITCH_CAPABILITYtable in STATE_DB)sleep_foror blocking calls added to the select loop (avoids the issue that caused PR Fix orchagent 100% CPU spin when gOrchUnhealthy is set #4274 to be reverted)