Write SAI failure state to state DB by prabhataravind · Pull Request #4622 · sonic-net/sonic-swss

prabhataravind · 2026-05-29T20:50:35Z

What I did

Replaced the in-memory gOrchUnhealthy global boolean with a STATE_DB-backed health status table (PROCESS_HEALTH), enabling external reset of the orchagent unhealthy flag without restarting the process.

Changes:

Removed bool gOrchUnhealthy global variable from main.cpp, orchdaemon.cpp, saihelper.cpp, and mock_orchagent_main.cpp
Added PROCESS_HEALTH table in STATE_DB with key orchagent and fields unhealthy ("true"/"false") and error (failure description)
Added three functions in saihelper.h/saihelper.cpp:
- initSaiFailureTable() — creates the STATE_DB connection and table object
- setSaiFailureStatus() — writes health status to STATE_DB and updates a local cache
- getSaiFailureStatus() — returns cached status; only reads STATE_DB when unhealthy (to detect external resets)
handleSaiFailure() calls setSaiFailureStatus(true, errorString) instead of setting global
OrchDaemon::start() loop calls getSaiFailureStatus() inside the existing SELECT_TIMEOUT periodic block, throttling the check to ~1/second
main() calls initSaiFailureTable() + setSaiFailureStatus(false) after option parsing, immediately before SAI initialization
Added mock tests for initSaiFailureTable/setSaiFailureStatus/getSaiFailureStatus round-trip, including external reset detection

Why I did it

The gOrchUnhealthy flag had no reset path — once set by a non-fatal SAI failure, it stayed true forever, causing the orchdaemon loop to log the error every ~1 second until the process was restarted. Moving the flag to STATE_DB allows operators to clear it externally:

sonic-db-cli STATE_DB HSET "PROCESS_HEALTH|orchagent" "unhealthy" "false"
sonic-db-cli STATE_DB HSET "PROCESS_HEALTH|orchagent" "error" ""

This also makes orchagent health status observable by other SONiC components via standard DB subscriptions.

How I verified it

Verified no remaining references to gOrchUnhealthy in the codebase
Confirmed setSaiFailureStatus() is only called on discrete events (startup and SAI failures), not in the polling loop
Confirmed getSaiFailureStatus() uses a local cache (gOrchUnhealthyCached) so the healthy path incurs zero Redis I/O; the unhealthy path uses a single HGETALL (one Redis round-trip) to fetch both fields
The health check in OrchDaemon::start() is inside the SELECT_TIMEOUT guard, so it runs at most once per second even when the select loop spins (e.g., after a SAI failure leaves m_ready non-empty)
Health table initialization runs after option parsing, so orchagent -h does not require Redis
Added mock tests covering: initial healthy state, set/get unhealthy with error string, reset to healthy, external reset detection via STATE_DB, and successive failure overwrites

Details if related

STATE_DB table: PROCESS_HEALTH|orchagent with fields unhealthy and error
Only abort_on_failure=false call sites in handleSaiCreateStatus, handleSaiSetStatus, handleSaiRemoveStatus leave orchagent running unhealthy — these are the paths that benefit from external reset
Follows existing SONiC patterns (similar to SWITCH_CAPABILITY table in STATE_DB)
No sleep_for or blocking calls added to the select loop (avoids the issue that caused PR Fix orchagent 100% CPU spin when gOrchUnhealthy is set #4274 to be reverted)

mssonicbld · 2026-05-29T20:50:43Z

/azp run

azure-pipelines · 2026-05-29T20:50:54Z

Azure Pipelines successfully started running 1 pipeline(s).

mssonicbld · 2026-05-29T21:00:15Z

/azp run

azure-pipelines · 2026-05-29T21:00:26Z

Azure Pipelines successfully started running 1 pipeline(s).

mssonicbld · 2026-05-29T21:57:37Z

/azp run

azure-pipelines · 2026-05-29T21:57:47Z

Azure Pipelines successfully started running 1 pipeline(s).

mssonicbld · 2026-05-29T22:40:09Z

/azp run

azure-pipelines · 2026-05-29T22:40:19Z

Azure Pipelines successfully started running 1 pipeline(s).

mssonicbld · 2026-05-30T11:21:33Z

/azp run

azure-pipelines · 2026-05-30T11:21:42Z

Azure Pipelines successfully started running 1 pipeline(s).

Signed-off-by: Prabhat Aravind <paravind@microsoft.com>

Copilot

Pull request overview

This PR replaces orchagent’s in-memory SAI failure health flag with a STATE_DB-backed PROCESS_HEALTH|orchagent status so external components/operators can observe and clear the unhealthy state.

Changes:

Adds SAI failure health table initialization, read, and write helpers.
Updates SAI failure handling and orchdaemon polling to use STATE_DB health status.
Adds CI artifact collection for mock test logs on pipeline failure.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
`orchagent/saihelper.h`	Declares new SAI failure health table helper APIs.
`orchagent/saihelper.cpp`	Implements STATE_DB-backed health status storage and updates failure handling.
`orchagent/orchdaemon.cpp`	Reads health status from STATE_DB during the main daemon loop.
`orchagent/main.cpp`	Initializes and clears the SAI failure health table on startup.
`tests/mock_tests/mock_orchagent_main.cpp`	Removes the obsolete mock `gOrchUnhealthy` global.
`.azure-pipelines/build-template.yml`	Collects mock test logs as build artifacts on failure.

mssonicbld · 2026-05-30T14:04:18Z

/azp run

azure-pipelines · 2026-05-30T14:04:28Z

Azure Pipelines successfully started running 1 pipeline(s).

mssonicbld · 2026-05-30T14:15:15Z

/azp run

azure-pipelines · 2026-05-30T14:15:25Z

Azure Pipelines successfully started running 1 pipeline(s).

mssonicbld · 2026-05-30T14:22:43Z

/azp run

azure-pipelines · 2026-05-30T14:22:53Z

Azure Pipelines successfully started running 1 pipeline(s).

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated no new comments.

mssonicbld · 2026-05-30T16:42:33Z

/azp run

azure-pipelines · 2026-05-30T16:42:43Z

Azure Pipelines successfully started running 1 pipeline(s).

Write SAI failure state to state DB

e17a446

Signed-off-by: Prabhat Aravind <paravind@microsoft.com>

prabhataravind requested a review from Copilot May 30, 2026 13:13

Copilot started reviewing on behalf of prabhataravind May 30, 2026 13:13 View session

Copilot AI reviewed May 30, 2026

View reviewed changes

Comment thread orchagent/main.cpp Outdated

Comment thread orchagent/orchdaemon.cpp Outdated

Comment thread orchagent/saihelper.cpp Outdated

prabhataravind force-pushed the paravind/orch_unhealthy_reset branch from 000a064 to e17a446 Compare May 30, 2026 14:04

Fix comments and add tests

35ecce1

Use a single table get instead of 2 hgets

d20cab6

prabhataravind requested a review from Copilot May 30, 2026 14:25

Copilot started reviewing on behalf of prabhataravind May 30, 2026 14:26 View session

Copilot AI reviewed May 30, 2026

View reviewed changes

Fix bugs and add more tests to cover error cases

df97644

Conversation

prabhataravind commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mssonicbld commented May 29, 2026

Uh oh!

azure-pipelines Bot commented May 29, 2026

Uh oh!

mssonicbld commented May 29, 2026

Uh oh!

azure-pipelines Bot commented May 29, 2026

Uh oh!

mssonicbld commented May 29, 2026

Uh oh!

azure-pipelines Bot commented May 29, 2026

Uh oh!

mssonicbld commented May 29, 2026

Uh oh!

azure-pipelines Bot commented May 29, 2026

Uh oh!

mssonicbld commented May 30, 2026

Uh oh!

azure-pipelines Bot commented May 30, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mssonicbld commented May 30, 2026

Uh oh!

azure-pipelines Bot commented May 30, 2026

Uh oh!

mssonicbld commented May 30, 2026

Uh oh!

azure-pipelines Bot commented May 30, 2026

Uh oh!

mssonicbld commented May 30, 2026

Uh oh!

azure-pipelines Bot commented May 30, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

mssonicbld commented May 30, 2026

Uh oh!

azure-pipelines Bot commented May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

prabhataravind commented May 29, 2026 •

edited

Loading