Skip to content

Write SAI failure state to state DB#4622

Draft
prabhataravind wants to merge 4 commits into
sonic-net:masterfrom
prabhataravind:paravind/orch_unhealthy_reset
Draft

Write SAI failure state to state DB#4622
prabhataravind wants to merge 4 commits into
sonic-net:masterfrom
prabhataravind:paravind/orch_unhealthy_reset

Conversation

@prabhataravind
Copy link
Copy Markdown
Contributor

@prabhataravind prabhataravind commented May 29, 2026

What I did

Replaced the in-memory gOrchUnhealthy global boolean with a STATE_DB-backed health status table (PROCESS_HEALTH), enabling external reset of the orchagent unhealthy flag without restarting the process.

Changes:

  • Removed bool gOrchUnhealthy global variable from main.cpp, orchdaemon.cpp, saihelper.cpp, and mock_orchagent_main.cpp
  • Added PROCESS_HEALTH table in STATE_DB with key orchagent and fields unhealthy ("true"/"false") and error (failure description)
  • Added three functions in saihelper.h/saihelper.cpp:
    • initSaiFailureTable() — creates the STATE_DB connection and table object
    • setSaiFailureStatus() — writes health status to STATE_DB and updates a local cache
    • getSaiFailureStatus() — returns cached status; only reads STATE_DB when unhealthy (to detect external resets)
  • handleSaiFailure() calls setSaiFailureStatus(true, errorString) instead of setting global
  • OrchDaemon::start() loop calls getSaiFailureStatus() inside the existing SELECT_TIMEOUT periodic block, throttling the check to ~1/second
  • main() calls initSaiFailureTable() + setSaiFailureStatus(false) after option parsing, immediately before SAI initialization
  • Added mock tests for initSaiFailureTable/setSaiFailureStatus/getSaiFailureStatus round-trip, including external reset detection

Why I did it

The gOrchUnhealthy flag had no reset path — once set by a non-fatal SAI failure, it stayed true forever, causing the orchdaemon loop to log the error every ~1 second until the process was restarted. Moving the flag to STATE_DB allows operators to clear it externally:

sonic-db-cli STATE_DB HSET "PROCESS_HEALTH|orchagent" "unhealthy" "false"
sonic-db-cli STATE_DB HSET "PROCESS_HEALTH|orchagent" "error" ""

This also makes orchagent health status observable by other SONiC components via standard DB subscriptions.

How I verified it

  • Verified no remaining references to gOrchUnhealthy in the codebase
  • Confirmed setSaiFailureStatus() is only called on discrete events (startup and SAI failures), not in the polling loop
  • Confirmed getSaiFailureStatus() uses a local cache (gOrchUnhealthyCached) so the healthy path incurs zero Redis I/O; the unhealthy path uses a single HGETALL (one Redis round-trip) to fetch both fields
  • The health check in OrchDaemon::start() is inside the SELECT_TIMEOUT guard, so it runs at most once per second even when the select loop spins (e.g., after a SAI failure leaves m_ready non-empty)
  • Health table initialization runs after option parsing, so orchagent -h does not require Redis
  • Added mock tests covering: initial healthy state, set/get unhealthy with error string, reset to healthy, external reset detection via STATE_DB, and successive failure overwrites

Details if related

  • STATE_DB table: PROCESS_HEALTH|orchagent with fields unhealthy and error
  • Only abort_on_failure=false call sites in handleSaiCreateStatus, handleSaiSetStatus, handleSaiRemoveStatus leave orchagent running unhealthy — these are the paths that benefit from external reset
  • Follows existing SONiC patterns (similar to SWITCH_CAPABILITY table in STATE_DB)
  • No sleep_for or blocking calls added to the select loop (avoids the issue that caused PR Fix orchagent 100% CPU spin when gOrchUnhealthy is set #4274 to be reverted)

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Signed-off-by: Prabhat Aravind <paravind@microsoft.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR replaces orchagent’s in-memory SAI failure health flag with a STATE_DB-backed PROCESS_HEALTH|orchagent status so external components/operators can observe and clear the unhealthy state.

Changes:

  • Adds SAI failure health table initialization, read, and write helpers.
  • Updates SAI failure handling and orchdaemon polling to use STATE_DB health status.
  • Adds CI artifact collection for mock test logs on pipeline failure.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
orchagent/saihelper.h Declares new SAI failure health table helper APIs.
orchagent/saihelper.cpp Implements STATE_DB-backed health status storage and updates failure handling.
orchagent/orchdaemon.cpp Reads health status from STATE_DB during the main daemon loop.
orchagent/main.cpp Initializes and clears the SAI failure health table on startup.
tests/mock_tests/mock_orchagent_main.cpp Removes the obsolete mock gOrchUnhealthy global.
.azure-pipelines/build-template.yml Collects mock test logs as build artifacts on failure.

Comment thread orchagent/main.cpp Outdated
Comment thread orchagent/orchdaemon.cpp Outdated
Comment thread orchagent/saihelper.cpp Outdated
@prabhataravind prabhataravind force-pushed the paravind/orch_unhealthy_reset branch from 000a064 to e17a446 Compare May 30, 2026 14:04
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated no new comments.

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants