Fix for PMON restart fails with Error "All critical services should be fully started!"#371
Open
asechoud wants to merge 1 commit into
Open
Conversation
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Signed-off-by: Aseem Choudhary <asechoud@cisco.com>
5437b2d to
f8b97b1
Compare
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
anamehra
approved these changes
Apr 1, 2026
Contributor
|
@saiarcot895 can you help review this change? |
bhouse-nexthop
approved these changes
Apr 8, 2026
Contributor
bhouse-nexthop
left a comment
There was a problem hiding this comment.
this is a good find, we see failures like this in our automated testing using sonic-mgmt. change seems reasonable.
Contributor
|
Are you seeing this even after sonic-net/sonic-utilities#4336? |
Contributor
Thanks @saiarcot895 ! This PR works. We can close current PR. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why I did it:
On 202511 when running override_config_table.test_override_config_table#test_load_minigraph_with_golden_config sonic-mgmt test, the test fails with Error "All critical services should be fully started!". This is because pmon service hits start-limit-hit and doesn't come back for 420 seconds. This can also be reproduced by doing config reload continuously (immediately 3 in a row).
pmon has StartLimitBurst set to 3 within 20min (which is sonic common). This service is controlled by featured as a delay start process. Config reload bumps up the restart count by 1. When config reload starts the process again, featured kicks in to stop the process and start again as delayed. That again bumps the restart count. If the restart count reach max before featured starts the process, it fails due to start-limit.
18:52:43 — pmon restarts (1st time, success)
18:54:32 — pmon receives SIGTERM (~2 min after start), deliberate shutdown
18:55:14 — featured detects pmon is down, holds restart
18:55:46 — featured attempts restart → FAILS with start-limit-hit
How I did it:
Reset accumulated start-limit-hit counter before starting a service in featured. Featured Issues a deliberate, externally-triggered start for pmon service. It is safe to clear the rate-limit state even if systemd's own Restart= loop would normally be subject to StartLimitBurst. reset-failed is idempotent and returns 0 when the service is not in a failed state.
How to verify it:
This can be manually tested by continuously running "config reload" command 3 or more times and check pmon is able to come up. This can also be tested by running the aforementioned sonic-mgmt test.