Skip to content

Fix for PMON restart fails with Error "All critical services should be fully started!"#371

Open
asechoud wants to merge 1 commit into
sonic-net:masterfrom
asechoud:fix/pmon-restart-failed-on-config-reload
Open

Fix for PMON restart fails with Error "All critical services should be fully started!"#371
asechoud wants to merge 1 commit into
sonic-net:masterfrom
asechoud:fix/pmon-restart-failed-on-config-reload

Conversation

@asechoud
Copy link
Copy Markdown

@asechoud asechoud commented Apr 1, 2026

Why I did it:

On 202511 when running override_config_table.test_override_config_table#test_load_minigraph_with_golden_config sonic-mgmt test, the test fails with Error "All critical services should be fully started!". This is because pmon service hits start-limit-hit and doesn't come back for 420 seconds. This can also be reproduced by doing config reload continuously (immediately 3 in a row).
pmon has StartLimitBurst set to 3 within 20min (which is sonic common). This service is controlled by featured as a delay start process. Config reload bumps up the restart count by 1. When config reload starts the process again, featured kicks in to stop the process and start again as delayed. That again bumps the restart count. If the restart count reach max before featured starts the process, it fails due to start-limit.
18:52:43 — pmon restarts (1st time, success)
18:54:32 — pmon receives SIGTERM (~2 min after start), deliberate shutdown
18:55:14 — featured detects pmon is down, holds restart
18:55:46 — featured attempts restart → FAILS with start-limit-hit

How I did it:

Reset accumulated start-limit-hit counter before starting a service in featured. Featured Issues a deliberate, externally-triggered start for pmon service. It is safe to clear the rate-limit state even if systemd's own Restart= loop would normally be subject to StartLimitBurst. reset-failed is idempotent and returns 0 when the service is not in a failed state.

How to verify it:

This can be manually tested by continuously running "config reload" command 3 or more times and check pmon is able to come up. This can also be tested by running the aforementioned sonic-mgmt test.

@mssonicbld
Copy link
Copy Markdown

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Signed-off-by: Aseem Choudhary <asechoud@cisco.com>
@asechoud asechoud force-pushed the fix/pmon-restart-failed-on-config-reload branch from 5437b2d to f8b97b1 Compare April 1, 2026 22:29
@mssonicbld
Copy link
Copy Markdown

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@anamehra
Copy link
Copy Markdown
Contributor

anamehra commented Apr 1, 2026

Hi @arlakshm , @prgeor , please review or assign reviewers. Thanks

@gechiang gechiang requested a review from mlok-nokia April 8, 2026 16:03
@bhouse-nexthop bhouse-nexthop self-requested a review April 8, 2026 16:04
@gechiang
Copy link
Copy Markdown

gechiang commented Apr 8, 2026

@saiarcot895 can you help review this change?
Thanks!

Copy link
Copy Markdown
Contributor

@bhouse-nexthop bhouse-nexthop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a good find, we see failures like this in our automated testing using sonic-mgmt. change seems reasonable.

@saiarcot895
Copy link
Copy Markdown
Contributor

Are you seeing this even after sonic-net/sonic-utilities#4336?

@anamehra
Copy link
Copy Markdown
Contributor

Are you seeing this even after sonic-net/sonic-utilities#4336?

Thanks @saiarcot895 ! This PR works. We can close current PR.

@anamehra anamehra self-requested a review April 10, 2026 21:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants