Fix route_check.py: performance, dict bug, and intersection retry#4576
Open
mike-dubrovsky wants to merge 3 commits into
Open
Fix route_check.py: performance, dict bug, and intersection retry#4576mike-dubrovsky wants to merge 3 commits into
mike-dubrovsky wants to merge 3 commits into
Conversation
…lity fetch_routes() was appending bare prefix strings to missing_routes, but mitigate_installed_not_offloaded_frr_routes() accesses entry['prefix'] and entry['protocol'], causing TypeError: string indices must be integers whenever suppress-fib-pending triggers mitigation. The bug was introduced in a9543cb ("Fix route_check.py to not hog a lot of memory", PR sonic-net#4205) which moved route filtering into fetch_routes() but switched from appending the full JSON entry dict to appending only the prefix string. Fix: return {'prefix': prefix, 'protocol': route_entry.get('protocol', '')} to match the format expected by mitigate_installed_not_offloaded_frr_routes(). Signed-off-by: mike-dubrovsky <mdubrovs@cisco.com>
## What I did Replaced the ijson-based JSON parser in fetch_routes() with a line-by-line scan of vtysh plain-text output. This is 5-9x faster and keeps peak RSS flat regardless of route table size. ## Why text output is faster vtysh generates "show ip route json" by serialising the full in-memory route table to JSON before sending any bytes. "show ip route" (text) is produced incrementally — vtysh prints each route as it iterates — so Python never holds more than one line in RAM. ## How text output encodes offload state FRR zebra_vty.c assembles each route line as three fixed-position chars (zebra_route_char + selected_char + re_status_output_char): B>* 10.0.0.0/24 selected, offloaded (normal / healthy) B>q 10.0.0.0/24 queued, not yet offloaded (suppress-fib-pending) B>o 10.0.0.0/24 offload failure The regex r'^B>([qo])\s+(\S+)\s' matches only the two problematic cases directly. On a fully-converged device nothing matches and the function returns empty lists without any heap allocations. The ijson dependency is no longer needed and is removed. Signed-off-by: mike-dubrovsky <mdubrovs@cisco.com>
## What I did
Replaced the overwrite-on-each-iteration retry loop with an intersection
accumulator so that only routes stuck in *every* poll window are mitigated.
## Problem with the previous logic
The retry loop called get_frr_routes_parallel() up to FRR_CHECK_RETRIES
times and simply overwrote missed_rt on every iteration:
for i in range(retries):
missed_rt, failed_rt = get_frr_routes_parallel(namespace)
if not missed_rt and not failed_rt:
break
time.sleep(FRR_WAIT_TIME)
# mitigate whatever happened to be in missed_rt at loop exit
Consequence: a route stuck only in the last iteration is
mitigated even though it may be in the middle of normal BGP convergence
(false positive / premature mitigation).
## New logic
Accumulate a running intersection across iterations. On the first iteration
the accumulator is seeded with the full result set. On each subsequent
iteration, any prefix no longer present is removed.
iter 0: {A, B, C, D} <- seed
iter 1: {A, B, C} <- D removed (converged)
iter 2: {A, B} <- C removed (converged)
result: {A, B} <- truly stuck, safe to mitigate
If the current iteration returns nothing, the intersection becomes empty and
the loop exits early — same fast-path as before for the healthy case.
Signed-off-by: mike-dubrovsky <mdubrovs@cisco.com>
Collaborator
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What I did
Three fixes to
route_check.py:Bug fix (
fetch_routesreturns dicts):fetch_routes()was appendingbare prefix strings to
missing_routes, butmitigate_installed_not_offloaded_frr_routes()expects dicts withentry['prefix']andentry['protocol'], causing aTypeErrorwheneversuppress-fib-pending triggered mitigation.
Performance (replace ijson JSON scan with vtysh text streaming):
Replace the ijson-based parser with a streaming line-by-line scan of
vtysh show ip/ipv6 routeplain-text output and a focused regex thatmatches only queued (
q) or offload-failed (o) routes.Correctness (intersection-based retry in
check_frr_pending_routes):Replace the overwrite-on-each-iteration retry loop with an intersection
accumulator so that only routes stuck in every poll window are mitigated,
avoiding premature mitigation of routes that are still converging.
How I did it
Bug fix:
fetch_routes()now returns{'prefix': prefix, 'protocol': 'bgp'}dicts instead of bare strings, matching the format expected by the mitigation
function.
Performance:
vtysh show ip route jsonfor a large route table serialisesthe entire in-memory RIB to JSON before sending a single byte, producing
~200 MB that must then be parsed by Python. The plain-text output is streamed
incrementally by vtysh and scanned line-by-line with a single regex — no JSON
heap allocation. On a healthy device (no stuck routes) the function returns
empty lists without any heap allocations at all.
FRR
zebra_vty.cassembles each route line as three fixed-position characters(
zebra_route_char+selected_char+re_status_output_char):The regex
r'^B>([qo])\s+(\S+)\s'matches only the two cases that requireaction.
Correctness: the previous loop overwrote
missed_rton every iteration,so a route that appeared stuck only in the last iteration could be mitigated
even though it was still converging. The new logic seeds an accumulator on the
first iteration and intersects it with each subsequent result, returning only
prefixes that were stuck in every poll.
Why this matters — timeout risk on T2 chassis
See bug sonic-net/sonic-buildimage#27612
In the latest sonic-mgmt, suppress-fib-pending is enabled on T2 chassis,
and
check_frr_pending_routesruns unconditionally on everyroute_check.pyinvocation.route_check.pyruns under two hard time limits:/etc/monit/conf.d/sonic-host)On a T2 chassis a line card can have 3 ASICs, each running an independent
FRR instance.
sonic-mgmttests use ~51k IPv4 + ~51k IPv6 routes per ASIC.Route download speed is ~1k prefixes/sec, so BGP converging during a test run are
common.
With the old JSON-based implementation, if BGP was flapping at the time
route_check.pyran, the script would enter the retry loop (up to 3iterations × 15 s sleep + ~70 s JSON parse + vtysh fetch per ASIC), easily
exceeding the 2-minute and 5-minute thresholds and getting killed before it
could report or mitigate anything.
Measured results (88-LC0-36FH, 3 ASICs, 51k IPv4 + 51k IPv6 per ASIC)
How to verify it
Deploy
route_check.pyto a T2 line card with suppress-fib-pending enabledand a large BGP route table. Confirm that:
retries to confirm they are persistently stuck, and sends mitigation
notifications — after which
vtysh show ip routeno longer showsB>qlines for those prefixes.
Previous command output (if the output of a command-line utility has changed)
N/A — no CLI changes.
New command output (if the output of a command-line utility has changed)
N/A — no CLI changes.