Skip to content

Fix route_check.py: performance, dict bug, and intersection retry#4576

Open
mike-dubrovsky wants to merge 3 commits into
sonic-net:masterfrom
mike-dubrovsky:route_check
Open

Fix route_check.py: performance, dict bug, and intersection retry#4576
mike-dubrovsky wants to merge 3 commits into
sonic-net:masterfrom
mike-dubrovsky:route_check

Conversation

@mike-dubrovsky
Copy link
Copy Markdown

@mike-dubrovsky mike-dubrovsky commented May 29, 2026

What I did

Three fixes to route_check.py:

  1. Bug fix (fetch_routes returns dicts): fetch_routes() was appending
    bare prefix strings to missing_routes, but
    mitigate_installed_not_offloaded_frr_routes() expects dicts with
    entry['prefix'] and entry['protocol'], causing a TypeError whenever
    suppress-fib-pending triggered mitigation.

  2. Performance (replace ijson JSON scan with vtysh text streaming):
    Replace the ijson-based parser with a streaming line-by-line scan of
    vtysh show ip/ipv6 route plain-text output and a focused regex that
    matches only queued (q) or offload-failed (o) routes.

  3. Correctness (intersection-based retry in check_frr_pending_routes):
    Replace the overwrite-on-each-iteration retry loop with an intersection
    accumulator so that only routes stuck in every poll window are mitigated,
    avoiding premature mitigation of routes that are still converging.

How I did it

Bug fix: fetch_routes() now returns {'prefix': prefix, 'protocol': 'bgp'}
dicts instead of bare strings, matching the format expected by the mitigation
function.

Performance: vtysh show ip route json for a large route table serialises
the entire in-memory RIB to JSON before sending a single byte, producing
~200 MB that must then be parsed by Python. The plain-text output is streamed
incrementally by vtysh and scanned line-by-line with a single regex — no JSON
heap allocation. On a healthy device (no stuck routes) the function returns
empty lists without any heap allocations at all.

FRR zebra_vty.c assembles each route line as three fixed-position characters
(zebra_route_char + selected_char + re_status_output_char):

B>* 10.0.0.0/24   selected, offloaded (healthy)
B>q 10.0.0.0/24   selected, queued    (suppress-fib-pending)
B>o 10.0.0.0/24   selected, offload failed

The regex r'^B>([qo])\s+(\S+)\s' matches only the two cases that require
action.

Correctness: the previous loop overwrote missed_rt on every iteration,
so a route that appeared stuck only in the last iteration could be mitigated
even though it was still converging. The new logic seeds an accumulator on the
first iteration and intersects it with each subsequent result, returning only
prefixes that were stuck in every poll.

Why this matters — timeout risk on T2 chassis

See bug sonic-net/sonic-buildimage#27612

In the latest sonic-mgmt, suppress-fib-pending is enabled on T2 chassis,
and check_frr_pending_routes runs unconditionally on every
route_check.py invocation.

route_check.py runs under two hard time limits:

  • SIGALRM at 2 minutes (TIMEOUT_SECONDS timer set inside the script itself)
  • monit kill at 5 minutes (/etc/monit/conf.d/sonic-host)

On a T2 chassis a line card can have 3 ASICs, each running an independent
FRR instance. sonic-mgmt tests use ~51k IPv4 + ~51k IPv6 routes per ASIC.
Route download speed is ~1k prefixes/sec, so BGP converging during a test run are
common.

With the old JSON-based implementation, if BGP was flapping at the time
route_check.py ran, the script would enter the retry loop (up to 3
iterations × 15 s sleep + ~70 s JSON parse + vtysh fetch per ASIC), easily
exceeding the 2-minute and 5-minute thresholds and getting killed before it
could report or mitigate anything.

Measured results (88-LC0-36FH, 3 ASICs, 51k IPv4 + 51k IPv6 per ASIC)

Scenario Old wall time New wall time
Healthy device (0 stuck) ~120 s ~31 s
All routes stuck, 3 retries ~350–400 s ~85 s

How to verify it

Deploy route_check.py to a T2 line card with suppress-fib-pending enabled
and a large BGP route table. Confirm that:

  1. The script completes within the 2-minute window on a healthy device.
  2. When routes are stuck (queued), the script detects them, waits through
    retries to confirm they are persistently stuck, and sends mitigation
    notifications — after which vtysh show ip route no longer shows B>q
    lines for those prefixes.
  3. Routes that converge between retry iterations are not mitigated.

Previous command output (if the output of a command-line utility has changed)

N/A — no CLI changes.

New command output (if the output of a command-line utility has changed)

N/A — no CLI changes.

…lity

fetch_routes() was appending bare prefix strings to missing_routes, but
mitigate_installed_not_offloaded_frr_routes() accesses entry['prefix'] and
entry['protocol'], causing TypeError: string indices must be integers whenever
suppress-fib-pending triggers mitigation.

The bug was introduced in a9543cb ("Fix route_check.py to not hog a lot of
memory", PR sonic-net#4205) which moved route filtering into fetch_routes() but
switched from appending the full JSON entry dict to appending only the
prefix string.

Fix: return {'prefix': prefix, 'protocol': route_entry.get('protocol', '')}
to match the format expected by mitigate_installed_not_offloaded_frr_routes().

Signed-off-by: mike-dubrovsky <mdubrovs@cisco.com>
## What I did

Replaced the ijson-based JSON parser in fetch_routes() with a line-by-line
scan of vtysh plain-text output.  This is 5-9x faster and keeps peak RSS
flat regardless of route table size.

## Why text output is faster

vtysh generates "show ip route json" by serialising the full in-memory
route table to JSON before sending any bytes.  "show ip route" (text) is
produced incrementally — vtysh prints each route as it iterates — so
Python never holds more than one line in RAM.

## How text output encodes offload state

FRR zebra_vty.c assembles each route line as three fixed-position chars
(zebra_route_char + selected_char + re_status_output_char):

  B>* 10.0.0.0/24   selected, offloaded      (normal / healthy)
  B>q 10.0.0.0/24   queued, not yet offloaded (suppress-fib-pending)
  B>o 10.0.0.0/24   offload failure

The regex r'^B>([qo])\s+(\S+)\s' matches only the two problematic cases
directly.  On a fully-converged device nothing matches and the function
returns empty lists without any heap allocations.

The ijson dependency is no longer needed and is removed.

Signed-off-by: mike-dubrovsky <mdubrovs@cisco.com>
## What I did

Replaced the overwrite-on-each-iteration retry loop with an intersection
accumulator so that only routes stuck in *every* poll window are mitigated.

## Problem with the previous logic

The retry loop called get_frr_routes_parallel() up to FRR_CHECK_RETRIES
times and simply overwrote missed_rt on every iteration:

  for i in range(retries):
      missed_rt, failed_rt = get_frr_routes_parallel(namespace)
      if not missed_rt and not failed_rt:
          break
      time.sleep(FRR_WAIT_TIME)
  # mitigate whatever happened to be in missed_rt at loop exit

Consequence: a route stuck only in the last iteration is
mitigated even though it may be in the middle of normal BGP convergence
(false positive / premature mitigation).

## New logic

Accumulate a running intersection across iterations.  On the first iteration
the accumulator is seeded with the full result set.  On each subsequent
iteration, any prefix no longer present is removed.

  iter 0: {A, B, C, D}   <- seed
  iter 1: {A, B, C}       <- D removed (converged)
  iter 2: {A, B}          <- C removed (converged)
  result: {A, B}          <- truly stuck, safe to mitigate

If the current iteration returns nothing, the intersection becomes empty and
the loop exits early — same fast-path as before for the healthy case.

Signed-off-by: mike-dubrovsky <mdubrovs@cisco.com>
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@mike-dubrovsky mike-dubrovsky changed the title Route check Fix route_check.py: performance, dict bug, and intersection retry May 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants