Skip to content

fix(workload): prevent endpoint/service BPF map divergence on partial…#1680

Open
OmAmbole009 wants to merge 5 commits into
kmesh-net:mainfrom
OmAmbole009:fix/endpoint-service-bpf-map-consistency
Open

fix(workload): prevent endpoint/service BPF map divergence on partial…#1680
OmAmbole009 wants to merge 5 commits into
kmesh-net:mainfrom
OmAmbole009:fix/endpoint-service-bpf-map-consistency

Conversation

@OmAmbole009
Copy link
Copy Markdown

Summary

This PR fixes a transactional consistency issue during endpoint registration/removal where partial failures could leave EndpointMap, ServiceMap, and EndpointCache in divergent states.

The fix ensures:

endpoint visibility is only exposed to the datapath after successful service count updates
failed updates correctly roll back intermediate state
cache state remains consistent with BPF-visible state
endpoint deletion ordering avoids transient duplicate backend visibility

Fixes #1679

Root Cause

  1. addWorkloadToService() mutated state before all writes succeeded

Previously:

sv.EndpointCount was incremented before EndpointUpdate
EndpointCache was updated before ServiceUpdate

This created two failure windows:

Failure window A

If EndpointUpdate() failed:

sv.EndpointCount was already incremented
in-memory state diverged from the BPF service map
Failure window B

If ServiceUpdate() failed:

endpoint entry already existed in the BPF endpoint map
service count was not incremented
endpoint became unreachable to the datapath
cache state incorrectly reflected a successful add

This could lead to silent traffic inconsistencies.

  1. deleteEndpoint() exposed transient duplicate backend visibility

The original ordering was:

Swap -> decrement count -> delete last entry

Between swap and count decrement, the swapped backend could temporarily become visible at two indices simultaneously.

Fix Strategy

addWorkloadToService()

New ordering:

EndpointUpdate -> increment count + ServiceUpdate -> cache update

Behavior:

backend index computed without mutating state
endpoint inserted first while still invisible to datapath
service count updated only after successful endpoint insertion
rollback performed if ServiceUpdate() fails
cache updated only after all BPF updates succeed
deleteEndpoint()

New ordering:

decrement count -> ServiceUpdate -> swap -> delete last entry

Behavior:

last slot becomes unreachable before swap
prevents transient duplicate backend selection
rollback attempted if swap fails
deleting the unreachable last entry becomes safe/non-critical

Tests Added

addWorkloadToService
TestAddWorkloadToService_EndpointUpdateFailure
TestAddWorkloadToService_ServiceUpdateFailure
TestAddWorkloadToService_EndpointCountConsistency
TestAddWorkloadToService_RetryAfterFailureSucceeds
deleteEndpoint
TestDeleteEndpoint_CountDecrementedBeforeSwap
TestDeleteEndpoint_DeleteLastEntry
TestDeleteEndpoint_EndpointCountServiceMapConsistency

These tests validate:

rollback correctness
endpoint/service consistency
cache consistency
retry behavior
endpoint count invariants
map compactness after deletion

Result

This change improves consistency guarantees between:

userspace workload state
EndpointMap
ServiceMap
datapath-visible topology

and prevents silent routing inconsistencies during partial BPF update failures.

Copilot AI review requested due to automatic review settings May 6, 2026 21:27
@kmesh-bot kmesh-bot requested review from hzxuzhonghu and kwb0523 May 6, 2026 21:27
@kmesh-bot
Copy link
Copy Markdown
Collaborator

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign hzxuzhonghu for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses critical transactional consistency issues in the workload controller's BPF map management. By reordering operations and introducing robust rollback logic, it ensures that the userspace state, BPF maps, and the eBPF datapath remain synchronized even when partial failures occur during endpoint registration or deletion. These changes significantly improve the reliability of service routing and prevent silent traffic inconsistencies.

Highlights

  • Transactional Consistency: Refactored addWorkloadToService and deleteEndpoint to ensure atomic-like updates to BPF maps, preventing state divergence between EndpointMap, ServiceMap, and EndpointCache during partial failures.
  • Improved Error Handling: Implemented rollback mechanisms for failed BPF updates, ensuring that if a service update fails, the endpoint insertion is reverted and in-memory state remains consistent.
  • Datapath Visibility: Adjusted the ordering of operations to ensure endpoints are only visible to the eBPF datapath after successful registration, and unreachable before removal, preventing transient traffic inconsistencies.
  • Comprehensive Testing: Added a suite of unit tests to validate consistency invariants, rollback behavior, and retry logic under simulated failure conditions.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.


In maps where packets find their way, We keep the state from going astray. With order strict and rollback smart, Consistency is now the heart.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@kmesh-bot
Copy link
Copy Markdown
Collaborator

Welcome @OmAmbole009! It looks like this is your first PR to kmesh-net/kmesh 🎉

@OmAmbole009 OmAmbole009 force-pushed the fix/endpoint-service-bpf-map-consistency branch from 6acdbb7 to 689bf6d Compare May 6, 2026 21:30
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the BPF map update logic in addWorkloadToService and deleteEndpoint to provide stronger consistency guarantees and ordering between endpoint and service map writes. It introduces a swap-then-shrink algorithm for deletions and comprehensive unit tests for failure and rollback scenarios. Feedback highlights critical safety issues, including missing validation for the priority parameter and potential uint32 underflow when decrementing endpoint counts. Additionally, the reviewer noted that the EndpointCache must be updated during endpoint swaps to maintain consistency with the BPF map state.

Comment thread pkg/controller/workload/workload_processor.go
Comment thread pkg/controller/workload/workload_processor.go
Comment thread pkg/controller/workload/workload_processor.go
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses transactional consistency between userspace state and the eBPF EndpointMap/ServiceMap during workload endpoint add/remove operations, preventing partial failures from leaving endpoint/service counts and caches diverged (Fixes #1679).

Changes:

  • Reorders addWorkloadToService() updates so the endpoint entry is written before incrementing/persisting the service EndpointCount, with rollback on partial failures and cache updates only after BPF writes succeed.
  • Reorders deleteEndpoint() to decrement/persist service EndpointCount before swapping/deleting endpoint entries to avoid transient duplicate backend visibility.
  • Adds targeted unit tests covering failure windows, rollback behavior, and endpoint/service/cache consistency invariants.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
pkg/controller/workload/workload_processor.go Makes add/remove endpoint operations more transactional via ordering changes and rollback behavior.
pkg/controller/workload/workload_processor_test.go Adds tests validating rollback, consistency, and deletion compactness/invariants.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/controller/workload/workload_processor.go
Comment thread pkg/controller/workload/workload_processor.go Outdated
Comment thread pkg/controller/workload/workload_processor_test.go Outdated
Comment thread pkg/controller/workload/workload_processor.go Outdated
Copilot AI review requested due to automatic review settings May 6, 2026 21:55
@OmAmbole009 OmAmbole009 force-pushed the fix/endpoint-service-bpf-map-consistency branch from 1673bd8 to 3acb2bd Compare May 6, 2026 21:55
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

Comment thread pkg/controller/workload/workload_processor.go
Comment thread pkg/controller/workload/workload_processor.go Outdated
Comment thread pkg/controller/workload/workload_processor.go
Comment thread pkg/controller/workload/workload_processor.go
Comment thread pkg/controller/workload/workload_processor_test.go Outdated
@OmAmbole009
Copy link
Copy Markdown
Author

@hzxuzhonghu @kwb0523

I’ve addressed the latest review feedback and pushed the follow-up fixes/tests. The PR should now be up to date with all discussed changes.

Could you please take a look when you get a chance? Thanks!

@codecov
Copy link
Copy Markdown

codecov Bot commented May 6, 2026

Codecov Report

❌ Patch coverage is 42.42424% with 19 lines in your changes missing coverage. Please review.
✅ Project coverage is 39.72%. Comparing base (c9b38d3) to head (4ed998c).
⚠️ Report is 7 commits behind head on main.

Files with missing lines Patch % Lines
pkg/controller/workload/workload_processor.go 42.42% 13 Missing and 6 partials ⚠️

❌ Your patch check has failed because the patch coverage (42.42%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage.

Files with missing lines Coverage Δ
pkg/controller/workload/workload_processor.go 60.29% <42.42%> (+2.53%) ⬆️

... and 1 file with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3d77877...4ed998c. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

// Step 1: write the endpoint entry. The service EndpointCount still points to the
// previous last entry, so the new entry is invisible to the eBPF program until step 2.
if err := p.bpf.EndpointUpdate(&ek, &ev); err != nil {
log.Errorf("Update endpoint map failed, err:%s", err)
Comment on lines +1245 to 1249
BackendIndex: lastIndex,
}
if err := p.bpf.EndpointDelete(lastKey); err != nil {
log.Errorf("EndpointDelete [%#v] failed: %v", lastKey, err)
return err
log.Errorf("EndpointDelete [%#v] failed (ghost entry, non-fatal): %v", lastKey, err)
}
Comment on lines +1222 to +1239
// Step 2: overwrite the deleted slot with the (now unreachable) last entry so
// the remaining entries stay contiguous.
lastIndex := newCount + 1 // index of the entry that is now unreachable
if err := p.bpf.EndpointSwap(ek.BackendIndex, ev.BackendUid, lastIndex, sk.ServiceId, ek.Prio); err != nil {
log.Errorf("swap workload endpoint index failed: %s", err)
// Attempt to restore the count so the service map is not left with a count
// that is one less than the actual number of reachable entries.
sv.EndpointCount[ek.Prio] = lastIndex
if restoreErr := p.bpf.ServiceUpdate(&sk, &sv); restoreErr != nil {
log.Errorf("restore EndpointCount after swap failure failed: %v", restoreErr)
return fmt.Errorf("swap workload endpoint index failed: %w; restoring EndpointCount also failed: %v", err, restoreErr)
}
return err
}

// Step 3: delete the now-unreachable last entry. A failure here leaves a ghost
// entry that is invisible to the eBPF program because EndpointCount was already
// decremented in step 1. The entry wastes one map slot but does not affect
Comment on lines +1408 to +1418
// Close the endpoint map. deleteEndpoint will:
// step 1: decrement count to 0, ServiceUpdate (service map open) — succeeds.
// step 2: EndpointSwap(1, wlID, 1, svcID, 0) — currentIndex==lastIndex==1,
// lastIndex<=1 branch → no-op, returns nil (no endpoint map access).
// step 3: EndpointDelete(lastKey) — endpoint map closed → fails → logs error,
// returns nil (ghost entry is non-fatal).
workloadMap.KmEndpoint.Close()

err := p.deleteEndpoint(ek, ev, sv, sk)
assert.NoError(t, err,
"deleteEndpoint must return nil even when EndpointDelete fails (ghost entry is non-fatal)")
Comment on lines +1232 to +1245
// TestAddWorkloadToService_ServiceUpdateFailure_RollbackDeleteAlsoFails documents
// an unreachable branch in the current fake map infrastructure.
//
// To trigger this path, addWorkloadToService must observe:
// 1. EndpointUpdate succeeds (endpoint map open)
// 2. ServiceUpdate fails (service map closed)
// 3. rollback EndpointDelete fails (endpoint map closed)
//
// With the current synchronous fake maps, we cannot transition endpoint-map state
// between (1) and (3) inside a single addWorkloadToService call without changing
// production code or fake map internals, so this path is intentionally skipped.
func TestAddWorkloadToService_ServiceUpdateFailure_RollbackDeleteAlsoFails(t *testing.T) {
t.Skip("requires mid-call endpoint map state transition not possible with current fake map implementation")
}
… update failure

addWorkloadToService previously incremented sv.EndpointCount before the
endpoint BPF map write succeeded, and updated EndpointCache before the
service BPF map write succeeded.  This created two failure windows:

1. If EndpointUpdate failed, sv.EndpointCount was already incremented in
   the caller's copy, leaving the in-memory service value out of sync with
   the BPF service map on the next ServiceLookup.

2. If ServiceUpdate failed after a successful EndpointUpdate, the endpoint
   entry existed in the BPF map but EndpointCount was not incremented, so
   the eBPF program could never select the new endpoint (silent traffic
   loss).  EndpointCache was also updated despite the failure, diverging
   from the actual datapath state.

deleteEndpoint previously decremented EndpointCount after the EndpointSwap,
meaning the eBPF program could briefly observe the swapped-in backend at
two indices simultaneously (duplicate traffic window).

Fix addWorkloadToService:
- Compute the new BackendIndex without mutating sv.EndpointCount.
- Write the endpoint entry first (invisible to eBPF until count grows).
- Increment EndpointCount and call ServiceUpdate second.
- On ServiceUpdate failure, delete the just-inserted endpoint entry and
  restore sv.EndpointCount so the caller's copy is not left dirty.
- Update EndpointCache only after both BPF writes succeed.

Fix deleteEndpoint:
- Decrement EndpointCount and call ServiceUpdate first, making the last
  slot unreachable to the eBPF program before any data is moved.
- Perform EndpointSwap second (both slots hold the same backend but the
  last slot is already unreachable, so no duplicate traffic).
- On EndpointSwap failure, attempt to restore EndpointCount.
- Delete the now-unreachable last entry last; a failure here leaves a
  ghost entry that is invisible to the datapath and will be cleaned up
  on the next restart via handleRemovedAddressesDuringRestart.

Add six unit tests covering:
- EndpointUpdate failure: sv not mutated, cache not updated.
- ServiceUpdate failure: endpoint rolled back, cache not updated.
- EndpointCount always equals actual BPF endpoint entries after adds.
- deleteEndpoint: count decremented, map compact, deleted entry absent.
- deleteEndpoint: last-entry deletion leaves count 0 and map empty.
- Mixed add/delete sequence: BPF count, map count, cache count consistent.
- Retry after removal succeeds with correct state.

Signed-off-by: OmAmbole009 <omambole09@gmail.com>
…nsistency

Four targeted fixes based on Copilot review:

1. Add underflow/state guard in deleteEndpoint()
   Decrementing sv.EndpointCount[ek.Prio] when the count is already 0,
   or when BackendIndex is 0 or exceeds the current count, would write
   an invalid count to the BPF service map and corrupt the datapath
   state.  Add explicit checks before the decrement and return a
   descriptive error so the caller can log and skip the entry rather
   than persisting bad state.

2. Fix misleading ghost-entry cleanup comment in deleteEndpoint()
   The previous comment claimed the ghost entry 'will be cleaned up on
   the next restart via handleRemovedAddressesDuringRestart'.  That
   function cleans up backend/service entries whose UIDs are absent from
   the userspace cache; it does not explicitly iterate unreachable
   endpoint slots.  Replace the comment with an accurate description:
   the entry is invisible to the eBPF program because EndpointCount was
   already decremented, and it will remain until the next explicit
   cleanup of the owning service or workload.

3. Fix inaccurate test comment in TestAddWorkloadToService_EndpointUpdateFailure
   The old comment described the failure injection as 'using a service
   that does not exist in the BPF map'.  The actual mechanism is closing
   workloadMap.KmEndpoint so that the EndpointUpdate call inside
   addWorkloadToService returns an error.  Update the comment to
   accurately describe what is being tested.

4. Fix outdated function-level rollback comment in addWorkloadToService()
   The previous doc comment said the rollback restores 'the in-memory
   EndpointCache'.  EndpointCache is now updated only after both BPF
   writes succeed, so no cache rollback is ever needed.  Rewrite the
   comment to reflect the actual behaviour: on ServiceUpdate failure the
   endpoint BPF entry is deleted and sv.EndpointCount is restored; the
   cache is untouched because it was never written.

Signed-off-by: OmAmbole009 <omambole09@gmail.com>
… and deleteEndpoint

sv.EndpointCount is a fixed-size array of length bpf.PrioCount (7).
Passing a priority value >= PrioCount causes an out-of-bounds array
access and a runtime panic before any BPF map write occurs.

addWorkloadToService: add an explicit bounds check on the priority
parameter before the first sv.EndpointCount[priority] read.

deleteEndpoint: add an explicit bounds check on ek.Prio before the
existing underflow and BackendIndex guards, which themselves access
sv.EndpointCount[ek.Prio].  The ordering is:
  1. ek.Prio < bpf.PrioCount  (new, prevents out-of-bounds)
  2. EndpointCount[ek.Prio] > 0  (existing, prevents underflow)
  3. BackendIndex in [1, EndpointCount[ek.Prio]]  (existing, prevents
     out-of-range swap)

Both checks return a descriptive error so callers can log and skip the
entry without writing invalid state to the BPF service map.

No logic changes beyond the new guard conditions.

Signed-off-by: OmAmbole009 <omambole09@gmail.com>
…fix test comment

Three targeted fixes based on Copilot review:

1. Rename 'swap-then-shrink' to 'shrink-then-swap' in deleteEndpoint doc
   The implemented order is: decrement EndpointCount + ServiceUpdate
   (shrink) first, then EndpointSwap (swap) second.  The previous
   algorithm name was inverted relative to the actual code.

2. Return aggregated error when EndpointSwap rollback also fails
   When EndpointSwap fails and the subsequent ServiceUpdate to restore
   EndpointCount also fails, the function previously returned only the
   original swap error, silently discarding the restore failure.  Now
   both errors are surfaced in the returned error so operators can
   distinguish a swap-only failure from a swap + rollback failure.
   The existing log line and rollback attempt are preserved unchanged.

3. Rename test and fix misleading comment
   TestAddWorkloadToService_RetryAfterFailureSucceeds used a different
   workload name and IP for 'wl2', making it a distinct workload rather
   than a retry of 'wl1'.  The test name and comment implied retry
   semantics that were not present.  Rename to
   TestAddWorkloadToService_NewWorkloadAfterDeletionSucceeds and update
   the comment to accurately describe what is tested: adding a new
   workload after a previous one was deleted leaves the service in a
   consistent state.

No logic changes to the transactional ordering or datapath guarantees.

Signed-off-by: OmAmbole009 <omambole09@gmail.com>
Signed-off-by: OmAmbole009 <omambole09@gmail.com>
@OmAmbole009 OmAmbole009 force-pushed the fix/endpoint-service-bpf-map-consistency branch from 13338c7 to ed73d0f Compare May 7, 2026 16:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Potential BPF map consistency issue during partial endpoint update failures

3 participants