fix(workload): prevent endpoint/service BPF map divergence on partial…#1680
fix(workload): prevent endpoint/service BPF map divergence on partial…#1680OmAmbole009 wants to merge 5 commits into
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses critical transactional consistency issues in the workload controller's BPF map management. By reordering operations and introducing robust rollback logic, it ensures that the userspace state, BPF maps, and the eBPF datapath remain synchronized even when partial failures occur during endpoint registration or deletion. These changes significantly improve the reliability of service routing and prevent silent traffic inconsistencies. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. In maps where packets find their way, We keep the state from going astray. With order strict and rollback smart, Consistency is now the heart. Footnotes
|
|
Welcome @OmAmbole009! It looks like this is your first PR to kmesh-net/kmesh 🎉 |
6acdbb7 to
689bf6d
Compare
There was a problem hiding this comment.
Code Review
This pull request refactors the BPF map update logic in addWorkloadToService and deleteEndpoint to provide stronger consistency guarantees and ordering between endpoint and service map writes. It introduces a swap-then-shrink algorithm for deletions and comprehensive unit tests for failure and rollback scenarios. Feedback highlights critical safety issues, including missing validation for the priority parameter and potential uint32 underflow when decrementing endpoint counts. Additionally, the reviewer noted that the EndpointCache must be updated during endpoint swaps to maintain consistency with the BPF map state.
There was a problem hiding this comment.
Pull request overview
This PR addresses transactional consistency between userspace state and the eBPF EndpointMap/ServiceMap during workload endpoint add/remove operations, preventing partial failures from leaving endpoint/service counts and caches diverged (Fixes #1679).
Changes:
- Reorders
addWorkloadToService()updates so the endpoint entry is written before incrementing/persisting the serviceEndpointCount, with rollback on partial failures and cache updates only after BPF writes succeed. - Reorders
deleteEndpoint()to decrement/persist serviceEndpointCountbefore swapping/deleting endpoint entries to avoid transient duplicate backend visibility. - Adds targeted unit tests covering failure windows, rollback behavior, and endpoint/service/cache consistency invariants.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
pkg/controller/workload/workload_processor.go |
Makes add/remove endpoint operations more transactional via ordering changes and rollback behavior. |
pkg/controller/workload/workload_processor_test.go |
Adds tests validating rollback, consistency, and deletion compactness/invariants. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
1673bd8 to
3acb2bd
Compare
|
I’ve addressed the latest review feedback and pushed the follow-up fixes/tests. The PR should now be up to date with all discussed changes. Could you please take a look when you get a chance? Thanks! |
Codecov Report❌ Patch coverage is
❌ Your patch check has failed because the patch coverage (42.42%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage.
... and 1 file with indirect coverage changes Continue to review full report in Codecov by Sentry.
🚀 New features to boost your workflow:
|
| // Step 1: write the endpoint entry. The service EndpointCount still points to the | ||
| // previous last entry, so the new entry is invisible to the eBPF program until step 2. | ||
| if err := p.bpf.EndpointUpdate(&ek, &ev); err != nil { | ||
| log.Errorf("Update endpoint map failed, err:%s", err) |
| BackendIndex: lastIndex, | ||
| } | ||
| if err := p.bpf.EndpointDelete(lastKey); err != nil { | ||
| log.Errorf("EndpointDelete [%#v] failed: %v", lastKey, err) | ||
| return err | ||
| log.Errorf("EndpointDelete [%#v] failed (ghost entry, non-fatal): %v", lastKey, err) | ||
| } |
| // Step 2: overwrite the deleted slot with the (now unreachable) last entry so | ||
| // the remaining entries stay contiguous. | ||
| lastIndex := newCount + 1 // index of the entry that is now unreachable | ||
| if err := p.bpf.EndpointSwap(ek.BackendIndex, ev.BackendUid, lastIndex, sk.ServiceId, ek.Prio); err != nil { | ||
| log.Errorf("swap workload endpoint index failed: %s", err) | ||
| // Attempt to restore the count so the service map is not left with a count | ||
| // that is one less than the actual number of reachable entries. | ||
| sv.EndpointCount[ek.Prio] = lastIndex | ||
| if restoreErr := p.bpf.ServiceUpdate(&sk, &sv); restoreErr != nil { | ||
| log.Errorf("restore EndpointCount after swap failure failed: %v", restoreErr) | ||
| return fmt.Errorf("swap workload endpoint index failed: %w; restoring EndpointCount also failed: %v", err, restoreErr) | ||
| } | ||
| return err | ||
| } | ||
|
|
||
| // Step 3: delete the now-unreachable last entry. A failure here leaves a ghost | ||
| // entry that is invisible to the eBPF program because EndpointCount was already | ||
| // decremented in step 1. The entry wastes one map slot but does not affect |
| // Close the endpoint map. deleteEndpoint will: | ||
| // step 1: decrement count to 0, ServiceUpdate (service map open) — succeeds. | ||
| // step 2: EndpointSwap(1, wlID, 1, svcID, 0) — currentIndex==lastIndex==1, | ||
| // lastIndex<=1 branch → no-op, returns nil (no endpoint map access). | ||
| // step 3: EndpointDelete(lastKey) — endpoint map closed → fails → logs error, | ||
| // returns nil (ghost entry is non-fatal). | ||
| workloadMap.KmEndpoint.Close() | ||
|
|
||
| err := p.deleteEndpoint(ek, ev, sv, sk) | ||
| assert.NoError(t, err, | ||
| "deleteEndpoint must return nil even when EndpointDelete fails (ghost entry is non-fatal)") |
| // TestAddWorkloadToService_ServiceUpdateFailure_RollbackDeleteAlsoFails documents | ||
| // an unreachable branch in the current fake map infrastructure. | ||
| // | ||
| // To trigger this path, addWorkloadToService must observe: | ||
| // 1. EndpointUpdate succeeds (endpoint map open) | ||
| // 2. ServiceUpdate fails (service map closed) | ||
| // 3. rollback EndpointDelete fails (endpoint map closed) | ||
| // | ||
| // With the current synchronous fake maps, we cannot transition endpoint-map state | ||
| // between (1) and (3) inside a single addWorkloadToService call without changing | ||
| // production code or fake map internals, so this path is intentionally skipped. | ||
| func TestAddWorkloadToService_ServiceUpdateFailure_RollbackDeleteAlsoFails(t *testing.T) { | ||
| t.Skip("requires mid-call endpoint map state transition not possible with current fake map implementation") | ||
| } |
… update failure addWorkloadToService previously incremented sv.EndpointCount before the endpoint BPF map write succeeded, and updated EndpointCache before the service BPF map write succeeded. This created two failure windows: 1. If EndpointUpdate failed, sv.EndpointCount was already incremented in the caller's copy, leaving the in-memory service value out of sync with the BPF service map on the next ServiceLookup. 2. If ServiceUpdate failed after a successful EndpointUpdate, the endpoint entry existed in the BPF map but EndpointCount was not incremented, so the eBPF program could never select the new endpoint (silent traffic loss). EndpointCache was also updated despite the failure, diverging from the actual datapath state. deleteEndpoint previously decremented EndpointCount after the EndpointSwap, meaning the eBPF program could briefly observe the swapped-in backend at two indices simultaneously (duplicate traffic window). Fix addWorkloadToService: - Compute the new BackendIndex without mutating sv.EndpointCount. - Write the endpoint entry first (invisible to eBPF until count grows). - Increment EndpointCount and call ServiceUpdate second. - On ServiceUpdate failure, delete the just-inserted endpoint entry and restore sv.EndpointCount so the caller's copy is not left dirty. - Update EndpointCache only after both BPF writes succeed. Fix deleteEndpoint: - Decrement EndpointCount and call ServiceUpdate first, making the last slot unreachable to the eBPF program before any data is moved. - Perform EndpointSwap second (both slots hold the same backend but the last slot is already unreachable, so no duplicate traffic). - On EndpointSwap failure, attempt to restore EndpointCount. - Delete the now-unreachable last entry last; a failure here leaves a ghost entry that is invisible to the datapath and will be cleaned up on the next restart via handleRemovedAddressesDuringRestart. Add six unit tests covering: - EndpointUpdate failure: sv not mutated, cache not updated. - ServiceUpdate failure: endpoint rolled back, cache not updated. - EndpointCount always equals actual BPF endpoint entries after adds. - deleteEndpoint: count decremented, map compact, deleted entry absent. - deleteEndpoint: last-entry deletion leaves count 0 and map empty. - Mixed add/delete sequence: BPF count, map count, cache count consistent. - Retry after removal succeeds with correct state. Signed-off-by: OmAmbole009 <omambole09@gmail.com>
…nsistency Four targeted fixes based on Copilot review: 1. Add underflow/state guard in deleteEndpoint() Decrementing sv.EndpointCount[ek.Prio] when the count is already 0, or when BackendIndex is 0 or exceeds the current count, would write an invalid count to the BPF service map and corrupt the datapath state. Add explicit checks before the decrement and return a descriptive error so the caller can log and skip the entry rather than persisting bad state. 2. Fix misleading ghost-entry cleanup comment in deleteEndpoint() The previous comment claimed the ghost entry 'will be cleaned up on the next restart via handleRemovedAddressesDuringRestart'. That function cleans up backend/service entries whose UIDs are absent from the userspace cache; it does not explicitly iterate unreachable endpoint slots. Replace the comment with an accurate description: the entry is invisible to the eBPF program because EndpointCount was already decremented, and it will remain until the next explicit cleanup of the owning service or workload. 3. Fix inaccurate test comment in TestAddWorkloadToService_EndpointUpdateFailure The old comment described the failure injection as 'using a service that does not exist in the BPF map'. The actual mechanism is closing workloadMap.KmEndpoint so that the EndpointUpdate call inside addWorkloadToService returns an error. Update the comment to accurately describe what is being tested. 4. Fix outdated function-level rollback comment in addWorkloadToService() The previous doc comment said the rollback restores 'the in-memory EndpointCache'. EndpointCache is now updated only after both BPF writes succeed, so no cache rollback is ever needed. Rewrite the comment to reflect the actual behaviour: on ServiceUpdate failure the endpoint BPF entry is deleted and sv.EndpointCount is restored; the cache is untouched because it was never written. Signed-off-by: OmAmbole009 <omambole09@gmail.com>
… and deleteEndpoint
sv.EndpointCount is a fixed-size array of length bpf.PrioCount (7).
Passing a priority value >= PrioCount causes an out-of-bounds array
access and a runtime panic before any BPF map write occurs.
addWorkloadToService: add an explicit bounds check on the priority
parameter before the first sv.EndpointCount[priority] read.
deleteEndpoint: add an explicit bounds check on ek.Prio before the
existing underflow and BackendIndex guards, which themselves access
sv.EndpointCount[ek.Prio]. The ordering is:
1. ek.Prio < bpf.PrioCount (new, prevents out-of-bounds)
2. EndpointCount[ek.Prio] > 0 (existing, prevents underflow)
3. BackendIndex in [1, EndpointCount[ek.Prio]] (existing, prevents
out-of-range swap)
Both checks return a descriptive error so callers can log and skip the
entry without writing invalid state to the BPF service map.
No logic changes beyond the new guard conditions.
Signed-off-by: OmAmbole009 <omambole09@gmail.com>
…fix test comment Three targeted fixes based on Copilot review: 1. Rename 'swap-then-shrink' to 'shrink-then-swap' in deleteEndpoint doc The implemented order is: decrement EndpointCount + ServiceUpdate (shrink) first, then EndpointSwap (swap) second. The previous algorithm name was inverted relative to the actual code. 2. Return aggregated error when EndpointSwap rollback also fails When EndpointSwap fails and the subsequent ServiceUpdate to restore EndpointCount also fails, the function previously returned only the original swap error, silently discarding the restore failure. Now both errors are surfaced in the returned error so operators can distinguish a swap-only failure from a swap + rollback failure. The existing log line and rollback attempt are preserved unchanged. 3. Rename test and fix misleading comment TestAddWorkloadToService_RetryAfterFailureSucceeds used a different workload name and IP for 'wl2', making it a distinct workload rather than a retry of 'wl1'. The test name and comment implied retry semantics that were not present. Rename to TestAddWorkloadToService_NewWorkloadAfterDeletionSucceeds and update the comment to accurately describe what is tested: adding a new workload after a previous one was deleted leaves the service in a consistent state. No logic changes to the transactional ordering or datapath guarantees. Signed-off-by: OmAmbole009 <omambole09@gmail.com>
Signed-off-by: OmAmbole009 <omambole09@gmail.com>
13338c7 to
ed73d0f
Compare
Summary
This PR fixes a transactional consistency issue during endpoint registration/removal where partial failures could leave EndpointMap, ServiceMap, and EndpointCache in divergent states.
The fix ensures:
endpoint visibility is only exposed to the datapath after successful service count updates
failed updates correctly roll back intermediate state
cache state remains consistent with BPF-visible state
endpoint deletion ordering avoids transient duplicate backend visibility
Fixes #1679
Root Cause
Previously:
sv.EndpointCount was incremented before EndpointUpdate
EndpointCache was updated before ServiceUpdate
This created two failure windows:
Failure window A
If EndpointUpdate() failed:
sv.EndpointCount was already incremented
in-memory state diverged from the BPF service map
Failure window B
If ServiceUpdate() failed:
endpoint entry already existed in the BPF endpoint map
service count was not incremented
endpoint became unreachable to the datapath
cache state incorrectly reflected a successful add
This could lead to silent traffic inconsistencies.
The original ordering was:
Swap -> decrement count -> delete last entry
Between swap and count decrement, the swapped backend could temporarily become visible at two indices simultaneously.
Fix Strategy
addWorkloadToService()
New ordering:
EndpointUpdate -> increment count + ServiceUpdate -> cache update
Behavior:
backend index computed without mutating state
endpoint inserted first while still invisible to datapath
service count updated only after successful endpoint insertion
rollback performed if ServiceUpdate() fails
cache updated only after all BPF updates succeed
deleteEndpoint()
New ordering:
decrement count -> ServiceUpdate -> swap -> delete last entry
Behavior:
last slot becomes unreachable before swap
prevents transient duplicate backend selection
rollback attempted if swap fails
deleting the unreachable last entry becomes safe/non-critical
Tests Added
addWorkloadToService
TestAddWorkloadToService_EndpointUpdateFailure
TestAddWorkloadToService_ServiceUpdateFailure
TestAddWorkloadToService_EndpointCountConsistency
TestAddWorkloadToService_RetryAfterFailureSucceeds
deleteEndpoint
TestDeleteEndpoint_CountDecrementedBeforeSwap
TestDeleteEndpoint_DeleteLastEntry
TestDeleteEndpoint_EndpointCountServiceMapConsistency
These tests validate:
rollback correctness
endpoint/service consistency
cache consistency
retry behavior
endpoint count invariants
map compactness after deletion
Result
This change improves consistency guarantees between:
userspace workload state
EndpointMap
ServiceMap
datapath-visible topology
and prevents silent routing inconsistencies during partial BPF update failures.