feat: granular debug ready status#302
Conversation
|
@vanshika2720: The label(s) DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
✅ Deploy Preview for kmesh-net ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Welcome @vanshika2720! It looks like this is your first PR to kmesh-net/website 🎉 |
There was a problem hiding this comment.
Code Review
This pull request implements a comprehensive status and control plane for Kmesh, including a BpfLoader for eBPF management, ADS and Workload controllers for XDS synchronization, and a StatusServer for administrative tasks and health monitoring. Key feedback points include addressing a potential deadlock caused by holding a mutex during network calls, preventing a runtime panic in reflection logic within the health check, and transitioning from formatted strings to structured data in status responses to improve compatibility with external monitoring tools.
| c.mu.Lock() | ||
| if c.grpcConn, err = nets.GrpcConnect(c.xdsConfig.DiscoveryAddress); err != nil { | ||
| c.mu.Unlock() | ||
| return fmt.Errorf("grpc connect failed: %s", err) | ||
| } | ||
|
|
||
| c.client = discoveryv3.NewAggregatedDiscoveryServiceClient(c.grpcConn) | ||
| c.lastConnect = time.Now() | ||
| c.mu.Unlock() |
There was a problem hiding this comment.
The c.mu lock is held while performing the network operation nets.GrpcConnect. This can lead to lock contention and block other calls like IsReady() or GetGrpcState() if the connection attempt takes a long time (e.g., due to network timeouts or a slow control plane). It is better to perform the dial outside the lock and then update the client fields.
| c.mu.Lock() | |
| if c.grpcConn, err = nets.GrpcConnect(c.xdsConfig.DiscoveryAddress); err != nil { | |
| c.mu.Unlock() | |
| return fmt.Errorf("grpc connect failed: %s", err) | |
| } | |
| c.client = discoveryv3.NewAggregatedDiscoveryServiceClient(c.grpcConn) | |
| c.lastConnect = time.Now() | |
| c.mu.Unlock() | |
| conn, err := nets.GrpcConnect(c.xdsConfig.DiscoveryAddress) | |
| if err != nil { | |
| return fmt.Errorf("grpc connect failed: %s", err) | |
| } | |
| c.mu.Lock() | |
| c.grpcConn = conn | |
| c.client = discoveryv3.NewAggregatedDiscoveryServiceClient(c.grpcConn) | |
| c.lastConnect = time.Now() | |
| c.mu.Unlock() |
| } | ||
|
|
||
| func formatLinkStatus(link interface{}) string { | ||
| if link == nil || reflect.ValueOf(link).IsNil() { |
There was a problem hiding this comment.
Using reflect.ValueOf(link).IsNil() can panic if the interface link contains a non-nillable type (such as a struct). While most eBPF link implementations are pointers, it is safer to check the Kind before calling IsNil() to avoid potential runtime panics in the health check endpoint.
| if link == nil || reflect.ValueOf(link).IsNil() { | |
| if link == nil { | |
| return "not attached" | |
| } | |
| v := reflect.ValueOf(link) | |
| switch v.Kind() { | |
| case reflect.Chan, reflect.Func, reflect.Map, reflect.Ptr, reflect.UnsafePointer, reflect.Interface, reflect.Slice: | |
| if v.IsNil() { | |
| return "not attached" | |
| } | |
| } | |
| return "ok" |
| func (c *XdsClient) GetXdsStreamStability() string { | ||
| if c == nil { | ||
| return "not initialized" | ||
| } | ||
| c.mu.RLock() | ||
| defer c.mu.RUnlock() | ||
| return fmt.Sprintf("reconnects: %d, last_connect: %v", c.reconnectCount, c.lastConnect.Format(time.RFC3339)) | ||
| } |
There was a problem hiding this comment.
Returning a formatted string for stream stability makes it harder for programmatic consumers (like the Headlamp plugin mentioned in the PR description) to use this data. It is better to return structured data (reconnect count and last connect time) so the UI can format it as needed.
| func (c *XdsClient) GetXdsStreamStability() string { | |
| if c == nil { | |
| return "not initialized" | |
| } | |
| c.mu.RLock() | |
| defer c.mu.RUnlock() | |
| return fmt.Sprintf("reconnects: %d, last_connect: %v", c.reconnectCount, c.lastConnect.Format(time.RFC3339)) | |
| } | |
| func (c *XdsClient) GetXdsStreamStability() (uint64, time.Time) { | |
| if c == nil { | |
| return 0, time.Time{} | |
| } | |
| c.mu.RLock() | |
| defer c.mu.RUnlock() | |
| return c.reconnectCount, c.lastConnect | |
| } |
| type XdsStatus struct { | ||
| State string `json:"state"` | ||
| Controller string `json:"controller"` | ||
| StreamStability string `json:"stream_stability"` | ||
| } |
There was a problem hiding this comment.
To support the Headlamp plugin's visual indicators, the XdsStatus should provide structured fields for reconnect counts and connection timestamps instead of a single formatted string.
| type XdsStatus struct { | |
| State string `json:"state"` | |
| Controller string `json:"controller"` | |
| StreamStability string `json:"stream_stability"` | |
| } | |
| type XdsStatus struct { | |
| State string `json:"state"` | |
| Controller string `json:"controller"` | |
| ReconnectCount uint64 `json:"reconnect_count"` | |
| LastConnectTime time.Time `json:"last_connect_time"` | |
| } |
e06815a to
e82bee9
Compare
Signed-off-by: vanshika2720 <pahalvanshikaa@gmail.com>
e82bee9 to
6bb27e2
Compare
What type of PR is this?
/kind feature
What this PR does / why we need it
This PR expands the:
endpoint to provide granular health visibility for:
These enhancements improve operational observability and enable:
in the Headlamp plugin.
This allows users to verify:
Previously, the readiness endpoint only exposed coarse readiness state, making it difficult to diagnose partial failures or unstable control plane connectivity.
Key changes
Granular BPF status reporting
Expanded
BpfLoaderreadiness reporting to include:This improves low-level dataplane observability.
XDS stream stability tracking
Added thread-safe XDS connection stability tracking in:
XdsClientincluding:
This provides better visibility into:
Expanded readiness response
Enhanced the JSON payload returned by:
to expose detailed component-level readiness information for:
This makes the endpoint more useful for:
Controller readiness integration
Integrated readiness checks into:
AdsControllerWorkloadControllerto provide centralized readiness reporting across core mesh components.
Which issue(s) this PR fixes
Fixes #
Special notes for your reviewer
Thread safety
Introduced:
in:
XdsClientto ensure safe concurrent access during readiness and status reporting.
Test updates
Updated:
to validate the new granular readiness response format.
Formatting
Applied:
to all modified files.
Why this matters
These changes improve:
Users can now identify:
without relying on logs or deep internal debugging.
Does this PR introduce a user-facing change?