Skip to content

feat: granular debug ready status#302

Closed
vanshika2720 wants to merge 1 commit into
kmesh-net:mainfrom
vanshika2720:feat/granular-debug-ready
Closed

feat: granular debug ready status#302
vanshika2720 wants to merge 1 commit into
kmesh-net:mainfrom
vanshika2720:feat/granular-debug-ready

Conversation

@vanshika2720
Copy link
Copy Markdown

What type of PR is this?

/kind feature


What this PR does / why we need it

This PR expands the:

/debug/ready

endpoint to provide granular health visibility for:

  • eBPF programs
  • eBPF maps
  • XDS stream stability

These enhancements improve operational observability and enable:

Visual indicators of mesh status

in the Headlamp plugin.

This allows users to verify:

  • Whether BPF programs are correctly attached
  • Whether required BPF maps are healthy
  • Whether the XDS control plane connection is stable

Previously, the readiness endpoint only exposed coarse readiness state, making it difficult to diagnose partial failures or unstable control plane connectivity.


Key changes

Granular BPF status reporting

Expanded BpfLoader readiness reporting to include:

  • Individual eBPF program attachment status
  • eBPF map readiness information
  • Detailed component-level health visibility

This improves low-level dataplane observability.


XDS stream stability tracking

Added thread-safe XDS connection stability tracking in:

XdsClient

including:

  • Reconnect counts
  • Last successful connect time
  • Stream stability metadata

This provides better visibility into:

  • Control plane health
  • ADS stream reliability
  • Reconnection behavior

Expanded readiness response

Enhanced the JSON payload returned by:

/debug/ready

to expose detailed component-level readiness information for:

  • BPF programs
  • Maps
  • XDS connectivity
  • Controller readiness

This makes the endpoint more useful for:

  • Headlamp UI integration
  • Monitoring systems
  • Operational debugging

Controller readiness integration

Integrated readiness checks into:

  • AdsController
  • WorkloadController

to provide centralized readiness reporting across core mesh components.


Which issue(s) this PR fixes

Fixes #

(Please add the issue number here if applicable)

Special notes for your reviewer

Thread safety

Introduced:

sync.RWMutex

in:

  • XdsClient
  • Controllers

to ensure safe concurrent access during readiness and status reporting.


Test updates

Updated:

pkg/status/ready_test.go

to validate the new granular readiness response format.


Formatting

Applied:

go fmt

to all modified files.


Why this matters

These changes improve:

  • Mesh observability
  • Readiness diagnostics
  • Control plane visibility
  • Headlamp integration capabilities

Users can now identify:

  • Missing BPF attachments
  • Map initialization issues
  • Unstable XDS streams
  • Partial readiness failures

without relying on logs or deep internal debugging.


Does this PR introduce a user-facing change?

Expanded the /debug/ready endpoint to include granular status for eBPF programs, maps, and XDS stream stability for better mesh observability.

Copilot AI review requested due to automatic review settings May 14, 2026 15:08
@kmesh-bot
Copy link
Copy Markdown
Collaborator

@vanshika2720: The label(s) kind/feature cannot be applied, because the repository doesn't have them.

Details

In response to this:

What type of PR is this?

/kind feature


What this PR does / why we need it

This PR expands the:

/debug/ready

endpoint to provide granular health visibility for:

  • eBPF programs
  • eBPF maps
  • XDS stream stability

These enhancements improve operational observability and enable:

Visual indicators of mesh status

in the Headlamp plugin.

This allows users to verify:

  • Whether BPF programs are correctly attached
  • Whether required BPF maps are healthy
  • Whether the XDS control plane connection is stable

Previously, the readiness endpoint only exposed coarse readiness state, making it difficult to diagnose partial failures or unstable control plane connectivity.


Key changes

Granular BPF status reporting

Expanded BpfLoader readiness reporting to include:

  • Individual eBPF program attachment status
  • eBPF map readiness information
  • Detailed component-level health visibility

This improves low-level dataplane observability.


XDS stream stability tracking

Added thread-safe XDS connection stability tracking in:

XdsClient

including:

  • Reconnect counts
  • Last successful connect time
  • Stream stability metadata

This provides better visibility into:

  • Control plane health
  • ADS stream reliability
  • Reconnection behavior

Expanded readiness response

Enhanced the JSON payload returned by:

/debug/ready

to expose detailed component-level readiness information for:

  • BPF programs
  • Maps
  • XDS connectivity
  • Controller readiness

This makes the endpoint more useful for:

  • Headlamp UI integration
  • Monitoring systems
  • Operational debugging

Controller readiness integration

Integrated readiness checks into:

  • AdsController
  • WorkloadController

to provide centralized readiness reporting across core mesh components.


Which issue(s) this PR fixes

Fixes #

(Please add the issue number here if applicable)

Special notes for your reviewer

Thread safety

Introduced:

sync.RWMutex

in:

  • XdsClient
  • Controllers

to ensure safe concurrent access during readiness and status reporting.


Test updates

Updated:

pkg/status/ready_test.go

to validate the new granular readiness response format.


Formatting

Applied:

go fmt

to all modified files.


Why this matters

These changes improve:

  • Mesh observability
  • Readiness diagnostics
  • Control plane visibility
  • Headlamp integration capabilities

Users can now identify:

  • Missing BPF attachments
  • Map initialization issues
  • Unstable XDS streams
  • Partial readiness failures

without relying on logs or deep internal debugging.


Does this PR introduce a user-facing change?

Expanded the /debug/ready endpoint to include granular status for eBPF programs, maps, and XDS stream stability for better mesh observability.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@netlify
Copy link
Copy Markdown

netlify Bot commented May 14, 2026

Deploy Preview for kmesh-net ready!

Name Link
🔨 Latest commit 6bb27e2
🔍 Latest deploy log https://app.netlify.com/projects/kmesh-net/deploys/6a06af115a48bf0008d1e697
😎 Deploy Preview https://deploy-preview-302--kmesh-net.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@kmesh-bot
Copy link
Copy Markdown
Collaborator

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign hzxuzhonghu for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kmesh-bot
Copy link
Copy Markdown
Collaborator

Welcome @vanshika2720! It looks like this is your first PR to kmesh-net/website 🎉

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a comprehensive status and control plane for Kmesh, including a BpfLoader for eBPF management, ADS and Workload controllers for XDS synchronization, and a StatusServer for administrative tasks and health monitoring. Key feedback points include addressing a potential deadlock caused by holding a mutex during network calls, preventing a runtime panic in reflection logic within the health check, and transitioning from formatted strings to structured data in status responses to improve compatibility with external monitoring tools.

Comment thread pkg/controller/client.go
Comment on lines +83 to +91
c.mu.Lock()
if c.grpcConn, err = nets.GrpcConnect(c.xdsConfig.DiscoveryAddress); err != nil {
c.mu.Unlock()
return fmt.Errorf("grpc connect failed: %s", err)
}

c.client = discoveryv3.NewAggregatedDiscoveryServiceClient(c.grpcConn)
c.lastConnect = time.Now()
c.mu.Unlock()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The c.mu lock is held while performing the network operation nets.GrpcConnect. This can lead to lock contention and block other calls like IsReady() or GetGrpcState() if the connection attempt takes a long time (e.g., due to network timeouts or a slow control plane). It is better to perform the dial outside the lock and then update the client fields.

Suggested change
c.mu.Lock()
if c.grpcConn, err = nets.GrpcConnect(c.xdsConfig.DiscoveryAddress); err != nil {
c.mu.Unlock()
return fmt.Errorf("grpc connect failed: %s", err)
}
c.client = discoveryv3.NewAggregatedDiscoveryServiceClient(c.grpcConn)
c.lastConnect = time.Now()
c.mu.Unlock()
conn, err := nets.GrpcConnect(c.xdsConfig.DiscoveryAddress)
if err != nil {
return fmt.Errorf("grpc connect failed: %s", err)
}
c.mu.Lock()
c.grpcConn = conn
c.client = discoveryv3.NewAggregatedDiscoveryServiceClient(c.grpcConn)
c.lastConnect = time.Now()
c.mu.Unlock()

Comment thread pkg/bpf/bpf.go Outdated
}

func formatLinkStatus(link interface{}) string {
if link == nil || reflect.ValueOf(link).IsNil() {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using reflect.ValueOf(link).IsNil() can panic if the interface link contains a non-nillable type (such as a struct). While most eBPF link implementations are pointers, it is safer to check the Kind before calling IsNil() to avoid potential runtime panics in the health check endpoint.

Suggested change
if link == nil || reflect.ValueOf(link).IsNil() {
if link == nil {
return "not attached"
}
v := reflect.ValueOf(link)
switch v.Kind() {
case reflect.Chan, reflect.Func, reflect.Map, reflect.Ptr, reflect.UnsafePointer, reflect.Interface, reflect.Slice:
if v.IsNil() {
return "not attached"
}
}
return "ok"

Comment thread pkg/controller/client.go
Comment on lines +255 to +262
func (c *XdsClient) GetXdsStreamStability() string {
if c == nil {
return "not initialized"
}
c.mu.RLock()
defer c.mu.RUnlock()
return fmt.Sprintf("reconnects: %d, last_connect: %v", c.reconnectCount, c.lastConnect.Format(time.RFC3339))
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Returning a formatted string for stream stability makes it harder for programmatic consumers (like the Headlamp plugin mentioned in the PR description) to use this data. It is better to return structured data (reconnect count and last connect time) so the UI can format it as needed.

Suggested change
func (c *XdsClient) GetXdsStreamStability() string {
if c == nil {
return "not initialized"
}
c.mu.RLock()
defer c.mu.RUnlock()
return fmt.Sprintf("reconnects: %d, last_connect: %v", c.reconnectCount, c.lastConnect.Format(time.RFC3339))
}
func (c *XdsClient) GetXdsStreamStability() (uint64, time.Time) {
if c == nil {
return 0, time.Time{}
}
c.mu.RLock()
defer c.mu.RUnlock()
return c.reconnectCount, c.lastConnect
}

Comment on lines +506 to +510
type XdsStatus struct {
State string `json:"state"`
Controller string `json:"controller"`
StreamStability string `json:"stream_stability"`
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To support the Headlamp plugin's visual indicators, the XdsStatus should provide structured fields for reconnect counts and connection timestamps instead of a single formatted string.

Suggested change
type XdsStatus struct {
State string `json:"state"`
Controller string `json:"controller"`
StreamStability string `json:"stream_stability"`
}
type XdsStatus struct {
State string `json:"state"`
Controller string `json:"controller"`
ReconnectCount uint64 `json:"reconnect_count"`
LastConnectTime time.Time `json:"last_connect_time"`
}

@vanshika2720 vanshika2720 force-pushed the feat/granular-debug-ready branch from e06815a to e82bee9 Compare May 14, 2026 17:27
Signed-off-by: vanshika2720 <pahalvanshikaa@gmail.com>
Copilot AI review requested due to automatic review settings May 15, 2026 05:28
@vanshika2720 vanshika2720 force-pushed the feat/granular-debug-ready branch from e82bee9 to 6bb27e2 Compare May 15, 2026 05:28
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants