Skip to content

Bugfix: collector unreachable causing system OOM#130

Open
maxming2333 wants to merge 1 commit intoapache:masterfrom
maxming2333:master
Open

Bugfix: collector unreachable causing system OOM#130
maxming2333 wants to merge 1 commit intoapache:masterfrom
maxming2333:master

Conversation

@maxming2333
Copy link
Copy Markdown

@maxming2333 maxming2333 commented Mar 27, 2026

Summary

Fix a memory leak that causes OOM when the SkyWalking OAP collector is
persistently unreachable (e.g. ECONNREFUSED / UNAVAILABLE).

Root Cause

reportFunction is scheduled every 1 s. On each tick, regardless of
connectivity, it:

  1. Unconditionally emits segments-sent at the very top, resetting
    SpanContext.nActiveSegments to 0. This re-opens the gate for new
    segments immediately, even though nothing was delivered.

  2. Calls this.reporterClient.collect() with a deadline of
    now + traceTimeout (default 10 000 ms). When the channel is in
    TRANSIENT_FAILURE, gRPC-js accepts the call and pushes it onto its
    internal pickQueue (internal-channel.js → queueCallForPick).
    Every stream.write() keeps a SegmentObjectAdapter reference alive
    inside that queued call until the deadline expires.

  3. Clears this.buffer — giving a false impression that memory is freed,
    while the actual data is still retained inside pickQueue.

With a 1 s interval and a 10 s deadline, ~10 calls accumulate
concurrently
in pickQueue. Each holds up to maxBufferSize (1 000)
segment objects. Because segments-sent is always emitted, new segments
keep flowing in without bound → heap grows linearly → OOM.

This is visible in V8 heap snapshots as a large count of ConsOneByteString
objects from the repeated "Error: 14 UNAVAILABLE: No connection established"
stack-trace string, retained by gRPC interceptor closure chains.

Fix

Before creating any gRPC stream, check this.isConnected (which internally
calls getConnectivityState(true)).

If not connected:

  • Skip collect() — no new call is pushed onto pickQueue.
  • Skip segments-sentSpanContext.nActiveSegments stays non-zero,
    so SpanContext returns DummyContext for new spans. This applies
    natural backpressure and stops further segment production during an outage.
  • Clear this.buffer — releases any segments already queued at the
    application layer.
  • getConnectivityState(true) also triggers exitIdle() when the channel
    is IDLE, so reconnection is automatic and tracing resumes within the
    next 1 s tick once the collector recovers.

If connected (READY): behaviour is identical to before.

Behaviour Comparison

Scenario Before After
Collector reachable Normal Normal (unchanged)
Collector unreachable OOM over time Memory stable, spans become no-ops
Collector recovers N/A Automatically resumes within ≤1 s

Testing

Start the agent pointing at an unreachable address, generate continuous
traffic, and observe process.memoryUsage().heapUsed over time.

Before fix: RSS grows ~50 MB/min at ~200 req/s.
After fix: RSS remains stable; heap usage does not grow.

Once the collector address is restored, tracing resumes automatically
without restarting the process.

Copilot AI review requested due to automatic review settings March 27, 2026 03:19
@tkNobug
Copy link
Copy Markdown

tkNobug commented Mar 27, 2026 via email

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes an OOM scenario in the gRPC trace reporter when the SkyWalking collector is unreachable, by avoiding gRPC collect() calls that can accumulate queued writes and retain large error/stack objects in memory.

Changes:

  • Skip collect() when the gRPC channel isn’t READY to prevent queued/pending streaming calls from accumulating.
  • Withhold the segments-sent event while disconnected to keep existing SpanContext backpressure active.
  • Minor adjustment to the buffer-full warning message formatting.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

// reconnection automatically when the channel is IDLE, so recovery
// is transparent once the collector comes back.
if (!this.isConnected) {
this.buffer.length = 0;
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clearing this.buffer when the channel isn't READY drops all already-finished segments immediately, even for short transient disconnects / initial CONNECTING state, which defeats the purpose of having a bounded buffer and makes recovery unable to flush the backlog once the collector is back. Consider keeping the buffer intact (it’s already capped by maxBufferSize) and simply skipping collect() + not emitting segments-sent while disconnected; only drop segments when the buffer cap is hit (or after a separate time-based policy).

Suggested change
this.buffer.length = 0;

Copilot uses AI. Check for mistakes.
@maxming2333
Copy link
Copy Markdown
Author

Close: apache/skywalking#13764

@wu-sheng wu-sheng added this to the 0.8.0 milestone Mar 27, 2026
@wu-sheng wu-sheng added the enhancement New feature or request label Mar 27, 2026
@wu-sheng wu-sheng requested a review from kezhenxu94 March 27, 2026 03:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants