Bugfix: collector unreachable causing system OOM by maxming2333 · Pull Request #130 · apache/skywalking-nodejs

maxming2333 · 2026-03-27T03:19:27Z

Summary

Fix a memory leak that causes OOM when the SkyWalking OAP collector is
persistently unreachable (e.g. ECONNREFUSED / UNAVAILABLE).

Root Cause

reportFunction is scheduled every 1 s. On each tick, regardless of
connectivity, it:

Unconditionally emits segments-sent at the very top, resetting
SpanContext.nActiveSegments to 0. This re-opens the gate for new
segments immediately, even though nothing was delivered.
Calls this.reporterClient.collect() with a deadline of
now + traceTimeout (default 10 000 ms). When the channel is in
TRANSIENT_FAILURE, gRPC-js accepts the call and pushes it onto its
internal pickQueue (internal-channel.js → queueCallForPick).
Every stream.write() keeps a SegmentObjectAdapter reference alive
inside that queued call until the deadline expires.
Clears this.buffer — giving a false impression that memory is freed,
while the actual data is still retained inside pickQueue.

With a 1 s interval and a 10 s deadline, ~10 calls accumulate
concurrently in pickQueue. Each holds up to maxBufferSize (1 000)
segment objects. Because segments-sent is always emitted, new segments
keep flowing in without bound → heap grows linearly → OOM.

This is visible in V8 heap snapshots as a large count of ConsOneByteString
objects from the repeated "Error: 14 UNAVAILABLE: No connection established"
stack-trace string, retained by gRPC interceptor closure chains.

Fix

Before creating any gRPC stream, check this.isConnected (which internally
calls getConnectivityState(true)).

If not connected:

Skip collect() — no new call is pushed onto pickQueue.
Skip segments-sent — SpanContext.nActiveSegments stays non-zero,
so SpanContext returns DummyContext for new spans. This applies
natural backpressure and stops further segment production during an outage.
Clear this.buffer — releases any segments already queued at the
application layer.
getConnectivityState(true) also triggers exitIdle() when the channel
is IDLE, so reconnection is automatic and tracing resumes within the
next 1 s tick once the collector recovers.

If connected (READY): behaviour is identical to before.

Behaviour Comparison

Scenario	Before	After
Collector reachable	Normal	Normal (unchanged)
Collector unreachable	OOM over time	Memory stable, spans become no-ops
Collector recovers	N/A	Automatically resumes within ≤1 s

Testing

Start the agent pointing at an unreachable address, generate continuous
traffic, and observe process.memoryUsage().heapUsed over time.

Before fix: RSS grows ~50 MB/min at ~200 req/s.
After fix: RSS remains stable; heap usage does not grow.

Once the collector address is restored, tracing resumes automatically
without restarting the process.

tkNobug · 2026-03-27T03:20:04Z

这是来自QQ邮箱的假期自动回复邮件。您好，我最近正在休假中，无法亲自回复您的邮件。我将在假期结束后，尽快给您回复。

Copilot

Pull request overview

Fixes an OOM scenario in the gRPC trace reporter when the SkyWalking collector is unreachable, by avoiding gRPC collect() calls that can accumulate queued writes and retain large error/stack objects in memory.

Changes:

Skip collect() when the gRPC channel isn’t READY to prevent queued/pending streaming calls from accumulating.
Withhold the segments-sent event while disconnected to keep existing SpanContext backpressure active.
Minor adjustment to the buffer-full warning message formatting.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-27T03:24:28Z

src/agent/protocol/grpc/clients/TraceReportClient.ts

+      // reconnection automatically when the channel is IDLE, so recovery
+      // is transparent once the collector comes back.
+      if (!this.isConnected) {
+        this.buffer.length = 0;


Clearing this.buffer when the channel isn't READY drops all already-finished segments immediately, even for short transient disconnects / initial CONNECTING state, which defeats the purpose of having a bounded buffer and makes recovery unable to flush the backlog once the collector is back. Consider keeping the buffer intact (it’s already capped by maxBufferSize) and simply skipping collect() + not emitting segments-sent while disconnected; only drop segments when the buffer cap is hit (or after a separate time-based policy).

Suggested change

this.buffer.length = 0;

maxming2333 · 2026-03-27T03:26:10Z

Close: apache/skywalking#13764

fix: collector unreachable causing system OOM

658cc99

Copilot AI review requested due to automatic review settings March 27, 2026 03:19

Copilot started reviewing on behalf of maxming2333 March 27, 2026 03:19 View session

Copilot AI reviewed Mar 27, 2026

View reviewed changes

wu-sheng added this to the 0.8.0 milestone Mar 27, 2026

wu-sheng added the enhancement New feature or request label Mar 27, 2026

wu-sheng requested a review from kezhenxu94 March 27, 2026 03:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bugfix: collector unreachable causing system OOM#130

Bugfix: collector unreachable causing system OOM#130
maxming2333 wants to merge 1 commit intoapache:masterfrom
maxming2333:master

maxming2333 commented Mar 27, 2026 •

edited

Loading

Uh oh!

tkNobug commented Mar 27, 2026 via email

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 27, 2026

Uh oh!

maxming2333 commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

maxming2333 commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root Cause

Fix

Behaviour Comparison

Testing

Uh oh!

tkNobug commented Mar 27, 2026 via email

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

maxming2333 commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

maxming2333 commented Mar 27, 2026 •

edited

Loading