Bugfix: collector unreachable causing system OOM#130
Bugfix: collector unreachable causing system OOM#130maxming2333 wants to merge 1 commit intoapache:masterfrom
Conversation
|
这是来自QQ邮箱的假期自动回复邮件。您好,我最近正在休假中,无法亲自回复您的邮件。我将在假期结束后,尽快给您回复。
|
There was a problem hiding this comment.
Pull request overview
Fixes an OOM scenario in the gRPC trace reporter when the SkyWalking collector is unreachable, by avoiding gRPC collect() calls that can accumulate queued writes and retain large error/stack objects in memory.
Changes:
- Skip
collect()when the gRPC channel isn’tREADYto prevent queued/pending streaming calls from accumulating. - Withhold the
segments-sentevent while disconnected to keep existing SpanContext backpressure active. - Minor adjustment to the buffer-full warning message formatting.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // reconnection automatically when the channel is IDLE, so recovery | ||
| // is transparent once the collector comes back. | ||
| if (!this.isConnected) { | ||
| this.buffer.length = 0; |
There was a problem hiding this comment.
Clearing this.buffer when the channel isn't READY drops all already-finished segments immediately, even for short transient disconnects / initial CONNECTING state, which defeats the purpose of having a bounded buffer and makes recovery unable to flush the backlog once the collector is back. Consider keeping the buffer intact (it’s already capped by maxBufferSize) and simply skipping collect() + not emitting segments-sent while disconnected; only drop segments when the buffer cap is hit (or after a separate time-based policy).
| this.buffer.length = 0; |
|
Close: apache/skywalking#13764 |
Summary
Fix a memory leak that causes OOM when the SkyWalking OAP collector is
persistently unreachable (e.g.
ECONNREFUSED/UNAVAILABLE).Root Cause
reportFunctionis scheduled every 1 s. On each tick, regardless ofconnectivity, it:
Unconditionally emits
segments-sentat the very top, resettingSpanContext.nActiveSegmentsto 0. This re-opens the gate for newsegments immediately, even though nothing was delivered.
Calls
this.reporterClient.collect()with adeadlineofnow + traceTimeout(default 10 000 ms). When the channel is inTRANSIENT_FAILURE, gRPC-js accepts the call and pushes it onto itsinternal
pickQueue(internal-channel.js → queueCallForPick).Every
stream.write()keeps aSegmentObjectAdapterreference aliveinside that queued call until the deadline expires.
Clears
this.buffer— giving a false impression that memory is freed,while the actual data is still retained inside
pickQueue.With a 1 s interval and a 10 s deadline, ~10 calls accumulate
concurrently in
pickQueue. Each holds up tomaxBufferSize(1 000)segment objects. Because
segments-sentis always emitted, new segmentskeep flowing in without bound → heap grows linearly → OOM.
This is visible in V8 heap snapshots as a large count of
ConsOneByteStringobjects from the repeated
"Error: 14 UNAVAILABLE: No connection established"stack-trace string, retained by gRPC interceptor closure chains.
Fix
Before creating any gRPC stream, check
this.isConnected(which internallycalls
getConnectivityState(true)).If not connected:
collect()— no new call is pushed ontopickQueue.segments-sent—SpanContext.nActiveSegmentsstays non-zero,so
SpanContextreturnsDummyContextfor new spans. This appliesnatural backpressure and stops further segment production during an outage.
this.buffer— releases any segments already queued at theapplication layer.
getConnectivityState(true)also triggersexitIdle()when the channelis
IDLE, so reconnection is automatic and tracing resumes within thenext 1 s tick once the collector recovers.
If connected (READY): behaviour is identical to before.
Behaviour Comparison
Testing
Start the agent pointing at an unreachable address, generate continuous
traffic, and observe
process.memoryUsage().heapUsedover time.Before fix: RSS grows ~50 MB/min at ~200 req/s.
After fix: RSS remains stable; heap usage does not grow.
Once the collector address is restored, tracing resumes automatically
without restarting the process.