Skip to content

[Doc] Make docs clear that clients have a static LogCallback *logCallback and a crash can occur when using multiple clients with log handlers #519

@justintime4tea

Description

@justintime4tea

Search before reporting

  • I searched in the issues and found nothing similar.

What issue do you find in Pulsar docs?

Ran into a rather nasty to debug bug when writing integration tests for a node-js application that uses "pulsar-client" NPM package. Essentially, the issue is though it seems like the "pulsar-client" supports multiple instances of a Pulsar client, the C++ side of things doesn't fully isolate these instances and a single static LogCallback *logCallback is created for ALL clients with each client overwriting this static callback with whatever handler the last executed Pulsar.Client.setLogHandler or Pulsar.Client constructor is given. The fact that you can only have a single log handler isn't documented other than some comments I see in the C++ code nor is the potential issue when setting multiple with multiple log handlers clients.

Why this is an issue

This static LogCallback *logCallback being shared results in a potential race-condition whereby some final message from one of the multiple C++ pulsar client is attempted to be logged and a crash occurs. Since neither the Close or ~Client() destructor clear/set static Client::logCallback to null on the C++ side, it's possible that static Client::logCallback becomes a dangling pointer pointing to a ThreadSafeFunction that was part of a previous client, which has since been destroyed.

Repro

Looking at the Client::Close method and the Client::~Client destructor:

  • Client::~Client() {} (The destructor is empty)
  • Napi::Value Client::Close(const Napi::CallbackInfo &info) (This function only calls pulsar_client_close_async. It does not touch Client::logCallback.)

Conceptual repro steps:

  1. You create Client A, which sets the static Client::logCallback to point to its logger.
  2. You create Client B, which replaces the static Client::logCallback to point to its logger. (Client A's original logger function is now abandoned).
  3. You call await ClientB.close(). The Close method is called, and eventually ~Client() is called.
  4. Neither Close nor the destructor clears the logCallback.
  5. logCallback is now a dangling pointer, pointing to a ThreadSafeFunction that was part of Client B, which is now destroyed.
  6. A background thread from Client A tries to log a message. It uses the static logCallback, accesses the dangling pointer, and the process crashes.

What is your suggestion?

I suggest that, at bare minimum and baring the C++ code being "corrected", documentation is updated so that consumers of pulsar client libraries are made aware of this. I'm not sure exactly how, because it doesn't seem like you'd be safe even just setting the handler a single time due to the nature of the "close" and "destructor" code in C++.

Any reference?

Diagnostics
Though I don't have a code repro I do have a breakdown of how I stumbled upon this - In integration tests where there are many pulsar clients being created and closed, as well as producers/consumers created/closed, along with the underlying Node JS app being tested creating/closing it's own pulsar client and consumers/producers, the race condition is triggered and a SIGABRT is displayed near the end of the test output with the following message:

libc++abi: terminating due to uncaught exception of type Napi::Error:
 ✖  SIGABRT

Running the tests using lldb to capture the coredump results in the following which lead me to the issue:

    frame apache/pulsar#8: 0x00000001922fbbd8 libc++abi.dylib`__cxa_throw + 92
    frame apache/pulsar#9: 0x0000000151a485d4 pulsar.node`Napi::Function::Call(std::initializer_list<napi_value__*> const&) const + 196
    frame apache/pulsar#10: 0x0000000151a4fc10 pulsar.node`LogMessageProxy(Napi::Env, Napi::Function, LogMessage*) + 272
    frame apache/pulsar#11: 0x0000000151a4c1ec pulsar.node`Napi::ThreadSafeFunction::CallJS(napi_env__*, napi_value__*, void*, void*) + 68
    frame apache/pulsar#12: 0x00000001000c6b94 node`v8impl::(anonymous namespace)::ThreadSafeFunction::AsyncCb(uv_async_s*) + 472

Workarounds

  • Explicitly calling Pulsar.Client.setLogHandler(null) in "cleanup" stage of tests and "graceful shutdown" stage of the NodeJS application bypasses the issue
  • For whatever reason, this only occurs on node versions below v24.
    • I tested and repro'ed in many v22.x versions and v23.x versions
    • I can't seem to reproduce this issue with node v24.

Full Coredump:

* thread apache/pulsar#1
  * frame #0: 0x000000019230a388 libsystem_kernel.dylib`__pthread_kill + 8
    frame apache/pulsar#1: 0x000000019234388c libsystem_pthread.dylib`pthread_kill + 296
    frame apache/pulsar#2: 0x000000019224ca3c libsystem_c.dylib`abort + 124
    frame apache/pulsar#3: 0x00000001922f9384 libc++abi.dylib`abort_message + 132
    frame apache/pulsar#4: 0x00000001922e7cd8 libc++abi.dylib`demangling_terminate_handler() + 316
    frame apache/pulsar#5: 0x0000000191f6cde4 libobjc.A.dylib`_objc_terminate() + 172
    frame apache/pulsar#6: 0x00000001922f8698 libc++abi.dylib`std::__terminate(void (*)()) + 16
    frame apache/pulsar#7: 0x00000001922fbc30 libc++abi.dylib`__cxxabiv1::failed_throw(__cxxabiv1::__cxa_exception*) + 88
    frame apache/pulsar#8: 0x00000001922fbbd8 libc++abi.dylib`__cxa_throw + 92
    frame apache/pulsar#9: 0x0000000151a485d4 pulsar.node`Napi::Function::Call(std::initializer_list<napi_value__*> const&) const + 196
    frame apache/pulsar#10: 0x0000000151a4fc10 pulsar.node`LogMessageProxy(Napi::Env, Napi::Function, LogMessage*) + 272
    frame apache/pulsar#11: 0x0000000151a4c1ec pulsar.node`Napi::ThreadSafeFunction::CallJS(napi_env__*, napi_value__*, void*, void*) + 68
    frame apache/pulsar#12: 0x00000001000c6b94 node`v8impl::(anonymous namespace)::ThreadSafeFunction::AsyncCb(uv_async_s*) + 472
    frame apache/pulsar#13: 0x0000000100f01478 node`uv__async_io + 304
    frame apache/pulsar#14: 0x0000000100f1524c node`uv__io_poll + 1512
    frame apache/pulsar#15: 0x0000000100f019e0 node`uv_run + 408
    frame apache/pulsar#16: 0x0000000100093004 node`node::Environment::CleanupHandles() + 188
    frame apache/pulsar#17: 0x000000010009377c node`node::Environment::RunCleanup() + 292
    frame apache/pulsar#18: 0x0000000100009ff0 node`node::FreeEnvironment(node::Environment*) + 128
    frame apache/pulsar#19: 0x000000010014d26c node`node::NodeMainInstance::Run() + 304
    frame apache/pulsar#20: 0x00000001000bf5f4 node`node::Start(int, char**) + 600
    frame apache/pulsar#21: 0x0000000191fa2b98 dyld`start + 6076
  thread apache/pulsar#2
    frame #0: 0x0000000192307d04 libsystem_kernel.dylib`kevent + 8
    frame apache/pulsar#1: 0x0000000100f14f74 node`uv__io_poll + 784
    frame apache/pulsar#2: 0x0000000100f019e0 node`uv_run + 408
    frame apache/pulsar#3: 0x00000001001825f0 node`node::WorkerThreadsTaskRunner::DelayedTaskScheduler::Run() + 352
    frame apache/pulsar#4: 0x0000000192343c0c libsystem_pthread.dylib`_pthread_start + 136
  thread apache/pulsar#3
    frame #0: 0x00000001923053cc libsystem_kernel.dylib`__psynch_cvwait + 8
    frame apache/pulsar#1: 0x00000001923440e0 libsystem_pthread.dylib`_pthread_cond_wait + 984
    frame apache/pulsar#2: 0x0000000100f10054 node`uv_cond_wait + 40
    frame apache/pulsar#3: 0x00000001001827d8 node`node::TaskQueue<v8::Task>::BlockingPop() + 60
    frame apache/pulsar#4: 0x000000010018028c node`node::(anonymous namespace)::PlatformWorkerThread(void*) + 384
    frame apache/pulsar#5: 0x0000000192343c0c libsystem_pthread.dylib`_pthread_start + 136
  thread apache/pulsar#4
    frame #0: 0x00000001923053cc libsystem_kernel.dylib`__psynch_cvwait + 8
    frame apache/pulsar#1: 0x00000001923440e0 libsystem_pthread.dylib`_pthread_cond_wait + 984
    frame apache/pulsar#2: 0x0000000100f10054 node`uv_cond_wait + 40
    frame apache/pulsar#3: 0x00000001001827d8 node`node::TaskQueue<v8::Task>::BlockingPop() + 60
    frame apache/pulsar#4: 0x000000010018028c node`node::(anonymous namespace)::PlatformWorkerThread(void*) + 384
    frame apache/pulsar#5: 0x0000000192343c0c libsystem_pthread.dylib`_pthread_start + 136
  thread apache/pulsar#5
    frame #0: 0x00000001923053cc libsystem_kernel.dylib`__psynch_cvwait + 8
    frame apache/pulsar#1: 0x00000001923440e0 libsystem_pthread.dylib`_pthread_cond_wait + 984
    frame apache/pulsar#2: 0x0000000100f10054 node`uv_cond_wait + 40
    frame apache/pulsar#3: 0x00000001001827d8 node`node::TaskQueue<v8::Task>::BlockingPop() + 60
    frame apache/pulsar#4: 0x000000010018028c node`node::(anonymous namespace)::PlatformWorkerThread(void*) + 384
    frame apache/pulsar#5: 0x0000000192343c0c libsystem_pthread.dylib`_pthread_start + 136
  thread apache/pulsar#6
    frame #0: 0x00000001923053cc libsystem_kernel.dylib`__psynch_cvwait + 8
    frame apache/pulsar#1: 0x00000001923440e0 libsystem_pthread.dylib`_pthread_cond_wait + 984
    frame apache/pulsar#2: 0x0000000100f10054 node`uv_cond_wait + 40
    frame apache/pulsar#3: 0x00000001001827d8 node`node::TaskQueue<v8::Task>::BlockingPop() + 60
    frame apache/pulsar#4: 0x000000010018028c node`node::(anonymous namespace)::PlatformWorkerThread(void*) + 384
    frame apache/pulsar#5: 0x0000000192343c0c libsystem_pthread.dylib`_pthread_start + 136
  thread apache/pulsar#7
    frame #0: 0x0000000192301bb0 libsystem_kernel.dylib`semaphore_wait_trap + 8
    frame apache/pulsar#1: 0x0000000100f0fed0 node`uv_sem_wait + 24
    frame apache/pulsar#2: 0x00000001002508bc node`node::inspector::(anonymous namespace)::StartIoThreadMain(void*) + 56
    frame apache/pulsar#3: 0x0000000192343c0c libsystem_pthread.dylib`_pthread_start + 136
  thread apache/pulsar#8
    frame #0: 0x00000001923053cc libsystem_kernel.dylib`__psynch_cvwait + 8
    frame apache/pulsar#1: 0x00000001923440e0 libsystem_pthread.dylib`_pthread_cond_wait + 984
    frame apache/pulsar#2: 0x0000000100f10054 node`uv_cond_wait + 40
    frame apache/pulsar#3: 0x0000000100efdbe4 node`worker + 392
    frame apache/pulsar#4: 0x0000000192343c0c libsystem_pthread.dylib`_pthread_start + 136
  thread apache/pulsar#9
    frame #0: 0x00000001923053cc libsystem_kernel.dylib`__psynch_cvwait + 8
    frame apache/pulsar#1: 0x00000001923440e0 libsystem_pthread.dylib`_pthread_cond_wait + 984
    frame apache/pulsar#2: 0x0000000100f10054 node`uv_cond_wait + 40
    frame apache/pulsar#3: 0x0000000100efdbe4 node`worker + 392
    frame apache/pulsar#4: 0x0000000192343c0c libsystem_pthread.dylib`_pthread_start + 136
  thread apache/pulsar#10
    frame #0: 0x00000001923053cc libsystem_kernel.dylib`__psynch_cvwait + 8
    frame apache/pulsar#1: 0x00000001923440e0 libsystem_pthread.dylib`_pthread_cond_wait + 984
    frame apache/pulsar#2: 0x0000000100f10054 node`uv_cond_wait + 40
    frame apache/pulsar#3: 0x0000000100efdbe4 node`worker + 392
    frame apache/pulsar#4: 0x0000000192343c0c libsystem_pthread.dylib`_pthread_start + 136
  thread apache/pulsar#11
    frame #0: 0x00000001923053cc libsystem_kernel.dylib`__psynch_cvwait + 8
    frame apache/pulsar#1: 0x00000001923440e0 libsystem_pthread.dylib`_pthread_cond_wait + 984
    frame apache/pulsar#2: 0x0000000100f10054 node`uv_cond_wait + 40
    frame apache/pulsar#3: 0x0000000100efdbe4 node`worker + 392
    frame apache/pulsar#4: 0x0000000192343c0c libsystem_pthread.dylib`_pthread_start + 136
  thread apache/pulsar#12
    frame #0: 0x00000001923038b0 libsystem_kernel.dylib`__workq_kernreturn + 8

Are you willing to submit a PR?

  • I'm willing to submit a PR!

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions