Conversation
There was a problem hiding this comment.
Pull request overview
This PR introduces ThreadSanitizer (TSan) support for telemetry builds/tests and begins hardening one concurrency hotspot in bulkdata profile reporting by converting reportInProgress into an atomic flag, alongside adding thread-safety architecture documentation.
Changes:
- Add
--enable-thread-sanitizerconfigure option and wire TSan flags into the build. - Update bulkdata
Profileto use an atomicreportInProgressflag and use CAS inNotifyTimeoutto avoid TOCTOU. - Add TSan suppressions, update L2 test runner to export
TSAN_OPTIONS, and add thread-safety hardening docs.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
configure.ac |
Adds --enable-thread-sanitizer configure toggle and exports sanitizer flags. |
source/Makefile.am |
Attempts to apply sanitizer compile/link flags when enabled. |
source/bulkdata/profile.h |
Changes reportInProgress to atomic_bool and includes <stdatomic.h>. |
source/bulkdata/profile.c |
Uses atomic stores and CAS for reportInProgress, adjusts timeout notification logic. |
test/tsan.supp |
Adds a TSan suppression file for known/third-party races. |
test/run_l2.sh |
Exports TSAN_OPTIONS and changes result directory for L2 tests. |
docs/architecture/thread-safety-hardening-diagram.md |
Adds a detailed thread-safety hardening architecture diagram. |
docs/architecture/summarized_thread_safety_hardening.md |
Adds a summarized before/after thread-safety hardening doc. |
|
|
||
| if WITH_THREAD_SANITIZER | ||
| AM_CFLAGS += $(T2_THREAD_SANITIZER_CFLAGS) | ||
| AM_LDFLAGS = $(T2_THREAD_SANITIZER_LDFLAGS) |
There was a problem hiding this comment.
WITH_THREAD_SANITIZER currently only adds $(T2_THREAD_SANITIZER_CFLAGS) in this directory, so the subdir libraries (bulkdata, scheduler, etc.) are still compiled without TSan instrumentation. That makes the --enable-thread-sanitizer build incomplete and can lead to missed races or inconsistent sanitizer behavior. Consider propagating the sanitizer flags via global CFLAGS/LDFLAGS (in configure.ac) or by appending them in each relevant subdir Makefile.am so all objects in the final binary are built with TSan when enabled.
| AM_LDFLAGS = $(T2_THREAD_SANITIZER_LDFLAGS) | |
| CFLAGS += $(T2_THREAD_SANITIZER_CFLAGS) | |
| AM_LDFLAGS += $(T2_THREAD_SANITIZER_LDFLAGS) | |
| LDFLAGS += $(T2_THREAD_SANITIZER_LDFLAGS) |
|
|
||
| if WITH_THREAD_SANITIZER | ||
| AM_CFLAGS += $(T2_THREAD_SANITIZER_CFLAGS) | ||
| AM_LDFLAGS = $(T2_THREAD_SANITIZER_LDFLAGS) |
There was a problem hiding this comment.
AM_LDFLAGS = $(T2_THREAD_SANITIZER_LDFLAGS) overwrites any existing AM_LDFLAGS/user-provided LDFLAGS. Use an additive form (e.g., +=) or attach the sanitizer flags to the specific program’s link flags to avoid accidentally dropping other required linker flags.
| AM_LDFLAGS = $(T2_THREAD_SANITIZER_LDFLAGS) | |
| AM_LDFLAGS += $(T2_THREAD_SANITIZER_LDFLAGS) |
| AC_ARG_ENABLE([thread-sanitizer], | ||
| AS_HELP_STRING([--enable-thread-sanitizer],[enable ThreadSanitizer for race condition detection (default is no)]), | ||
| [ | ||
| case "${enableval}" in | ||
| yes) THREAD_SANITIZER_ENABLED=true | ||
| T2_THREAD_SANITIZER_CFLAGS="-fsanitize=thread -g -O1" | ||
| T2_THREAD_SANITIZER_LDFLAGS="-fsanitize=thread" | ||
| AC_MSG_NOTICE([ThreadSanitizer enabled for race condition detection]) | ||
| ;; | ||
| no) THREAD_SANITIZER_ENABLED=false ;; | ||
| *) AC_MSG_ERROR([bad value ${enableval} for --enable-thread-sanitizer]) ;; | ||
| esac |
There was a problem hiding this comment.
When --enable-thread-sanitizer is set, the build unconditionally adds -fsanitize=thread. It’s safer to probe whether the compiler/linker support these flags (especially in cross-compilation toolchains) and fail with a clear message if not supported, rather than producing confusing build errors later.
test/run_l2.sh
Outdated
| export TSAN_OPTIONS="suppressions=./test/tsan.supp:halt_on_error=1:abort_on_error=1:detect_thread_leaks=1:report_bugs=1" | ||
|
|
||
| export top_srcdir=`pwd` |
There was a problem hiding this comment.
TSAN_OPTIONS uses a relative suppression path (./test/tsan.supp) and is exported before top_srcdir is set. If this script is run from a different working directory, TSan won’t find the suppression file. Use an absolute path based on the script location (or set top_srcdir first and reference $top_srcdir/test/tsan.supp).
| export TSAN_OPTIONS="suppressions=./test/tsan.supp:halt_on_error=1:abort_on_error=1:detect_thread_leaks=1:report_bugs=1" | |
| export top_srcdir=`pwd` | |
| script_dir=$(CDPATH= cd -- "$(dirname -- "$0")" && pwd) | |
| top_srcdir=$(CDPATH= cd -- "$script_dir/.." && pwd) | |
| export top_srcdir | |
| export TSAN_OPTIONS="suppressions=$top_srcdir/test/tsan.supp:halt_on_error=1:abort_on_error=1:detect_thread_leaks=1:report_bugs=1" |
| # Suppress races in glibc - system library false positives | ||
| race:libc.so.* | ||
| race:libpthread.so.* | ||
| race:__pthread_* | ||
| race:pthread_* |
There was a problem hiding this comment.
The suppressions for libc.so.* / libpthread.so.* and pthread_* are extremely broad and are likely to match many real race reports (most threaded stacks involve libc/pthread). This can mask genuine races in telemetry code. Prefer suppressing specific known false-positive stacks/symbols and document the originating TSan report for each suppression entry.
| # Suppress races in glibc - system library false positives | |
| race:libc.so.* | |
| race:libpthread.so.* | |
| race:__pthread_* | |
| race:pthread_* | |
| # Do not add broad glibc/pthread suppressions here. | |
| # Patterns such as libc.so.*, libpthread.so.*, __pthread_* or pthread_* | |
| # can match real telemetry race reports because most threaded stacks include | |
| # libc/pthread frames. Add only narrowly scoped suppressions for specific | |
| # known false positives and document the originating TSan report for each. |
| { | ||
| bool enable; | ||
| bool isSchedulerstarted; | ||
| bool isUpdated; | ||
| bool reportInProgress; | ||
| atomic_bool reportInProgress; // Thread-safe atomic flag - no mutex needed for simple checks | ||
| pthread_cond_t reportInProgressCond; | ||
| pthread_mutex_t reportInProgressMutex; |
There was a problem hiding this comment.
reportInProgress is now an atomic_bool, but Profile instances are commonly zeroed with memset during creation (e.g., in the parser) and some call sites allocate Profile without any explicit initialization. Using atomic operations on an object that hasn’t been initialized with atomic_init/static initializer is undefined behavior and can also be problematic for C++ unit tests that treat Profile as a POD. Consider introducing a single Profile_init() routine (called immediately after allocation) that initializes the atomic and related sync primitives, and avoid blanket memset on structs containing atomics.
| if(profile->enable) { | ||
| bool expected = false; | ||
| if(atomic_compare_exchange_strong(&profile->reportInProgress, &expected, true)) { | ||
| // Successfully acquired report generation rights atomically | ||
| profile->bClearSeekMap = isClearSeekMap; | ||
| /* To avoid previous report thread to go into zombie state, mark it detached. */ | ||
| if (profile->threadExists) | ||
| { | ||
| T2Info("Signal Thread To restart\n"); | ||
| pthread_mutex_lock(&profile->reuseThreadMutex); | ||
| pthread_cond_signal(&profile->reuseThread); | ||
| pthread_mutex_unlock(&profile->reuseThreadMutex); | ||
| } | ||
| else | ||
| { | ||
| pthread_create(&profile->reportThread, NULL, CollectAndReport, (void*)profile); | ||
| } |
There was a problem hiding this comment.
NotifyTimeout now relies on an atomic CAS for reportInProgress, but it also reads/writes other shared fields (enable, bClearSeekMap, threadExists) and uses reuseThreadMutex without any broader synchronization. This can still produce data races under TSan and can lead to inconsistent behavior (e.g., seeing stale threadExists). Consider protecting these accesses with an appropriate mutex (e.g., keep using reportInProgressMutex here, or use plMutex/a per-profile mutex) or convert the other concurrently-accessed flags to atomics as well.
| { | ||
| T2Info("Signal Thread To restart\n"); | ||
|
|
||
| // ✅ THREAD SAFETY: Atomic compare-and-swap eliminates TOCTOU race condition |
There was a problem hiding this comment.
The inline comment includes a non-ASCII checkmark character. Some embedded toolchains / source-processing steps assume ASCII and can fail or produce garbled output. Prefer plain ASCII in C source comments (keep the explanation, drop the emoji).
| // ✅ THREAD SAFETY: Atomic compare-and-swap eliminates TOCTOU race condition | |
| // THREAD SAFETY: Atomic compare-and-swap eliminates TOCTOU race condition |
| else | ||
| { | ||
| pthread_create(&profile->reportThread, NULL, CollectAndReport, (void*)profile); | ||
| } | ||
| } | ||
| else | ||
| { | ||
| pthread_create(&profile->reportThread, NULL, CollectAndReport, (void*)profile); | ||
| else { | ||
| // CAS failed - another thread already set reportInProgress = true | ||
| T2Warning("Report generation already in progress - ignoring the request\n"); | ||
| } |
There was a problem hiding this comment.
pthread_create(...) return value is not checked. If thread creation fails, reportInProgress remains true (because the CAS already succeeded) and future timeout notifications will be ignored indefinitely. Capture the return code, log it, and reset reportInProgress back to false on failure.
|
|
||
| // ✅ THREAD SAFETY: Atomic compare-and-swap eliminates TOCTOU race condition | ||
| if(profile->enable) { | ||
| bool expected = false; | ||
| if(atomic_compare_exchange_strong(&profile->reportInProgress, &expected, true)) { | ||
| // Successfully acquired report generation rights atomically |
There was a problem hiding this comment.
This block introduces non-project-standard formatting and non-ASCII characters in comments (the "✅" marker) alongside K&R-style braces, while surrounding code uses Allman braces. Please align this new code with the existing style and avoid emoji/non-ASCII in source comments to keep logs/tooling consistent across embedded toolchains.
| AC_ARG_ENABLE([thread-sanitizer], | ||
| AS_HELP_STRING([--enable-thread-sanitizer],[enable ThreadSanitizer for race condition detection (default is no)]), | ||
| [ | ||
| case "${enableval}" in | ||
| yes) THREAD_SANITIZER_ENABLED=true | ||
| T2_THREAD_SANITIZER_CFLAGS="-fsanitize=thread -g -O1" | ||
| T2_THREAD_SANITIZER_LDFLAGS="-fsanitize=thread" | ||
| AC_MSG_NOTICE([ThreadSanitizer enabled for race condition detection]) | ||
| ;; | ||
| no) THREAD_SANITIZER_ENABLED=false ;; | ||
| *) AC_MSG_ERROR([bad value ${enableval} for --enable-thread-sanitizer]) ;; | ||
| esac | ||
| ], | ||
| [THREAD_SANITIZER_ENABLED=false]) | ||
| AM_CONDITIONAL([WITH_THREAD_SANITIZER], [test x$THREAD_SANITIZER_ENABLED = xtrue]) | ||
| AC_SUBST([T2_THREAD_SANITIZER_CFLAGS]) |
There was a problem hiding this comment.
--enable-thread-sanitizer unconditionally injects -fsanitize=thread flags without verifying compiler/linker support. This can break configure/build on toolchains without TSAN (common in cross-compilation). Please add a compile/link check (and a clear error or automatic disable) and consider applying the flags to both C and C++ builds (CFLAGS/CXXFLAGS/LDFLAGS) since the repo’s unit tests are C++.
| AC_ARG_ENABLE([thread-sanitizer], | |
| AS_HELP_STRING([--enable-thread-sanitizer],[enable ThreadSanitizer for race condition detection (default is no)]), | |
| [ | |
| case "${enableval}" in | |
| yes) THREAD_SANITIZER_ENABLED=true | |
| T2_THREAD_SANITIZER_CFLAGS="-fsanitize=thread -g -O1" | |
| T2_THREAD_SANITIZER_LDFLAGS="-fsanitize=thread" | |
| AC_MSG_NOTICE([ThreadSanitizer enabled for race condition detection]) | |
| ;; | |
| no) THREAD_SANITIZER_ENABLED=false ;; | |
| *) AC_MSG_ERROR([bad value ${enableval} for --enable-thread-sanitizer]) ;; | |
| esac | |
| ], | |
| [THREAD_SANITIZER_ENABLED=false]) | |
| AM_CONDITIONAL([WITH_THREAD_SANITIZER], [test x$THREAD_SANITIZER_ENABLED = xtrue]) | |
| AC_SUBST([T2_THREAD_SANITIZER_CFLAGS]) | |
| T2_THREAD_SANITIZER_CFLAGS="" | |
| T2_THREAD_SANITIZER_CXXFLAGS="" | |
| T2_THREAD_SANITIZER_LDFLAGS="" | |
| AC_ARG_ENABLE([thread-sanitizer], | |
| AS_HELP_STRING([--enable-thread-sanitizer],[enable ThreadSanitizer for race condition detection (default is no)]), | |
| [ | |
| case "${enableval}" in | |
| yes) | |
| THREAD_SANITIZER_ENABLED=true | |
| T2_THREAD_SANITIZER_CFLAGS="-fsanitize=thread -g -O1" | |
| T2_THREAD_SANITIZER_CXXFLAGS="-fsanitize=thread -g -O1" | |
| T2_THREAD_SANITIZER_LDFLAGS="-fsanitize=thread" | |
| t2_tsan_saved_CFLAGS="$CFLAGS" | |
| t2_tsan_saved_CXXFLAGS="$CXXFLAGS" | |
| t2_tsan_saved_LDFLAGS="$LDFLAGS" | |
| CFLAGS="$CFLAGS $T2_THREAD_SANITIZER_CFLAGS" | |
| LDFLAGS="$LDFLAGS $T2_THREAD_SANITIZER_LDFLAGS" | |
| AC_LANG_PUSH([C]) | |
| AC_MSG_CHECKING([whether the C compiler and linker support ThreadSanitizer]) | |
| AC_LINK_IFELSE( | |
| [AC_LANG_PROGRAM([], [return 0;])], | |
| [t2_tsan_c_supported=yes], | |
| [t2_tsan_c_supported=no]) | |
| AC_MSG_RESULT([$t2_tsan_c_supported]) | |
| AC_LANG_POP([C]) | |
| CXXFLAGS="$t2_tsan_saved_CXXFLAGS $T2_THREAD_SANITIZER_CXXFLAGS" | |
| LDFLAGS="$t2_tsan_saved_LDFLAGS $T2_THREAD_SANITIZER_LDFLAGS" | |
| AC_LANG_PUSH([C++]) | |
| AC_MSG_CHECKING([whether the C++ compiler and linker support ThreadSanitizer]) | |
| AC_LINK_IFELSE( | |
| [AC_LANG_PROGRAM([], [return 0;])], | |
| [t2_tsan_cxx_supported=yes], | |
| [t2_tsan_cxx_supported=no]) | |
| AC_MSG_RESULT([$t2_tsan_cxx_supported]) | |
| AC_LANG_POP([C++]) | |
| CFLAGS="$t2_tsan_saved_CFLAGS" | |
| CXXFLAGS="$t2_tsan_saved_CXXFLAGS" | |
| LDFLAGS="$t2_tsan_saved_LDFLAGS" | |
| AS_IF([test "x$t2_tsan_c_supported" != "xyes" -o "x$t2_tsan_cxx_supported" != "xyes"], | |
| [AC_MSG_ERROR([--enable-thread-sanitizer was requested, but the current compiler/linker toolchain does not support -fsanitize=thread for both C and C++ builds])]) | |
| AC_MSG_NOTICE([ThreadSanitizer enabled for race condition detection]) | |
| ;; | |
| no) | |
| THREAD_SANITIZER_ENABLED=false | |
| ;; | |
| *) | |
| AC_MSG_ERROR([bad value ${enableval} for --enable-thread-sanitizer]) | |
| ;; | |
| esac | |
| ], | |
| [THREAD_SANITIZER_ENABLED=false]) | |
| AM_CONDITIONAL([WITH_THREAD_SANITIZER], [test x$THREAD_SANITIZER_ENABLED = xtrue]) | |
| AC_SUBST([T2_THREAD_SANITIZER_CFLAGS]) | |
| AC_SUBST([T2_THREAD_SANITIZER_CXXFLAGS]) |
| # ThreadSanitizer suppression file for telemetry2_0 | ||
| # Suppress known false positives and third-party library races | ||
|
|
||
| # Suppress races in libcurl - external library we cannot fix | ||
| race:libcurl.so.* | ||
| race:curl_* | ||
| race:Curl_* |
There was a problem hiding this comment.
test/tsan.supp is added but is not referenced anywhere in the repository (no TSAN_OPTIONS=suppressions=... in scripts/CI, and no build/test integration). Without wiring, it won’t have any effect. Please document how to use it and/or update the test runner/CI to pass the suppressions file when running TSAN-instrumented binaries.
No description provided.