RDKB-64163: Cherry pick race condition fixes to 1.6.4#314
RDKB-64163: Cherry pick race condition fixes to 1.6.4#314shibu-kv merged 4 commits intosupport/1.6.4from
Conversation
* Fix reporting hang due to webconfig reload and triggercondition reporting at teh same time from activation (cherry picked from commit 1d59a609d299165dfe22f5147953e8869c7a4450) * Include a summary of custom signals used by telemetry
* Fix race conditions between config reloads Add log triaging skills * Fix race conditions for event markerlist update and fetch * Fix crash due to use member access after free causing NULL access * Reduce the frequency of logging for report completion * Fix few more race conditions with new changes * Fix astyle formatting errors
There was a problem hiding this comment.
Pull request overview
Backport-focused PR intended to reduce race conditions/deadlocks in Telemetry 2.0 reporting threads (notably XConf), plus adds new repository documentation.
Changes:
- Adjust XConf reporting thread synchronization (mutex/cond usage,
reportInProgresshandling, thread lifecycle coordination). - Avoid holding
plMutexacross potentially blocking operations (scheduler registration, report generation / upload) and avoid UAF in one scheduler-unregister path. - Add new README/triage documentation under
README.md,.github/README.md, and.github/skills/triage-logs/SKILL.md.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| source/dcautil/dca.c | Downgrades a rotated-log open failure message from error to debug during mmap search fallback. |
| source/bulkdata/profilexconf.c | Major concurrency refactor for XConf CollectAndReport thread, condvar init/destroy timing, and delete/uninit behavior. |
| source/bulkdata/profile.c | Ensures plMutex isn’t held across pthread_join() during profile deletion paths to avoid deadlocks. |
| README.md | Adds top-level project documentation and operational usage guidance. |
| .github/README.md | Adds a duplicate of the root README under .github/. |
| .github/skills/triage-logs/SKILL.md | Adds a structured skill doc for log triage and test-gap analysis. |
source/bulkdata/profilexconf.c
Outdated
| @@ -307,10 +343,20 @@ | |||
|
|
|||
| dcaFlagReportCompleation(); | |||
|
|
|||
| if(profile->eMarkerList != NULL && Vector_Size(profile->eMarkerList) > 0) | |||
| /* CRITICAL: Re-acquire plMutex to safely access eMarkerList. | |||
| * External components can call t2_event_s() which modifies eMarkerList | |||
| * via ProfileXConf_storeMarkerEvent(). We must hold plMutex during | |||
| * event marker encoding to prevent race conditions. | |||
| * This is safe because encoding is quick (~milliseconds), unlike HTTP | |||
| * upload which can take 30+ seconds. | |||
| */ | |||
| pthread_mutex_lock(&plMutex); | |||
| if(singleProfile == profile && profile->eMarkerList != NULL && Vector_Size(profile->eMarkerList) > 0) | |||
| { | |||
| encodeEventMarkersInJSON(valArray, profile->eMarkerList); | |||
| } | |||
| pthread_mutex_unlock(&plMutex); | |||
|
|
|||
| profile->grepSeekProfile->execCounter += 1; | |||
| T2Info("Execution Count = %d\n", profile->grepSeekProfile->execCounter); | |||
|
|
|||
| @@ -322,9 +368,15 @@ | |||
| if(ret != T2ERROR_SUCCESS) | |||
There was a problem hiding this comment.
startTime is captured using CLOCK_MONOTONIC, but endTime is later captured using CLOCK_REALTIME and the code subtracts endTime.tv_sec - startTime.tv_sec and also passes both to getLapsedTime(). Mixing clock domains makes the elapsed-time calculation invalid (can produce negative/garbage values). Use the same clock source for both timestamps (prefer CLOCK_MONOTONIC for elapsed time) and keep all related measurements consistent.
| initialized = false; | ||
|
|
||
| if(singleProfile->reportInProgress) | ||
| { | ||
| T2Debug("Waiting for final report before uninit\n"); | ||
| pthread_mutex_lock(&plMutex); | ||
| pthread_cond_signal(&reuseThread); | ||
| pthread_mutex_unlock(&plMutex); | ||
| pthread_join(singleProfile->reportThread, NULL); | ||
| reportThreadExits = false; | ||
| singleProfile->reportInProgress = false ; | ||
| T2Info("Final report is completed, releasing profile memory\n"); | ||
| } | ||
| pthread_mutex_lock(&plMutex); | ||
| freeProfileXConf(); | ||
| pthread_mutex_unlock(&plMutex); | ||
|
|
||
| /* Destroy condition variable at module uninit, after all threads are stopped */ | ||
| pthread_cond_destroy(&reuseThread); | ||
| pthread_mutex_destroy(&plMutex); |
There was a problem hiding this comment.
ProfileXConf_uninit() sets initialized = false and then destroys reuseThread/plMutex, but it only signals/pthread_joins when singleProfile->reportInProgress is true. If the CollectAndReportXconf thread is blocked in pthread_cond_wait() with reportInProgress == false (normal idle state), pthread_cond_destroy() can run while another thread is still waiting on it (undefined behavior). Also, singleProfile->reportInProgress and reportThreadExits are accessed/modified here without holding plMutex, reintroducing data races. Ensure uninit reliably wakes and joins the report thread whenever it exists (e.g., when reportThreadExits is true), and perform shared-state updates under plMutex before destroying sync primitives.
| @@ -633,6 +737,9 @@ | |||
| { | |||
| T2Error("Unable to register profile : %s with Scheduler\n", singleProfile->name); | |||
| } | |||
|
|
|||
| /* Note: We already released plMutex above, so no need to unlock again */ | |||
| pthread_mutex_lock(&plMutex); | |||
There was a problem hiding this comment.
ProfileXConf_set() unlocks plMutex and then calls registerProfileWithScheduler(singleProfile->...) while still reading singleProfile fields. Because plMutex is the protection for singleProfile, another thread can delete/uninit and free/replace singleProfile during this window, leading to a use-after-free. Copy the needed fields (e.g., name/interval) into local variables before unlocking, or introduce a stronger lifetime guarantee (refcount/ownership) around scheduler registration.
| pthread_mutex_lock(&plMutex); | ||
| unsigned int waitIterations = 0; | ||
| const unsigned int LOG_INTERVAL = 3000; // Log every 3000 iterations (30 seconds at 10ms per iteration) | ||
| while(singleProfile && singleProfile->reportInProgress) | ||
| { | ||
| T2Info("Waiting for CollectAndReport to be complete : %s\n", singleProfile->name); | ||
| pthread_mutex_lock(&plMutex); | ||
| initialized = false; | ||
| T2Info("Sending signal to reuse Thread in CollectAndReportXconf\n"); | ||
| pthread_cond_signal(&reuseThread); | ||
| if(waitIterations % LOG_INTERVAL == 0) | ||
| { | ||
| T2Info("Waiting for CollectAndReportXconf to be complete : %s\n", singleProfile->name); | ||
| } | ||
| waitIterations++; | ||
| pthread_mutex_unlock(&plMutex); | ||
| pthread_join(singleProfile->reportThread, NULL); | ||
| T2Info("reportThread exits and initialising the profile list\n"); | ||
| reportThreadExits = false; | ||
| initialized = true; | ||
| singleProfile->reportInProgress = false ; | ||
| usleep(10000); // 10ms polling interval | ||
| pthread_mutex_lock(&plMutex); | ||
| } | ||
|
|
||
| pthread_mutex_lock(&plMutex); | ||
|
|
||
| size_t count = Vector_Size(singleProfile->cachedReportList); | ||
| // Copy any cached message present in previous single profile to new profile |
There was a problem hiding this comment.
After the while(singleProfile && singleProfile->reportInProgress) loop, the code immediately dereferences singleProfile (e.g., Vector_Size(singleProfile->cachedReportList)) without re-validating it. Since the loop can also exit when singleProfile becomes NULL, this can crash with a NULL dereference. Add a NULL check for singleProfile after the wait loop and handle the "profile disappeared" case safely before accessing its members.
| pthread_mutex_lock(&plMutex); | ||
| unsigned int waitIterations = 0; | ||
| const unsigned int LOG_INTERVAL = 3000; // Log every 3000 iterations (30 seconds at 10ms per iteration) | ||
| while(singleProfile && singleProfile->reportInProgress) | ||
| { | ||
| T2Info("Waiting for CollectAndReport to be complete : %s\n", singleProfile->name); | ||
| pthread_mutex_lock(&plMutex); | ||
| initialized = false; | ||
| T2Info("Sending signal to reuse Thread in CollectAndReportXconf\n"); | ||
| pthread_cond_signal(&reuseThread); | ||
| if(waitIterations % LOG_INTERVAL == 0) | ||
| { | ||
| T2Info("Waiting for CollectAndReportXconf to be complete : %s\n", singleProfile->name); | ||
| } | ||
| waitIterations++; | ||
| pthread_mutex_unlock(&plMutex); | ||
| pthread_join(singleProfile->reportThread, NULL); | ||
| T2Info("reportThread exits and initialising the profile list\n"); | ||
| reportThreadExits = false; | ||
| initialized = true; | ||
| singleProfile->reportInProgress = false ; | ||
| usleep(10000); // 10ms polling interval | ||
| pthread_mutex_lock(&plMutex); | ||
| } |
There was a problem hiding this comment.
ProfileXConf_delete() waits for reportInProgress to clear via a 10ms polling loop with usleep(). This is CPU-inefficient and can significantly delay deletion under load. Prefer waiting on a condition variable that is signaled when reportInProgress transitions to false (you already have reuseThread, or add a dedicated cond for report completion) to avoid active polling.
| [](LICENSE) | ||
| [](https://en.wikipedia.org/wiki/C_(programming_language)) | ||
| [](https://www.yoctoproject.org/) | ||
|
|
||
| A lightweight, efficient telemetry framework for RDK (Reference Design Kit) embedded devices. | ||
|
|
||
| ## Overview | ||
|
|
||
| Telemetry 2.0 provides real-time monitoring, event collection, and reporting capabilities optimized for resource-constrained embedded devices such as set-top boxes, gateways, and IoT devices. | ||
|
|
||
| ### Key Features | ||
|
|
||
| - ⚡ **Efficient**: Connection pooling and batch reporting | ||
| - 🔒 **Secure**: mTLS support for encrypted communication | ||
| - 📊 **Flexible**: Profile-based configuration (JSON/XConf) | ||
| - 🔧 **Platform-Independent**: Multiple architecture support | ||
|
|
||
| ### Architecture Highlights | ||
|
|
||
| ```mermaid | ||
| graph TB | ||
| A[Telemetry Events/Markers] --> B[Profile Matcher] | ||
| B --> C[Report Generator] | ||
| C --> D[HTTP Connection Pool] | ||
| D --> E[XConf Server / Data Collector] | ||
| F[XConf Client] -.->|Config| B | ||
| G[Scheduler] -.->|Triggers| C | ||
| ``` | ||
|
|
||
| ## Quick Start | ||
|
|
||
| ### Prerequisites | ||
|
|
||
| - GCC 4.8+ or Clang 3.5+ | ||
| - pthread library | ||
| - libcurl 7.65.0+ | ||
| - cJSON library | ||
| - OpenSSL 1.1.1+ (for mTLS) | ||
|
|
||
| ### Build | ||
|
|
||
| ```bash | ||
| # Clone repository | ||
| git clone https://github.com/rdkcentral/telemetry.git | ||
| cd telemetry | ||
|
|
||
| # Configure | ||
| autoreconf -i | ||
| ./configure | ||
|
|
There was a problem hiding this comment.
This PR adds a full repository README under .github/README.md that appears to duplicate the root README.md. Keeping two copies of the same long-form documentation tends to drift over time, and .github/README.md is generally not rendered for the repository homepage when a root README.md exists. Consider removing the duplicate or replacing it with a short pointer to the root README to avoid divergence.
| [](LICENSE) | |
| [](https://en.wikipedia.org/wiki/C_(programming_language)) | |
| [](https://www.yoctoproject.org/) | |
| A lightweight, efficient telemetry framework for RDK (Reference Design Kit) embedded devices. | |
| ## Overview | |
| Telemetry 2.0 provides real-time monitoring, event collection, and reporting capabilities optimized for resource-constrained embedded devices such as set-top boxes, gateways, and IoT devices. | |
| ### Key Features | |
| - ⚡ **Efficient**: Connection pooling and batch reporting | |
| - 🔒 **Secure**: mTLS support for encrypted communication | |
| - 📊 **Flexible**: Profile-based configuration (JSON/XConf) | |
| - 🔧 **Platform-Independent**: Multiple architecture support | |
| ### Architecture Highlights | |
| ```mermaid | |
| graph TB | |
| A[Telemetry Events/Markers] --> B[Profile Matcher] | |
| B --> C[Report Generator] | |
| C --> D[HTTP Connection Pool] | |
| D --> E[XConf Server / Data Collector] | |
| F[XConf Client] -.->|Config| B | |
| G[Scheduler] -.->|Triggers| C | |
| ``` | |
| ## Quick Start | |
| ### Prerequisites | |
| - GCC 4.8+ or Clang 3.5+ | |
| - pthread library | |
| - libcurl 7.65.0+ | |
| - cJSON library | |
| - OpenSSL 1.1.1+ (for mTLS) | |
| ### Build | |
| ```bash | |
| # Clone repository | |
| git clone https://github.com/rdkcentral/telemetry.git | |
| cd telemetry | |
| # Configure | |
| autoreconf -i | |
| ./configure | |
| This file intentionally avoids duplicating the full repository documentation. | |
| For the canonical project overview, setup instructions, and architecture details, see the root [README.md](../README.md). |
| # Telemetry 2.0 | ||
|
|
||
| [](LICENSE) | ||
| [](https://en.wikipedia.org/wiki/C_(programming_language)) | ||
| [](https://www.yoctoproject.org/) | ||
|
|
||
| A lightweight, efficient telemetry framework for RDK (Reference Design Kit) embedded devices. | ||
|
|
||
| ## Overview | ||
|
|
||
| Telemetry 2.0 provides real-time monitoring, event collection, and reporting capabilities optimized for resource-constrained embedded devices such as set-top boxes, gateways, and IoT devices. | ||
|
|
||
| ### Key Features | ||
|
|
||
| - ⚡ **Efficient**: Connection pooling and batch reporting |
There was a problem hiding this comment.
PR title indicates this is a cherry-pick of race-condition fixes for the 1.6.4 line, but the PR also introduces large new documentation files (README.md, .github/README.md, and a new .github/skills/... skill doc). If these docs are not part of the cherry-pick scope, consider moving them to a separate PR to keep the backport focused and reduce release risk.
Remove unused clockReturn variable that caused -Werror=unused-but-set-variable compilation error.
| /* CRITICAL: Release plMutex before potentially blocking operations. | ||
| * Report generation involves: | ||
| * | ||
| * Holding plMutex during these operations blocks ALL other XCONF profile | ||
| * operations (timeouts, updates, deletions, marker events) system-wide, | ||
| * causing the telemetry system to hang. | ||
| * | ||
| * We can safely release plMutex here because: | ||
| * 1. We've already checked singleProfile is valid | ||
| * 2. We use profile->reportInProgress to prevent concurrent reports | ||
| * 3. Profile deletion waits for reportInProgress to be false | ||
| * 4. We'll re-acquire plMutex before updating shared state | ||
| */ | ||
| pthread_mutex_unlock(&plMutex); | ||
| /* CRITICAL SECTION END - plMutex released, other threads can now proceed */ |
There was a problem hiding this comment.
CollectAndReportXconf drops plMutex and then calls code paths that still dereference global shared state (e.g., initJSONReportXconf uses singleProfile->name under some build flags). With plMutex unlocked, singleProfile can be changed/freed by ProfileXConf_delete/uninit, leading to a race/use-after-free or inconsistent report metadata. Snapshot the needed profile fields (e.g., name/encoding/protocol pointers or copies) while holding plMutex and avoid using singleProfile directly after unlocking.
| pthread_mutex_unlock(&plMutex); | ||
|
|
||
| if(registerProfileWithScheduler(singleProfile->name, singleProfile->reportingInterval, INFINITE_TIMEOUT, false, true, false, DEFAULT_FIRST_REPORT_INT, NULL) == T2ERROR_SUCCESS) | ||
| { | ||
| T2Info("Successfully set profile : %s\n", singleProfile->name); |
There was a problem hiding this comment.
ProfileXConf_set unlocks plMutex and then uses singleProfile to call registerProfileWithScheduler(). If another thread runs ProfileXConf_uninit/delete after the unlock, singleProfile may be cleared/freed, causing a use-after-free/null deref. Capture the required scheduler args (e.g., a local pointer to the newly set profile and/or a strdup of the name + interval) while holding plMutex, then unlock and call the scheduler with those stable values.
| # Telemetry 2.0 | ||
|
|
||
| [](LICENSE) | ||
| [](https://en.wikipedia.org/wiki/C_(programming_language)) | ||
| [](https://www.yoctoproject.org/) |
There was a problem hiding this comment.
The PR title indicates a cherry-pick of race condition fixes, but this PR also introduces a large new top-level README. If this documentation addition isn't part of the intended cherry-pick scope, consider moving it to a separate PR to keep backport risk and review scope aligned with the release branch intent.
| # Telemetry 2.0 | ||
|
|
||
| [](LICENSE) | ||
| [](https://en.wikipedia.org/wiki/C_(programming_language)) | ||
| [](https://www.yoctoproject.org/) | ||
|
|
||
| A lightweight, efficient telemetry framework for RDK (Reference Design Kit) embedded devices. | ||
|
|
||
| ## Overview | ||
|
|
||
| Telemetry 2.0 provides real-time monitoring, event collection, and reporting capabilities optimized for resource-constrained embedded devices such as set-top boxes, gateways, and IoT devices. | ||
|
|
||
| ### Key Features | ||
|
|
||
| - ⚡ **Efficient**: Connection pooling and batch reporting | ||
| - 🔒 **Secure**: mTLS support for encrypted communication | ||
| - 📊 **Flexible**: Profile-based configuration (JSON/XConf) | ||
| - 🔧 **Platform-Independent**: Multiple architecture support | ||
|
|
||
| ### Architecture Highlights | ||
|
|
||
| ```mermaid | ||
| graph TB | ||
| A[Telemetry Events/Markers] --> B[Profile Matcher] | ||
| B --> C[Report Generator] | ||
| C --> D[HTTP Connection Pool] | ||
| D --> E[XConf Server / Data Collector] | ||
| F[XConf Client] -.->|Config| B | ||
| G[Scheduler] -.->|Triggers| C | ||
| ``` | ||
|
|
||
| ## Quick Start | ||
|
|
||
| ### Prerequisites | ||
|
|
||
| - GCC 4.8+ or Clang 3.5+ | ||
| - pthread library | ||
| - libcurl 7.65.0+ | ||
| - cJSON library | ||
| - OpenSSL 1.1.1+ (for mTLS) | ||
|
|
||
| ### Build | ||
|
|
||
| ```bash | ||
| # Clone repository | ||
| git clone https://github.com/rdkcentral/telemetry.git | ||
| cd telemetry | ||
|
|
||
| # Configure | ||
| autoreconf -i | ||
| ./configure | ||
|
|
||
| # Build |
There was a problem hiding this comment.
.github/README.md appears to duplicate the repository README.md verbatim. This duplication can drift over time and isn't a standard GitHub surface (GitHub renders the root README). Consider removing the duplicate or replacing it with a short pointer to the root README to avoid maintenance overhead.
| # Telemetry 2.0 | |
| [](LICENSE) | |
| [](https://en.wikipedia.org/wiki/C_(programming_language)) | |
| [](https://www.yoctoproject.org/) | |
| A lightweight, efficient telemetry framework for RDK (Reference Design Kit) embedded devices. | |
| ## Overview | |
| Telemetry 2.0 provides real-time monitoring, event collection, and reporting capabilities optimized for resource-constrained embedded devices such as set-top boxes, gateways, and IoT devices. | |
| ### Key Features | |
| - ⚡ **Efficient**: Connection pooling and batch reporting | |
| - 🔒 **Secure**: mTLS support for encrypted communication | |
| - 📊 **Flexible**: Profile-based configuration (JSON/XConf) | |
| - 🔧 **Platform-Independent**: Multiple architecture support | |
| ### Architecture Highlights | |
| ```mermaid | |
| graph TB | |
| A[Telemetry Events/Markers] --> B[Profile Matcher] | |
| B --> C[Report Generator] | |
| C --> D[HTTP Connection Pool] | |
| D --> E[XConf Server / Data Collector] | |
| F[XConf Client] -.->|Config| B | |
| G[Scheduler] -.->|Triggers| C | |
| ``` | |
| ## Quick Start | |
| ### Prerequisites | |
| - GCC 4.8+ or Clang 3.5+ | |
| - pthread library | |
| - libcurl 7.65.0+ | |
| - cJSON library | |
| - OpenSSL 1.1.1+ (for mTLS) | |
| ### Build | |
| ```bash | |
| # Clone repository | |
| git clone https://github.com/rdkcentral/telemetry.git | |
| cd telemetry | |
| # Configure | |
| autoreconf -i | |
| ./configure | |
| # Build | |
| # Repository documentation | |
| This file intentionally avoids duplicating the main project documentation. | |
| Please see the root [README.md](../README.md) for the canonical project overview, setup instructions, and usage details. |
| freeProfileXConf(); | ||
| pthread_mutex_unlock(&plMutex); | ||
|
|
||
| /* Destroy condition variable at module uninit, after all threads are stopped */ | ||
| pthread_cond_destroy(&reuseThread); | ||
| pthread_mutex_destroy(&plMutex); | ||
| T2Debug("%s --out\n", __FUNCTION__); | ||
| return T2ERROR_SUCCESS; |
There was a problem hiding this comment.
ProfileXConf_uninit destroys reuseThread/plMutex here, but it only signals/joins the CollectAndReportXconf thread when singleProfile->reportInProgress is true. If the thread exists but is idle (waiting on reuseThread with reportInProgress=false), uninit can destroy the cond/mutex while the thread is still blocked in pthread_cond_wait, which is undefined behavior. Consider always waking (broadcast) and pthread_join()ing the thread whenever it has been created (e.g., based on reportThreadExits/thread state) before destroying synchronization primitives and freeing singleProfile.
| /* Initialize condition variable at module init to prevent race where | ||
| * ProfileXConf_notifyTimeout signals before CollectAndReportXconf initializes it. | ||
| */ | ||
| if(pthread_cond_init(&reuseThread, NULL) != 0) | ||
| { | ||
| T2Error("%s Condition variable init has failed\n", __FUNCTION__); | ||
| pthread_mutex_destroy(&plMutex); | ||
| return T2ERROR_FAILURE; | ||
| } |
There was a problem hiding this comment.
On pthread_cond_init(&reuseThread) failure, the function returns after destroying plMutex, but the module-level initialized flag remains true (it is set before mutex/cond init). That can leave the module in a partially-initialized state where later calls skip initialization but try to use an uninitialized mutex/cond. Consider setting initialized=false before returning on any init failure (and/or only setting initialized=true after all synchronization primitives are successfully initialized).
| } | ||
|
|
||
| pthread_mutex_lock(&plMutex); | ||
| /* Release plMutex before pthread_join to avoid deadlock */ |
There was a problem hiding this comment.
This comment says “Release plMutex before pthread_join”, but plMutex is not held at this point (it was unlocked at line 1181). Consider updating/removing the comment to match the actual locking behavior so future changes don't assume an unlock is required here.
No description provided.