RDKB-64163: [Field-Reported][XLE-activation] T2 reports are not reaching XDP | telemetry process is hung and recovers with selfheal#313
Conversation
* Fix reporting hang due to webconfig reload and triggercondition reporting at teh same time from activation (cherry picked from commit 1d59a609d299165dfe22f5147953e8869c7a4450) * Include a summary of custom signals used by telemetry (cherry picked from commit 221189a) RDKB-64163: Fix race conditions with config reload (#312) * Fix race conditions between config reloads Add log triaging skills * Fix race conditions for event markerlist update and fetch * Fix crash due to use member access after free causing NULL access * Reduce the frequency of logging for report completion * Fix few more race conditions with new changes * Fix astyle formatting errors (cherry picked from commit a232134)
There was a problem hiding this comment.
Pull request overview
This PR aims to address field-reported Telemetry 2.0 reporting hangs during heavy parallel operations (notably XConf reload + report generation), by reducing lock contention and adjusting thread/cleanup coordination, along with adding/expanding repository documentation for troubleshooting and onboarding.
Changes:
- Refactors XConf report generation thread coordination/locking in
profilexconf.cto reduce time holdingplMutexand to better coordinate deletion vs. in-flight reporting. - Adjusts profile deletion paths in
profile.cto avoid holdingplMutexacross potentially blockingpthread_joincalls. - Adds new docs (root/
.githubREADME and a log-triage “skill” document) and tweaks DCA rotated-log open logging level.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
source/dcautil/dca.c |
Lowers severity of a rotated-log open failure message to reduce log noise. |
source/bulkdata/profilexconf.c |
Major concurrency refactor around XConf report thread lifecycle, mutex usage, and deletion coordination. |
source/bulkdata/profile.c |
Avoids holding plMutex while joining profile report threads to reduce deadlock risk. |
README.md |
Adds a comprehensive top-level project README (build/run/troubleshooting). |
.github/skills/triage-logs/SKILL.md |
Adds a structured log-triage guide for investigating telemetry behavioral issues. |
.github/README.md |
Adds a GitHub-facing README mirroring the project overview and usage docs. |
| T2Debug("Top markers report generated but is empty possabliy the memory value is changed"); | ||
| } | ||
| Vector_Destroy(topMarkerResultList, freeGResult); | ||
| processTopPattern(profile->name, profile->topMarkerList, count); |
There was a problem hiding this comment.
processTopPattern is declared/defined with 4 parameters (profileName, topMarkerList, out_grepResultList, execCount) (see source/dcautil/dca.h:34 / source/dcautil/dca.c:168), but it’s being called here with only 3 args. This will not compile. Update the call to match the existing API (or update the declaration/definition + all call sites consistently).
| processTopPattern(profile->name, profile->topMarkerList, count); | |
| processTopPattern(profile->name, profile->topMarkerList, profile->gMarkerList, count); |
| Vector_Destroy(topMarkerResultList, freeGResult); | ||
| processTopPattern(profile->name, profile->topMarkerList, count); | ||
| T2Info("Top markers report is completed\n"); | ||
| encodeTopResultInJSON(valArray, profile->topMarkerList); |
There was a problem hiding this comment.
encodeTopResultInJSON is used here but there is no definition or declaration anywhere in the repo (only this call site shows up in search). This will cause a build/link error. Either add the implementation + header declaration, or keep using the existing encodeGrepResultInJSON flow (with a topMarkerResultList output vector) like source/bulkdata/profile.c does.
| encodeTopResultInJSON(valArray, profile->topMarkerList); | |
| encodeGrepResultInJSON(valArray, profile->topMarkerList); |
| if(profile->gMarkerList != NULL && Vector_Size(profile->gMarkerList) > 0) | ||
| { | ||
| getGrepResults(&(profile->grepSeekProfile), profile->gMarkerList, &grepResultList, profile->bClearSeekMap, checkRotated, customLogPath); // Passing 5th argument as true to check rotated logs only in case of single profile | ||
| getGrepResults(&(profile->grepSeekProfile), profile->gMarkerList, profile->bClearSeekMap, checkRotated, customLogPath); // Passing 4th argument as true to check rotated logs only in case of single profile | ||
| T2Info("Grep complete for %lu markers \n", (unsigned long)Vector_Size(profile->gMarkerList)); | ||
| encodeGrepResultInJSON(valArray, grepResultList); | ||
| Vector_Destroy(grepResultList, freeGResult); | ||
| encodeGrepResultInJSON(valArray, profile->gMarkerList); | ||
| } |
There was a problem hiding this comment.
getGrepResults currently requires an out-parameter for grepResultList (T2ERROR getGrepResults(..., Vector **grepResultList, ...) in source/dcautil/dcautil.h:54 / source/dcautil/dcautil.c:44), but this call no longer passes it. This is a compile-time API mismatch; either restore the grepResultList argument (and the corresponding free) or change the function signature and update all call sites + header accordingly.
| getGrepResults(&(profile->grepSeekProfile), profile->gMarkerList, profile->bClearSeekMap, checkRotated, customLogPath); // Passing 4th argument as true to check rotated logs only in case of single profile | ||
| T2Info("Grep complete for %lu markers \n", (unsigned long)Vector_Size(profile->gMarkerList)); | ||
| encodeGrepResultInJSON(valArray, grepResultList); | ||
| Vector_Destroy(grepResultList, freeGResult); | ||
| encodeGrepResultInJSON(valArray, profile->gMarkerList); | ||
| } |
There was a problem hiding this comment.
encodeGrepResultInJSON expects a vector of GrepResult entries (see source/reportgen/reportgen.c:392), but this call now passes profile->gMarkerList (a marker-definition list). Even if compilation is fixed, this would encode the wrong type and can lead to invalid reads/crashes. Keep encoding the grepResultList returned by getGrepResults, as is done in source/bulkdata/profile.c:472-474.
| { | ||
| T2Debug("Abort upload is not yet set.\n"); | ||
| ret = sendReportOverHTTP(profile->t2HTTPDest->URL, jsonReport, &xconfReportPid); | ||
| ret = sendReportOverHTTP(profile->t2HTTPDest->URL, jsonReport); |
There was a problem hiding this comment.
sendReportOverHTTP is declared as sendReportOverHTTP(char *httpUrl, char* payload, pid_t* outForkedPid) (source/protocol/http/curlinterface.h:42), but it’s being called with only 2 arguments here. This will not compile. Either pass the third argument (as before) or provide a new wrapper / update the function signature and all other call sites (source/bulkdata/profile.c also calls it with 3 args).
| ret = sendReportOverHTTP(profile->t2HTTPDest->URL, jsonReport); | |
| ret = sendReportOverHTTP(profile->t2HTTPDest->URL, jsonReport, NULL); |
| initialized = false; | ||
|
|
||
| if(singleProfile->reportInProgress) | ||
| { | ||
| T2Debug("Waiting for final report before uninit\n"); | ||
| pthread_mutex_lock(&plMutex); | ||
| pthread_cond_signal(&reuseThread); | ||
| pthread_mutex_unlock(&plMutex); | ||
| pthread_join(singleProfile->reportThread, NULL); | ||
| reportThreadExits = false; | ||
| singleProfile->reportInProgress = false ; | ||
| T2Info("Final report is completed, releasing profile memory\n"); | ||
| } |
There was a problem hiding this comment.
ProfileXConf_uninit sets initialized = false but only signals/joins the report thread when singleProfile->reportInProgress is true. If the report thread exists and is currently blocked in pthread_cond_wait (common case when idle), it will never be woken to observe initialized == false, and freeProfileXConf() / pthread_cond_destroy(&reuseThread) can run while the thread is still waiting -> hang or use-after-free. Signal the condition and pthread_join the report thread whenever it was created (e.g., when reportThreadExits is true / thread handle is valid), not only when reportInProgress is true.
| /* Release plMutex before calling scheduler API to avoid potential | ||
| * blocking while holding the mutex. Scheduler operations may involve | ||
| * timer management and other operations that shouldn't block profile access. | ||
| */ | ||
| pthread_mutex_unlock(&plMutex); | ||
|
|
||
| if(registerProfileWithScheduler(singleProfile->name, singleProfile->reportingInterval, INFINITE_TIMEOUT, false, true, false, DEFAULT_FIRST_REPORT_INT, NULL) == T2ERROR_SUCCESS) | ||
| { | ||
| T2Info("Successfully set profile : %s\n", singleProfile->name); |
There was a problem hiding this comment.
plMutex is intentionally released before calling registerProfileWithScheduler, but the code still dereferences singleProfile after the unlock (singleProfile->name, singleProfile->reportingInterval). That can race with a concurrent delete/uninit path and becomes a potential use-after-free. Copy the needed fields (e.g., name/reportingInterval) to locals while holding plMutex, then unlock and call the scheduler using those locals.
| #ifdef GTEST_ENABLE | ||
| void test_set_reportThreadExits(bool value) | ||
| { | ||
|
|
||
| T2ERROR ret = T2ERROR_FAILURE; | ||
|
|
||
| if(!singleProfile) | ||
| { | ||
| T2Error("Xconf profile is not set.\n"); | ||
| return ret; | ||
| } | ||
|
|
||
| // Check whether any XconfReport is in progress | ||
| if(singleProfile->reportInProgress) | ||
| { | ||
| isAbortTriggered = true; | ||
| // Check if a child pid is still alive | ||
| if((xconfReportPid > 0) && !kill(xconfReportPid, 0)) | ||
| { | ||
| T2Info("Report upload in progress, terminating the forked reporting child : %d \n", xconfReportPid); | ||
| if(!kill(xconfReportPid, SIGKILL)) | ||
| { | ||
| ret = T2ERROR_SUCCESS; | ||
| } | ||
| } | ||
| else | ||
| { | ||
| T2Info(" Report upload has net yet started, set the abort flag \n"); | ||
| ret = T2ERROR_SUCCESS; | ||
| } | ||
| } | ||
| else | ||
| { | ||
| T2Info("No report generation in progress. No further action required for abort.\n"); | ||
| } | ||
|
|
||
| return ret; | ||
|
|
||
| reportThreadExits = value; | ||
| } | ||
|
|
||
| #endif |
There was a problem hiding this comment.
ProfileXConf_terminateReport is still declared in profilexconf.h and called from production code (source/bulkdata/reportprofiles.c:437), but its implementation was removed here. This will cause an undefined reference at link time and also breaks existing unit tests that call it. Either restore/keep the implementation (possibly updated for the new send/abort approach) or remove the declaration and update all callers to the new mechanism.
|
|
||
| ```bash | ||
| # Publish DCM reload event | ||
| rbuscli publish Device.X_RDKCENTREL-COM.Reloadconfig |
There was a problem hiding this comment.
The RBUS event name uses X_RDKCENTREL here, but the codebase consistently uses X_RDKCENTRAL (e.g., source/ccspinterface/rbusInterface.c:1415). This looks like a typo and would mislead users during troubleshooting. Update the docs to Device.X_RDKCENTRAL-COM.Reloadconfig (or to the exact value used by the platform, if different).
| rbuscli publish Device.X_RDKCENTREL-COM.Reloadconfig | |
| rbuscli publish Device.X_RDKCENTRAL-COM.Reloadconfig |
|
|
||
| ```bash | ||
| # Publish DCM reload event | ||
| rbuscli publish Device.X_RDKCENTREL-COM.Reloadconfig |
There was a problem hiding this comment.
The RBUS event name uses X_RDKCENTREL here, but the codebase consistently uses X_RDKCENTRAL (e.g., source/ccspinterface/rbusInterface.c:1415). This looks like a typo and would mislead users during troubleshooting. Update the docs to Device.X_RDKCENTRAL-COM.Reloadconfig (or to the exact value used by the platform, if different).
| rbuscli publish Device.X_RDKCENTREL-COM.Reloadconfig | |
| rbuscli publish Device.X_RDKCENTRAL-COM.Reloadconfig |
RDKB-64163: Fix for reporting hang with heavy parallel operations (#307)
(cherry picked from commit 1d59a609d299165dfe22f5147953e8869c7a4450)
(cherry picked from commit 221189a)
RDKB-64163: Fix race conditions with config reload (#312)
Add log triaging skills
Fix race conditions for event markerlist update and fetch
Fix crash due to use member access after free causing NULL access
Reduce the frequency of logging for report completion
Fix few more race conditions with new changes
Fix astyle formatting errors
(cherry picked from commit a232134)