Skip to content

RDKB-64163: [Field-Reported][XLE-activation] T2 reports are not reaching XDP | telemetry process is hung and recovers with selfheal#313

Open
tabbas651 wants to merge 1 commit intosupport/1.6.4from
topic/cherrypick-RDKB-64163
Open

RDKB-64163: [Field-Reported][XLE-activation] T2 reports are not reaching XDP | telemetry process is hung and recovers with selfheal#313
tabbas651 wants to merge 1 commit intosupport/1.6.4from
topic/cherrypick-RDKB-64163

Conversation

@tabbas651
Copy link
Copy Markdown
Contributor

RDKB-64163: Fix for reporting hang with heavy parallel operations (#307)

  • Fix reporting hang due to webconfig reload and triggercondition reporting at teh same time from activation

(cherry picked from commit 1d59a609d299165dfe22f5147953e8869c7a4450)

  • Include a summary of custom signals used by telemetry

(cherry picked from commit 221189a)

RDKB-64163: Fix race conditions with config reload (#312)

  • Fix race conditions between config reloads

Add log triaging skills

  • Fix race conditions for event markerlist update and fetch

  • Fix crash due to use member access after free causing NULL access

  • Reduce the frequency of logging for report completion

  • Fix few more race conditions with new changes

  • Fix astyle formatting errors

(cherry picked from commit a232134)

* Fix reporting hang due to webconfig reload and triggercondition reporting at teh same time from activation

(cherry picked from commit 1d59a609d299165dfe22f5147953e8869c7a4450)

* Include a summary of custom signals used by telemetry

(cherry picked from commit 221189a)

RDKB-64163: Fix race conditions with config reload (#312)

* Fix race conditions between config reloads

Add log triaging skills

* Fix race conditions for event markerlist update and fetch

* Fix crash due to use member access after free causing NULL access

* Reduce the frequency of logging for report completion

* Fix few more race conditions with new changes

* Fix astyle formatting errors

(cherry picked from commit a232134)
Copilot AI review requested due to automatic review settings April 3, 2026 15:49
@tabbas651 tabbas651 requested a review from a team as a code owner April 3, 2026 15:49
@tabbas651 tabbas651 changed the title RDKB-64163: [Field-Reported][XLE-activation] [Field-Reported][XLE-activation] T2 reports are not reaching XDP | telemetry process is hung and recovers with selfheal RDKB-64163: [Field-Reported][XLE-activation] T2 reports are not reaching XDP | telemetry process is hung and recovers with selfheal Apr 3, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to address field-reported Telemetry 2.0 reporting hangs during heavy parallel operations (notably XConf reload + report generation), by reducing lock contention and adjusting thread/cleanup coordination, along with adding/expanding repository documentation for troubleshooting and onboarding.

Changes:

  • Refactors XConf report generation thread coordination/locking in profilexconf.c to reduce time holding plMutex and to better coordinate deletion vs. in-flight reporting.
  • Adjusts profile deletion paths in profile.c to avoid holding plMutex across potentially blocking pthread_join calls.
  • Adds new docs (root/.github README and a log-triage “skill” document) and tweaks DCA rotated-log open logging level.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
source/dcautil/dca.c Lowers severity of a rotated-log open failure message to reduce log noise.
source/bulkdata/profilexconf.c Major concurrency refactor around XConf report thread lifecycle, mutex usage, and deletion coordination.
source/bulkdata/profile.c Avoids holding plMutex while joining profile report threads to reduce deadlock risk.
README.md Adds a comprehensive top-level project README (build/run/troubleshooting).
.github/skills/triage-logs/SKILL.md Adds a structured log-triage guide for investigating telemetry behavioral issues.
.github/README.md Adds a GitHub-facing README mirroring the project overview and usage docs.

T2Debug("Top markers report generated but is empty possabliy the memory value is changed");
}
Vector_Destroy(topMarkerResultList, freeGResult);
processTopPattern(profile->name, profile->topMarkerList, count);
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

processTopPattern is declared/defined with 4 parameters (profileName, topMarkerList, out_grepResultList, execCount) (see source/dcautil/dca.h:34 / source/dcautil/dca.c:168), but it’s being called here with only 3 args. This will not compile. Update the call to match the existing API (or update the declaration/definition + all call sites consistently).

Suggested change
processTopPattern(profile->name, profile->topMarkerList, count);
processTopPattern(profile->name, profile->topMarkerList, profile->gMarkerList, count);

Copilot uses AI. Check for mistakes.
Vector_Destroy(topMarkerResultList, freeGResult);
processTopPattern(profile->name, profile->topMarkerList, count);
T2Info("Top markers report is completed\n");
encodeTopResultInJSON(valArray, profile->topMarkerList);
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

encodeTopResultInJSON is used here but there is no definition or declaration anywhere in the repo (only this call site shows up in search). This will cause a build/link error. Either add the implementation + header declaration, or keep using the existing encodeGrepResultInJSON flow (with a topMarkerResultList output vector) like source/bulkdata/profile.c does.

Suggested change
encodeTopResultInJSON(valArray, profile->topMarkerList);
encodeGrepResultInJSON(valArray, profile->topMarkerList);

Copilot uses AI. Check for mistakes.
Comment on lines 323 to 328
if(profile->gMarkerList != NULL && Vector_Size(profile->gMarkerList) > 0)
{
getGrepResults(&(profile->grepSeekProfile), profile->gMarkerList, &grepResultList, profile->bClearSeekMap, checkRotated, customLogPath); // Passing 5th argument as true to check rotated logs only in case of single profile
getGrepResults(&(profile->grepSeekProfile), profile->gMarkerList, profile->bClearSeekMap, checkRotated, customLogPath); // Passing 4th argument as true to check rotated logs only in case of single profile
T2Info("Grep complete for %lu markers \n", (unsigned long)Vector_Size(profile->gMarkerList));
encodeGrepResultInJSON(valArray, grepResultList);
Vector_Destroy(grepResultList, freeGResult);
encodeGrepResultInJSON(valArray, profile->gMarkerList);
}
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getGrepResults currently requires an out-parameter for grepResultList (T2ERROR getGrepResults(..., Vector **grepResultList, ...) in source/dcautil/dcautil.h:54 / source/dcautil/dcautil.c:44), but this call no longer passes it. This is a compile-time API mismatch; either restore the grepResultList argument (and the corresponding free) or change the function signature and update all call sites + header accordingly.

Copilot uses AI. Check for mistakes.
Comment on lines +325 to 328
getGrepResults(&(profile->grepSeekProfile), profile->gMarkerList, profile->bClearSeekMap, checkRotated, customLogPath); // Passing 4th argument as true to check rotated logs only in case of single profile
T2Info("Grep complete for %lu markers \n", (unsigned long)Vector_Size(profile->gMarkerList));
encodeGrepResultInJSON(valArray, grepResultList);
Vector_Destroy(grepResultList, freeGResult);
encodeGrepResultInJSON(valArray, profile->gMarkerList);
}
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

encodeGrepResultInJSON expects a vector of GrepResult entries (see source/reportgen/reportgen.c:392), but this call now passes profile->gMarkerList (a marker-definition list). Even if compilation is fixed, this would encode the wrong type and can lead to invalid reads/crashes. Keep encoding the grepResultList returned by getGrepResults, as is done in source/bulkdata/profile.c:472-474.

Copilot uses AI. Check for mistakes.
{
T2Debug("Abort upload is not yet set.\n");
ret = sendReportOverHTTP(profile->t2HTTPDest->URL, jsonReport, &xconfReportPid);
ret = sendReportOverHTTP(profile->t2HTTPDest->URL, jsonReport);
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sendReportOverHTTP is declared as sendReportOverHTTP(char *httpUrl, char* payload, pid_t* outForkedPid) (source/protocol/http/curlinterface.h:42), but it’s being called with only 2 arguments here. This will not compile. Either pass the third argument (as before) or provide a new wrapper / update the function signature and all other call sites (source/bulkdata/profile.c also calls it with 3 args).

Suggested change
ret = sendReportOverHTTP(profile->t2HTTPDest->URL, jsonReport);
ret = sendReportOverHTTP(profile->t2HTTPDest->URL, jsonReport, NULL);

Copilot uses AI. Check for mistakes.
Comment on lines 686 to 697
initialized = false;

if(singleProfile->reportInProgress)
{
T2Debug("Waiting for final report before uninit\n");
pthread_mutex_lock(&plMutex);
pthread_cond_signal(&reuseThread);
pthread_mutex_unlock(&plMutex);
pthread_join(singleProfile->reportThread, NULL);
reportThreadExits = false;
singleProfile->reportInProgress = false ;
T2Info("Final report is completed, releasing profile memory\n");
}
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ProfileXConf_uninit sets initialized = false but only signals/joins the report thread when singleProfile->reportInProgress is true. If the report thread exists and is currently blocked in pthread_cond_wait (common case when idle), it will never be woken to observe initialized == false, and freeProfileXConf() / pthread_cond_destroy(&reuseThread) can run while the thread is still waiting -> hang or use-after-free. Signal the condition and pthread_join the report thread whenever it was created (e.g., when reportThreadExits is true / thread handle is valid), not only when reportInProgress is true.

Copilot uses AI. Check for mistakes.
Comment on lines +729 to 737
/* Release plMutex before calling scheduler API to avoid potential
* blocking while holding the mutex. Scheduler operations may involve
* timer management and other operations that shouldn't block profile access.
*/
pthread_mutex_unlock(&plMutex);

if(registerProfileWithScheduler(singleProfile->name, singleProfile->reportingInterval, INFINITE_TIMEOUT, false, true, false, DEFAULT_FIRST_REPORT_INT, NULL) == T2ERROR_SUCCESS)
{
T2Info("Successfully set profile : %s\n", singleProfile->name);
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

plMutex is intentionally released before calling registerProfileWithScheduler, but the code still dereferences singleProfile after the unlock (singleProfile->name, singleProfile->reportingInterval). That can race with a concurrent delete/uninit path and becomes a potential use-after-free. Copy the needed fields (e.g., name/reportingInterval) to locals while holding plMutex, then unlock and call the scheduler using those locals.

Copilot uses AI. Check for mistakes.
Comment on lines +1136 to +1141
#ifdef GTEST_ENABLE
void test_set_reportThreadExits(bool value)
{

T2ERROR ret = T2ERROR_FAILURE;

if(!singleProfile)
{
T2Error("Xconf profile is not set.\n");
return ret;
}

// Check whether any XconfReport is in progress
if(singleProfile->reportInProgress)
{
isAbortTriggered = true;
// Check if a child pid is still alive
if((xconfReportPid > 0) && !kill(xconfReportPid, 0))
{
T2Info("Report upload in progress, terminating the forked reporting child : %d \n", xconfReportPid);
if(!kill(xconfReportPid, SIGKILL))
{
ret = T2ERROR_SUCCESS;
}
}
else
{
T2Info(" Report upload has net yet started, set the abort flag \n");
ret = T2ERROR_SUCCESS;
}
}
else
{
T2Info("No report generation in progress. No further action required for abort.\n");
}

return ret;

reportThreadExits = value;
}

#endif
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ProfileXConf_terminateReport is still declared in profilexconf.h and called from production code (source/bulkdata/reportprofiles.c:437), but its implementation was removed here. This will cause an undefined reference at link time and also breaks existing unit tests that call it. Either restore/keep the implementation (possibly updated for the new send/abort approach) or remove the declaration and update all callers to the new mechanism.

Copilot uses AI. Check for mistakes.

```bash
# Publish DCM reload event
rbuscli publish Device.X_RDKCENTREL-COM.Reloadconfig
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RBUS event name uses X_RDKCENTREL here, but the codebase consistently uses X_RDKCENTRAL (e.g., source/ccspinterface/rbusInterface.c:1415). This looks like a typo and would mislead users during troubleshooting. Update the docs to Device.X_RDKCENTRAL-COM.Reloadconfig (or to the exact value used by the platform, if different).

Suggested change
rbuscli publish Device.X_RDKCENTREL-COM.Reloadconfig
rbuscli publish Device.X_RDKCENTRAL-COM.Reloadconfig

Copilot uses AI. Check for mistakes.

```bash
# Publish DCM reload event
rbuscli publish Device.X_RDKCENTREL-COM.Reloadconfig
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RBUS event name uses X_RDKCENTREL here, but the codebase consistently uses X_RDKCENTRAL (e.g., source/ccspinterface/rbusInterface.c:1415). This looks like a typo and would mislead users during troubleshooting. Update the docs to Device.X_RDKCENTRAL-COM.Reloadconfig (or to the exact value used by the platform, if different).

Suggested change
rbuscli publish Device.X_RDKCENTREL-COM.Reloadconfig
rbuscli publish Device.X_RDKCENTRAL-COM.Reloadconfig

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants