diff --git a/.github/README.md b/.github/README.md new file mode 100644 index 00000000..eb0cfcae --- /dev/null +++ b/.github/README.md @@ -0,0 +1,397 @@ +# Telemetry 2.0 + +[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](LICENSE) +[![C](https://img.shields.io/badge/Language-C-blue.svg)](https://en.wikipedia.org/wiki/C_(programming_language)) +[![Platform](https://img.shields.io/badge/Platform-Embedded%20Linux-orange.svg)](https://www.yoctoproject.org/) + +A lightweight, efficient telemetry framework for RDK (Reference Design Kit) embedded devices. + +## Overview + +Telemetry 2.0 provides real-time monitoring, event collection, and reporting capabilities optimized for resource-constrained embedded devices such as set-top boxes, gateways, and IoT devices. + +### Key Features + +- ⚡ **Efficient**: Connection pooling and batch reporting +- 🔒 **Secure**: mTLS support for encrypted communication +- 📊 **Flexible**: Profile-based configuration (JSON/XConf) +- 🔧 **Platform-Independent**: Multiple architecture support + +### Architecture Highlights + +```mermaid +graph TB + A[Telemetry Events/Markers] --> B[Profile Matcher] + B --> C[Report Generator] + C --> D[HTTP Connection Pool] + D --> E[XConf Server / Data Collector] + F[XConf Client] -.->|Config| B + G[Scheduler] -.->|Triggers| C +``` + +## Quick Start + +### Prerequisites + +- GCC 4.8+ or Clang 3.5+ +- pthread library +- libcurl 7.65.0+ +- cJSON library +- OpenSSL 1.1.1+ (for mTLS) + +### Build + +```bash +# Clone repository +git clone https://github.com/rdkcentral/telemetry.git +cd telemetry + +# Configure +autoreconf -i +./configure + +# Build +make + +# Install +sudo make install +``` + +### Docker Development + +Refer to the provided Docker container for a consistent development environment: + +https://github.com/rdkcentral/docker-device-mgt-service-test + + +See [Build Setup Guide](docs/integration/build-setup.md) for detailed build options. + +### Basic Usage + +```c +#include "telemetry2_0.h" + +int main(void) { + // Initialize telemetry + if (t2_init() != 0) { + fprintf(stderr, "Failed to initialize telemetry\n"); + return -1; + } + + // Send a marker event + t2_event_s("SYS_INFO_DeviceBootup", "Device started successfully"); + + // Cleanup + t2_uninit(); + return 0; +} +``` + +Compile: `gcc -o myapp myapp.c -ltelemetry` + +## Documentation + +📚 **[Complete Documentation](docs/README.md)** + +### Key Documents + +- **[Architecture Overview](docs/architecture/overview.md)** - System design and components +- **[API Reference](docs/api/public-api.md)** - Public API documentation +- **[Developer Guide](docs/integration/developer-guide.md)** - Getting started +- **[Build Setup](docs/integration/build-setup.md)** - Build configuration +- **[Testing Guide](docs/integration/testing.md)** - Test procedures + +### Component Documentation + +Individual component documentation is in [`source/docs/`](source/docs/): + +- [Bulk Data System](source/docs/bulkdata/README.md) - Profile and marker management +- [HTTP Protocol](source/docs/protocol/README.md) - Communication layer +- [Scheduler](source/docs/scheduler/README.md) - Report scheduling +- [XConf Client](source/docs/xconf-client/README.md) - Configuration retrieval + +## Project Structure + +``` +telemetry/ +├── source/ # Source code +│ ├── bulkdata/ # Profile and marker management +│ ├── protocol/ # HTTP/RBUS communication +│ ├── scheduler/ # Report scheduling +│ ├── xconf-client/ # Configuration retrieval +│ ├── dcautil/ # Log marker utilities +│ └── test/ # Unit tests (gtest/gmock) +├── include/ # Public headers +├── config/ # Configuration files +├── docs/ # Documentation +├── containers/ # Docker development environment +└── test/ # Functional tests +``` + +## Configuration + +### Profile Configuration + +Telemetry uses JSON profiles to define what data to collect: + +```json +{ + "Profile": "RDKB_BasicProfile", + "Version": "1.0.0", + "Protocol": "HTTP", + "EncodingType": "JSON", + "ReportingInterval": 300, + "Parameters": [ + { + "type": "dataModel", + "name": "Device.DeviceInfo.Manufacturer" + }, + { + "type": "event", + "eventName": "bootup_complete" + } + ] +} +``` + +See [Profile Configuration Guide](docs/integration/profile-configuration.md) for details. + +### Environment Variables + +| Variable | Description | Default | +|----------|-------------|---------| +| `T2_ENABLE_DEBUG` | Enable debug logging | `0` | +| `T2_PROFILE_PATH` | Default profile directory | `/etc/DefaultT2Profile.json` | +| `T2_XCONF_URL` | XConf server URL | - | +| `T2_REPORT_URL` | Report upload URL | - | + +## Runtime Operations + +### Signal Handling + +The Telemetry 2.0 daemon responds to the following signals for runtime control: + +| Signal | Value | Purpose | +|--------|-------|---------| +| **SIGTERM** | 15 | Gracefully terminate the daemon, cleanup resources and exit | +| **SIGINT** | 2 | Interrupt signal - uninitialize services, cleanup and exit | +| **SIGUSR1** | 10 | Trigger log upload with seekmap reset | +| **SIGUSR2** | 12 | Reload configuration from XConf server | +| **LOG_UPLOAD** | 10 | Custom signal to trigger log upload and reset retain seekmap flag | +| **EXEC_RELOAD** | 12 | Custom signal to reload XConf configuration and restart XConf client | +| **LOG_UPLOAD_ONDEMAND** | 29 | Custom signal for on-demand log upload without seekmap reset | +| **SIGIO** | - | I/O signal - repurposed for on-demand log upload | + +**Examples:** + +```bash +# Gracefully stop telemetry +kill -SIGTERM $(pidof telemetry2_0) + +# Trigger log upload +kill -10 $(pidof telemetry2_0) + +# Reload configuration +kill -12 $(pidof telemetry2_0) + +# On-demand log upload +kill -29 $(pidof telemetry2_0) +``` + +**Notes:** +- Custom signal values (10, 12, 29) are defined to avoid conflicts with standard system signals +- Signals SIGUSR1, SIGUSR2, LOG_UPLOAD, EXEC_RELOAD, LOG_UPLOAD_ONDEMAND, and SIGIO are blocked during signal handler execution to prevent race conditions +- Child processes ignore most signals except SIGCHLD, SIGPIPE, SIGALRM, and the log upload/reload signals + +### WebConfig/Profile Reload + +Telemetry 2.0 supports multiple mechanisms for dynamically reloading report profiles and configuration: + +#### 1. Signal-Based XConf Reload + +Trigger XConf configuration reload using signals: + +```bash +# Using custom signal value +kill -12 $(pidof telemetry2_0) +``` + +This stops the XConf client and restarts it to fetch updated configuration from the XConf server. + +#### 2. RBUS-Based Profile Updates + +For WebConfig integration, profiles can be set directly via RBUS (requires `rbuscli`): + +```bash +# Load a temporary profile (JSON format) +rbuscli setv "Device.X_RDKCENTRAL-COM_T2.Temp_ReportProfiles" string '{"profiles":[...]}' + +# Set permanent profiles +rbuscli setv "Device.X_RDKCENTRAL-COM_T2.ReportProfiles" string '{"profiles":[...]}' + +# Set profiles in MessagePack binary format +rbuscli setv "Device.X_RDKCENTRAL-COM_T2.ReportProfilesMsgPack" bytes + +# Clear all profiles +rbuscli setv "Device.X_RDKCENTRAL-COM_T2.ReportProfiles" string '{"profiles":[]}' +``` + +#### 3. DCM Event-Based Reload + +Subscribe to DCM reload events via RBUS (typically used by WebConfig framework): + +```bash +# Publish DCM reload event +rbuscli publish Device.X_RDKCENTREL-COM.Reloadconfig +``` + +#### 4. Using Test Utilities + +The project includes a convenience script for testing profile updates: + +```bash +# Load example profile +./test/set_report_profile.sh example + +# Load DOCSIS reference profile +./test/set_report_profile.sh docsis + +# Clear all profiles +./test/set_report_profile.sh empty + +# Load custom JSON profile +./test/set_report_profile.sh '{"profiles":[...]}' +``` + +**Available RBUS Parameters:** + +- `Device.X_RDKCENTRAL-COM_T2.ReportProfiles` - Persistent report profiles (JSON) +- `Device.X_RDKCENTRAL-COM_T2.ReportProfilesMsgPack` - Persistent profiles (MessagePack binary) +- `Device.X_RDKCENTRAL-COM_T2.Temp_ReportProfiles` - Temporary profiles (JSON) +- `Device.X_RDKCENTRAL-COM_T2.UploadDCMReport` - Trigger on-demand report upload +- `Device.X_RDKCENTRAL-COM_T2.AbortDCMReport` - Abort ongoing report upload + +## Development + +### Running Tests + +```bash +# Unit tests +make check + +# Functional tests +cd test +./run_ut.sh + +# Code coverage +./cov_build.sh +``` + +### Development Container + +Use the provided Docker container for consistent development: + https://github.com/rdkcentral/docker-device-mgt-service-test + +```bash +cd docker-device-mgt-service-test +docker compose up -d +``` + +A directory above the current directory is mounted as a volume in /mnt/L2_CONTAINER_SHARED_VOLUME . +Login to the container as follows: +```bash +docker exec -it native-platform /bin/bash +cd /mnt/L2_CONTAINER_SHARED_VOLUME/telemetry +sh test/run_ut.sh +``` + +See [Docker Development Guide](containers/README.md) for more details. + +## Platform Support + +Telemetry 2.0 is designed to be platform-independent and has been tested on: + +- **RDK-B** (Broadband devices) +- **RDK-V** (Video devices) +- **Linux** (x86_64, ARM, ARM64) +- **Yocto Project** builds + +See [Platform Porting Guide](docs/integration/platform-porting.md) for porting to new platforms. + + +## Contributing + +We welcome contributions! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines. + +### Development Workflow + +1. Fork the repository +2. Create a feature branch (`git checkout -b feature/amazing-feature`) +3. Make your changes +4. Add tests for new functionality +5. Ensure all L1 and L2 tests pass +6. Commit your changes (`git commit -m 'Add amazing feature'`) +7. Push to the branch (`git push origin feature/amazing-feature`) +8. Open a Pull Request + +### Code Style + +- Follow existing C code style and ensure astyle formatting and checks pass with below commands +```bash + find . -name '*.c' -o -name '*.h' | xargs astyle --options=.astylerc + find . -name '*.orig' -type f -delete + ``` + +- Use descriptive variable names +- Document all public APIs +- Add unit tests for new functions +- Add functional tests for new features + +See [Coding Guidelines](.github/instructions/c-embedded.instructions.md) for details. + +## Troubleshooting + +### Common Issues + +**Q: Telemetry not sending reports** +- Check network connectivity +- Verify XConf URL configuration +- Review logs in `/var/log/telemetry/` + +**Q: High memory usage** + +- Reduce number of active profiles +- Decrease reporting intervals +- Check for memory leaks with valgrind + +**Q: Build errors** + +- Ensure all dependencies installed +- Check compiler version (GCC 4.8+) +- Review build logs for missing libraries + +See [Troubleshooting Guide](docs/troubleshooting/common-errors.md) for more solutions. + +## License + +This project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details. + +## Acknowledgments + +- RDK Management LLC +- RDK Community Contributors +- Open Source Community + +## Contact + +- **Repository**: https://github.com/rdkcentral/telemetry +- **Issues**: https://github.com/rdkcentral/telemetry/issues +- **RDK Central**: https://rdkcentral.com + +## Changelog + +See [CHANGELOG.md](CHANGELOG.md) for version history and release notes. + +--- + +**Built for the RDK Community** diff --git a/.github/skills/triage-logs/SKILL.md b/.github/skills/triage-logs/SKILL.md new file mode 100644 index 00000000..d11d6be4 --- /dev/null +++ b/.github/skills/triage-logs/SKILL.md @@ -0,0 +1,303 @@ +--- +name: triage-logs +description: > + Triage any Telemetry 2.0 behavioral issue on RDK devices by correlating device + log bundles with source code. Covers hangs, under-reporting, over-reporting, + duplicate reports, CPU/memory spikes, scheduler anomalies, rbus problems, and + test gap analysis. The user states the issue; this skill guides systematic + root-cause analysis regardless of issue type. +--- + +# Telemetry 2.0 Issue Triage Skill + +## Purpose + +Systematically correlate device log bundles with Telemetry 2.0 source code to +identify root causes, characterize impact, and propose unit-test and +functional-test reproduction scenarios — for **any** behavioral anomaly reported +by the user. + +--- + +## Usage + +Invoke this skill when: +- A device log bundle is available under `logs/` (or attached separately) +- The user describes a behavioral anomaly (examples: daemon stuck, reports + missing, too many reports sent, reports arriving late, high CPU, high memory, + unexpected profile activation, marker counts wrong) +- You need to write a reproduction scenario for an existing or proposed fix + +**The user's stated issue drives the investigation.** Do not assume a specific +failure mode — read the issue description first, then follow the steps below. + +--- + +## Step 1: Orient to the Log Bundle + +**Log bundle layout** (typical RDK device): +``` +logs///logs/ + telemetry2_0.txt.0 ← Primary T2 daemon log (start here) + GatewayManagerLog.txt.0 ← WAN/gateway state machine + WanManager*.txt.0 ← WAN interface transitions + PAMlog.txt.0 ← Platform/parameter management + SelfHeal*.txt.0 ← Watchdog and recovery events + top_log.txt.0 ← CPU/memory snapshots (useful for perf issues) + messages.txt.0 ← Kernel and system messages +``` + +Include any log files surfaced by the user's issue description (e.g., `cellular*.txt.0` +for connectivity issues, `syslog` for OOM events). + +**Log timestamp prefix format**: `YYMMDD-HH:MM:SS.uuuuuu` +- Session folder names are **local-time snapshots** (format: `MM-DD-YY-HH:MMxM`) +- Log lines inside use device local time — always confirm via `[Time]` field + in telemetry reports (`"Time":"2026-03-06 07:24:23"`) +- Report JSON `"timestamp"` fields are Unix epoch UTC + +**Session ordering**: Sort session folders chronologically. Multiple sessions may +represent reboots. Alphabetical sort does NOT equal chronological order. + +--- + +## Step 2: Map Profiles and Threads + +Read the startup section of `telemetry2_0.txt.0` (first ~50 lines) to identify: + +| What to find | Log pattern | +|---|---| +| Profile name | `Profile Name : ` | +| Reporting interval | `Waiting for sec for next TIMEOUT` | +| Timeout thread TID | `TIMEOUT for profile - ` (first occurrence) | +| CollectAndReport TID | `CollectAndReport ++in profileName : ` (first occurrence) | +| Send mechanism | `methodName = Device.X_RDK_Xmidt.SendData` (rbus) or `HTTP_CODE` (curl) | + +**Thread role map** (look for TID in `TimeoutThread` context): +- `TimeoutThread` per profile — fires `TIMEOUT for profile` log lines +- `CollectAndReport` / `CollectAndReportXconf` — one per profile, generates/sends reports +- `asyncMethodHandler` — short-lived rbus handler thread, called when `SendData` is dispatched + +--- + +## Step 3: Identify the Anomaly Window + +Based on the **user's stated issue**, search for the relevant evidence pattern: + +### Hang / Stuck Daemon +A reporting hang manifests as a **timestamp gap** between `CollectAndReport ++in` and +the next report-related log line from the same TID. +``` +grep -n "CollectAndReport" telemetry2_0.txt.0 | head -40 +``` +Gap > 1 reporting interval = anomaly. During the gap, check: +- Is `asyncMethodHandler` ever logged? (no → rbus provider unresponsive) +- Does `TIMEOUT for profile` still fire? (yes → TimeoutThread alive but CollectAndReport stuck) + +### Under-Reporting / Missing Reports +Look for expected `TIMEOUT for profile` events that never trigger a `CollectAndReport`: +``` +grep -n "TIMEOUT for profile\|CollectAndReport ++in\|Return status" telemetry2_0.txt.0 +``` +- Count `TIMEOUT` events vs. `CollectAndReport` entries over a time window +- Check for `SendInterruptToTimeoutThread` logged as failed (EBUSY path) — signals silently dropped +- Check for profile deactivation or reload during expected report window + +### Over-Reporting / Duplicate Reports +Look for multiple `CollectAndReport ++in` within a single interval: +``` +grep -n "CollectAndReport ++in\|TIMEOUT for profile" telemetry2_0.txt.0 +``` +- Multiple `TIMEOUT` signals in one interval → concurrent interrupt and scheduler fire +- Report-on-condition (`T2ERROR_SUCCESS` after a marker event) firing alongside periodic report +- Check `signalrecived_and_executing` global flag race (concurrent profile callbacks) + +### CPU / Memory Spikes +Correlate `top_log.txt.0` timestamps with T2 activity: +``` +grep -n "telemetry2" top_log.txt.0 +``` +- Identify what T2 was doing (profile scan, DCA grep, report generation, rbus call) at spike time +- Check DCA log grep operations (`dca.c`, `dcautil.c`) for large log files causing high CPU +- Check marker accumulation in `t2markers.c` for memory growth +- Check if multiple profiles overlap their `CollectAndReport` window + +### Profile / Configuration Anomalies +- Unexpected profile changes: `grep -n "profile\|xconf" telemetry2_0.txt.0 | grep -i "receiv\|updat\|activ"` +- Marker count mismatches: compare report JSON marker values against grep patterns in `dca.c` +- Wrong reporting interval: confirm `Waiting for sec` matches profile definition + +--- + +## Step 4: Correlate with Other Component Logs + +Based on the anomaly window identified in Step 3, cross-reference with other logs: + +| Issue Type | Companion Log | What to Look For | +|---|---|---| +| Hang / rbus block | `GatewayManagerLog.txt.0` | WAN/interface state changes within hang window | +| Hang / rbus block | `WanManager*.txt.0` | Interface up/down transitions | +| Under-reporting | `SelfHeal*.txt.0` | Watchdog restarts of telemetry2_0 process | +| Over-reporting | `PAMlog.txt.0` | Parameter changes triggering report-on-condition | +| CPU spike | `top_log.txt.0` | CPU% at anomaly timestamps | +| Memory growth | `messages.txt.0` | OOM killer events, slab usage | +| Profile changes | Any xconf response log | Profile push or xconf poll activity | + +A tight coupling between an external event (state change, parameter update, restart) +and the T2 anomaly window is the primary indicator of cause vs. coincidence. + +--- + +## Step 5: Locate the Code Path + +Navigate to the relevant source based on the anomaly type. Key modules: + +### Scheduler (`source/scheduler/scheduler.c`) + +Controls when profiles fire. Key paths: +- **`TimeoutThread`** — per-profile thread; calls `timeoutNotificationCb` while holding `tMutex` +- **`SendInterruptToTimeoutThread`** — uses `pthread_mutex_trylock`; if `tMutex` is held + (callback in progress), the interrupt is **silently dropped** (EBUSY returns `T2ERROR_FAILURE`) +- **`signalrecived_and_executing`** — global flag with no atomic protection; susceptible + to concurrent-write races under multi-profile load + +### Profile / Report Generation (`source/bulkdata/profile.c`, `profilexconf.c`, `reportprofiles.c`) + +- `CollectAndReport` / `CollectAndReportXconf` hold `plMutex` or `reuseThreadMutex` + for the entire report lifecycle (collection + send) +- **rbus send** (`rbusMethod_Invoke` / `rbusMethod_InvokeAsync`) has **no timeout** — + a blocked rbus provider blocks the entire thread indefinitely +- Report-on-condition logic in `reportprofiles.c` can fire concurrently with a + periodic send if synchronization is missing + +### Data Collection / CPU (`source/dcautil/dca.c`, `dcautil.c`, `dcacpu.c`, `dcamem.c`) + +- DCA log-grep is I/O and CPU intensive; large log files can cause CPU spikes +- `dcacpu.c` and `dcamem.c` sample system resources; misreads can cause false markers +- Marker accumulation without cleanup (`t2markers.c`) can grow heap over time + +### Profile Configuration (`source/t2parser/`, `source/bulkdata/profilexconf.c`) + +- Profile reception, parsing, and activation path for xconf-sourced profiles +- Incorrect interval parsing or duplicate profile names can cause + over-scheduling or silent deactivation + +### Transport Layer (`source/protocol/http/`, `source/protocol/rbusMethod/`) + +- HTTP send failures, retry logic, and cached-report replay +- rbus method provider registration and response handling + +--- + +## Step 6: Characterize Root Cause + +Use this matrix to classify the issue based on observed evidence: + +| Observed Pattern | Issue Class | Primary Code Location | +|---|---|---| +| rbus call blocks > 10s, no `asyncMethodHandler` logged | Rbus provider unresponsive | `profile.c` / rbus transport | +| `Signal Thread To restart` logged but no report follows | Interrupt signal dropped (EBUSY on `tMutex`) | `scheduler.c:SendInterruptToTimeoutThread` | +| `TIMEOUT` fires but `CollectAndReport` never starts | Thread pool exhausted or profile in error state | `scheduler.c`, `reportprofiles.c` | +| `TIMEOUT` entries missing for > 2 intervals | TimeoutThread stuck, exited, or profile deregistered | `scheduler.c:TimeoutThread` | +| Multiple `CollectAndReport ++in` within one interval | Over-scheduling: concurrent interrupt + periodic fire | `scheduler.c`, `reportprofiles.c` | +| Long gap between `++in` and `--out` with HTTP errors | Network failure; cached report retry loop | `profilexconf.c`, HTTP transport | +| Report JSON marker counts lower than expected | DCA grep miss, log rotation during scan, or marker not registered | `dca.c`, `t2markers.c` | +| Report JSON marker counts higher than expected | Duplicate marker registration or over-counting in DCA | `dca.c`, `t2markers.c` | +| `signalrecived_and_executing` logic inconsistency | Unsynchronized global flag race | `scheduler.c` (global variable) | +| CPU spike during report window | Large log file DCA grep or concurrent profile collection | `dcautil.c`, `dca.c` | +| Memory growth over sessions | Marker list not freed, profile not cleaned up on deregister | `t2markers.c`, `profile.c` | +| Profile activated/deactivated unexpectedly | xconf push race or profile name collision | `profilexconf.c`, `t2parser/` | + +--- + +## Step 7: Assess L1 (Unit) Test Coverage + +**Location**: `source/test/` + +**Existing coverage** (representative): +- `schedulerTest.cpp`: basic `SendInterruptToTimeoutThread`, `TimeoutThread` single-run, + profile register/unregister lifecycle +- `profileTest.cpp`: profile creation, marker accumulation, basic report generation +- `dcaTest.cpp`: grep pattern matching, marker extraction + +**Identify gaps relevant to the issue**. For each gap, write a test template: + +``` +Test Name: +Setup: +Action: +Assert: +File: source/test// +``` + +**Common gap areas** (match to the issue class): +- Scheduler signal dropped when `tMutex` held during callback (EBUSY path) +- `CollectAndReport` blocked while scheduler fires multiple subsequent timeouts +- DCA grep on a large/rotating log file — correct marker counts +- Profile re-activation during an active `CollectAndReport` — no double-send +- Memory freed correctly when profile is deregistered mid-cycle +- `signalrecived_and_executing` flag read/write under concurrent profile load + +--- + +## Step 8: Assess L2 (Functional) Test Coverage + +**Location**: `test/functional-tests/features/` + +**Existing scenarios** (from `.feature` files): +- `telemetry_process_singleprofile.feature` — caching on send failure +- `telemetry_process_multiprofile.feature` — multi-profile interaction +- `telemetry_bootup_sequence.feature`, `telemetry_runs_as_daemon.feature` +- `telemetry_process_tempProfile.feature`, `telemetry_xconf_communication.feature` + +**Identify the missing scenario** that would catch the reported issue. Write a +Gherkin outline covering: +1. The precondition (profile active, network state, system load) +2. The triggering event (external state change, concurrent interrupt, large log file, etc.) +3. The correct observable outcome (report sent within interval, no duplicate, CPU within bounds) +4. The failure observable outcome (what the bug produces vs. what is expected) + +```gherkin +Feature: + + Scenario: + Given + And + When + Then + And +``` + +--- + +## Step 9: Document Findings + +Produce a triage report with: +1. **Issue restatement**: confirm back the user's stated problem in one sentence +2. **Device context**: MAC, firmware, session timestamp(s) examined +3. **Anomaly timeline**: exact timestamps, thread IDs, duration or frequency +4. **Root cause chain**: numbered steps, each with log evidence + source code reference +5. **L1 test gap**: which test file, test name, and what assertion it makes +6. **L2 test gap**: Gherkin scenario outline +7. **Proposed fix**: minimum-scope change — file, function, and what to change + +--- + +## Common Pitfalls + +- **Timestamp confusion**: Log header `260306-HH:MM:SS` = `2026-03-06`; report JSON + `"timestamp":"177xxxxxxx.xx"` is Unix epoch UTC — do not mix them +- **Session folder order**: Alphabetical sort does NOT equal chronological order +- **`signalrecived_and_executing`**: This global has a typo in the source ("recived") — + search for it exactly as spelled +- **EBUSY ≠ deadlock**: The `trylock` in `SendInterruptToTimeoutThread` prevents + deadlock but causes **silent signal loss** — the thread is not stuck, the interrupt + was simply never delivered +- **`asyncMethodHandler` absence**: No log of this thread during an rbus call means + the rbus provider never received the request — distinguish from a network-only issue +- **Double-log artifact**: T2 logs `methodName = ...` twice per send in some builds — + this is a logging artifact, not two actual sends; verify by counting `Return status` lines +- **Profile count vs. report count**: A profile may be registered but never reach + `CollectAndReport` if upstream conditions are not met — trace from `TIMEOUT` forward +- **DCA grep on rotated logs**: If a log file rotates mid-scan, DCA may return 0 for + a marker that was incremented — correlates to under-reporting without any error log diff --git a/README.md b/README.md new file mode 100644 index 00000000..eb0cfcae --- /dev/null +++ b/README.md @@ -0,0 +1,397 @@ +# Telemetry 2.0 + +[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](LICENSE) +[![C](https://img.shields.io/badge/Language-C-blue.svg)](https://en.wikipedia.org/wiki/C_(programming_language)) +[![Platform](https://img.shields.io/badge/Platform-Embedded%20Linux-orange.svg)](https://www.yoctoproject.org/) + +A lightweight, efficient telemetry framework for RDK (Reference Design Kit) embedded devices. + +## Overview + +Telemetry 2.0 provides real-time monitoring, event collection, and reporting capabilities optimized for resource-constrained embedded devices such as set-top boxes, gateways, and IoT devices. + +### Key Features + +- ⚡ **Efficient**: Connection pooling and batch reporting +- 🔒 **Secure**: mTLS support for encrypted communication +- 📊 **Flexible**: Profile-based configuration (JSON/XConf) +- 🔧 **Platform-Independent**: Multiple architecture support + +### Architecture Highlights + +```mermaid +graph TB + A[Telemetry Events/Markers] --> B[Profile Matcher] + B --> C[Report Generator] + C --> D[HTTP Connection Pool] + D --> E[XConf Server / Data Collector] + F[XConf Client] -.->|Config| B + G[Scheduler] -.->|Triggers| C +``` + +## Quick Start + +### Prerequisites + +- GCC 4.8+ or Clang 3.5+ +- pthread library +- libcurl 7.65.0+ +- cJSON library +- OpenSSL 1.1.1+ (for mTLS) + +### Build + +```bash +# Clone repository +git clone https://github.com/rdkcentral/telemetry.git +cd telemetry + +# Configure +autoreconf -i +./configure + +# Build +make + +# Install +sudo make install +``` + +### Docker Development + +Refer to the provided Docker container for a consistent development environment: + +https://github.com/rdkcentral/docker-device-mgt-service-test + + +See [Build Setup Guide](docs/integration/build-setup.md) for detailed build options. + +### Basic Usage + +```c +#include "telemetry2_0.h" + +int main(void) { + // Initialize telemetry + if (t2_init() != 0) { + fprintf(stderr, "Failed to initialize telemetry\n"); + return -1; + } + + // Send a marker event + t2_event_s("SYS_INFO_DeviceBootup", "Device started successfully"); + + // Cleanup + t2_uninit(); + return 0; +} +``` + +Compile: `gcc -o myapp myapp.c -ltelemetry` + +## Documentation + +📚 **[Complete Documentation](docs/README.md)** + +### Key Documents + +- **[Architecture Overview](docs/architecture/overview.md)** - System design and components +- **[API Reference](docs/api/public-api.md)** - Public API documentation +- **[Developer Guide](docs/integration/developer-guide.md)** - Getting started +- **[Build Setup](docs/integration/build-setup.md)** - Build configuration +- **[Testing Guide](docs/integration/testing.md)** - Test procedures + +### Component Documentation + +Individual component documentation is in [`source/docs/`](source/docs/): + +- [Bulk Data System](source/docs/bulkdata/README.md) - Profile and marker management +- [HTTP Protocol](source/docs/protocol/README.md) - Communication layer +- [Scheduler](source/docs/scheduler/README.md) - Report scheduling +- [XConf Client](source/docs/xconf-client/README.md) - Configuration retrieval + +## Project Structure + +``` +telemetry/ +├── source/ # Source code +│ ├── bulkdata/ # Profile and marker management +│ ├── protocol/ # HTTP/RBUS communication +│ ├── scheduler/ # Report scheduling +│ ├── xconf-client/ # Configuration retrieval +│ ├── dcautil/ # Log marker utilities +│ └── test/ # Unit tests (gtest/gmock) +├── include/ # Public headers +├── config/ # Configuration files +├── docs/ # Documentation +├── containers/ # Docker development environment +└── test/ # Functional tests +``` + +## Configuration + +### Profile Configuration + +Telemetry uses JSON profiles to define what data to collect: + +```json +{ + "Profile": "RDKB_BasicProfile", + "Version": "1.0.0", + "Protocol": "HTTP", + "EncodingType": "JSON", + "ReportingInterval": 300, + "Parameters": [ + { + "type": "dataModel", + "name": "Device.DeviceInfo.Manufacturer" + }, + { + "type": "event", + "eventName": "bootup_complete" + } + ] +} +``` + +See [Profile Configuration Guide](docs/integration/profile-configuration.md) for details. + +### Environment Variables + +| Variable | Description | Default | +|----------|-------------|---------| +| `T2_ENABLE_DEBUG` | Enable debug logging | `0` | +| `T2_PROFILE_PATH` | Default profile directory | `/etc/DefaultT2Profile.json` | +| `T2_XCONF_URL` | XConf server URL | - | +| `T2_REPORT_URL` | Report upload URL | - | + +## Runtime Operations + +### Signal Handling + +The Telemetry 2.0 daemon responds to the following signals for runtime control: + +| Signal | Value | Purpose | +|--------|-------|---------| +| **SIGTERM** | 15 | Gracefully terminate the daemon, cleanup resources and exit | +| **SIGINT** | 2 | Interrupt signal - uninitialize services, cleanup and exit | +| **SIGUSR1** | 10 | Trigger log upload with seekmap reset | +| **SIGUSR2** | 12 | Reload configuration from XConf server | +| **LOG_UPLOAD** | 10 | Custom signal to trigger log upload and reset retain seekmap flag | +| **EXEC_RELOAD** | 12 | Custom signal to reload XConf configuration and restart XConf client | +| **LOG_UPLOAD_ONDEMAND** | 29 | Custom signal for on-demand log upload without seekmap reset | +| **SIGIO** | - | I/O signal - repurposed for on-demand log upload | + +**Examples:** + +```bash +# Gracefully stop telemetry +kill -SIGTERM $(pidof telemetry2_0) + +# Trigger log upload +kill -10 $(pidof telemetry2_0) + +# Reload configuration +kill -12 $(pidof telemetry2_0) + +# On-demand log upload +kill -29 $(pidof telemetry2_0) +``` + +**Notes:** +- Custom signal values (10, 12, 29) are defined to avoid conflicts with standard system signals +- Signals SIGUSR1, SIGUSR2, LOG_UPLOAD, EXEC_RELOAD, LOG_UPLOAD_ONDEMAND, and SIGIO are blocked during signal handler execution to prevent race conditions +- Child processes ignore most signals except SIGCHLD, SIGPIPE, SIGALRM, and the log upload/reload signals + +### WebConfig/Profile Reload + +Telemetry 2.0 supports multiple mechanisms for dynamically reloading report profiles and configuration: + +#### 1. Signal-Based XConf Reload + +Trigger XConf configuration reload using signals: + +```bash +# Using custom signal value +kill -12 $(pidof telemetry2_0) +``` + +This stops the XConf client and restarts it to fetch updated configuration from the XConf server. + +#### 2. RBUS-Based Profile Updates + +For WebConfig integration, profiles can be set directly via RBUS (requires `rbuscli`): + +```bash +# Load a temporary profile (JSON format) +rbuscli setv "Device.X_RDKCENTRAL-COM_T2.Temp_ReportProfiles" string '{"profiles":[...]}' + +# Set permanent profiles +rbuscli setv "Device.X_RDKCENTRAL-COM_T2.ReportProfiles" string '{"profiles":[...]}' + +# Set profiles in MessagePack binary format +rbuscli setv "Device.X_RDKCENTRAL-COM_T2.ReportProfilesMsgPack" bytes + +# Clear all profiles +rbuscli setv "Device.X_RDKCENTRAL-COM_T2.ReportProfiles" string '{"profiles":[]}' +``` + +#### 3. DCM Event-Based Reload + +Subscribe to DCM reload events via RBUS (typically used by WebConfig framework): + +```bash +# Publish DCM reload event +rbuscli publish Device.X_RDKCENTREL-COM.Reloadconfig +``` + +#### 4. Using Test Utilities + +The project includes a convenience script for testing profile updates: + +```bash +# Load example profile +./test/set_report_profile.sh example + +# Load DOCSIS reference profile +./test/set_report_profile.sh docsis + +# Clear all profiles +./test/set_report_profile.sh empty + +# Load custom JSON profile +./test/set_report_profile.sh '{"profiles":[...]}' +``` + +**Available RBUS Parameters:** + +- `Device.X_RDKCENTRAL-COM_T2.ReportProfiles` - Persistent report profiles (JSON) +- `Device.X_RDKCENTRAL-COM_T2.ReportProfilesMsgPack` - Persistent profiles (MessagePack binary) +- `Device.X_RDKCENTRAL-COM_T2.Temp_ReportProfiles` - Temporary profiles (JSON) +- `Device.X_RDKCENTRAL-COM_T2.UploadDCMReport` - Trigger on-demand report upload +- `Device.X_RDKCENTRAL-COM_T2.AbortDCMReport` - Abort ongoing report upload + +## Development + +### Running Tests + +```bash +# Unit tests +make check + +# Functional tests +cd test +./run_ut.sh + +# Code coverage +./cov_build.sh +``` + +### Development Container + +Use the provided Docker container for consistent development: + https://github.com/rdkcentral/docker-device-mgt-service-test + +```bash +cd docker-device-mgt-service-test +docker compose up -d +``` + +A directory above the current directory is mounted as a volume in /mnt/L2_CONTAINER_SHARED_VOLUME . +Login to the container as follows: +```bash +docker exec -it native-platform /bin/bash +cd /mnt/L2_CONTAINER_SHARED_VOLUME/telemetry +sh test/run_ut.sh +``` + +See [Docker Development Guide](containers/README.md) for more details. + +## Platform Support + +Telemetry 2.0 is designed to be platform-independent and has been tested on: + +- **RDK-B** (Broadband devices) +- **RDK-V** (Video devices) +- **Linux** (x86_64, ARM, ARM64) +- **Yocto Project** builds + +See [Platform Porting Guide](docs/integration/platform-porting.md) for porting to new platforms. + + +## Contributing + +We welcome contributions! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines. + +### Development Workflow + +1. Fork the repository +2. Create a feature branch (`git checkout -b feature/amazing-feature`) +3. Make your changes +4. Add tests for new functionality +5. Ensure all L1 and L2 tests pass +6. Commit your changes (`git commit -m 'Add amazing feature'`) +7. Push to the branch (`git push origin feature/amazing-feature`) +8. Open a Pull Request + +### Code Style + +- Follow existing C code style and ensure astyle formatting and checks pass with below commands +```bash + find . -name '*.c' -o -name '*.h' | xargs astyle --options=.astylerc + find . -name '*.orig' -type f -delete + ``` + +- Use descriptive variable names +- Document all public APIs +- Add unit tests for new functions +- Add functional tests for new features + +See [Coding Guidelines](.github/instructions/c-embedded.instructions.md) for details. + +## Troubleshooting + +### Common Issues + +**Q: Telemetry not sending reports** +- Check network connectivity +- Verify XConf URL configuration +- Review logs in `/var/log/telemetry/` + +**Q: High memory usage** + +- Reduce number of active profiles +- Decrease reporting intervals +- Check for memory leaks with valgrind + +**Q: Build errors** + +- Ensure all dependencies installed +- Check compiler version (GCC 4.8+) +- Review build logs for missing libraries + +See [Troubleshooting Guide](docs/troubleshooting/common-errors.md) for more solutions. + +## License + +This project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details. + +## Acknowledgments + +- RDK Management LLC +- RDK Community Contributors +- Open Source Community + +## Contact + +- **Repository**: https://github.com/rdkcentral/telemetry +- **Issues**: https://github.com/rdkcentral/telemetry/issues +- **RDK Central**: https://rdkcentral.com + +## Changelog + +See [CHANGELOG.md](CHANGELOG.md) for version history and release notes. + +--- + +**Built for the RDK Community** diff --git a/source/bulkdata/profile.c b/source/bulkdata/profile.c index 6d7c094e..3e5c866b 100644 --- a/source/bulkdata/profile.c +++ b/source/bulkdata/profile.c @@ -1185,7 +1185,7 @@ T2ERROR deleteAllProfiles(bool delFromDisk) T2Error("Profile : %s failed to unregister from scheduler\n", tempProfile->name); } - pthread_mutex_lock(&plMutex); + /* Release plMutex before pthread_join to avoid deadlock */ if (tempProfile->threadExists) { pthread_mutex_lock(&tempProfile->reuseThreadMutex); @@ -1194,6 +1194,9 @@ T2ERROR deleteAllProfiles(bool delFromDisk) pthread_join(tempProfile->reportThread, NULL); tempProfile->threadExists = false; } + + /* Re-acquire plMutex for profile cleanup */ + pthread_mutex_lock(&plMutex); if(tempProfile->grepSeekProfile) { freeGrepSeekProfile(tempProfile->grepSeekProfile); @@ -1285,6 +1288,14 @@ T2ERROR deleteProfile(const char *profileName) } pthread_mutex_unlock(&profile->reportInProgressMutex); + /* Release plMutex before pthread_join to avoid deadlock. + * pthread_join can block indefinitely if the CollectAndReport thread + * is stuck (e.g., waiting on rbusMethodMutex). Holding plMutex during + * pthread_join prevents other threads (timeout callbacks, other profile + * operations) from making progress, creating a deadlock. + */ + pthread_mutex_unlock(&plMutex); + if (profile->threadExists) { pthread_mutex_lock(&profile->reuseThreadMutex); @@ -1294,6 +1305,9 @@ T2ERROR deleteProfile(const char *profileName) profile->threadExists = false; } + /* Re-acquire plMutex for profile cleanup operations */ + pthread_mutex_lock(&plMutex); + if(Vector_Size(profile->triggerConditionList) > 0) { rbusT2ConsumerUnReg(profile->triggerConditionList); diff --git a/source/bulkdata/profilexconf.c b/source/bulkdata/profilexconf.c index d101020e..bc3c5953 100644 --- a/source/bulkdata/profilexconf.c +++ b/source/bulkdata/profilexconf.c @@ -206,21 +206,35 @@ static T2ERROR initJSONReportXconf(cJSON** jsonObj, cJSON **valArray) static void* CollectAndReportXconf(void* data) { (void) data;// To fix compiler warning + + /* Set reportThreadExits flag under mutex to prevent data race with + * ProfileXConf_notifyTimeout which reads this flag under plMutex. + */ pthread_mutex_lock(&plMutex); - ProfileXConf* profile = singleProfile; - if(profile == NULL) - { - T2Error("profile is NULL\n"); - pthread_mutex_unlock(&plMutex); - return NULL; - } - pthread_cond_init(&reuseThread, NULL); reportThreadExits = true; - //GrepSeekProfile *GPF = profile->grepSeekProfile; + pthread_mutex_unlock(&plMutex); + do { + /* CRITICAL SECTION START: Acquire plMutex to check/access singleProfile */ + pthread_mutex_lock(&plMutex); + T2Info("%s while Loop -- START \n", __FUNCTION__); - profile = singleProfile; + if(singleProfile == NULL) + { + T2Error("%s is called with empty profile, profile reload might be in-progress, skip the request\n", __FUNCTION__); + goto reportXconfThreadEnd; + } + + ProfileXConf* profile = singleProfile; + + /* Set reportInProgress flag to prevent concurrent report generation + * and profile deletion while we're working. This must be done under + * plMutex to prevent races with ProfileXConf_notifyTimeout() and + * ProfileXConf_delete(). + */ + profile->reportInProgress = true; + Vector *profileParamVals = NULL; Vector *grepResultList = NULL; cJSON *valArray = NULL; @@ -243,6 +257,21 @@ static void* CollectAndReportXconf(void* data) T2Info("%s ++in profileName : %s\n", __FUNCTION__, profile->name); } + /* CRITICAL: Release plMutex before potentially blocking operations. + * Report generation involves: + * + * Holding plMutex during these operations blocks ALL other XCONF profile + * operations (timeouts, updates, deletions, marker events) system-wide, + * causing the telemetry system to hang. + * + * We can safely release plMutex here because: + * 1. We've already checked singleProfile is valid + * 2. We use profile->reportInProgress to prevent concurrent reports + * 3. Profile deletion waits for reportInProgress to be false + * 4. We'll re-acquire plMutex before updating shared state + */ + pthread_mutex_unlock(&plMutex); + /* CRITICAL SECTION END - plMutex released, other threads can now proceed */ clock_gettime(CLOCK_REALTIME, &startTime); if(profile->encodingType != NULL && !strcmp(profile->encodingType, "JSON")) @@ -250,9 +279,15 @@ static void* CollectAndReportXconf(void* data) if(T2ERROR_SUCCESS != initJSONReportXconf(&profile->jsonReportObj, &valArray)) { T2Error("Failed to initialize JSON Report\n"); - profile->reportInProgress = false; - //pthread_mutex_unlock(&plMutex); - //return NULL; + /* Re-acquire plMutex before updating profile state. + * CRITICAL: Keep plMutex locked before goto because reportXconfThreadEnd + * calls pthread_cond_wait which requires the mutex to be locked. + */ + pthread_mutex_lock(&plMutex); + if(singleProfile == profile) + { + profile->reportInProgress = false; + } goto reportXconfThreadEnd; } @@ -307,10 +342,20 @@ static void* CollectAndReportXconf(void* data) dcaFlagReportCompleation(); - if(profile->eMarkerList != NULL && Vector_Size(profile->eMarkerList) > 0) + /* CRITICAL: Re-acquire plMutex to safely access eMarkerList. + * External components can call t2_event_s() which modifies eMarkerList + * via ProfileXConf_storeMarkerEvent(). We must hold plMutex during + * event marker encoding to prevent race conditions. + * This is safe because encoding is quick (~milliseconds), unlike HTTP + * upload which can take 30+ seconds. + */ + pthread_mutex_lock(&plMutex); + if(singleProfile == profile && profile->eMarkerList != NULL && Vector_Size(profile->eMarkerList) > 0) { encodeEventMarkersInJSON(valArray, profile->eMarkerList); } + pthread_mutex_unlock(&plMutex); + profile->grepSeekProfile->execCounter += 1; T2Info("Execution Count = %d\n", profile->grepSeekProfile->execCounter); @@ -322,9 +367,15 @@ static void* CollectAndReportXconf(void* data) if(ret != T2ERROR_SUCCESS) { T2Error("Unable to generate report for : %s\n", profile->name); - profile->reportInProgress = false; - //pthread_mutex_unlock(&plMutex); - //return NULL; + /* Re-acquire plMutex before updating profile state. + * CRITICAL: Keep plMutex locked before goto because reportXconfThreadEnd + * calls pthread_cond_wait which requires the mutex to be locked. + */ + pthread_mutex_lock(&plMutex); + if(singleProfile == profile) + { + profile->reportInProgress = false; + } goto reportXconfThreadEnd; } long size = strlen(jsonReport); @@ -350,13 +401,19 @@ static void* CollectAndReportXconf(void* data) free(thirdCachedReport); } Vector_PushBack(profile->cachedReportList, strdup(jsonReport)); - profile->reportInProgress = false; + /* Re-acquire plMutex before updating profile state. + * CRITICAL: Keep plMutex locked before goto because reportXconfThreadEnd + * calls pthread_cond_wait which requires the mutex to be locked. + */ + pthread_mutex_lock(&plMutex); + if(singleProfile == profile) + { + profile->reportInProgress = false; + } /* CID 187010: Dereference before null check */ free(jsonReport); jsonReport = NULL; T2Debug("%s --out\n", __FUNCTION__); - //pthread_mutex_unlock(&plMutex); - //return NULL; goto reportXconfThreadEnd; } if(size > DEFAULT_MAX_REPORT_SIZE) @@ -490,17 +547,45 @@ static void* CollectAndReportXconf(void* data) isAbortTriggered = false ; } - profile->reportInProgress = false; - //pthread_mutex_unlock(&plMutex); + /* CRITICAL SECTION START: Re-acquire plMutex before updating profile state. + * pthread_cond_wait requires us to hold plMutex, so we acquire it here + * and hold it through the state update and into the cond_wait. + */ + pthread_mutex_lock(&plMutex); + if(singleProfile == profile) + { + profile->reportInProgress = false; + } reportXconfThreadEnd : T2Info("%s while Loop -- END \n", __FUNCTION__); - T2Info("%s --out\n", __FUNCTION__); - pthread_cond_wait(&reuseThread, &plMutex); + /* CRITICAL: Check wait condition in a loop to handle spurious wakeups. + * pthread_cond_wait can wake up spuriously without an actual signal. We must verify the actual + * condition (timeout notification pending) before proceeding. + * + * Wait while: profile exists AND no timeout pending AND not shutting down. + * Exit loop when: timeout arrives (reportInProgress=true) OR shutdown (initialized=false). + * + * pthread_cond_wait atomically releases plMutex while waiting. + * When signaled or spuriously woken, it re-acquires plMutex before returning. + */ + while(singleProfile && !singleProfile->reportInProgress && initialized) + { + pthread_cond_wait(&reuseThread, &plMutex); + } + /* After cond_wait loop exits, we hold plMutex again. Release it before + * the next loop iteration (which will re-acquire it). + */ + pthread_mutex_unlock(&plMutex); } while(initialized); + + /* Thread is exiting. We don't hold plMutex here, so acquire it to + * update the reportThreadExits flag, then release it. + */ + pthread_mutex_lock(&plMutex); reportThreadExits = false; pthread_mutex_unlock(&plMutex); - pthread_cond_destroy(&reuseThread); + T2Info("%s --out exiting the CollectAndReportXconf thread \n", __FUNCTION__); return NULL; } @@ -520,6 +605,15 @@ T2ERROR ProfileXConf_init(bool checkPreviousSeek) T2Error("%s Mutex init has failed\n", __FUNCTION__); return T2ERROR_FAILURE; } + /* Initialize condition variable at module init to prevent race where + * ProfileXConf_notifyTimeout signals before CollectAndReportXconf initializes it. + */ + if(pthread_cond_init(&reuseThread, NULL) != 0) + { + T2Error("%s Condition variable init has failed\n", __FUNCTION__); + pthread_mutex_destroy(&plMutex); + return T2ERROR_FAILURE; + } Vector_Create(&configList); fetchLocalConfigs(XCONFPROFILE_PERSISTENCE_PATH, configList); @@ -600,6 +694,8 @@ T2ERROR ProfileXConf_uninit() freeProfileXConf(); pthread_mutex_unlock(&plMutex); + /* Destroy condition variable at module uninit, after all threads are stopped */ + pthread_cond_destroy(&reuseThread); pthread_mutex_destroy(&plMutex); T2Debug("%s --out\n", __FUNCTION__); return T2ERROR_SUCCESS; @@ -624,6 +720,13 @@ T2ERROR ProfileXConf_set(ProfileXConf *profile) eMarker = (EventMarker *)Vector_At(singleProfile->eMarkerList, emIndex); addT2EventMarker(eMarker->markerName, eMarker->compName, singleProfile->name, eMarker->skipFreq); } + + /* Release plMutex before calling scheduler API to avoid potential + * blocking while holding the mutex. Scheduler operations may involve + * timer management and other operations that shouldn't block profile access. + */ + pthread_mutex_unlock(&plMutex); + if(registerProfileWithScheduler(singleProfile->name, singleProfile->reportingInterval, INFINITE_TIMEOUT, false, true, false, DEFAULT_FIRST_REPORT_INT, NULL) == T2ERROR_SUCCESS) { T2Info("Successfully set profile : %s\n", singleProfile->name); @@ -633,6 +736,9 @@ T2ERROR ProfileXConf_set(ProfileXConf *profile) { T2Error("Unable to register profile : %s with Scheduler\n", singleProfile->name); } + + /* Note: We already released plMutex above, so no need to unlock again */ + pthread_mutex_lock(&plMutex); } else { @@ -708,7 +814,16 @@ T2ERROR ProfileXConf_delete(ProfileXConf *profile) return T2ERROR_FAILURE; } + /* Copy profile name before unlocking to prevent use-after-free if + * another thread deletes/frees singleProfile after we release plMutex. + */ + char profileNameCopy[256] = {0}; + if(!isNameEqual && singleProfile && singleProfile->name) + { + strncpy(profileNameCopy, singleProfile->name, sizeof(profileNameCopy) - 1); + } pthread_mutex_unlock(&plMutex); + if(isNameEqual) { T2Info("Profile exists already, updating the config in file system\n"); @@ -719,29 +834,39 @@ T2ERROR ProfileXConf_delete(ProfileXConf *profile) } else { - if(T2ERROR_SUCCESS != unregisterProfileFromScheduler(singleProfile->name)) + /* Use copied profile name instead of singleProfile->name to avoid + * use-after-free since we no longer hold plMutex. + */ + if(profileNameCopy[0] != '\0') { - T2Error("Profile : %s failed to unregister from scheduler\n", singleProfile->name); + if(T2ERROR_SUCCESS != unregisterProfileFromScheduler(profileNameCopy)) + { + T2Error("Profile : %s failed to unregister from scheduler\n", profileNameCopy); + } } } - if(singleProfile->reportInProgress) + + /* CRITICAL: Actually wait for report to complete before deletion. + * Without this wait, the profile could be deleted while CollectAndReportXconf is still running and accessing profile members, + * causing use-after-free crash when CollectAndReportXconf() was still + * accessing profile members during brief mutex holds (event encoding, etc). + */ + pthread_mutex_lock(&plMutex); + unsigned int waitIterations = 0; + const unsigned int LOG_INTERVAL = 3000; // Log every 3000 iterations (30 seconds at 10ms per iteration) + while(singleProfile && singleProfile->reportInProgress) { - T2Info("Waiting for CollectAndReport to be complete : %s\n", singleProfile->name); - pthread_mutex_lock(&plMutex); - initialized = false; - T2Info("Sending signal to reuse Thread in CollectAndReportXconf\n"); - pthread_cond_signal(&reuseThread); + if(waitIterations % LOG_INTERVAL == 0) + { + T2Info("Waiting for CollectAndReportXconf to be complete : %s\n", singleProfile->name); + } + waitIterations++; pthread_mutex_unlock(&plMutex); - pthread_join(singleProfile->reportThread, NULL); - T2Info("reportThread exits and initialising the profile list\n"); - reportThreadExits = false; - initialized = true; - singleProfile->reportInProgress = false ; + usleep(10000); // 10ms polling interval + pthread_mutex_lock(&plMutex); } - pthread_mutex_lock(&plMutex); - size_t count = Vector_Size(singleProfile->cachedReportList); // Copy any cached message present in previous single profile to new profile if(isNameEqual) diff --git a/source/dcautil/dca.c b/source/dcautil/dca.c index 7aa2218e..18ad0e06 100644 --- a/source/dcautil/dca.c +++ b/source/dcautil/dca.c @@ -885,7 +885,7 @@ static FileDescriptor* getFileDeltaInMemMapAndSearch(const int fd, const off_t s } else { - T2Error("Error opening rotated file. Start search in current file\n"); + T2Debug("Error opening rotated file. Start search in current file\n"); T2Debug("File size rounded to nearest page size used for offset read: %jd bytes\n", (intmax_t)offset_in_page_size_multiple); if(seek_value < sb.st_size) {