feat(perf): add PSI collectors for system and container pressure monitoring#246
Closed
jra3 wants to merge 7 commits into
Closed
feat(perf): add PSI collectors for system and container pressure monitoring#246jra3 wants to merge 7 commits into
jra3 wants to merge 7 commits into
Conversation
…tion Add Pressure Stall Information (PSI) collector to monitor resource contention (CPU, memory, I/O) for performance optimization and workload scheduling. Changes: - Add PSIStats, PSIResourceStats, CgroupPSIStats types - Add MetricTypePSI and MetricTypeCgroupPSI constants - Implement system-level PSI collector reading /proc/pressure/* - Parse 'some' and 'full' metrics with avg10/avg60/avg300/total - Kernel 4.20+ requirement with graceful PSI availability check Partial implementation of issue #88. Remaining work: - System PSI tests - Cgroup PSI collector - Cgroup PSI tests Reference: https://www.kernel.org/doc/html/latest/accounting/psi.html Co-Authored-By: Claude <noreply@anthropic.com>
Add comprehensive unit tests for PSI (Pressure Stall Information) collector covering all parsing scenarios, error handling, and edge cases. Test coverage: - Constructor validation (valid config, missing PSI directory, invalid paths) - PSI collection (all metrics, zero values, high pressure) - Error handling (missing files, malformed data, empty files) - Parsing edge cases (whitespace, missing fields, decimal precision, negative values) - Real-world scenarios (idle system, memory/IO constrained) - CPU-specific behavior (full metric always zero) - File permissions - Large uint64 values All tests passing. Partial implementation of issue #88. Remaining: cgroup PSI collector. Co-Authored-By: Claude <noreply@anthropic.com>
Add OpenTelemetry transformer to convert PSI (Pressure Stall Information) metrics to OTEL format for observability platforms. PSI Metric Mapping: - avg10/avg60/avg300 (percentages) → Float64Gauge - total (microseconds) → Int64Counter (cumulative) Attributes for dimensionality: - resource: cpu|memory|io - stall_type: some|full - container.id, cgroup.path (for cgroup PSI) Implementation: - transformPSIStats() for system-level PSI - transformCgroupPSIStats() for per-container PSI - recordPSIResourceMetrics() shared helper for both Metric names following OTEL semantic conventions: - system.psi.pressure.avg10/avg60/avg300 (gauge, %) - system.psi.pressure.total (counter, microseconds) Partial implementation of issue #88. Co-Authored-By: Claude <noreply@anthropic.com>
…oring
Add cgroup-level PSI collector to monitor resource contention per container.
Implementation:
- Uses containers.Discovery for automatic container detection
- Supports cgroup v2 (standard PSI) and v1 (optional PSI)
- Reads {cpu,memory,io}.pressure from each container's cgroup
- Graceful degradation when PSI files unavailable
- Reuses ParsePSIData() from system PSI collector
Features:
- Per-container CPU/memory/IO pressure metrics
- Fault isolation (errors don't stop other containers)
- Returns []*CgroupPSIStats slice
- Context cancellation support
- OTEL transformer already implemented
Tests:
- Constructor validation
- Cgroup v2 collection (single/multiple containers)
- Partial PSI file availability
- Error handling (malformed data, missing files)
Note: Tests need mock filesystem refinement for Discovery integration.
Cgroup PSI collector code is production-ready.
Completes issue #88.
Co-Authored-By: Claude <noreply@anthropic.com>
- Update types_test.go to expect 18 collectors (added MetricTypePSI and MetricTypeCgroupPSI) - Fix cgroup_psi_test.go container IDs to use valid 12+ char hex strings (required by container discovery) - Improve ParsePSIData to distinguish between empty files (graceful degradation with zero values) and invalid content (returns error) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
acae89a to
48957df
Compare
… allocations Use explicit slice pre-allocation with known capacity instead of append() which may allocate a new backing array when capacity is exceeded. This is especially important in transformCgroupPSIStats which loops over containers. Changes: - Pre-allocate stallAttrs in recordPSIResourceMetrics with cap+1 - Reuse backing array for "some" and "full" stall types - Pre-allocate containerAttrs per iteration with maxExtraAttrs capacity - Use three-index slice expressions to control capacity during append 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Add parsePSIPercentage helper that validates avg10/avg60/avg300 values are valid percentages. This catches corrupted reads or unexpected data and provides clear error messages for debugging. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements Pressure Stall Information (PSI) collectors for both system-level and per-container pressure monitoring. PSI quantifies resource contention (CPU, memory, I/O) to enable performance optimization and intelligent workload scheduling decisions.
Motivation
Resource pressure metrics are critical for:
PSI provides kernel-level visibility into when tasks are stalled waiting for resources, distinguishing between "some" stalls (at least one task affected) and "full" stalls (all non-idle tasks blocked).
Implementation Details
System-Level PSI Collector
File: pkg/performance/collectors/psi.go (181 lines)
Collects pressure metrics from
/proc/pressure/{cpu,memory,io}:Container-Level PSI Collector
File: pkg/performance/collectors/cgroup_psi.go (140 lines)
Collects per-container pressure from cgroup PSI files:
OTEL Metric Transformation
File: internal/metrics/consumers/otel/transformer.go (+113 lines)
OpenTelemetry metric mapping:
Metric names:
system.psi.pressure.avg10(gauge, %)system.psi.pressure.avg60(gauge, %)system.psi.pressure.avg300(gauge, %)system.psi.pressure.total(counter, microseconds)Type System
File: pkg/performance/types.go (+49 lines)
Added structured types:
PSIStats: System-level pressure for all resourcesPSIResourceStats: Per-resource metrics (some/full × avg10/60/300/total)CgroupPSIStats: Per-container pressure with container metadataMetricTypePSIandMetricTypeCgroupPSIconstantsEnabled in
DefaultCollectionConfigfor automatic collection.Testing
System PSI Tests
File: pkg/performance/collectors/psi_test.go (468 lines)
Comprehensive coverage:
Container PSI Tests
File: pkg/performance/collectors/cgroup_psi_test.go (236 lines)
Test scenarios:
All tests use mock filesystems for isolation and reproducibility.
Graceful Degradation
The implementation handles various environments:
This ensures the agent works across diverse kernel configurations.
Kernel Compatibility
CI Improvements
File: .github/workflows/build_and_test.yml
Added
linux-tools-genericpackage to ensure bpftool availability for vmlinux.h generation during eBPF builds. This fixes intermittent CI failures when building eBPF programs.Related Issue
Closes #88
Files Changed
Total: +1,192 lines across 8 files
Testing Instructions
Unit Tests
All PSI collector tests use mock filesystems and run on any platform.
Integration Tests (Linux required)
Manual Testing
Deploy to a Kubernetes cluster and verify metrics:
Expected output:
Performance Impact
DefaultCollectionConfigPSI files are kernel-maintained and reading them does not trigger expensive computations.
Future Enhancements
Potential follow-up work:
Related