WIP: NO-JIRA: stream karpenter pod logs during HC teardown#8467
WIP: NO-JIRA: stream karpenter pod logs during HC teardown#8467maxcao13 wants to merge 1 commit intoopenshift:mainfrom
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
|
@maxcao13: This pull request explicitly references no jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Skipping CI for Draft Pull Request. |
📝 WalkthroughWalkthroughThe changes start background log streaming during hosted control plane (HCP) teardown. teardownHostedCluster computes the HCP namespace and calls streamHCPPodLogs, which spawns goroutines that follow container logs for pods matching app=karpenter and app=karpenter-operator, writing each container's stream to artifactDir/teardown-logs/-.log. A stop function cancels the streams and waits for goroutines; teardownHostedCluster defers this stop so logs are captured through HCP deletion. Sequence DiagramsequenceDiagram
participant Test as Test Harness
participant Framework as Teardown Framework
participant K8s as Kubernetes API
participant Stream as Log Streamer
participant Pod as Pod/Container
participant FS as File System
Test->>Framework: teardownHostedCluster()
Framework->>Framework: determine HCP namespace
Framework->>Stream: streamHCPPodLogs(namespace, artifactDir)
Stream->>Stream: create cancellable context
Stream->>K8s: build client from rest config
Stream->>FS: ensure artifactDir/teardown-logs exists
Stream->>K8s: list pods by labels (app=karpenter, app=karpenter-operator)
par follow logs
Stream->>Pod: open pod container log stream (Follow: true)
Pod-->>Stream: stream log bytes
Stream->>FS: write to <pod>-<container>.log
and delete cluster
Framework->>K8s: delete hosted cluster resources
K8s->>Pod: pods/containers terminate
end
Framework->>Stream: stopStreaming() (deferred)
Stream->>Stream: cancel context
Stream->>Stream: wait for goroutines to finish
Stream-->>Framework: streaming stopped
Framework-->>Test: teardownHostedCluster() complete
🚥 Pre-merge checks | ✅ 11 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (11 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: maxcao13 The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
/test e2e-aws-autonode |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #8467 +/- ##
=======================================
Coverage 37.53% 37.53%
=======================================
Files 751 751
Lines 92026 92026
=======================================
Hits 34544 34544
Misses 54841 54841
Partials 2641 2641
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@test/e2e/util/teardown_log_streamer.go`:
- Around line 66-67: The followContainerLog goroutine currently swallows errors
(file creation/stream setup/copy) so teardown log failures are silent; change
followContainerLog to accept an error channel (e.g., errCh chan error) or return
an error via a channel instead of dropping it, send any non-nil errors from file
creation/stream setup/copy into that channel, and update callers (the goroutine
invocation that passes streamCtx, &wg, kc, namespace, pod.Name, container.Name,
logDir) to collect errors from errCh, aggregate/log them, and fail or surface
the error (e.g., t.Fatalf or test logger) after wg completes. Ensure the wg.Done
handling remains correct and close/cleanup of errCh is handled to avoid leaks.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 6dc4b3d4-e49a-4472-9d7b-9bd3299288b7
📒 Files selected for processing (2)
test/e2e/util/hypershift_framework.gotest/e2e/util/teardown_log_streamer.go
Point-in-time dumps miss the teardown window because karpenter pods are killed during HC deletion before the retry dump runs. Start Follow streams on karpenter-operator and karpenter pods before the destroy loop so the full finalizer-cleanup sequence is captured in teardown-logs/. Co-authored-by: Cursor <cursoragent@cursor.com> Signed-off-by: Max Cao <macao@redhat.com>
f451e5d to
17ff9e9
Compare
There was a problem hiding this comment.
♻️ Duplicate comments (1)
test/e2e/util/teardown_log_streamer.go (1)
104-104:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDon’t ignore log copy failures.
Line 104 drops
io.Copyerrors, so broken/truncated streams can still look “successful” in teardown artifacts.Suggested fix
- io.Copy(f, stream) + if _, err := io.Copy(f, stream); err != nil && ctx.Err() == nil { + t.Logf("Teardown log streaming: copy failed for %s/%s: %v", podName, containerName, err) + }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/e2e/util/teardown_log_streamer.go` at line 104, The io.Copy(f, stream) call in teardown_log_streamer.go currently ignores errors; change it to capture the returned (n, err) or just err, check if err != nil, and handle it by returning the error (or logging it to the test/teardown logger) so broken or truncated streams are surfaced; update the surrounding function (the method that performs the stream-to-file write where io.Copy is invoked) to propagate or record the error instead of dropping it.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@test/e2e/util/teardown_log_streamer.go`:
- Line 104: The io.Copy(f, stream) call in teardown_log_streamer.go currently
ignores errors; change it to capture the returned (n, err) or just err, check if
err != nil, and handle it by returning the error (or logging it to the
test/teardown logger) so broken or truncated streams are surfaced; update the
surrounding function (the method that performs the stream-to-file write where
io.Copy is invoked) to propagate or record the error instead of dropping it.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: bdfc741e-2334-4b01-8c9d-942085271a36
📒 Files selected for processing (2)
test/e2e/util/hypershift_framework.gotest/e2e/util/teardown_log_streamer.go
🚧 Files skipped from review as they are similar to previous changes (1)
- test/e2e/util/hypershift_framework.go
|
/test e2e-aws-autonode |
|
/test e2e-aws-autonode |
|
Now I have the complete picture. Let me verify the exact import ordering issue by looking at the file's imports against the GCI config: The file has imports in this order:
According to the GCI config with
The file has k8s.io imports before the hypershift import — they should be swapped. Test Failure Analysis CompleteJob Information
Test Failure AnalysisErrorSummaryThe new file Root CauseTwo lint violations in the newly added file 1. 2.
The file currently has imports ordered as: stdlib → k8s.io → hypershift. The correct order per the config is: stdlib → hypershift → k8s.io. The RecommendationsFix 1 — Handle the // In followContainerLog(), line 104, replace:
io.Copy(f, stream)
// With:
if _, err := io.Copy(f, stream); err != nil {
t.Logf("Teardown log streaming: error copying logs for %s/%s: %v", podName, containerName, err)
}Fix 2 — Reorder imports (gci): import (
"context"
"fmt"
"io"
"os"
"path/filepath"
"sync"
"testing"
"github.com/openshift/hypershift/cmd/util"
corev1 "k8s.io/api/core/v1"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
kubeclient "k8s.io/client-go/kubernetes"
)Alternatively, run Evidence
|
|
/test e2e-aws-autonode |
|
@maxcao13: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Test Resultse2e-aws
|
What this PR does / why we need it:
Point-in-time dumps miss the teardown window because karpenter pods are killed during HC deletion before the retry dump runs. Start Follow streams on karpenter-operator and karpenter pods before the destroy loop so the full finalizer-cleanup sequence is captured in teardown-logs/.
Which issue(s) this PR fixes:
Fixes
Special notes for your reviewer:
Checklist:
Summary by CodeRabbit