Skip to content

WIP: NO-JIRA: stream karpenter pod logs during HC teardown#8467

Draft
maxcao13 wants to merge 1 commit intoopenshift:mainfrom
maxcao13:karpenter-teardown-log-streamer
Draft

WIP: NO-JIRA: stream karpenter pod logs during HC teardown#8467
maxcao13 wants to merge 1 commit intoopenshift:mainfrom
maxcao13:karpenter-teardown-log-streamer

Conversation

@maxcao13
Copy link
Copy Markdown
Member

@maxcao13 maxcao13 commented May 8, 2026

What this PR does / why we need it:

Point-in-time dumps miss the teardown window because karpenter pods are killed during HC deletion before the retry dump runs. Start Follow streams on karpenter-operator and karpenter pods before the destroy loop so the full finalizer-cleanup sequence is captured in teardown-logs/.

Which issue(s) this PR fixes:

Fixes

Special notes for your reviewer:

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Summary by CodeRabbit

  • Tests
    • Real-time streaming of hosted control plane pod logs during teardown so logs are captured continuously rather than only in point-in-time dumps.
    • New background log-streaming utilities to collect per-pod/container log files into teardown artifacts for improved diagnostics and faster root-cause analysis.

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 8, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@maxcao13: This pull request explicitly references no jira issue.

Details

In response to this:

What this PR does / why we need it:

Point-in-time dumps miss the teardown window because karpenter pods are killed during HC deletion before the retry dump runs. Start Follow streams on karpenter-operator and karpenter pods before the destroy loop so the full finalizer-cleanup sequence is captured in teardown-logs/.

Which issue(s) this PR fixes:

Fixes

Special notes for your reviewer:

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 8, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci openshift-ci Bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/needs-area labels May 8, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 8, 2026

📝 Walkthrough

Walkthrough

The changes start background log streaming during hosted control plane (HCP) teardown. teardownHostedCluster computes the HCP namespace and calls streamHCPPodLogs, which spawns goroutines that follow container logs for pods matching app=karpenter and app=karpenter-operator, writing each container's stream to artifactDir/teardown-logs/-.log. A stop function cancels the streams and waits for goroutines; teardownHostedCluster defers this stop so logs are captured through HCP deletion.

Sequence Diagram

sequenceDiagram
    participant Test as Test Harness
    participant Framework as Teardown Framework
    participant K8s as Kubernetes API
    participant Stream as Log Streamer
    participant Pod as Pod/Container
    participant FS as File System

    Test->>Framework: teardownHostedCluster()
    Framework->>Framework: determine HCP namespace
    Framework->>Stream: streamHCPPodLogs(namespace, artifactDir)

    Stream->>Stream: create cancellable context
    Stream->>K8s: build client from rest config
    Stream->>FS: ensure artifactDir/teardown-logs exists
    Stream->>K8s: list pods by labels (app=karpenter, app=karpenter-operator)

    par follow logs
        Stream->>Pod: open pod container log stream (Follow: true)
        Pod-->>Stream: stream log bytes
        Stream->>FS: write to <pod>-<container>.log
    and delete cluster
        Framework->>K8s: delete hosted cluster resources
        K8s->>Pod: pods/containers terminate
    end

    Framework->>Stream: stopStreaming() (deferred)
    Stream->>Stream: cancel context
    Stream->>Stream: wait for goroutines to finish
    Stream-->>Framework: streaming stopped
    Framework-->>Test: teardownHostedCluster() complete
Loading
🚥 Pre-merge checks | ✅ 11 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (11 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly describes the main change: streaming karpenter pod logs during HC teardown, which aligns with the pull request objectives and code changes.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed PR adds utility functions for log streaming. No Ginkgo test definitions found in modified files. Check is not applicable.
Test Structure And Quality ✅ Passed Custom check for Ginkgo test structure is not applicable. PR contains utility helper functions, not Ginkgo test code with It blocks or test assertions.
Microshift Test Compatibility ✅ Passed No new Ginkgo e2e tests are added in this PR. Only utility functions for log streaming are added. The MicroShift compatibility check applies only to new test declarations, not utility functions.
Single Node Openshift (Sno) Test Compatibility ✅ Passed No new Ginkgo e2e tests added. Changes are limited to utility functions in test infrastructure. Check not applicable.
Topology-Aware Scheduling Compatibility ✅ Passed Check not applicable. PR modifies only test utilities (test/e2e/util/) with no deployment manifests, operators, controllers, or scheduling constraints introduced.
Ote Binary Stdout Contract ✅ Passed No stdout contract violations detected. All logging uses t.Logf() (test framework) or file writes (os.Create). No process-level initialization code writes to stdout.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed No new Ginkgo e2e tests were added. PR only adds utility functions for test infrastructure (streamHCPPodLogs, followContainerLog) with no IPv4 assumptions or external connectivity requirements.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 8, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: maxcao13
Once this PR has been reviewed and has the lgtm label, please assign cblecker for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@maxcao13
Copy link
Copy Markdown
Member Author

maxcao13 commented May 8, 2026

/test e2e-aws-autonode

@openshift-ci openshift-ci Bot added area/testing Indicates the PR includes changes for e2e testing and removed do-not-merge/needs-area labels May 8, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented May 8, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 37.53%. Comparing base (bded456) to head (17ff9e9).

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #8467   +/-   ##
=======================================
  Coverage   37.53%   37.53%           
=======================================
  Files         751      751           
  Lines       92026    92026           
=======================================
  Hits        34544    34544           
  Misses      54841    54841           
  Partials     2641     2641           
Flag Coverage Δ
cmd-support 32.76% <ø> (ø)
cpo-hostedcontrolplane 36.77% <ø> (ø)
cpo-other 37.76% <ø> (ø)
hypershift-operator 47.93% <ø> (ø)
other 27.77% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/e2e/util/teardown_log_streamer.go`:
- Around line 66-67: The followContainerLog goroutine currently swallows errors
(file creation/stream setup/copy) so teardown log failures are silent; change
followContainerLog to accept an error channel (e.g., errCh chan error) or return
an error via a channel instead of dropping it, send any non-nil errors from file
creation/stream setup/copy into that channel, and update callers (the goroutine
invocation that passes streamCtx, &wg, kc, namespace, pod.Name, container.Name,
logDir) to collect errors from errCh, aggregate/log them, and fail or surface
the error (e.g., t.Fatalf or test logger) after wg completes. Ensure the wg.Done
handling remains correct and close/cleanup of errCh is handled to avoid leaks.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 6dc4b3d4-e49a-4472-9d7b-9bd3299288b7

📥 Commits

Reviewing files that changed from the base of the PR and between bded456 and f451e5d.

📒 Files selected for processing (2)
  • test/e2e/util/hypershift_framework.go
  • test/e2e/util/teardown_log_streamer.go

Comment thread test/e2e/util/teardown_log_streamer.go Outdated
Point-in-time dumps miss the teardown window because karpenter
pods are killed during HC deletion before the retry dump runs.
Start Follow streams on karpenter-operator and karpenter pods
before the destroy loop so the full finalizer-cleanup sequence
is captured in teardown-logs/.

Co-authored-by: Cursor <cursoragent@cursor.com>
Signed-off-by: Max Cao <macao@redhat.com>
@maxcao13 maxcao13 force-pushed the karpenter-teardown-log-streamer branch from f451e5d to 17ff9e9 Compare May 8, 2026 20:05
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
test/e2e/util/teardown_log_streamer.go (1)

104-104: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don’t ignore log copy failures.

Line 104 drops io.Copy errors, so broken/truncated streams can still look “successful” in teardown artifacts.

Suggested fix
-	io.Copy(f, stream)
+	if _, err := io.Copy(f, stream); err != nil && ctx.Err() == nil {
+		t.Logf("Teardown log streaming: copy failed for %s/%s: %v", podName, containerName, err)
+	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/util/teardown_log_streamer.go` at line 104, The io.Copy(f, stream)
call in teardown_log_streamer.go currently ignores errors; change it to capture
the returned (n, err) or just err, check if err != nil, and handle it by
returning the error (or logging it to the test/teardown logger) so broken or
truncated streams are surfaced; update the surrounding function (the method that
performs the stream-to-file write where io.Copy is invoked) to propagate or
record the error instead of dropping it.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@test/e2e/util/teardown_log_streamer.go`:
- Line 104: The io.Copy(f, stream) call in teardown_log_streamer.go currently
ignores errors; change it to capture the returned (n, err) or just err, check if
err != nil, and handle it by returning the error (or logging it to the
test/teardown logger) so broken or truncated streams are surfaced; update the
surrounding function (the method that performs the stream-to-file write where
io.Copy is invoked) to propagate or record the error instead of dropping it.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: bdfc741e-2334-4b01-8c9d-942085271a36

📥 Commits

Reviewing files that changed from the base of the PR and between f451e5d and 17ff9e9.

📒 Files selected for processing (2)
  • test/e2e/util/hypershift_framework.go
  • test/e2e/util/teardown_log_streamer.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • test/e2e/util/hypershift_framework.go

@maxcao13
Copy link
Copy Markdown
Member Author

maxcao13 commented May 8, 2026

/test e2e-aws-autonode
/test e2e-aws

@maxcao13
Copy link
Copy Markdown
Member Author

maxcao13 commented May 8, 2026

/test e2e-aws-autonode

@hypershift-jira-solve-ci
Copy link
Copy Markdown

Now I have the complete picture. Let me verify the exact import ordering issue by looking at the file's imports against the GCI config:

The file has imports in this order:

  1. context, fmt, io, os, path/filepath, sync, testing (standard library)
  2. corev1, metav1, kubeclient (k8s.io — should be section 7)
  3. github.com/openshift/hypershift/cmd/util (should be section 3)

According to the GCI config with custom-order: true, the expected order is:

  1. standard
  2. dot
  3. github.com/openshift/hypershift
  4. github.com/openshift
  5. github.com/aws
  6. github.com/Azure
  7. k8s.io
  8. sigs.k8s.io
  9. default

The file has k8s.io imports before the hypershift import — they should be swapped.

Test Failure Analysis Complete

Job Information

  • Prow Job: lint / Lint (GitHub Actions)
  • Build ID: 25576912179 / Job 75086210743
  • PR: WIP: NO-JIRA: stream karpenter pod logs during HC teardown #8467WIP: NO-JIRA: stream karpenter pod logs during HC teardown
  • Workflow: lint-reusable.yamlmake lintgolangci-lint v2.11.4
  • Failed File: test/e2e/util/teardown_log_streamer.go

Test Failure Analysis

Error

test/e2e/util/teardown_log_streamer.go:104:9: Error return value of `io.Copy` is not checked (errcheck)
	io.Copy(f, stream)
	       ^
test/e2e/util/teardown_log_streamer.go:11:1: File is not properly formatted (gci)

2 issues:
* errcheck: 1
* gci: 1
make: *** [Makefile:120: lint] Error 1

Summary

The new file test/e2e/util/teardown_log_streamer.go introduced in this PR has two golangci-lint violations: (1) the return value of io.Copy(f, stream) on line 104 is silently discarded without checking for errors (errcheck linter), and (2) the import block ordering does not follow the project's custom GCI section order — k8s.io/ imports appear before github.com/openshift/hypershift/ imports, but the project's .golangci.yml requires hypershift imports first (gci linter).

Root Cause

Two lint violations in the newly added file test/e2e/util/teardown_log_streamer.go:

1. errcheck violation (line 104):
The followContainerLog function calls io.Copy(f, stream) but discards the returned (int64, error). The errcheck linter enforces that error return values must be checked. Even though this is a "fire-and-forget" log streaming goroutine and the error may not be actionable, the linter requires explicit handling — either checking and logging the error, or assigning to _ with a comment.

2. gci violation (line 11, import block formatting):
The project's .golangci.yml configures gci with custom-order: true and this section order:

  1. standard (stdlib)
  2. dot
  3. prefix(github.com/openshift/hypershift)
  4. prefix(github.com/openshift)
  5. prefix(github.com/aws)
  6. prefix(github.com/Azure)
  7. prefix(k8s.io)
  8. prefix(sigs.k8s.io)
  9. default

The file currently has imports ordered as: stdlib → k8s.io → hypershift. The correct order per the config is: stdlib → hypershift → k8s.io. The k8s.io/ imports (section 7) must come after the github.com/openshift/hypershift/ import (section 3).

Recommendations

Fix 1 — Handle the io.Copy error (errcheck):

// In followContainerLog(), line 104, replace:
io.Copy(f, stream)

// With:
if _, err := io.Copy(f, stream); err != nil {
    t.Logf("Teardown log streaming: error copying logs for %s/%s: %v", podName, containerName, err)
}

Fix 2 — Reorder imports (gci):

import (
    "context"
    "fmt"
    "io"
    "os"
    "path/filepath"
    "sync"
    "testing"

    "github.com/openshift/hypershift/cmd/util"

    corev1 "k8s.io/api/core/v1"
    metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
    kubeclient "k8s.io/client-go/kubernetes"
)

Alternatively, run golangci-lint run --fix locally to auto-fix the gci ordering.

Evidence
Evidence Detail
Lint error 1 errcheck: io.Copy return value not checked at teardown_log_streamer.go:104:9
Lint error 2 gci: Import ordering wrong at teardown_log_streamer.go:11:1 — k8s.io imports placed before hypershift imports
GCI config .golangci.yml defines custom order: standardhypershiftopenshiftawsAzurek8s.iosigs.k8s.iodefault
Current import order stdlib → k8s.io → hypershift (violates config: hypershift must precede k8s.io)
Linter version golangci-lint v2.11.4, active linters for main module: errcheck, gci, govet, ineffassign, misspell, staticcheck, unparam, unused
API lint Passed (0 issues) — only the main module lint failed
Exit code make lint exited with code 1 (2 issues found, 924 total before processing → 2 after filtering)

@maxcao13
Copy link
Copy Markdown
Member Author

maxcao13 commented May 9, 2026

/test e2e-aws-autonode
/test e2e-aws

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 9, 2026

@maxcao13: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@cwbotbot
Copy link
Copy Markdown

cwbotbot commented May 9, 2026

Test Results

e2e-aws

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/testing Indicates the PR includes changes for e2e testing do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants