Skip to content

feat: compress termination messages to fit more results in 4KB limit#9682

Open
waveywaves wants to merge 1 commit intotektoncd:mainfrom
waveywaves:feat/compress-termination-results
Open

feat: compress termination messages to fit more results in 4KB limit#9682
waveywaves wants to merge 1 commit intotektoncd:mainfrom
waveywaves:feat/compress-termination-results

Conversation

@waveywaves
Copy link
Copy Markdown
Member

@waveywaves waveywaves commented Mar 28, 2026

Changes

Add optional flate compression for termination messages, gated behind the
enable-termination-message-compression alpha feature flag. When enabled, results
are compressed with Go stdlib compress/flate and base64-encoded before writing
to the termination message path, effectively increasing the usable capacity from
~33 to ~187 results (5.7x) for typical image digest payloads.

The problem

Kubernetes limits termination messages to 4KB per container, and the total across
all containers in a pod is capped at 12KB (divided by container count). With multiple
steps, each container may get as little as 1KB for results (#4808). The sidecar logs
method (TEP-0127) bypasses this but adds a sidecar container to every TaskRun,
which is wasteful when most tasks don't need large results (#8448).

The solution

Compress the JSON payload before writing to the termination message path. The parser
auto-detects compressed messages via a tknz: prefix, maintaining full backward
compatibility with uncompressed messages.

Step writes result → entrypoint reads → JSON → flate compress → base64 → termination message
                                                                              ↓
Reconciler reads ← JSON ← flate decompress ← base64 decode ← termination message

E2E test results (kind cluster, k8s v1.29.2)

Results Without Compression With Compression
30 3776 bytes ✅
35 >4096 bytes ❌ 899 bytes ✅
100 2241 bytes ✅
150 3307 bytes ✅
180 3952 bytes ✅
185 4061 bytes ✅
200 >4096 bytes ❌

Max capacity: ~33 → ~187 results (5.7x improvement)

Implementation

  • Feature flag: enable-termination-message-compression (alpha, default: false)
    • Uses PerFeatureFlag with AlphaAPIFields stability
    • Documented in config/config-feature-flags.yaml
    • Only meaningful when results-from: termination-message (default)
  • Write path: pkg/termination/write.go compresses with flate.BestCompression + base64.RawStdEncoding, prefixed with tknz:
    • Smart fallback: if compressed output is larger than plain JSON, falls back automatically
    • Fixed pre-existing O_TRUNC bug in file rewrite
  • Parse path: pkg/termination/parse.go auto-detects tknz: prefix, decompresses transparently
  • Decompression safety: io.LimitReader with 1MB cap + explicit size check
  • Entrypoint wiring: ConfigMap → pkg/pod/pod.gocmd/entrypoint/main.goEntrypointer
  • Zero new dependencies — Go stdlib only

Test coverage

  • 18 unit tests: round-trip, size comparison, append, auto-detect, corrupt data, capacity benchmark
  • Feature flag tests (default, all-flags-set, invalid value)
  • Example TaskRun and Pipeline in examples/v1/*/alpha/

Related Issues

/kind feature

Submitter Checklist

  • Has Docs if any changes are user facing
  • Has Tests included if any functionality added or changed
  • pre-commit Passed
  • Follows the commit message standard
  • Meets the Tekton contributor standards
  • Has a kind label
  • Release notes block below has been updated

Release Notes

Added optional termination message compression (alpha feature flag
enable-termination-message-compression) that uses flate compression to fit
approximately 5.7x more results in the 4KB Kubernetes termination message limit.
The parser auto-detects compressed messages for full backward compatibility.
Zero new dependencies — uses Go stdlib only.

@tekton-robot tekton-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. labels Mar 28, 2026
@tekton-robot tekton-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 28, 2026
@waveywaves
Copy link
Copy Markdown
Member Author

E2E Test Results — Kind Cluster (k8s v1.29.2)

Deployed Tekton from this branch via ko apply with KO_DOCKER_REPO=kind.local.

Without compression (baseline)

Results Termination Message Status
30 3776 bytes PASS
35 >4096 bytes FAILTermination message is above max allowed size 4096, caused by large task result.

Max capacity without compression: ~33 results

With compression (enable-termination-message-compression: "true")

Results Termination Message Status
35 899 bytes PASS
100 2241 bytes PASS
150 3307 bytes PASS
175 3849 bytes PASS
180 3952 bytes PASS
185 4061 bytes PASS
200 >4096 bytes FAIL

Max capacity with compression: ~187 results

Summary

Result type: ~100 byte image digest (gcr.io/project/image-NNN@sha256:64-hex-chars)
Without compression: ~33 results
With compression:    ~187 results
Improvement:         5.7x capacity increase

All results verified intact after round-trip (first/last result values checked, result count matched). The tknz: prefix was present in the termination message, confirming compression was active.

@waveywaves waveywaves requested review from aThorp96 and vdemeester and removed request for pritidesai March 28, 2026 16:39
Copy link
Copy Markdown
Member

@vdemeester vdemeester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few things to address before this can move forward.

Should fix

  1. No e2e test with the flag enabled. The unit tests in pkg/termination/ are thorough, but there's no integration test proving the full loop on a real cluster: feature flag → pod builder → entrypoint → compressed write → reconciler parse → results in .status.taskResults. At minimum, add one test to the alpha e2e suite that creates a TaskRun with enough results to exceed 4KB uncompressed and asserts they all appear in status.

  2. No tests in pkg/pod/pod_test.go or pkg/entrypoint/entrypointer_test.go for the new flag plumbing. The termination package is well-covered but the integration points aren't.

  3. OpenAPI spec not regenerated. EnableTerminationMessageCompression was added to the FeatureFlags struct but openapi_generated.go wasn't updated. Run ./hack/update-openapigen.sh.

  4. CRD schema entries missing description. The enableTerminationMessageCompression field is added in 8 places across the CRD YAMLs but without a description, unlike all other feature flags there.

Strategic context

I want to be transparent about the bigger picture here. There is a TEP in progress (TEP-0164) that addresses the root cause — storing results in external backends (OCI/S3/GCS) so the 4KB limit doesn't apply at all. That TEP also unifies results with TEP-0147 artifacts and converges with TEP-0139 trusted artifacts.

That said, TEP-0164 is still in proposal stage and won't land for a while. Users are hitting the 4KB wall today (as the issues you reference show), and this is a reasonable opt-in stop-gap: zero new deps, backward compatible, meaningful capacity improvement.

I'd suggest two things to make the relationship clear:

  1. Add a note to the feature flag description (in config-feature-flags.yaml and docs) that this is a short-term measure, and that external result storage (TEP-0164) will address the underlying limitation.
  2. Reference TEP-0164 in the PR description alongside the other related issues, so the community can see the full picture.

With the technical items above addressed, I'm supportive of this going in as an alpha feature.

Comment thread pkg/pod/pod.go
}

if featureFlags.EnableTerminationMessageCompression {
commonExtraEntrypointArgs = append(commonExtraEntrypointArgs, "-compress_termination_message=true")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Must fix: This is not gated on the result extraction method. When results-from: sidecar-logs is active, the sidecar reads results from disk, not the termination message — compression is meaningless and adds confusion.

Suggested change
commonExtraEntrypointArgs = append(commonExtraEntrypointArgs, "-compress_termination_message=true")
if featureFlags.EnableTerminationMessageCompression && !sidecarLogsResultsEnabled {
commonExtraEntrypointArgs = append(commonExtraEntrypointArgs, "-compress_termination_message=true")
}

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Added && !sidecarLogsResultsEnabled guard. Also added TestPodBuild_CompressTerminationMessage with 3 subtests covering: enabled+termination-message (arg present), enabled+sidecar-logs (arg skipped), disabled (arg absent).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @waveywaves - please make sure that this is also documented in the feature flag docs.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth logging a warning for the operator to catch as this would be an inconsistent configuration that needs to be detected

Comment thread pkg/entrypoint/entrypointer.go Outdated
Comment on lines +204 to +210
var wErr error
if e.CompressTerminationMessage {
wErr = termination.WriteCompressedMessage(e.TerminationPath, output)
} else {
wErr = termination.WriteMessage(e.TerminationPath, output)
}
if wErr != nil {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Must fix: This exact if compress { WriteCompressed } else { Write } block is duplicated at L478-485 in readResultsFromDisk. Extract a helper to avoid divergence:

func (e Entrypointer) writeTerminationMessage(path string, results []result.RunResult) error {
	if e.CompressTerminationMessage {
		return termination.WriteCompressedMessage(path, results)
	}
	return termination.WriteMessage(path, results)
}

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Extracted writeTerminationMessage helper method. Both call sites now use it. Added TestWriteTerminationMessage_Compression with 2 subtests (plain vs compressed output).

Comment thread go.mod
github.com/prometheus/client_golang v1.23.2 // indirect
github.com/prometheus/client_model v0.6.2 // indirect
github.com/prometheus/common v0.67.5 // indirect
github.com/prometheus/common v0.67.5
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Unrelated change — github.com/prometheus/common went from // indirect to direct but nothing in the codebase imports it. Looks like an accidental go mod tidy artifact, please revert.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted.

Comment thread pkg/termination/parse.go Outdated
if err := json.Unmarshal([]byte(msg), &r); err != nil {
return nil, fmt.Errorf("parsing message json: %w, msg: %s", err, msg)
if err := json.Unmarshal([]byte(jsonMsg), &r); err != nil {
return nil, fmt.Errorf("parsing message json: %w, msg: %s", err, jsonMsg)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: On JSON parse failure after decompression, this includes the full decompressed payload in the error message. For large payloads that could produce enormous log entries. Consider truncating to first N bytes.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Truncated to 256 bytes with ...(truncated) suffix.

Comment thread pkg/apis/config/feature_flags.go Outdated
Comment on lines +234 to +235
EnableWaitExponentialBackoff bool `json:"enableWaitExponentialBackoff,omitempty"`
EnableTerminationMessageCompression bool `json:"enableTerminationMessageCompression,omitempty"`
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Inconsistent alignment — these two fields have extra whitespace padding that doesn't match the rest of the struct.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@waveywaves waveywaves force-pushed the feat/compress-termination-results branch from e4cec69 to bf1e5f0 Compare March 31, 2026 09:02
@waveywaves waveywaves marked this pull request as ready for review April 2, 2026 14:00
@tekton-robot tekton-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 2, 2026
@waveywaves waveywaves force-pushed the feat/compress-termination-results branch from bf1e5f0 to b57cd0e Compare April 2, 2026 23:08
@vdemeester vdemeester self-assigned this Apr 7, 2026
Comment thread pkg/termination/write.go Outdated

// maxDecompressedSize is the upper bound for decompressed termination messages.
// The pre-compression JSON was at most 4KB, but with compression we allow
// larger payloads. 1MB is a safe limit that prevents decompression bombs
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't 1MB too generous? How much max payload are we expecting? 128 KB? Or is 1 MB really possible except in case of attack?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. The pre-compression input is capped at 4KB by Kubernetes (termination message limit). With flate, the theoretical max decompressed output from 4KB compressed is about 64KB for typical JSON.

1MB is a safety net against decompression bombs. 128KB would work for legitimate use — happy to lower it. The key constraint: must be bounded (not unlimited) and larger than any legitimate 4KB-compressed payload could produce.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's put it to 128 KB

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — changed to 128KB.

Comment thread pkg/termination/write.go
existing, err := parseExisting(fileContents)
if err == nil {
pro = append(existing, pro...)
}
Copy link
Copy Markdown
Contributor

@khrm khrm Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why weren't we surfacing error earlier? Is it still valid now to not surface it if error vectors have increased leading to loss of results?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pre-existing code silently swallows parse errors from the existing file (line 57-59) — this was intentional so a corrupt termination message doesn't prevent a step from completing and writing its own results.

With compression, the error vectors haven't increased: compressed messages auto-detect via the tknz: prefix, and if decompression fails, the same fallback applies (treat as corrupt, overwrite with new results).

If you think we should log a warning when dropping corrupt existing results, I can add that — though it would be a behavior change for the non-compressed path too.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated — now logs a warning when existing termination message can't be parsed:

slog.Warn("Failed to parse existing termination message, previous results will be lost",
    "error", parseErr, "path", path)

This applies to both compressed and non-compressed paths. Results are still written (step doesn't fail), but the data loss is no longer silent.

@waveywaves waveywaves force-pushed the feat/compress-termination-results branch from b57cd0e to 23c80a4 Compare April 9, 2026 09:47
@waveywaves
Copy link
Copy Markdown
Member Author

/retest

@waveywaves waveywaves force-pushed the feat/compress-termination-results branch from 23c80a4 to bf8451c Compare April 14, 2026 09:33
@tekton-robot tekton-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Apr 14, 2026
Adds optional zlib compression (gated by enable-termination-message-compression
feature flag) to the entrypoint binary's termination message writer, with
decompression on the controller side. Falls back to uncompressed format
when compression is disabled or the controller encounters uncompressed data.

This increases effective result capacity from ~33 to ~187 results in the
4KB termination message limit (5.7x improvement).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@waveywaves waveywaves force-pushed the feat/compress-termination-results branch from bf8451c to 9a62b83 Compare April 16, 2026 11:12
@tekton-robot tekton-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Apr 16, 2026
@waveywaves
Copy link
Copy Markdown
Member Author

/retest

Copy link
Copy Markdown
Member

@afrittoli afrittoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @waveywaves - it looks good, just a few minor comments.

/approve

Comment thread pkg/pod/pod.go
}

if featureFlags.EnableTerminationMessageCompression {
commonExtraEntrypointArgs = append(commonExtraEntrypointArgs, "-compress_termination_message=true")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @waveywaves - please make sure that this is also documented in the feature flag docs.

Comment on lines +10 to +12
# The task generates 20 image digest results. Without compression, this
# produces ~5KB of JSON, exceeding the 4KB Kubernetes termination message
# limit. With compression enabled (~40-50% reduction), it fits comfortably.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: wondering why is this in the no-ci folder? Couldn't we setup the test cluster with the flag enabled?

Comment thread pkg/pod/pod.go
}

if featureFlags.EnableTerminationMessageCompression {
commonExtraEntrypointArgs = append(commonExtraEntrypointArgs, "-compress_termination_message=true")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth logging a warning for the operator to catch as this would be an inconsistent configuration that needs to be detected

@tekton-robot
Copy link
Copy Markdown
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: afrittoli

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tekton-robot tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/feature Categorizes issue or PR as related to a new feature. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants