feat: compress termination messages to fit more results in 4KB limit#9682
feat: compress termination messages to fit more results in 4KB limit#9682waveywaves wants to merge 1 commit intotektoncd:mainfrom
Conversation
E2E Test Results — Kind Cluster (k8s v1.29.2)Deployed Tekton from this branch via Without compression (baseline)
Max capacity without compression: ~33 results With compression (
|
| Results | Termination Message | Status |
|---|---|---|
| 35 | 899 bytes | PASS |
| 100 | 2241 bytes | PASS |
| 150 | 3307 bytes | PASS |
| 175 | 3849 bytes | PASS |
| 180 | 3952 bytes | PASS |
| 185 | 4061 bytes | PASS |
| 200 | >4096 bytes | FAIL |
Max capacity with compression: ~187 results
Summary
Result type: ~100 byte image digest (gcr.io/project/image-NNN@sha256:64-hex-chars)
Without compression: ~33 results
With compression: ~187 results
Improvement: 5.7x capacity increase
All results verified intact after round-trip (first/last result values checked, result count matched). The tknz: prefix was present in the termination message, confirming compression was active.
There was a problem hiding this comment.
A few things to address before this can move forward.
Should fix
-
No e2e test with the flag enabled. The unit tests in
pkg/termination/are thorough, but there's no integration test proving the full loop on a real cluster: feature flag → pod builder → entrypoint → compressed write → reconciler parse → results in.status.taskResults. At minimum, add one test to the alpha e2e suite that creates a TaskRun with enough results to exceed 4KB uncompressed and asserts they all appear in status. -
No tests in
pkg/pod/pod_test.goorpkg/entrypoint/entrypointer_test.gofor the new flag plumbing. The termination package is well-covered but the integration points aren't. -
OpenAPI spec not regenerated.
EnableTerminationMessageCompressionwas added to theFeatureFlagsstruct butopenapi_generated.gowasn't updated. Run./hack/update-openapigen.sh. -
CRD schema entries missing
description. TheenableTerminationMessageCompressionfield is added in 8 places across the CRD YAMLs but without adescription, unlike all other feature flags there.
Strategic context
I want to be transparent about the bigger picture here. There is a TEP in progress (TEP-0164) that addresses the root cause — storing results in external backends (OCI/S3/GCS) so the 4KB limit doesn't apply at all. That TEP also unifies results with TEP-0147 artifacts and converges with TEP-0139 trusted artifacts.
That said, TEP-0164 is still in proposal stage and won't land for a while. Users are hitting the 4KB wall today (as the issues you reference show), and this is a reasonable opt-in stop-gap: zero new deps, backward compatible, meaningful capacity improvement.
I'd suggest two things to make the relationship clear:
- Add a note to the feature flag description (in
config-feature-flags.yamland docs) that this is a short-term measure, and that external result storage (TEP-0164) will address the underlying limitation. - Reference TEP-0164 in the PR description alongside the other related issues, so the community can see the full picture.
With the technical items above addressed, I'm supportive of this going in as an alpha feature.
| } | ||
|
|
||
| if featureFlags.EnableTerminationMessageCompression { | ||
| commonExtraEntrypointArgs = append(commonExtraEntrypointArgs, "-compress_termination_message=true") |
There was a problem hiding this comment.
Must fix: This is not gated on the result extraction method. When results-from: sidecar-logs is active, the sidecar reads results from disk, not the termination message — compression is meaningless and adds confusion.
| commonExtraEntrypointArgs = append(commonExtraEntrypointArgs, "-compress_termination_message=true") | |
| if featureFlags.EnableTerminationMessageCompression && !sidecarLogsResultsEnabled { | |
| commonExtraEntrypointArgs = append(commonExtraEntrypointArgs, "-compress_termination_message=true") | |
| } |
There was a problem hiding this comment.
Done. Added && !sidecarLogsResultsEnabled guard. Also added TestPodBuild_CompressTerminationMessage with 3 subtests covering: enabled+termination-message (arg present), enabled+sidecar-logs (arg skipped), disabled (arg absent).
There was a problem hiding this comment.
Thanks @waveywaves - please make sure that this is also documented in the feature flag docs.
There was a problem hiding this comment.
Might be worth logging a warning for the operator to catch as this would be an inconsistent configuration that needs to be detected
| var wErr error | ||
| if e.CompressTerminationMessage { | ||
| wErr = termination.WriteCompressedMessage(e.TerminationPath, output) | ||
| } else { | ||
| wErr = termination.WriteMessage(e.TerminationPath, output) | ||
| } | ||
| if wErr != nil { |
There was a problem hiding this comment.
Must fix: This exact if compress { WriteCompressed } else { Write } block is duplicated at L478-485 in readResultsFromDisk. Extract a helper to avoid divergence:
func (e Entrypointer) writeTerminationMessage(path string, results []result.RunResult) error {
if e.CompressTerminationMessage {
return termination.WriteCompressedMessage(path, results)
}
return termination.WriteMessage(path, results)
}There was a problem hiding this comment.
Done. Extracted writeTerminationMessage helper method. Both call sites now use it. Added TestWriteTerminationMessage_Compression with 2 subtests (plain vs compressed output).
| github.com/prometheus/client_golang v1.23.2 // indirect | ||
| github.com/prometheus/client_model v0.6.2 // indirect | ||
| github.com/prometheus/common v0.67.5 // indirect | ||
| github.com/prometheus/common v0.67.5 |
There was a problem hiding this comment.
Nit: Unrelated change — github.com/prometheus/common went from // indirect to direct but nothing in the codebase imports it. Looks like an accidental go mod tidy artifact, please revert.
| if err := json.Unmarshal([]byte(msg), &r); err != nil { | ||
| return nil, fmt.Errorf("parsing message json: %w, msg: %s", err, msg) | ||
| if err := json.Unmarshal([]byte(jsonMsg), &r); err != nil { | ||
| return nil, fmt.Errorf("parsing message json: %w, msg: %s", err, jsonMsg) |
There was a problem hiding this comment.
Nit: On JSON parse failure after decompression, this includes the full decompressed payload in the error message. For large payloads that could produce enormous log entries. Consider truncating to first N bytes.
There was a problem hiding this comment.
Done. Truncated to 256 bytes with ...(truncated) suffix.
| EnableWaitExponentialBackoff bool `json:"enableWaitExponentialBackoff,omitempty"` | ||
| EnableTerminationMessageCompression bool `json:"enableTerminationMessageCompression,omitempty"` |
There was a problem hiding this comment.
Nit: Inconsistent alignment — these two fields have extra whitespace padding that doesn't match the rest of the struct.
e4cec69 to
bf1e5f0
Compare
bf1e5f0 to
b57cd0e
Compare
|
|
||
| // maxDecompressedSize is the upper bound for decompressed termination messages. | ||
| // The pre-compression JSON was at most 4KB, but with compression we allow | ||
| // larger payloads. 1MB is a safe limit that prevents decompression bombs |
There was a problem hiding this comment.
Isn't 1MB too generous? How much max payload are we expecting? 128 KB? Or is 1 MB really possible except in case of attack?
There was a problem hiding this comment.
Good question. The pre-compression input is capped at 4KB by Kubernetes (termination message limit). With flate, the theoretical max decompressed output from 4KB compressed is about 64KB for typical JSON.
1MB is a safety net against decompression bombs. 128KB would work for legitimate use — happy to lower it. The key constraint: must be bounded (not unlimited) and larger than any legitimate 4KB-compressed payload could produce.
There was a problem hiding this comment.
Done — changed to 128KB.
| existing, err := parseExisting(fileContents) | ||
| if err == nil { | ||
| pro = append(existing, pro...) | ||
| } |
There was a problem hiding this comment.
Why weren't we surfacing error earlier? Is it still valid now to not surface it if error vectors have increased leading to loss of results?
There was a problem hiding this comment.
The pre-existing code silently swallows parse errors from the existing file (line 57-59) — this was intentional so a corrupt termination message doesn't prevent a step from completing and writing its own results.
With compression, the error vectors haven't increased: compressed messages auto-detect via the tknz: prefix, and if decompression fails, the same fallback applies (treat as corrupt, overwrite with new results).
If you think we should log a warning when dropping corrupt existing results, I can add that — though it would be a behavior change for the non-compressed path too.
There was a problem hiding this comment.
Updated — now logs a warning when existing termination message can't be parsed:
slog.Warn("Failed to parse existing termination message, previous results will be lost",
"error", parseErr, "path", path)This applies to both compressed and non-compressed paths. Results are still written (step doesn't fail), but the data loss is no longer silent.
b57cd0e to
23c80a4
Compare
|
/retest |
23c80a4 to
bf8451c
Compare
Adds optional zlib compression (gated by enable-termination-message-compression feature flag) to the entrypoint binary's termination message writer, with decompression on the controller side. Falls back to uncompressed format when compression is disabled or the controller encounters uncompressed data. This increases effective result capacity from ~33 to ~187 results in the 4KB termination message limit (5.7x improvement). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
bf8451c to
9a62b83
Compare
|
/retest |
afrittoli
left a comment
There was a problem hiding this comment.
Thanks @waveywaves - it looks good, just a few minor comments.
/approve
| } | ||
|
|
||
| if featureFlags.EnableTerminationMessageCompression { | ||
| commonExtraEntrypointArgs = append(commonExtraEntrypointArgs, "-compress_termination_message=true") |
There was a problem hiding this comment.
Thanks @waveywaves - please make sure that this is also documented in the feature flag docs.
| # The task generates 20 image digest results. Without compression, this | ||
| # produces ~5KB of JSON, exceeding the 4KB Kubernetes termination message | ||
| # limit. With compression enabled (~40-50% reduction), it fits comfortably. |
There was a problem hiding this comment.
NIT: wondering why is this in the no-ci folder? Couldn't we setup the test cluster with the flag enabled?
| } | ||
|
|
||
| if featureFlags.EnableTerminationMessageCompression { | ||
| commonExtraEntrypointArgs = append(commonExtraEntrypointArgs, "-compress_termination_message=true") |
There was a problem hiding this comment.
Might be worth logging a warning for the operator to catch as this would be an inconsistent configuration that needs to be detected
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: afrittoli The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Changes
Add optional flate compression for termination messages, gated behind the
enable-termination-message-compressionalpha feature flag. When enabled, resultsare compressed with Go stdlib
compress/flateand base64-encoded before writingto the termination message path, effectively increasing the usable capacity from
~33 to ~187 results (5.7x) for typical image digest payloads.
The problem
Kubernetes limits termination messages to 4KB per container, and the total across
all containers in a pod is capped at 12KB (divided by container count). With multiple
steps, each container may get as little as 1KB for results (#4808). The sidecar logs
method (TEP-0127) bypasses this but adds a sidecar container to every TaskRun,
which is wasteful when most tasks don't need large results (#8448).
The solution
Compress the JSON payload before writing to the termination message path. The parser
auto-detects compressed messages via a
tknz:prefix, maintaining full backwardcompatibility with uncompressed messages.
E2E test results (kind cluster, k8s v1.29.2)
Max capacity: ~33 → ~187 results (5.7x improvement)
Implementation
enable-termination-message-compression(alpha, default:false)PerFeatureFlagwithAlphaAPIFieldsstabilityconfig/config-feature-flags.yamlresults-from: termination-message(default)pkg/termination/write.gocompresses withflate.BestCompression+base64.RawStdEncoding, prefixed withtknz:O_TRUNCbug in file rewritepkg/termination/parse.goauto-detectstknz:prefix, decompresses transparentlyio.LimitReaderwith 1MB cap + explicit size checkpkg/pod/pod.go→cmd/entrypoint/main.go→EntrypointerTest coverage
examples/v1/*/alpha/Related Issues
/kind feature
Submitter Checklist
Release Notes