Retry S3 chunk validation failures and preserve the underlying error#335
Draft
folbricht wants to merge 1 commit into
Draft
Retry S3 chunk validation failures and preserve the underlying error#335folbricht wants to merge 1 commit into
folbricht wants to merge 1 commit into
Conversation
A transient short read of a chunk body from an S3-compatible store (e.g. Cloudflare R2) leaves S3Store.GetChunk with truncated data that fails to decompress. Chunk.ID() discarded that error and returned the zero ChunkID, so the failure surfaced as the misleading "chunk id <id> does not match its hash 0000...0000", and GetChunk's --error-retry loop didn't cover it, making the transient failure fatal. - ChunkInvalid now carries an optional underlying cause (Err) and implements Unwrap(); its message reports the cause when present. - NewChunkFromStorage / NewChunkWithID capture the decode error instead of producing a bogus zero-hash mismatch, still returning ChunkInvalid so RepairableCache and 'verify --repair' keep handling such chunks. - S3Store.GetChunk retries NewChunkFromStorage failures under the existing --error-retry policy and closes the object reader before retrying. - Tests: new s3ErrCorruptBody fake-server mode plus corrupt_body_fail / corrupt_body_recover subtests, and a chunk_test.go covering ChunkInvalid.Err. Fixes #334
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #334.
Problem
desync untaragainst an S3-compatible store (Cloudflare R2 in the report) intermittently failed with:The object in the store is fine. The real cause is a transient short read of the chunk body somewhere between R2 and
minio-go/the HTTP transport. When that happens,io.ReadAllinS3Store.GetChunkreturns truncated bytes,NewChunkFromStorage→Chunk.ID()→Chunk.Data()fails to zstd-decode the partial stream, butChunk.ID()discards that error and returns the zeroChunkID{}— so it's reported as a hash mismatch against0000…, hiding the real error. AndGetChunk's--error-retryloop only coveredGetObjectandio.ReadAll, notNewChunkFromStorage, so the transient failure was fatal.Changes
errors.go—ChunkInvalidgets an optionalErr(underlying cause) and anUnwrap();Error()reports the cause when present instead of the bogus zero-hash message.chunk.go—NewChunkFromStorage/NewChunkWithIDcallc.Data()explicitly and returnChunkInvalid{ID: id, Err: err}on a decode failure, rather than lettingID()swallow it. StillChunkInvalid, soRepairableCacheandverify --repairkeep self-healing such chunks.s3.go—S3Store.GetChunkretriesNewChunkFromStoragevalidation/decompression failures under the existing--error-retrypolicy (same backoff as the other retry blocks), logging each retry, and closes the object reader before looping so deferred closes don't stack.s3ErrCorruptBodymode in the fake S3 server (well-formed HTTP response with a truncated body — the issue's scenario) pluscorrupt_body_fail/corrupt_body_recoversubtests, and a newchunk_test.gocovering theChunkInvalid.Errpaths.go build ./...,go vet ./...,gofmt -l ., andgo test ./...all pass.Out of scope
gcs.goandsftp.goGetChunkhave no--error-retryloop at all, andremotehttp.gohas one but doesn't coverNewChunkFromStorage. They all benefit from the improved error message here, but adding retry loops there is a separate change — happy to include it if wanted.