Skip to content

Retry S3 chunk validation failures and preserve the underlying error#335

Draft
folbricht wants to merge 1 commit into
masterfrom
fix-334-s3-chunk-validation-retry
Draft

Retry S3 chunk validation failures and preserve the underlying error#335
folbricht wants to merge 1 commit into
masterfrom
fix-334-s3-chunk-validation-retry

Conversation

@folbricht
Copy link
Copy Markdown
Owner

Fixes #334.

Problem

desync untar against an S3-compatible store (Cloudflare R2 in the report) intermittently failed with:

chunk id <ID> does not match its hash 0000000000000000000000000000000000000000000000000000000000000000: unexpected EOF

The object in the store is fine. The real cause is a transient short read of the chunk body somewhere between R2 and minio-go/the HTTP transport. When that happens, io.ReadAll in S3Store.GetChunk returns truncated bytes, NewChunkFromStorageChunk.ID()Chunk.Data() fails to zstd-decode the partial stream, but Chunk.ID() discards that error and returns the zero ChunkID{} — so it's reported as a hash mismatch against 0000…, hiding the real error. And GetChunk's --error-retry loop only covered GetObject and io.ReadAll, not NewChunkFromStorage, so the transient failure was fatal.

Changes

  • errors.goChunkInvalid gets an optional Err (underlying cause) and an Unwrap(); Error() reports the cause when present instead of the bogus zero-hash message.
  • chunk.goNewChunkFromStorage / NewChunkWithID call c.Data() explicitly and return ChunkInvalid{ID: id, Err: err} on a decode failure, rather than letting ID() swallow it. Still ChunkInvalid, so RepairableCache and verify --repair keep self-healing such chunks.
  • s3.goS3Store.GetChunk retries NewChunkFromStorage validation/decompression failures under the existing --error-retry policy (same backoff as the other retry blocks), logging each retry, and closes the object reader before looping so deferred closes don't stack.
  • Tests — new s3ErrCorruptBody mode in the fake S3 server (well-formed HTTP response with a truncated body — the issue's scenario) plus corrupt_body_fail / corrupt_body_recover subtests, and a new chunk_test.go covering the ChunkInvalid.Err paths.

go build ./..., go vet ./..., gofmt -l ., and go test ./... all pass.

Out of scope

gcs.go and sftp.go GetChunk have no --error-retry loop at all, and remotehttp.go has one but doesn't cover NewChunkFromStorage. They all benefit from the improved error message here, but adding retry loops there is a separate change — happy to include it if wanted.

A transient short read of a chunk body from an S3-compatible store (e.g.
Cloudflare R2) leaves S3Store.GetChunk with truncated data that fails to
decompress. Chunk.ID() discarded that error and returned the zero ChunkID,
so the failure surfaced as the misleading "chunk id <id> does not match its
hash 0000...0000", and GetChunk's --error-retry loop didn't cover it, making
the transient failure fatal.

- ChunkInvalid now carries an optional underlying cause (Err) and implements
  Unwrap(); its message reports the cause when present.
- NewChunkFromStorage / NewChunkWithID capture the decode error instead of
  producing a bogus zero-hash mismatch, still returning ChunkInvalid so
  RepairableCache and 'verify --repair' keep handling such chunks.
- S3Store.GetChunk retries NewChunkFromStorage failures under the existing
  --error-retry policy and closes the object reader before retrying.
- Tests: new s3ErrCorruptBody fake-server mode plus corrupt_body_fail /
  corrupt_body_recover subtests, and a chunk_test.go covering ChunkInvalid.Err.

Fixes #334
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

chunk id does not match its hash 0000000000000000000000000000000000000000000000000000000000000000

1 participant