fix(test): resolve rate-limit failures when downloading datasets#9658
fix(test): resolve rate-limit failures when downloading datasets#9658shiva-istari wants to merge 3 commits intomainfrom
Conversation
This comment has been minimized.
This comment has been minimized.
adc1378 to
12d07b5
Compare
mlwelles
left a comment
There was a problem hiding this comment.
Additional findings beyond the code duplication and cache key concerns already noted:
Note: t/t.go:1133 still has var suffix = "?raw=true" which gets appended to media.githubusercontent.com URLs via baseUrl + name + suffix. The ?raw=true parameter is a GitHub blob-URL convention and is unnecessary for direct CDN URLs. Not harmful, but misleading — should be removed for consistency with the other URL updates in this PR.
## Summary - Bumps Trivy from `0.65.0` to `0.69.3` in `.trunk/trunk.yaml` - Trivy `v0.65.0` was never published — the version scheme jumped from `v0.26.0` to `v0.69.2`, causing a 404 when trunk tries to download the hermetic tool binary - This fixes the "Trunk Code Quality / Check" CI failure seen on #9658 and all other PRs
12d07b5 to
8175ecc
Compare
This comment has been minimized.
This comment has been minimized.
1515821 to
8e86446
Compare
8e86446 to
ae0d4f7
Compare
ae0d4f7 to
ebcd77c
Compare
This comment has been minimized.
This comment has been minimized.
ebcd77c to
4ee619c
Compare
mlwelles
left a comment
There was a problem hiding this comment.
Re-review against 4ee619c. All eight threads from prior rounds are resolved cleanly — the new benchdata package matches what the last comment described, errors now propagate via errgroup, the --keep-data flag is wired through, the cache key is bound to benchmark-data-version, and LFS detection is dynamic via the Contents API. Nice work.
A few new findings on the introduced code; nothing blocking, but worth addressing before merge:
- 🟡 No tests on
benchdatadespite hand-rolled LFS pointer parsing, retry/backoff, gzip-magic validation, and ref resolution priority — these are exactly the helpers that silently break. - 🟡
EnsureFilesignores the context returned byerrgroup.WithContext— first failure won't cancel sibling downloads. - 🟡 No per-attempt timeout on
http.DefaultClient— a stalled connection hangs CI indefinitely. - 🔵 A few smaller consistency / correctness issues inline.
Inline comments follow.
| "time" | ||
|
|
||
| "golang.org/x/sync/errgroup" | ||
| ) |
There was a problem hiding this comment.
risk: no _test.go in this package. The hand-rolled LFS pointer parser, retry/backoff, gzip-magic check, and DataRef precedence are pure functions ideal for table tests. Without them, any regression to (e.g.) parseLFSPointer for a pointer with CRLF line endings, or a DataRef precedence change, will fail silently in CI under cache-hit and surface only on the next miss — which by then nobody is watching for.
At minimum: parseLFSPointer, validateFile, verifyChecksum, DataRef, and an httptest-backed resolveAndDownload round-trip.
| paths[i] = filepath.Join(destDir, string(f)) | ||
| } | ||
|
|
||
| g, _ := errgroup.WithContext(context.Background()) |
There was a problem hiding this comment.
risk: ctx is discarded with _. With errgroup.WithContext, that context is cancelled when any goroutine returns a non-nil error — but downloadToFile calls http.NewRequest, not http.NewRequestWithContext, so a sibling failure won't abort in-flight downloads of large LFS blobs. On a 21M.rdf.gz failure you'll still wait for the other four to finish (or time out).
Fix: capture the ctx, plumb it into downloadToFile via http.NewRequestWithContext(ctx, ...), and into fetchRawBlob similarly. resolveAndDownload should accept ctx as its first arg.
| req.Header.Set("Authorization", "Bearer "+token) | ||
| } | ||
|
|
||
| resp, err := http.DefaultClient.Do(req) |
There was a problem hiding this comment.
risk: http.DefaultClient has no timeout. A stalled TCP connection (no FIN, no RST) can wedge a download forever — the CI job hangs until the workflow-level timeout, far past where a useful error would appear.
Also — http.DefaultClient is shared global state; if anything else in the binary tweaks it, this download silently inherits that. Use a package-private client:
var httpClient = &http.Client{Timeout: 10 * time.Minute}Or better, derive per-attempt timeout from a context with deadline.
| return nil, err | ||
| } | ||
| // Read exactly 1KB. If there's more data, it's not an LFS pointer. | ||
| return buf[:n], nil |
There was a problem hiding this comment.
nit: the comment on L259 says "if there's more data, it's not an LFS pointer" but the code returns buf[:n] (1KB) regardless — parseLFSPointer will then check the prefix, find no LFS magic in the first 1KB of (e.g.) a small JSON schema, and correctly return nil. That works, but it's coincidental, not enforced. If a non-LFS file's first 1KB ever happens to start with the LFS magic prefix (extremely unlikely but not impossible for a corrupted commit), we'd misidentify it.
Consider: peek with bufio.Reader.Peek(maxPointerSize) and explicitly require the response body to be exactly the pointer length when LFS-detected. Or check Content-Length header, which the Contents API sets correctly.
|
|
||
| lfsInfo, err := detectLFS(path, ref) | ||
| if err != nil { | ||
| // GitHub API unavailable (rate limit, network). Try media URL first |
There was a problem hiding this comment.
risk: when detectLFS fails (rate limit / network), the code falls back to trying both CDN URLs sequentially. That doubles the request count on the very condition (rate limiting) the PR is trying to fix. If the GitHub Contents API is rate-limited, media.githubusercontent.com may also be — you'll burn 2× retries × 3 attempts × N files before failing.
Safer: distinguish 403/429 (rate limit, abort) from 5xx/network (transient, fall back). A 403 from the API means you don't have an authoritative answer about LFS-tracking; falling back is wrong on first principles.
| if bytes.HasPrefix(header, lfsPointerPrefix) { | ||
| return fmt.Errorf("file is a Git LFS pointer, not actual content") | ||
| } | ||
| if strings.HasSuffix(fpath, ".gz") && (n < 2 || !bytes.HasPrefix(header, gzipMagic)) { |
There was a problem hiding this comment.
nit: gzip-magic check is strings.HasSuffix(fpath, ".gz"). The TestDataFile constants encode this perfectly — a check on the filename string is one indirection too many. Consider exposing a (f TestDataFile) IsGzip() bool method or registering content-type metadata next to repoFilePath. This also lets you validate .rdf files (they should at least be UTF-8) and .schema files separately, instead of "everything not-gz is fine".
| return ref | ||
| } | ||
| } | ||
| return "main" |
There was a problem hiding this comment.
nit: silent fallback to "main" defeats the reproducibility goal of having a benchmark-data-version file. If the file exists but is empty/whitespace, falling back to main will silently use whatever HEAD points to. Prefer logging a warning when the file exists but yields no usable ref, so cache invalidation surprises (cache key from the file, but downloads from main) are discoverable.
| return nil | ||
| } | ||
|
|
||
| func pkgDir() string { |
There was a problem hiding this comment.
nit: pkgDir() uses runtime.Caller to find benchmark-data-version next to the source file. This breaks when the binary is built with -trimpath or shipped without source — thisFile becomes "benchdata/benchdata.go" (relative), and os.ReadFile will resolve it against CWD, which is whatever shell happened to invoke ./t.
Safer: //go:embed benchmark-data-version and read from the embedded FS. That also pins the version into the binary, which is exactly what you want for reproducibility.
|
|
||
| err := run() | ||
| _ = os.RemoveAll(*tmp) | ||
| if !*keepData && *tmp != "" { |
There was a problem hiding this comment.
nit: *tmp != "" guard is correct but the contract is murky — if run() populated *tmp then errored out before downloading anything, we still skip cleanup with --keep-data but otherwise remove it. That's the right behavior. However, if pflag.Parse() panics (malformed flag) the cleanup never runs at all. Consider defer after parse:
pflag.Parse()
defer func() {
if !*keepData && *tmp != "" {
_ = os.RemoveAll(*tmp)
}
}()Minor since the panic path is rare, but the defer form is also more idiomatic.
|
|
||
| if _, err := cmd.CombinedOutput(); err != nil { | ||
| return fmt.Errorf("error downloading file %s: %w", fname, err) | ||
| paths, err := benchdata.EnsureFiles(datasetFilesPath, benchdata.DataRef(""), benchdata.TestDataFile(filename)) |
There was a problem hiding this comment.
nit: passing benchdata.DataRef("") here means dgraphtest callers can never override the ref, and silently get whichever ref the env var or version file resolves to. That's fine for now — but worth a comment noting the ref source isn't pluggable from this entry point, so future readers don't assume it is.
CI workflows on forked PRs fail with HTTP 429 (Too Many Requests) when downloading benchmark datasets from GitHub, because shared-IP runners hit rate limits on unauthenticated requests. This fix replaces GitHub web URLs with direct CDN URLs, adds retry logic with exponential backoff, limits concurrent downloads to 5, and cleans up partial files on failure. It also adds GitHub Actions caching to avoid re-downloading data files on repeat runs. Together these changes make benchmark data downloads reliable for both forked and internal CI workflows.