Skip to content

fix(test): resolve rate-limit failures when downloading datasets#9658

Open
shiva-istari wants to merge 3 commits intomainfrom
shiva/ci-downloads
Open

fix(test): resolve rate-limit failures when downloading datasets#9658
shiva-istari wants to merge 3 commits intomainfrom
shiva/ci-downloads

Conversation

@shiva-istari
Copy link
Copy Markdown
Contributor

CI workflows on forked PRs fail with HTTP 429 (Too Many Requests) when downloading benchmark datasets from GitHub, because shared-IP runners hit rate limits on unauthenticated requests. This fix replaces GitHub web URLs with direct CDN URLs, adds retry logic with exponential backoff, limits concurrent downloads to 5, and cleans up partial files on failure. It also adds GitHub Actions caching to avoid re-downloading data files on repeat runs. Together these changes make benchmark data downloads reliable for both forked and internal CI workflows.

@shiva-istari shiva-istari requested a review from a team as a code owner March 16, 2026 10:13
@github-actions github-actions Bot added area/testing Testing related issues area/integrations Related to integrations with other projects. go Pull requests that update Go code labels Mar 16, 2026
@blacksmith-sh

This comment has been minimized.

Comment thread t/t.go Outdated
Comment thread .github/workflows/ci-dgraph-integration2-tests.yml Outdated
Copy link
Copy Markdown
Contributor

@mlwelles mlwelles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional findings beyond the code duplication and cache key concerns already noted:

Note: t/t.go:1133 still has var suffix = "?raw=true" which gets appended to media.githubusercontent.com URLs via baseUrl + name + suffix. The ?raw=true parameter is a GitHub blob-URL convention and is unnecessary for direct CDN URLs. Not harmful, but misleading — should be removed for consistency with the other URL updates in this PR.

Comment thread dgraphtest/load.go Outdated
Comment thread t/t.go
Comment thread t/t.go
matthewmcneely pushed a commit that referenced this pull request Mar 19, 2026
## Summary

- Bumps Trivy from `0.65.0` to `0.69.3` in `.trunk/trunk.yaml`
- Trivy `v0.65.0` was never published — the version scheme jumped from
`v0.26.0` to `v0.69.2`, causing a 404 when trunk tries to download the
hermetic tool binary
- This fixes the "Trunk Code Quality / Check" CI failure seen on #9658
and all other PRs
@shiva-istari shiva-istari requested a review from mlwelles March 23, 2026 04:42
Comment thread t/t.go Outdated
Comment thread t/t.go Outdated
Comment thread t/t.go Outdated
@blacksmith-sh

This comment has been minimized.

@blacksmith-sh

This comment has been minimized.

Copy link
Copy Markdown
Contributor

@mlwelles mlwelles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review against 4ee619c. All eight threads from prior rounds are resolved cleanly — the new benchdata package matches what the last comment described, errors now propagate via errgroup, the --keep-data flag is wired through, the cache key is bound to benchmark-data-version, and LFS detection is dynamic via the Contents API. Nice work.

A few new findings on the introduced code; nothing blocking, but worth addressing before merge:

  • 🟡 No tests on benchdata despite hand-rolled LFS pointer parsing, retry/backoff, gzip-magic validation, and ref resolution priority — these are exactly the helpers that silently break.
  • 🟡 EnsureFiles ignores the context returned by errgroup.WithContext — first failure won't cancel sibling downloads.
  • 🟡 No per-attempt timeout on http.DefaultClient — a stalled connection hangs CI indefinitely.
  • 🔵 A few smaller consistency / correctness issues inline.

Inline comments follow.

Comment thread benchdata/benchdata.go
"time"

"golang.org/x/sync/errgroup"
)
Copy link
Copy Markdown
Contributor

@mlwelles mlwelles May 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

risk: no _test.go in this package. The hand-rolled LFS pointer parser, retry/backoff, gzip-magic check, and DataRef precedence are pure functions ideal for table tests. Without them, any regression to (e.g.) parseLFSPointer for a pointer with CRLF line endings, or a DataRef precedence change, will fail silently in CI under cache-hit and surface only on the next miss — which by then nobody is watching for.

At minimum: parseLFSPointer, validateFile, verifyChecksum, DataRef, and an httptest-backed resolveAndDownload round-trip.

Comment thread benchdata/benchdata.go
paths[i] = filepath.Join(destDir, string(f))
}

g, _ := errgroup.WithContext(context.Background())
Copy link
Copy Markdown
Contributor

@mlwelles mlwelles May 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

risk: ctx is discarded with _. With errgroup.WithContext, that context is cancelled when any goroutine returns a non-nil error — but downloadToFile calls http.NewRequest, not http.NewRequestWithContext, so a sibling failure won't abort in-flight downloads of large LFS blobs. On a 21M.rdf.gz failure you'll still wait for the other four to finish (or time out).

Fix: capture the ctx, plumb it into downloadToFile via http.NewRequestWithContext(ctx, ...), and into fetchRawBlob similarly. resolveAndDownload should accept ctx as its first arg.

Comment thread benchdata/benchdata.go
req.Header.Set("Authorization", "Bearer "+token)
}

resp, err := http.DefaultClient.Do(req)
Copy link
Copy Markdown
Contributor

@mlwelles mlwelles May 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

risk: http.DefaultClient has no timeout. A stalled TCP connection (no FIN, no RST) can wedge a download forever — the CI job hangs until the workflow-level timeout, far past where a useful error would appear.

Also — http.DefaultClient is shared global state; if anything else in the binary tweaks it, this download silently inherits that. Use a package-private client:

var httpClient = &http.Client{Timeout: 10 * time.Minute}

Or better, derive per-attempt timeout from a context with deadline.

Comment thread benchdata/benchdata.go
return nil, err
}
// Read exactly 1KB. If there's more data, it's not an LFS pointer.
return buf[:n], nil
Copy link
Copy Markdown
Contributor

@mlwelles mlwelles May 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the comment on L259 says "if there's more data, it's not an LFS pointer" but the code returns buf[:n] (1KB) regardless — parseLFSPointer will then check the prefix, find no LFS magic in the first 1KB of (e.g.) a small JSON schema, and correctly return nil. That works, but it's coincidental, not enforced. If a non-LFS file's first 1KB ever happens to start with the LFS magic prefix (extremely unlikely but not impossible for a corrupted commit), we'd misidentify it.

Consider: peek with bufio.Reader.Peek(maxPointerSize) and explicitly require the response body to be exactly the pointer length when LFS-detected. Or check Content-Length header, which the Contents API sets correctly.

Comment thread benchdata/benchdata.go

lfsInfo, err := detectLFS(path, ref)
if err != nil {
// GitHub API unavailable (rate limit, network). Try media URL first
Copy link
Copy Markdown
Contributor

@mlwelles mlwelles May 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

risk: when detectLFS fails (rate limit / network), the code falls back to trying both CDN URLs sequentially. That doubles the request count on the very condition (rate limiting) the PR is trying to fix. If the GitHub Contents API is rate-limited, media.githubusercontent.com may also be — you'll burn 2× retries × 3 attempts × N files before failing.

Safer: distinguish 403/429 (rate limit, abort) from 5xx/network (transient, fall back). A 403 from the API means you don't have an authoritative answer about LFS-tracking; falling back is wrong on first principles.

Comment thread benchdata/benchdata.go
if bytes.HasPrefix(header, lfsPointerPrefix) {
return fmt.Errorf("file is a Git LFS pointer, not actual content")
}
if strings.HasSuffix(fpath, ".gz") && (n < 2 || !bytes.HasPrefix(header, gzipMagic)) {
Copy link
Copy Markdown
Contributor

@mlwelles mlwelles May 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: gzip-magic check is strings.HasSuffix(fpath, ".gz"). The TestDataFile constants encode this perfectly — a check on the filename string is one indirection too many. Consider exposing a (f TestDataFile) IsGzip() bool method or registering content-type metadata next to repoFilePath. This also lets you validate .rdf files (they should at least be UTF-8) and .schema files separately, instead of "everything not-gz is fine".

Comment thread benchdata/benchdata.go
return ref
}
}
return "main"
Copy link
Copy Markdown
Contributor

@mlwelles mlwelles May 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: silent fallback to "main" defeats the reproducibility goal of having a benchmark-data-version file. If the file exists but is empty/whitespace, falling back to main will silently use whatever HEAD points to. Prefer logging a warning when the file exists but yields no usable ref, so cache invalidation surprises (cache key from the file, but downloads from main) are discoverable.

Comment thread benchdata/benchdata.go
return nil
}

func pkgDir() string {
Copy link
Copy Markdown
Contributor

@mlwelles mlwelles May 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: pkgDir() uses runtime.Caller to find benchmark-data-version next to the source file. This breaks when the binary is built with -trimpath or shipped without source — thisFile becomes "benchdata/benchdata.go" (relative), and os.ReadFile will resolve it against CWD, which is whatever shell happened to invoke ./t.

Safer: //go:embed benchmark-data-version and read from the embedded FS. That also pins the version into the binary, which is exactly what you want for reproducibility.

Comment thread t/t.go

err := run()
_ = os.RemoveAll(*tmp)
if !*keepData && *tmp != "" {
Copy link
Copy Markdown
Contributor

@mlwelles mlwelles May 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: *tmp != "" guard is correct but the contract is murky — if run() populated *tmp then errored out before downloading anything, we still skip cleanup with --keep-data but otherwise remove it. That's the right behavior. However, if pflag.Parse() panics (malformed flag) the cleanup never runs at all. Consider defer after parse:

pflag.Parse()
defer func() {
    if !*keepData && *tmp != "" {
        _ = os.RemoveAll(*tmp)
    }
}()

Minor since the panic path is rare, but the defer form is also more idiomatic.

Comment thread dgraphtest/load.go

if _, err := cmd.CombinedOutput(); err != nil {
return fmt.Errorf("error downloading file %s: %w", fname, err)
paths, err := benchdata.EnsureFiles(datasetFilesPath, benchdata.DataRef(""), benchdata.TestDataFile(filename))
Copy link
Copy Markdown
Contributor

@mlwelles mlwelles May 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: passing benchdata.DataRef("") here means dgraphtest callers can never override the ref, and silently get whichever ref the env var or version file resolves to. That's fine for now — but worth a comment noting the ref source isn't pluggable from this entry point, so future readers don't assume it is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/integrations Related to integrations with other projects. area/testing Testing related issues go Pull requests that update Go code

Development

Successfully merging this pull request may close these issues.

3 participants