ci: BuildKit npm cache mount to fix ECONNRESET docker-build flake#2562
Merged
Conversation
0aade03 to
f1fcafe
Compare
Contributor
There was a problem hiding this comment.
Pull request overview
This PR reduces flaky CI Docker builds caused by intermittent npm network errors by switching remaining Dockerfiles from npm install to npm ci, enabling BuildKit’s npm cache mounts, and ensuring CI workflows set up Buildx where required. It also fixes a concurrency-group collision that could cause one Jest workflow to cancel the other on the same ref.
Changes:
- Enable BuildKit frontend (
# syntax=docker/dockerfile:1.4) and addRUN --mount=type=cache,target=/root/.npmto npm install steps across Dockerfiles. - Replace remaining
npm installusages withnpm ciwhere lockfiles are present. - Add
docker/setup-buildx-action@v3topython-ci.ymland fix Jest workflow concurrency groups to avoid cross-workflow cancellation.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
server/Dockerfile |
Enables BuildKit frontend and uses npm cache mounts; switches dev-stage install to npm ci. |
oidc-simulator/Dockerfile |
Enables BuildKit frontend; switches npm install to cached npm ci. |
file-server/Dockerfile |
Enables BuildKit frontend; adds npm cache mounts for all multi-stage client builds and file-server install. |
client-report/Dockerfile |
Enables BuildKit frontend; adds npm cache mount to npm ci. |
client-participation/Dockerfile |
Enables BuildKit frontend; adds npm cache mount to npm ci. |
client-participation-alpha/Dockerfile |
Enables BuildKit frontend; switches npm install to cached npm ci. |
client-admin/Dockerfile |
Enables BuildKit frontend; adds npm cache mount to npm ci. |
.github/workflows/python-ci.yml |
Adds Buildx setup so BuildKit cache-mount Dockerfile syntax is supported in this workflow. |
.github/workflows/jest-client-report-test.yml |
Makes concurrency group workflow-specific to prevent collisions with other Jest workflows. |
.github/workflows/jest-client-admin-test.yml |
Makes concurrency group workflow-specific to prevent collisions with other Jest workflows. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
79b6424 to
5e34f10
Compare
Contributor
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.
Comments suppressed due to low confidence (1)
server/Dockerfile:5
- The Dockerfile now relies on BuildKit features (
RUN --mount=type=cache) but the local build instructions still show a plaindocker build ...without indicating BuildKit must be enabled. On environments where BuildKit is disabled, that command will fail with an “unknown flag: mount” error; please update the header comment to reflect the BuildKit requirement.
# syntax=docker/dockerfile:1.4
# To build locally, for example:
# docker build -t polis-server:local .
# To run locally, for example:
# docker run --rm --name polis-server --env-file .env --network polis-dev_polis-net \
Comment on lines
+1
to
4
| # syntax=docker/dockerfile:1.4 | ||
| # Test this Dockerfile build with: | ||
| # (from the polis root directory) | ||
| # docker build -t polis-file-server:dev -f file-server/Dockerfile . |
The `Build Docker images` step in the E2E, Server, and Delphi CI workflows intermittently fails with `npm error code ECONNRESET` during `npm install`. PR #2344 already set `fetch-retries 5`, but the flake persists — observed ~7 times during the 2026-06-09/10 stack sessions. Adds BuildKit cache mounts (`--mount=type=cache,target=/root/.npm`) to every npm-based Dockerfile. The npm cache survives across builds and, more importantly, across npm's own retries within a single build, so a partial fetch interrupted by a TCP reset doesn't restart from scratch. Also: - Converts the three remaining `npm install` callers to `npm ci` (client-participation-alpha, oidc-simulator, server dev stage). Lockfiles already exist in all three. `server`'s dev stage also gets `--production=false` for symmetry with the prod stage — prevents devDeps being silently skipped if someone builds with NODE_ENV overridden to `production` while targeting `dev`. - Adds `# syntax=docker/dockerfile:1.4` so the cache-mount syntax is recognized. - Adds `docker/setup-buildx-action@v3` to `python-ci.yml`. The other two affected workflows (`cypress-tests.yml`, `jest-server-test.yml`) already had it; BuildKit is required by the new cache mount. - Fixes a pre-existing concurrency-group collision in the two jest workflows. `jest-client-admin-test.yml` and `jest-client-report-test.yml` both used `group: ${{ github.ref }}` with `cancel-in-progress: true`, which made them cancel each other on any PR touching both `client-admin/` and `client-report/`. Prefix each group with the workflow name so they don't collide. - Adds concurrency blocks (`cancel-in-progress: true`) to the four test/lint workflows that had none — `jest-server-test`, `lint`, `python-ci`, `test-clojure`. Stops stacked obsolete runs on rapid pushes from chewing through CI minutes. - Adds queue-collapse concurrency (`cancel-in-progress: false`) to `deploy-preprod` and `deploy-prod`. Both trigger on push (to `edge` and `stable`); during stack merges, the in-flight deploy finishes and the queue collapses to the newest commit — no mid-flight cancellation, no wasted intermediate-commit deploys. The workflow_dispatch-only deploys (`deploy-alpha-aws`) and the manual-only `sensemaker-cron` don't need it. Verified locally: builds of `client-participation-alpha`, `server` prod target, and the four-stage `file-server` all succeed with the new syntax.
5e34f10 to
8a04887
Compare
Delphi Coverage Report
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
CI's docker build step intermittently fails with
npm error code ECONNRESET. PR #2344 setfetch-retries 5, which helps but doesn't eliminate it (~7 hits during the 2026-06-09/10 sessions).What this PR contributes
1.
npm install→npm ciin the three Dockerfiles still usingnpm install(client-participation-alpha,oidc-simulator,serverdev stage — lockfiles already exist).npm cifinishes faster (no dependency resolution, parallel install), shrinking the wall-clock window in which a TCP reset can land. This is the biggest single contributor to flake reduction in this PR.server's dev stage also gets--production=falsefor symmetry with the prod stage.2. BuildKit npm cache mount (
RUN --mount=type=cache,target=/root/.npm) on every npm Dockerfile, plus the required# syntax=docker/dockerfile:1.4directive. The npm cache survives acrossRUNinstructions within a single build. On a fresh GH runner with no prior cache state, the immediate benefit is modest — but it leaves the door open for a future cross-run cache backend without further Dockerfile changes.server/Dockerfileandfile-server/Dockerfileheader comments also got a BuildKit-requirement note for local-build readers on older Docker.3.
docker/setup-buildx-action@v3inpython-ci.yml(the other two affected workflows already had it). BuildKit is required by the new cache-mount directives.4. Concurrency-group fix in the two jest workflows.
jest-client-admin-test.ymlandjest-client-report-test.ymlboth usedgroup: ${{ github.ref }}withcancel-in-progress: true. They share a group name, so any PR touching bothclient-admin/andclient-report/would have one cancel the other (showing as a failing check). Caught on this very PR. Prefixed each group with the workflow name so they no longer collide.5. Concurrency blocks added to four test/lint workflows that had none.
jest-server-test,lint,python-ci,test-clojure— each getscancel-in-progress: truewith a unique group. Stops obsolete runs from piling up on rapid pushes.6. Queue-collapse concurrency on the two push-triggered deploys.
deploy-preprod.ymlanddeploy-prod.ymlgetcancel-in-progress: false— the in-flight deploy finishes, but the queue collapses to the newest commit. Stack-merge scenarios (5 PRs land on edge → 5 deploys queue) now run only the final commit instead of all five.The
workflow_dispatch-only deploy (deploy-alpha-aws) and the manual-onlysensemaker-cronare left without concurrency — no realistic queue to collapse.Scope
7 Dockerfiles + 8 workflow files. The Dockerfile and Buildx-setup changes target the ECONNRESET flake directly. The concurrency changes are CI-hygiene cleanup discovered along the way and reviewed end-to-end by Copilot and an internal review agent.
A separate experiment with cross-run caching via GitHub Actions cache (#2563, closed) found that the gha cache backend I/O is too slow for our docker image sizes — cache hits don't pay for the export/import round-trip. So this PR delivers the within-build improvements without trying to ride them across runs.
Verified
client-participation-alpha,serverprod target, and the 4-stagefile-serverall succeed with the new BuildKit syntax.test (admin) pass 42s+test (report) pass 33son the same commit).--production=false+ deploy queue-collapse added per its feedback).