Skip to content

ci: BuildKit npm cache mount to fix ECONNRESET docker-build flake#2562

Merged
jucor merged 1 commit into
edgefrom
jc/ci-econnreset-fix
Jun 11, 2026
Merged

ci: BuildKit npm cache mount to fix ECONNRESET docker-build flake#2562
jucor merged 1 commit into
edgefrom
jc/ci-econnreset-fix

Conversation

@jucor

@jucor jucor commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

Summary

CI's docker build step intermittently fails with npm error code ECONNRESET. PR #2344 set fetch-retries 5, which helps but doesn't eliminate it (~7 hits during the 2026-06-09/10 sessions).

What this PR contributes

1. npm installnpm ci in the three Dockerfiles still using npm install (client-participation-alpha, oidc-simulator, server dev stage — lockfiles already exist). npm ci finishes faster (no dependency resolution, parallel install), shrinking the wall-clock window in which a TCP reset can land. This is the biggest single contributor to flake reduction in this PR. server's dev stage also gets --production=false for symmetry with the prod stage.

2. BuildKit npm cache mount (RUN --mount=type=cache,target=/root/.npm) on every npm Dockerfile, plus the required # syntax=docker/dockerfile:1.4 directive. The npm cache survives across RUN instructions within a single build. On a fresh GH runner with no prior cache state, the immediate benefit is modest — but it leaves the door open for a future cross-run cache backend without further Dockerfile changes. server/Dockerfile and file-server/Dockerfile header comments also got a BuildKit-requirement note for local-build readers on older Docker.

3. docker/setup-buildx-action@v3 in python-ci.yml (the other two affected workflows already had it). BuildKit is required by the new cache-mount directives.

4. Concurrency-group fix in the two jest workflows. jest-client-admin-test.yml and jest-client-report-test.yml both used group: ${{ github.ref }} with cancel-in-progress: true. They share a group name, so any PR touching both client-admin/ and client-report/ would have one cancel the other (showing as a failing check). Caught on this very PR. Prefixed each group with the workflow name so they no longer collide.

5. Concurrency blocks added to four test/lint workflows that had none. jest-server-test, lint, python-ci, test-clojure — each gets cancel-in-progress: true with a unique group. Stops obsolete runs from piling up on rapid pushes.

6. Queue-collapse concurrency on the two push-triggered deploys. deploy-preprod.yml and deploy-prod.yml get cancel-in-progress: false — the in-flight deploy finishes, but the queue collapses to the newest commit. Stack-merge scenarios (5 PRs land on edge → 5 deploys queue) now run only the final commit instead of all five.

The workflow_dispatch-only deploy (deploy-alpha-aws) and the manual-only sensemaker-cron are left without concurrency — no realistic queue to collapse.

Scope

7 Dockerfiles + 8 workflow files. The Dockerfile and Buildx-setup changes target the ECONNRESET flake directly. The concurrency changes are CI-hygiene cleanup discovered along the way and reviewed end-to-end by Copilot and an internal review agent.

A separate experiment with cross-run caching via GitHub Actions cache (#2563, closed) found that the gha cache backend I/O is too slow for our docker image sizes — cache hits don't pay for the export/import round-trip. So this PR delivers the within-build improvements without trying to ride them across runs.

Verified

  • Local builds of client-participation-alpha, server prod target, and the 4-stage file-server all succeed with the new BuildKit syntax.
  • A CI run on this PR passed all three Docker-building workflows on first try, with test counts identical to baseline (E2E 213/212✓/1⊘, Server 295✓/1⊘ in 42 suites, Delphi 336✓/14⊘/58 xfail).
  • Both jest workflows now run in parallel after the concurrency-group fix (verified: test (admin) pass 42s + test (report) pass 33s on the same commit).
  • Reviewed by Copilot (BuildKit-requirement comment added per its feedback) and an internal review agent (server dev --production=false + deploy queue-collapse added per its feedback).
  • One green run isn't proof — the handoff's bar is 5 consecutive successful runs.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR reduces flaky CI Docker builds caused by intermittent npm network errors by switching remaining Dockerfiles from npm install to npm ci, enabling BuildKit’s npm cache mounts, and ensuring CI workflows set up Buildx where required. It also fixes a concurrency-group collision that could cause one Jest workflow to cancel the other on the same ref.

Changes:

  • Enable BuildKit frontend (# syntax=docker/dockerfile:1.4) and add RUN --mount=type=cache,target=/root/.npm to npm install steps across Dockerfiles.
  • Replace remaining npm install usages with npm ci where lockfiles are present.
  • Add docker/setup-buildx-action@v3 to python-ci.yml and fix Jest workflow concurrency groups to avoid cross-workflow cancellation.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated no comments.

Show a summary per file
File Description
server/Dockerfile Enables BuildKit frontend and uses npm cache mounts; switches dev-stage install to npm ci.
oidc-simulator/Dockerfile Enables BuildKit frontend; switches npm install to cached npm ci.
file-server/Dockerfile Enables BuildKit frontend; adds npm cache mounts for all multi-stage client builds and file-server install.
client-report/Dockerfile Enables BuildKit frontend; adds npm cache mount to npm ci.
client-participation/Dockerfile Enables BuildKit frontend; adds npm cache mount to npm ci.
client-participation-alpha/Dockerfile Enables BuildKit frontend; switches npm install to cached npm ci.
client-admin/Dockerfile Enables BuildKit frontend; adds npm cache mount to npm ci.
.github/workflows/python-ci.yml Adds Buildx setup so BuildKit cache-mount Dockerfile syntax is supported in this workflow.
.github/workflows/jest-client-report-test.yml Makes concurrency group workflow-specific to prevent collisions with other Jest workflows.
.github/workflows/jest-client-admin-test.yml Makes concurrency group workflow-specific to prevent collisions with other Jest workflows.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@jucor jucor force-pushed the jc/ci-econnreset-fix branch 2 times, most recently from 79b6424 to 5e34f10 Compare June 11, 2026 17:35
@jucor jucor requested a review from Copilot June 11, 2026 17:44

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (1)

server/Dockerfile:5

  • The Dockerfile now relies on BuildKit features (RUN --mount=type=cache) but the local build instructions still show a plain docker build ... without indicating BuildKit must be enabled. On environments where BuildKit is disabled, that command will fail with an “unknown flag: mount” error; please update the header comment to reflect the BuildKit requirement.
# syntax=docker/dockerfile:1.4
# To build locally, for example:
# docker build -t polis-server:local .
# To run locally, for example:
# docker run --rm --name polis-server --env-file .env --network polis-dev_polis-net \

Comment thread file-server/Dockerfile
Comment on lines +1 to 4
# syntax=docker/dockerfile:1.4
# Test this Dockerfile build with:
# (from the polis root directory)
# docker build -t polis-file-server:dev -f file-server/Dockerfile .
The `Build Docker images` step in the E2E, Server, and Delphi CI
workflows intermittently fails with `npm error code ECONNRESET` during
`npm install`. PR #2344 already set `fetch-retries 5`, but the flake
persists — observed ~7 times during the 2026-06-09/10 stack sessions.

Adds BuildKit cache mounts (`--mount=type=cache,target=/root/.npm`) to
every npm-based Dockerfile. The npm cache survives across builds and,
more importantly, across npm's own retries within a single build, so a
partial fetch interrupted by a TCP reset doesn't restart from scratch.

Also:
- Converts the three remaining `npm install` callers to `npm ci`
  (client-participation-alpha, oidc-simulator, server dev stage).
  Lockfiles already exist in all three. `server`'s dev stage also gets
  `--production=false` for symmetry with the prod stage — prevents
  devDeps being silently skipped if someone builds with NODE_ENV
  overridden to `production` while targeting `dev`.
- Adds `# syntax=docker/dockerfile:1.4` so the cache-mount syntax is
  recognized.
- Adds `docker/setup-buildx-action@v3` to `python-ci.yml`. The other
  two affected workflows (`cypress-tests.yml`, `jest-server-test.yml`)
  already had it; BuildKit is required by the new cache mount.
- Fixes a pre-existing concurrency-group collision in the two jest
  workflows. `jest-client-admin-test.yml` and
  `jest-client-report-test.yml` both used `group: ${{ github.ref }}`
  with `cancel-in-progress: true`, which made them cancel each other
  on any PR touching both `client-admin/` and `client-report/`. Prefix
  each group with the workflow name so they don't collide.
- Adds concurrency blocks (`cancel-in-progress: true`) to the four
  test/lint workflows that had none — `jest-server-test`, `lint`,
  `python-ci`, `test-clojure`. Stops stacked obsolete runs on rapid
  pushes from chewing through CI minutes.
- Adds queue-collapse concurrency (`cancel-in-progress: false`) to
  `deploy-preprod` and `deploy-prod`. Both trigger on push (to `edge`
  and `stable`); during stack merges, the in-flight deploy finishes
  and the queue collapses to the newest commit — no mid-flight
  cancellation, no wasted intermediate-commit deploys. The
  workflow_dispatch-only deploys (`deploy-alpha-aws`) and the
  manual-only `sensemaker-cron` don't need it.

Verified locally: builds of `client-participation-alpha`, `server`
prod target, and the four-stage `file-server` all succeed with the
new syntax.
@jucor jucor force-pushed the jc/ci-econnreset-fix branch from 5e34f10 to 8a04887 Compare June 11, 2026 17:49
@jucor jucor merged commit 32ec6fc into edge Jun 11, 2026
8 checks passed
@jucor jucor deleted the jc/ci-econnreset-fix branch June 11, 2026 17:55
@github-actions

Copy link
Copy Markdown

Delphi Coverage Report

File Stmts Miss Cover
init.py 2 0 100%
benchmarks/bench_pca.py 76 76 0%
benchmarks/bench_repness.py 81 81 0%
benchmarks/bench_update_votes.py 38 38 0%
benchmarks/benchmark_utils.py 34 34 0%
components/init.py 1 0 100%
components/config.py 165 133 19%
conversation/init.py 2 0 100%
conversation/conversation.py 1062 296 72%
conversation/manager.py 131 42 68%
database/init.py 1 0 100%
database/dynamodb.py 387 234 40%
database/postgres.py 306 206 33%
pca_kmeans_rep/init.py 5 0 100%
pca_kmeans_rep/clusters.py 257 22 91%
pca_kmeans_rep/corr.py 98 17 83%
pca_kmeans_rep/pca.py 52 16 69%
pca_kmeans_rep/repness.py 297 35 88%
regression/init.py 4 0 100%
regression/clojure_comparer.py 188 20 89%
regression/comparer.py 887 720 19%
regression/datasets.py 135 27 80%
regression/recorder.py 36 27 25%
regression/utils.py 138 94 32%
run_math_pipeline.py 261 114 56%
umap_narrative/500_generate_embedding_umap_cluster.py 210 109 48%
umap_narrative/501_calculate_comment_extremity.py 112 53 53%
umap_narrative/502_calculate_priorities.py 135 135 0%
umap_narrative/700_datamapplot_for_layer.py 502 502 0%
umap_narrative/701_static_datamapplot_for_layer.py 310 310 0%
umap_narrative/702_consensus_divisive_datamapplot.py 432 432 0%
umap_narrative/801_narrative_report_batch.py 785 785 0%
umap_narrative/802_process_batch_results.py 265 265 0%
umap_narrative/803_check_batch_status.py 175 175 0%
umap_narrative/llm_factory_constructor/init.py 2 2 0%
umap_narrative/llm_factory_constructor/model_provider.py 157 157 0%
umap_narrative/polismath_commentgraph/init.py 1 0 100%
umap_narrative/polismath_commentgraph/cli.py 270 270 0%
umap_narrative/polismath_commentgraph/core/init.py 3 3 0%
umap_narrative/polismath_commentgraph/core/clustering.py 108 108 0%
umap_narrative/polismath_commentgraph/core/embedding.py 104 104 0%
umap_narrative/polismath_commentgraph/lambda_handler.py 219 219 0%
umap_narrative/polismath_commentgraph/schemas/init.py 2 0 100%
umap_narrative/polismath_commentgraph/schemas/dynamo_models.py 160 9 94%
umap_narrative/polismath_commentgraph/tests/conftest.py 17 17 0%
umap_narrative/polismath_commentgraph/tests/test_clustering.py 74 74 0%
umap_narrative/polismath_commentgraph/tests/test_embedding.py 55 55 0%
umap_narrative/polismath_commentgraph/tests/test_storage.py 87 87 0%
umap_narrative/polismath_commentgraph/utils/init.py 3 0 100%
umap_narrative/polismath_commentgraph/utils/converter.py 283 237 16%
umap_narrative/polismath_commentgraph/utils/group_data.py 354 336 5%
umap_narrative/polismath_commentgraph/utils/storage.py 584 518 11%
umap_narrative/reset_conversation.py 159 50 69%
umap_narrative/run_pipeline.py 453 312 31%
utils/general.py 62 41 34%
Total 10727 7597 29%

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants