Skip to content

ci: cache docker layers in GitHub Actions cache across runs#2563

Closed
jucor wants to merge 1 commit into
jc/ci-econnreset-fixfrom
jc/ci-compose-bake-cache
Closed

ci: cache docker layers in GitHub Actions cache across runs#2563
jucor wants to merge 1 commit into
jc/ci-econnreset-fixfrom
jc/ci-compose-bake-cache

Conversation

@jucor

@jucor jucor commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

Summary

Stacked on #2562. Together they fix the ECONNRESET flake and cut docker-build time on cache hits.

How

1. docker-compose.test.yml — each buildable service gets an explicit image: ${COMPOSE_PROJECT_NAME:-polis-test}-<service>:latest. This matches what compose's auto-naming was already producing, but makes the tags addressable by docker buildx bake (used below).

2. Three workflows (cypress-tests.yml, jest-server-test.yml, python-ci.yml) — the docker compose build step is replaced with docker/bake-action@v6. Cache config is passed via set::

*.cache-from=type=gha
*.cache-to=type=gha,mode=max

cypress-tests.yml also keeps postgres.no-cache=true (Docker layer caching has been known to retain stale migration files).

The subsequent docker compose up -d finds the bake-built images because the image: tags match what compose expects.

Why not COMPOSE_BAKE=true + x-bake: (the prior attempt)

COMPOSE_BAKE=true does delegate docker compose build to bake, but the compose-to-bake serialization layer silently drops x-bake: fields entirely. Confirmed by docker compose ... build --print: the bake JSON payload has no cache-from/cache-to whatsoever. The build succeeds but no cache is read or written. A real CI run with that approach produced cold-baseline timings on both a supposed cache-populate batch and a supposed cache-hit batch — the "speedup" was zero because the cache was never engaged.

docker/bake-action bypasses that serialization and applies --set directly, so cache config arrives intact.

No-op for local devs

Adding image: to compose services is the only externally visible change for non-CI users. It doesn't change behavior — it just locks in the tag that compose was already auto-generating, so it's equivalent for everyone running docker compose ... locally.

Expected impact

Workflow Cold-cache baseline Warm-cache target
E2E Tests ~16m ~12-13m
Server Integration ~4m 10s ~2m
Delphi Python ~7m 55s ~5m

(Build step compresses to ~30-60s on cache hit, per handoff doc estimate. First run after merge writes the cache; subsequent runs benefit.)

Verified locally

  • docker buildx bake -f docker-compose.test.yml --print shows correct tags (polis-test-<service>:latest) for every buildable service, including the delphi ECR tag for delphi.
  • After a fresh docker rmi, a second docker buildx bake against a local cache backend reports #10-#16 CACHED for every Dockerfile RUN step and importing cache manifest from local:... — cache import + load both work end-to-end.

CI verification (cache-populate + cache-hit batches) is in progress on this PR. Description will be edited with the numbers once both batches complete.

Stacked PR note

Base is jc/ci-econnreset-fix (the #2562 branch). When #2562 merges, GitHub auto-retargets this PR to edge. With jj, edits to the base commit propagate via jj rebase/jj squash.

@jucor jucor force-pushed the jc/ci-compose-bake-cache branch from 8f2921d to d687084 Compare June 11, 2026 13:10
@jucor jucor closed this Jun 11, 2026
@jucor jucor deleted the jc/ci-compose-bake-cache branch June 11, 2026 13:10
@jucor jucor restored the jc/ci-compose-bake-cache branch June 11, 2026 13:11
@jucor jucor reopened this Jun 11, 2026
Stacked on the npm-cache-mount PR. This PR exports the docker layer
cache (including the npm cache mount populated by the parent commit)
to GitHub Actions cache, so a fresh runner starts with a warm cache
instead of an empty one.

## How

1. `docker-compose.test.yml` — explicit `image:` field on each
   buildable service, matching what compose's auto-naming produces
   (`<project>-<service>:latest`). This is required so that
   `docker buildx bake` (used by bake-action below) produces tags
   addressable by the subsequent `docker compose up -d`.
2. All three workflows that build docker images — `docker compose
   build` swapped for `docker/bake-action@v6` with
   `cache-from=type=gha` and `cache-to=type=gha,mode=max`. cypress
   keeps `postgres.no-cache=true` to preserve the existing
   stale-migration-file safeguard.

## Why not the simpler COMPOSE_BAKE=true + x-bake approach

`COMPOSE_BAKE=true` does delegate `docker compose build` to bake,
but the compose-to-bake serialization layer **silently drops
`x-bake:` fields entirely**. Confirmed locally with
`docker compose ... build --print`: the bake JSON payload has no
`cache-from`/`cache-to`. The build succeeds but no cache is read
or written — verified on a real CI run that produced cold-baseline
timings on both a cache-populate and a supposed-cache-hit batch.

`docker/bake-action` bypasses that serialization. It also needs
the `image:` fields above, because going straight to bake (rather
than through compose) means bake doesn't know the compose project
name and would tag the images dangling.

## Verified locally

- `docker buildx bake -f docker-compose.test.yml --print` shows
  correct tags (`polis-test-<service>:latest`) for every buildable
  service.
- Second build after a fresh `docker rmi` reports `CACHED` for
  every Dockerfile RUN step and `importing cache manifest` — cache
  import + load both work end-to-end.
@jucor jucor force-pushed the jc/ci-compose-bake-cache branch from d687084 to 8b9fc18 Compare June 11, 2026 13:52
@github-actions

Copy link
Copy Markdown

Delphi Coverage Report

File Stmts Miss Cover
init.py 2 0 100%
benchmarks/bench_pca.py 76 76 0%
benchmarks/bench_repness.py 81 81 0%
benchmarks/bench_update_votes.py 38 38 0%
benchmarks/benchmark_utils.py 34 34 0%
components/init.py 1 0 100%
components/config.py 165 133 19%
conversation/init.py 2 0 100%
conversation/conversation.py 1062 296 72%
conversation/manager.py 131 42 68%
database/init.py 1 0 100%
database/dynamodb.py 387 234 40%
database/postgres.py 306 206 33%
pca_kmeans_rep/init.py 5 0 100%
pca_kmeans_rep/clusters.py 257 22 91%
pca_kmeans_rep/corr.py 98 17 83%
pca_kmeans_rep/pca.py 52 16 69%
pca_kmeans_rep/repness.py 297 35 88%
regression/init.py 4 0 100%
regression/clojure_comparer.py 188 20 89%
regression/comparer.py 887 720 19%
regression/datasets.py 135 27 80%
regression/recorder.py 36 27 25%
regression/utils.py 138 94 32%
run_math_pipeline.py 261 114 56%
umap_narrative/500_generate_embedding_umap_cluster.py 210 109 48%
umap_narrative/501_calculate_comment_extremity.py 112 53 53%
umap_narrative/502_calculate_priorities.py 135 135 0%
umap_narrative/700_datamapplot_for_layer.py 502 502 0%
umap_narrative/701_static_datamapplot_for_layer.py 310 310 0%
umap_narrative/702_consensus_divisive_datamapplot.py 432 432 0%
umap_narrative/801_narrative_report_batch.py 785 785 0%
umap_narrative/802_process_batch_results.py 265 265 0%
umap_narrative/803_check_batch_status.py 175 175 0%
umap_narrative/llm_factory_constructor/init.py 2 2 0%
umap_narrative/llm_factory_constructor/model_provider.py 157 157 0%
umap_narrative/polismath_commentgraph/init.py 1 0 100%
umap_narrative/polismath_commentgraph/cli.py 270 270 0%
umap_narrative/polismath_commentgraph/core/init.py 3 3 0%
umap_narrative/polismath_commentgraph/core/clustering.py 108 108 0%
umap_narrative/polismath_commentgraph/core/embedding.py 104 104 0%
umap_narrative/polismath_commentgraph/lambda_handler.py 219 219 0%
umap_narrative/polismath_commentgraph/schemas/init.py 2 0 100%
umap_narrative/polismath_commentgraph/schemas/dynamo_models.py 160 9 94%
umap_narrative/polismath_commentgraph/tests/conftest.py 17 17 0%
umap_narrative/polismath_commentgraph/tests/test_clustering.py 74 74 0%
umap_narrative/polismath_commentgraph/tests/test_embedding.py 55 55 0%
umap_narrative/polismath_commentgraph/tests/test_storage.py 87 87 0%
umap_narrative/polismath_commentgraph/utils/init.py 3 0 100%
umap_narrative/polismath_commentgraph/utils/converter.py 283 237 16%
umap_narrative/polismath_commentgraph/utils/group_data.py 354 336 5%
umap_narrative/polismath_commentgraph/utils/storage.py 584 518 11%
umap_narrative/reset_conversation.py 159 50 69%
umap_narrative/run_pipeline.py 453 312 31%
utils/general.py 62 41 34%
Total 10727 7597 29%

@jucor

jucor commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator Author

Closing — GitHub Actions cache backend isn't a fit for our docker image sizes.

What I learned across the iterations

  1. COMPOSE_BAKE=true + x-bake: doesn't work for cache config: compose silently strips x-bake from the bake payload it generates. The first attempt at this PR looked like it ran fine — but no cache was ever read or written.
  2. docker/bake-action works, but its default git context re-fetches per commit SHA, so layer hashes change every commit and cache always misses. Fixed by source: ..
  3. With source: ., cache hits actually happen (10-11 CACHED markers per build). But the gha cache I/O overhead is too high for our layer sizes — delphi's image alone is ~4 GB, and the mode=max export+import round-trip costs minutes per run. Net: every build with caching enabled is slower than baseline.

Numbers (final iteration, with source: .)

Workflow Baseline Cache write Cache read
E2E ~16m 14m 09s* 21m 27s
Server ~4m 10s 5m 34s 23m 04s
Delphi ~7m 55s 12m 43s 14m 33s

* The E2E "win" on cache write was incidental — bake collapsed the workflow's two sequential docker compose build calls (the --no-cache postgres one + the full build) into one parallel pass. That accounts for the +1m 50s and is unrelated to caching. Worth doing as a small follow-up PR.

What would actually win

  • A registry-based cache backend (requires a registry — out of scope here).
  • Self-hosted runners with persistent local disk.
  • Aggressive image size reduction (separate project).

mode=min was considered. It would cache only the final image per target — cheaper export, but no partial-hit benefit. For our workload (many independent services; PRs typically touch one) partial hits are what we'd most want, so mode=min is also a loser.

What's preserved

The within-build npm cache mount in #2562 still provides robustness during npm fetch retries (the original ECONNRESET fix doesn't depend on this PR). Closing this PR doesn't lose that.

@jucor jucor closed this Jun 11, 2026
@jucor jucor deleted the jc/ci-compose-bake-cache branch June 11, 2026 18:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant