Skip to content

ci(regression): build test Docker image once, share across shards#427

Merged
jrusso1020 merged 2 commits intomainfrom
ci/share-regression-docker-image
Apr 22, 2026
Merged

ci(regression): build test Docker image once, share across shards#427
jrusso1020 merged 2 commits intomainfrom
ci/share-regression-docker-image

Conversation

@jrusso1020
Copy link
Copy Markdown
Collaborator

What

Splits regression.yml into two jobs:

  1. build-image (new) — builds Dockerfile.test once, exports the image to a tarball via docker/build-push-action@v6 with outputs: type=docker,dest=..., and uploads it as a GHA artifact.
  2. regression-shards (existing matrix of 11) — downloads the artifact and runs docker load -i <tar> instead of rebuilding per shard.

GHA layer cache (type=gha,scope=regression-test-image) is preserved on the build job for warm-cache reuse across PRs.

Why

Measured on PR #419's earlier regression runs:

  • Docker build step: ~234s per shard WITH cache hit
  • 11 shards × ~234s = ~43 min of runner time per PR run spent rebuilding the same image

Cold-cache cases are much worse — happening right now on PR #419 as of this writing: release commit b6f50ce bumped every packages/*/package.json, which invalidates the Docker layer that feeds bun install --frozen-lockfile. All 10 shards are currently 25-30+ minutes into a parallel rebuild, thundering-herding npm from 10 runners simultaneously. fast finally finished its build at the 26-minute mark; 8 shards are still going.

After this PR:

Scenario Before After
Warm cache, 11 shards 11 × ~234s build = ~43 min 1 × ~234s build + 11 × ~15s load = ~7 min
Cold cache, 11 shards 11 × 20-30 min parallel build = 200-300 min runner-time, 25-30 min wall-clock on the slowest shard 1 × 15-20 min build + 11 × ~15s load = ~18 min wall-clock, ~20 min runner-time
No code changes (paths-filter skip) unchanged unchanged (both jobs gated by needs.changes.outputs.code == 'true')

On cold cache, this is a ~15× runner-time reduction.

How

Build job (new)

build-image:
  needs: changes
  if: needs.changes.outputs.code == 'true'
  runs-on: ubuntu-latest
  timeout-minutes: 20
  steps:
    - uses: actions/checkout@v4
      # no LFS — Dockerfile.test never COPYs the golden baselines,
      # only source + package manifests. LFS was always wasted here.
    - uses: docker/setup-buildx-action@v3
    - uses: docker/build-push-action@v6
      with:
        context: .
        file: Dockerfile.test
        tags: hyperframes-producer:test
        cache-from: type=gha,scope=regression-test-image
        cache-to: type=gha,mode=max,scope=regression-test-image
        outputs: type=docker,dest=/tmp/regression-test-image.tar
    - uses: actions/upload-artifact@v4
      with:
        name: regression-test-image
        path: /tmp/regression-test-image.tar
        retention-days: 1
        compression-level: 1

Shard job (existing, simplified)

Replaces the per-shard docker/setup-buildx-action + docker/build-push-action with:

- uses: actions/download-artifact@v4
  with:
    name: regression-test-image
    path: /tmp
- run: docker load -i /tmp/regression-test-image.tar

Tradeoffs

Pro: Massive savings on cold cache, meaningful savings on warm cache, shards start faster (no buildx setup + layer cache restore).

Con: Adds a sequential step — all shards now wait on build-image. Wall-clock for the fastest shard goes from "start at t=0, build is the bottleneck at ~4 min" to "start at t=build-time (~4 min), immediately load and run tests". Net wall-clock is typically faster because shards aren't fighting for buildx capacity, but the "time to first test output" moves right by ~4 min on warm-cache runs. On cold cache this cost is recovered many times over.

Con: Artifact retention consumes storage (1 day, ~500 MB per run × N PRs). retention-days: 1 caps it; artifacts older than 1 day are purged automatically.

Con: If build-image fails, all shards fail. Currently the failure mode is equivalent (every shard would have failed the same Docker build independently), so no new failure surface.

Not changed

  • Matrix shape, shard args, LFS validation, failure artifact upload — all untouched.
  • regression summary job — now transitively depends on build-image via regression-shards, no explicit wiring needed.
  • GHA layer cache (type=gha) preserved on the build job, so warm-cache rebuilds stay fast.
  • Branch protection status check name (regression) unchanged.

Test plan

  • Unit tests added/updated — N/A (workflow config)
  • Manual testing performed — verified YAML validity via oxfmt --check; validated structure by comparing to docker/build-push-action@v6 docs for outputs: type=docker,dest=... and actions/download-artifact@v4 docs
  • Documentation updated (if applicable) — N/A

Validation after merge: open any PR that touches packages/producer/** and confirm the new build-image job appears, shards download the artifact, and regression completes. First run after merge will also rebuild the GHA layer cache under the new job, so that run won't show the full savings — the second run forward will.

Splits regression.yml into a `build-image` job + the existing
`regression-shards` matrix. The build job produces a Docker tarball via
`docker/build-push-action` with `outputs: type=docker,dest=...`, uploads
it as a GHA artifact (retention 1 day, gzip level 1), and each shard
downloads + `docker load`s it instead of rebuilding.

Measured on PR #419 regression runs before the change:
- Docker build step: ~234s per shard WITH GHA layer cache hit
- 11 shards × ~234s = ~43 min of runner time per PR just on redundant
  image builds

Cold-cache cases are much worse — happening right now on PR #419 after
release commit b6f50ce bumped every `packages/*/package.json`, invalidating
the COPY layer that feeds `bun install --frozen-lockfile`. All 10 shards
are currently 25-30+ min into a parallel rebuild, thundering-herding
the same npm packages from 10 runners.

After this change:
- 1× build (~4 min warm, ~15 min cold) + 11× (download + `docker load`)
- Expected ~15-20s overhead per shard for artifact download + load
- Net savings: ~30-40 min of runner time per PR run on warm cache,
  substantially more on cold cache

The build job doesn't checkout LFS — Dockerfile.test only COPYs source +
package manifests, never the golden baselines, so the image build never
needed LFS. Shards still need LFS for the tests/**/output/output.mp4
baselines they validate against.
Comment thread .github/workflows/regression.yml Fixed
Addresses CodeQL warning 'Workflow does not contain permissions'.
Defaults the workflow GITHUB_TOKEN to `contents: read` only. The
build-image job elevates to `actions: write` because
`docker/build-push-action` with `cache-from/to: type=gha` uses the
GitHub Actions cache API, which needs read+write on the actions scope.
@jrusso1020 jrusso1020 merged commit ef26798 into main Apr 22, 2026
21 checks passed
@jrusso1020 jrusso1020 deleted the ci/share-regression-docker-image branch April 22, 2026 21:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants