ci(regression): build test Docker image once, share across shards by jrusso1020 · Pull Request #427 · heygen-com/hyperframes

jrusso1020 · 2026-04-22T21:30:42Z

What

Splits regression.yml into two jobs:

build-image (new) — builds Dockerfile.test once, exports the image to a tarball via docker/build-push-action@v6 with outputs: type=docker,dest=..., and uploads it as a GHA artifact.
regression-shards (existing matrix of 11) — downloads the artifact and runs docker load -i <tar> instead of rebuilding per shard.

GHA layer cache (type=gha,scope=regression-test-image) is preserved on the build job for warm-cache reuse across PRs.

Why

Measured on PR #419's earlier regression runs:

Docker build step: ~234s per shard WITH cache hit
11 shards × ~234s = ~43 min of runner time per PR run spent rebuilding the same image

Cold-cache cases are much worse — happening right now on PR #419 as of this writing: release commit b6f50ce bumped every packages/*/package.json, which invalidates the Docker layer that feeds bun install --frozen-lockfile. All 10 shards are currently 25-30+ minutes into a parallel rebuild, thundering-herding npm from 10 runners simultaneously. fast finally finished its build at the 26-minute mark; 8 shards are still going.

After this PR:

Scenario	Before	After
Warm cache, 11 shards	11 × ~234s build = ~43 min	1 × ~234s build + 11 × ~15s load = ~7 min
Cold cache, 11 shards	11 × 20-30 min parallel build = 200-300 min runner-time, 25-30 min wall-clock on the slowest shard	1 × 15-20 min build + 11 × ~15s load = ~18 min wall-clock, ~20 min runner-time
No code changes (paths-filter skip)	unchanged	unchanged (both jobs gated by `needs.changes.outputs.code == 'true'`)

On cold cache, this is a ~15× runner-time reduction.

How

Build job (new)

build-image:
  needs: changes
  if: needs.changes.outputs.code == 'true'
  runs-on: ubuntu-latest
  timeout-minutes: 20
  steps:
    - uses: actions/checkout@v4
      # no LFS — Dockerfile.test never COPYs the golden baselines,
      # only source + package manifests. LFS was always wasted here.
    - uses: docker/setup-buildx-action@v3
    - uses: docker/build-push-action@v6
      with:
        context: .
        file: Dockerfile.test
        tags: hyperframes-producer:test
        cache-from: type=gha,scope=regression-test-image
        cache-to: type=gha,mode=max,scope=regression-test-image
        outputs: type=docker,dest=/tmp/regression-test-image.tar
    - uses: actions/upload-artifact@v4
      with:
        name: regression-test-image
        path: /tmp/regression-test-image.tar
        retention-days: 1
        compression-level: 1

Shard job (existing, simplified)

Replaces the per-shard docker/setup-buildx-action + docker/build-push-action with:

- uses: actions/download-artifact@v4
  with:
    name: regression-test-image
    path: /tmp
- run: docker load -i /tmp/regression-test-image.tar

Tradeoffs

Pro: Massive savings on cold cache, meaningful savings on warm cache, shards start faster (no buildx setup + layer cache restore).

Con: Adds a sequential step — all shards now wait on build-image. Wall-clock for the fastest shard goes from "start at t=0, build is the bottleneck at ~4 min" to "start at t=build-time (~4 min), immediately load and run tests". Net wall-clock is typically faster because shards aren't fighting for buildx capacity, but the "time to first test output" moves right by ~4 min on warm-cache runs. On cold cache this cost is recovered many times over.

Con: Artifact retention consumes storage (1 day, ~500 MB per run × N PRs). retention-days: 1 caps it; artifacts older than 1 day are purged automatically.

Con: If build-image fails, all shards fail. Currently the failure mode is equivalent (every shard would have failed the same Docker build independently), so no new failure surface.

Not changed

Matrix shape, shard args, LFS validation, failure artifact upload — all untouched.
regression summary job — now transitively depends on build-image via regression-shards, no explicit wiring needed.
GHA layer cache (type=gha) preserved on the build job, so warm-cache rebuilds stay fast.
Branch protection status check name (regression) unchanged.

Test plan

Unit tests added/updated — N/A (workflow config)
Manual testing performed — verified YAML validity via oxfmt --check; validated structure by comparing to docker/build-push-action@v6 docs for outputs: type=docker,dest=... and actions/download-artifact@v4 docs
Documentation updated (if applicable) — N/A

Validation after merge: open any PR that touches packages/producer/** and confirm the new build-image job appears, shards download the artifact, and regression completes. First run after merge will also rebuild the GHA layer cache under the new job, so that run won't show the full savings — the second run forward will.

Splits regression.yml into a `build-image` job + the existing `regression-shards` matrix. The build job produces a Docker tarball via `docker/build-push-action` with `outputs: type=docker,dest=...`, uploads it as a GHA artifact (retention 1 day, gzip level 1), and each shard downloads + `docker load`s it instead of rebuilding. Measured on PR #419 regression runs before the change: - Docker build step: ~234s per shard WITH GHA layer cache hit - 11 shards × ~234s = ~43 min of runner time per PR just on redundant image builds Cold-cache cases are much worse — happening right now on PR #419 after release commit b6f50ce bumped every `packages/*/package.json`, invalidating the COPY layer that feeds `bun install --frozen-lockfile`. All 10 shards are currently 25-30+ min into a parallel rebuild, thundering-herding the same npm packages from 10 runners. After this change: - 1× build (~4 min warm, ~15 min cold) + 11× (download + `docker load`) - Expected ~15-20s overhead per shard for artifact download + load - Net savings: ~30-40 min of runner time per PR run on warm cache, substantially more on cold cache The build job doesn't checkout LFS — Dockerfile.test only COPYs source + package manifests, never the golden baselines, so the image build never needed LFS. Shards still need LFS for the tests/**/output/output.mp4 baselines they validate against.

Addresses CodeQL warning 'Workflow does not contain permissions'. Defaults the workflow GITHUB_TOKEN to `contents: read` only. The build-image job elevates to `actions: write` because `docker/build-push-action` with `cache-from/to: type=gha` uses the GitHub Actions cache API, which needs read+write on the actions scope.

github-advanced-security AI found potential problems Apr 22, 2026

View reviewed changes

Comment thread .github/workflows/regression.yml Fixed

miguel-heygen approved these changes Apr 22, 2026

View reviewed changes

jrusso1020 merged commit ef26798 into main Apr 22, 2026
21 checks passed

jrusso1020 deleted the ci/share-regression-docker-image branch April 22, 2026 21:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci(regression): build test Docker image once, share across shards#427

ci(regression): build test Docker image once, share across shards#427
jrusso1020 merged 2 commits intomainfrom
ci/share-regression-docker-image

jrusso1020 commented Apr 22, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jrusso1020 commented Apr 22, 2026

What

Why

How

Build job (new)

Shard job (existing, simplified)

Tradeoffs

Not changed

Test plan

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants