ensure test jobs always run on PRs by jameslamb · Pull Request #708 · rapidsai/docker

jameslamb · 2024-08-14T19:35:22Z

Follow-up to #702 and #693.

test jobs are not currently running on pull requests here, because they require build-multiarch-manifest jobs, which have this condition that causes such jobs to be skipped on PR builds:

docker/.github/workflows/build-test-publish-images.yml

Lines 171 to 172 in 1c27d92

    
           build-multiarch-manifest: 
        
             if: ${{ !cancelled() && inputs.build_type == 'branch' }}

This PR ensures that test jobs always run on PRs, and that merging is blocked until they succeed.

jameslamb · 2024-08-14T20:50:05Z

It looks to me like the test jobs either need to depend on the build-multiarch-manifest job, or need to include architecture in their expected image tags.

See this failing build:

/usr/local/bin/docker pull rapidsai/staging:docker-notebooks-708-24.10a-cuda11.8-py3.9
  Error response from daemon: manifest for rapidsai/staging:docker-notebooks-708-24.10a-cuda11.8-py3.9 not found: manifest unknown: manifest unknown
  Warning: Docker pull failed with exit code 1, back off 4.989 seconds before retry.
  /usr/local/bin/docker pull rapidsai/staging:docker-notebooks-708-24.10a-cuda11.8-py3.9
  Error response from daemon: manifest for rapidsai/staging:docker-notebooks-708-24.10a-cuda11.8-py3.9 not found: manifest unknown: manifest unknown
  Warning: Docker pull failed with exit code 1, back off 9.9[16](https://github.com/rapidsai/docker/actions/runs/10393969779/job/28784483770?pr=708#step:3:19) seconds before retry.
  /usr/local/bin/docker pull rapidsai/staging:docker-notebooks-708-24.10a-cuda11.8-py3.9
  Error response from daemon: manifest for rapidsai/staging:docker-notebooks-708-24.10a-cuda11.8-py3.9 not found: manifest unknown: manifest unknown
  Error: Docker pull failed with exit code 1

(build link)

There are only architecture-specific tags available at https://hub.docker.com/r/rapidsai/staging/tags?page=&page_size=&ordering=&name=docker-notebooks-708-24.10a-cuda11.8-py3.9.

I just pushed 11d05f7 adding the architecture to NOTEBOOK_TAG. Since in #702 it looks like the build-multiarch-manifest job was intentionally skipped for PRs.

jameslamb · 2024-08-14T21:42:07Z

Here's an example where one build failed with a network error, and where I was able to re-run only the failed job (the main design goal of #702):

https://github.com/rapidsai/docker/actions/runs/10394821734/job/28785524564?pr=708

And I can see that the delete-* jobs have not run yet, so all the images needed by the test jobs should exist.

jameslamb · 2024-08-15T15:09:36Z

Ok I think this is working!

✅ was able to run "re-run failed jobs" on build jobs (no need to re-run all of them)
✅ all test jobs ran
✅ 0 test jobs ran until all build jobs passed
✅ 0 delete-temp-images job ran until all test jobs passed
✅ all delete-temp-images ran successfully, and there are 0 images left behind from this PR (DockerHub search for '-708' tags for 'rapidsai/staging')

jameslamb · 2024-08-15T15:11:17Z

@raydouglass @AyodeAwe if you agree with the proposal in this PR, could you please change which pr-builder is required in the branch protection checks? It should now be ci / docker / pr-builder / run.

jakirkham

Thanks James! 🙏

Had a question below

jakirkham · 2024-08-15T16:43:55Z

  delete-temp-images:
    if: ${{ !cancelled() && needs.test.result == 'success' }}
-    needs: [compute-matrix, build-multiarch-manifest, test]
+    needs: [compute-matrix, build, test]


Thought we wanted to finish the multiarch builds before doing cleanup. Or am I misunderstanding something?

I can help explain.

build publishes images with tags like this:

rapidsai/staging:docker-notebooks-706-24.10a-cuda12.0-py3.11-arm64 rapidsai/staging:docker-notebooks-706-24.10a-cuda12.0-py3.11-amd64

The build-multiarch-manifest job pushes a new manifest to DockerHub which combines those 2 images together, so that something like this will work:

docker pull \ rapidsai/staging:docker-notebooks-706-24.10a-cuda12.0-py3.11

(notice .... no -{arch} suffix)

As of #702, that build-multiarch-manifest job is not run on pull requests at all.

docker/.github/workflows/build-test-publish-images.yml

Lines 171 to 172 in 1c27d92

build-multiarch-manifest:

if: ${{ !cancelled() && inputs.build_type == 'branch' }}

I believe that was intentional, as a way to cut down on CI time and network calls. The combined manifests are a convenience for downstream consumers of these images, and not necessary for one CI job here to read the outputs from a prior job.

Thanks James! 🙏

That's a helpful explanation. Agree we don't want this constraint in PRs

Though I think with builds on branches, this was a change that Jake introduced in PR ( #702 ) to make sure we were not deleting images before the manifest was generated

docker/.github/workflows/build-test-publish-images.yml

Lines 232 to 234 in 1c27d92

delete-temp-images:

if: ${{ !cancelled() && needs.test.result == 'success' }}

needs: [compute-matrix, build-multiarch-manifest, test]

Otherwise if a build failed, we would delete the images needed for the manifest and restarting the failed jobs would just fail immediately (as the manifest cannot be created)

gah you're 100% right. We need to account for that here. I'll push some changes.

That condition never could have worked on PRs, because build-multiarch-manifest doesn't run on PRs (see related conversation in this other thread: #708 (comment)). Sorry I didn't catch it in previous reviews.

bdice · 2024-08-15T16:47:04Z

+    needs:
+      - checks
+      - compute-matrix
+      - build


Are we missing build-multiarch-manifest here?

No, it (I think intentionally) does not run on PRs: #708 (comment)

@AyodeAwe what was the reason for not running build-multiarch-manifest on PRs any more as of #702?

I made some assumptions here but maybe those assumptions are wrong.

build-multiarch-manifest was never usually run in PRs:

docker/.github/workflows/build-test-publish-images.yml

Line 194 in 5cea518

if: inputs.build_type == 'branch'

Ah! Ok great, thanks for checking my understanding there.

I just pushed a commit making build-multiarch-manifest run on PRs (and adding it to pr-builder here).

I think it just makes everything simpler...:

it can now be required by the pr-builder job (to be sure it isn't accidentally skipped)

issues with it can be detected on PRs, instead of only on branch builds

it can be unconditionally included in the needs: block for delete-temp-images

Those jobs tend to take like 5-30 seconds each and don't use GPUs. I think that's a pretty small price to pay in exchange for the benefits I listed above.

raydouglass · 2024-08-15T17:08:56Z

  delete-temp-images:
    if: ${{ !cancelled() && needs.test.result == 'success' }}
-    needs: [compute-matrix, build-multiarch-manifest, test]
+    needs: [compute-matrix, build, test]


Won't this change mean that for branch builds delete-temp-images will not run at all?

I don't think so? Because this workflow is invoked on branch builds.

docker/.github/workflows/publish.yml

Lines 3 to 6 in 1c27d92

on:

push:

branches:

- "branch-*"

docker/.github/workflows/publish.yml

Lines 21 to 26 in 1c27d92

docker:

uses: ./.github/workflows/build-test-publish-images.yml

with:

build_type: branch

run_tests: ${{ inputs.run_tests || false }}

secrets: inherit

.... BUT it looks like delete-temp-images is ALREADY not running on branch builds 🙃

Look at the most recent one:

https://github.com/rapidsai/docker/actions/runs/10405157692

And that's because it has [test] in needs: but tests are being skipped, because on branch builds this is run like this:

build_type: branch run_tests: false

These requirements can't all be satisfied:

"do not run test on branch builds"

"only run delete-temp-images after the test succeeds"

"always run delete-temp-images on branch builds"

@raydouglass what do you think about just completely eliminating delete-temp-images and instead relying on the other, higher-level cleanup of the rapidsai/staging DockerHub repo: https://github.com/rapidsai/workflows/blob/main/.github/workflows/cleanup_staging.yaml

That'd simplify things here a lot, I think.

Not necessarily opposed to letting the cleanup happen there, but what if the condition was updated to be "only run delete-temp-images after the test succeeds or is skipped"?

if: ${{ !cancelled() && (needs.test.result == 'success' || needs.test.result == 'skipped') }} maybe?

Ok sure, I just pushed a commit with this modified condition.

I do strongly support just dropping delete-temp-images entirely. If you'd prefer to discuss that separately from this PR let me know, I can write up an issue.

It'd make CI here work more like it does in other RAPIDS repos, which might reduce the incidence of bugs like those being fixed in this PR.

It'd also remove some complexity... no need to keep https://github.com/rapidsai/docker/blob/branch-24.10/ci/delete-temp-images.sh up to date with changes to the set of image tags or changes in DockerHub's API.

It'd also remove a source of network calls in CI here, which reduces the risk of networking-based job failures.

… delete-temp-images runs on branch builds

jameslamb · 2024-08-16T19:03:26Z

This is ready to merge, as soon as the branch protections are changed as described in #708 (comment)

raydouglass · 2024-08-20T15:30:41Z

This is ready to merge, as soon as the branch protections are changed as described in #708 (comment)

Done! Checks look correct!

raydouglass · 2024-08-20T15:30:53Z

/merge

jameslamb · 2024-08-20T15:55:45Z

Awesome, thanks so much for the help!

Follow-up to #708. Proposes completely removing the `delete-temp-images` job, in favor of relying on the scheduled nightly cleanup at https://github.com/rapidsai/workflows/blob/main/.github/workflows/cleanup_staging.yaml. ## Notes for Reviewers ### Details CI here writes images to the `rapidsai/staging` repo on DockerHub, then later copies them to individual user-facing repos. To avoid those temporary CI artifacts piling up in the `rapidsai/staging` repo, pull requests and branch builds run a job called `delete-temp-images` which does what it sounds like. In exchange for more aggressive cleaning, this job introduces significant complexity for development here. Most notably, we've observed several instances where that job deletes images before all CI jobs needing them have completed successfully, leading to all of CI needing to be re-run. Significant effort has been put into trying to avoid that, and we've found it's been difficult to get it right: some attempts: * #702 * #708 a recent example: * #696 (comment) ### Ok so how will we clean up? The workflow at https://github.com/rapidsai/workflows/blob/main/.github/workflows/cleanup_staging.yaml. It runs once a day and deletes anything from `rapidsai/staging` that's more than 30 days old. ### Benefits of these changes As described in #708 (comment) ... CI here will work as it does in other RAPIDS repos.... if any jobs fail for retryable reasons (like network issues), you can safely click "re-run failed jobs" and make incremental progress towards all builds passing. Also reduces the need to maintain code that has to keep up with the DockerHub API in two places (by deleting `ci/delete-temp-images.sh` here). Authors: - James Lamb (https://github.com/jameslamb) - Bradley Dice (https://github.com/bdice) Approvers: - Bradley Dice (https://github.com/bdice) - Ray Douglass (https://github.com/raydouglass) - https://github.com/jakirkham URL: #709

ensure test jobs always run on PRs

229c43a

jameslamb added bug Something isn't working testing labels Aug 14, 2024

jameslamb added 2 commits August 14, 2024 14:43

just put pr-builder alongside other checks

b6c21ac

pass architecture-specific tag to test job

11d05f7

quotes

5566721

jameslamb marked this pull request as ready for review August 15, 2024 15:11

jameslamb requested a review from a team as a code owner August 15, 2024 15:11

jameslamb requested a review from raydouglass August 15, 2024 15:11

jameslamb changed the title ~~WIP: ensure test jobs always run on PRs~~ ensure test jobs always run on PRs Aug 15, 2024

jameslamb requested review from AyodeAwe, bdice and jakirkham August 15, 2024 15:11

jameslamb commented Aug 15, 2024

View reviewed changes

Comment thread .github/workflows/pr.yml

AyodeAwe approved these changes Aug 15, 2024

View reviewed changes

jakirkham reviewed Aug 15, 2024

View reviewed changes

bdice reviewed Aug 15, 2024

View reviewed changes

raydouglass reviewed Aug 15, 2024

View reviewed changes

remove setup-python, just always run build-multiarch-manifest, ensure…

ab85538

… delete-temp-images runs on branch builds

raydouglass approved these changes Aug 15, 2024

View reviewed changes

jameslamb mentioned this pull request Aug 15, 2024

Include common debug information in logs for raising issues rapidsai/build-planning#34

Open

rapids-bot bot merged commit afd8c20 into rapidsai:branch-24.10 Aug 20, 2024

jameslamb deleted the run-tests branch August 20, 2024 15:55

This was referenced Aug 20, 2024

Drop CUDA 12.2 Docker images #696

Merged

remove delete-temp-images job #709

Merged

	build-multiarch-manifest:
	if: ${{ !cancelled() && inputs.build_type == 'branch' }}

	delete-temp-images:
	if: ${{ !cancelled() && needs.test.result == 'success' }}
	needs: [compute-matrix, build-multiarch-manifest, test]

	docker:
	uses: ./.github/workflows/build-test-publish-images.yml
	with:
	build_type: branch
	run_tests: ${{ inputs.run_tests \|\| false }}
	secrets: inherit

Conversation

jameslamb commented Aug 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jameslamb commented Aug 14, 2024

Uh oh!

jameslamb commented Aug 14, 2024

Uh oh!

jameslamb commented Aug 15, 2024

Uh oh!

jameslamb commented Aug 15, 2024

Uh oh!

Uh oh!

jakirkham left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jameslamb Aug 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jameslamb Aug 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

raydouglass Aug 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jameslamb Aug 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jameslamb commented Aug 16, 2024

Uh oh!

raydouglass commented Aug 20, 2024

Uh oh!

raydouglass commented Aug 20, 2024

Uh oh!

jameslamb commented Aug 20, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

jameslamb commented Aug 14, 2024 •

edited

Loading

jameslamb Aug 15, 2024 •

edited

Loading

jameslamb Aug 15, 2024 •

edited

Loading

raydouglass Aug 15, 2024 •

edited

Loading

jameslamb Aug 15, 2024 •

edited

Loading