ensure test jobs always run on PRs#708
ensure test jobs always run on PRs#708rapids-bot[bot] merged 5 commits intorapidsai:branch-24.10from jameslamb:run-tests
Conversation
|
It looks to me like the See this failing build: There are only architecture-specific tags available at https://hub.docker.com/r/rapidsai/staging/tags?page=&page_size=&ordering=&name=docker-notebooks-708-24.10a-cuda11.8-py3.9.
I just pushed 11d05f7 adding the architecture to |
|
Here's an example where one build failed with a network error, and where I was able to re-run only the failed job (the main design goal of #702):
https://github.com/rapidsai/docker/actions/runs/10394821734/job/28785524564?pr=708 And I can see that the |
|
Ok I think this is working! ✅ was able to run "re-run failed jobs" on |
|
@raydouglass @AyodeAwe if you agree with the proposal in this PR, could you please change which
|
jakirkham
left a comment
There was a problem hiding this comment.
Thanks James! 🙏
Had a question below
| delete-temp-images: | ||
| if: ${{ !cancelled() && needs.test.result == 'success' }} | ||
| needs: [compute-matrix, build-multiarch-manifest, test] | ||
| needs: [compute-matrix, build, test] |
There was a problem hiding this comment.
Thought we wanted to finish the multiarch builds before doing cleanup. Or am I misunderstanding something?
There was a problem hiding this comment.
I can help explain.
build publishes images with tags like this:
rapidsai/staging:docker-notebooks-706-24.10a-cuda12.0-py3.11-arm64
rapidsai/staging:docker-notebooks-706-24.10a-cuda12.0-py3.11-amd64
The build-multiarch-manifest job pushes a new manifest to DockerHub which combines those 2 images together, so that something like this will work:
docker pull \
rapidsai/staging:docker-notebooks-706-24.10a-cuda12.0-py3.11(notice .... no -{arch} suffix)
As of #702, that build-multiarch-manifest job is not run on pull requests at all.
docker/.github/workflows/build-test-publish-images.yml
Lines 171 to 172 in 1c27d92
I believe that was intentional, as a way to cut down on CI time and network calls. The combined manifests are a convenience for downstream consumers of these images, and not necessary for one CI job here to read the outputs from a prior job.
There was a problem hiding this comment.
Thanks James! 🙏
That's a helpful explanation. Agree we don't want this constraint in PRs
Though I think with builds on branches, this was a change that Jake introduced in PR ( #702 ) to make sure we were not deleting images before the manifest was generated
docker/.github/workflows/build-test-publish-images.yml
Lines 232 to 234 in 1c27d92
Otherwise if a build failed, we would delete the images needed for the manifest and restarting the failed jobs would just fail immediately (as the manifest cannot be created)
There was a problem hiding this comment.
gah you're 100% right. We need to account for that here. I'll push some changes.
That condition never could have worked on PRs, because build-multiarch-manifest doesn't run on PRs (see related conversation in this other thread: #708 (comment)). Sorry I didn't catch it in previous reviews.
| needs: | ||
| - checks | ||
| - compute-matrix | ||
| - build |
There was a problem hiding this comment.
Are we missing build-multiarch-manifest here?
There was a problem hiding this comment.
No, it (I think intentionally) does not run on PRs: #708 (comment)
There was a problem hiding this comment.
build-multiarch-manifest was never usually run in PRs:
There was a problem hiding this comment.
Ah! Ok great, thanks for checking my understanding there.
There was a problem hiding this comment.
I just pushed a commit making build-multiarch-manifest run on PRs (and adding it to pr-builder here).
I think it just makes everything simpler...:
- it can now be required by the
pr-builderjob (to be sure it isn't accidentally skipped) - issues with it can be detected on PRs, instead of only on branch builds
- it can be unconditionally included in the
needs:block fordelete-temp-images
Those jobs tend to take like 5-30 seconds each and don't use GPUs. I think that's a pretty small price to pay in exchange for the benefits I listed above.
| delete-temp-images: | ||
| if: ${{ !cancelled() && needs.test.result == 'success' }} | ||
| needs: [compute-matrix, build-multiarch-manifest, test] | ||
| needs: [compute-matrix, build, test] |
There was a problem hiding this comment.
Won't this change mean that for branch builds delete-temp-images will not run at all?
There was a problem hiding this comment.
I don't think so? Because this workflow is invoked on branch builds.
docker/.github/workflows/publish.yml
Lines 3 to 6 in 1c27d92
docker/.github/workflows/publish.yml
Lines 21 to 26 in 1c27d92
.... BUT it looks like delete-temp-images is ALREADY not running on branch builds 🙃
Look at the most recent one:
https://github.com/rapidsai/docker/actions/runs/10405157692
And that's because it has [test] in needs: but tests are being skipped, because on branch builds this is run like this:
build_type: branch
run_tests: false
There was a problem hiding this comment.
These requirements can't all be satisfied:
- "do not run
teston branch builds" - "only run
delete-temp-imagesafter thetestsucceeds" - "always run
delete-temp-imageson branch builds"
There was a problem hiding this comment.
@raydouglass what do you think about just completely eliminating delete-temp-images and instead relying on the other, higher-level cleanup of the rapidsai/staging DockerHub repo: https://github.com/rapidsai/workflows/blob/main/.github/workflows/cleanup_staging.yaml
That'd simplify things here a lot, I think.
There was a problem hiding this comment.
Not necessarily opposed to letting the cleanup happen there, but what if the condition was updated to be "only run delete-temp-images after the test succeeds or is skipped"?
if: ${{ !cancelled() && (needs.test.result == 'success' || needs.test.result == 'skipped') }} maybe?
There was a problem hiding this comment.
Ok sure, I just pushed a commit with this modified condition.
I do strongly support just dropping delete-temp-images entirely. If you'd prefer to discuss that separately from this PR let me know, I can write up an issue.
It'd make CI here work more like it does in other RAPIDS repos, which might reduce the incidence of bugs like those being fixed in this PR.
It'd also remove some complexity... no need to keep https://github.com/rapidsai/docker/blob/branch-24.10/ci/delete-temp-images.sh up to date with changes to the set of image tags or changes in DockerHub's API.
It'd also remove a source of network calls in CI here, which reduces the risk of networking-based job failures.
… delete-temp-images runs on branch builds
|
This is ready to merge, as soon as the branch protections are changed as described in #708 (comment) |
Done! Checks look correct! |
|
/merge |
|
Awesome, thanks so much for the help! |
Follow-up to #708. Proposes completely removing the `delete-temp-images` job, in favor of relying on the scheduled nightly cleanup at https://github.com/rapidsai/workflows/blob/main/.github/workflows/cleanup_staging.yaml. ## Notes for Reviewers ### Details CI here writes images to the `rapidsai/staging` repo on DockerHub, then later copies them to individual user-facing repos. To avoid those temporary CI artifacts piling up in the `rapidsai/staging` repo, pull requests and branch builds run a job called `delete-temp-images` which does what it sounds like. In exchange for more aggressive cleaning, this job introduces significant complexity for development here. Most notably, we've observed several instances where that job deletes images before all CI jobs needing them have completed successfully, leading to all of CI needing to be re-run. Significant effort has been put into trying to avoid that, and we've found it's been difficult to get it right: some attempts: * #702 * #708 a recent example: * #696 (comment) ### Ok so how will we clean up? The workflow at https://github.com/rapidsai/workflows/blob/main/.github/workflows/cleanup_staging.yaml. It runs once a day and deletes anything from `rapidsai/staging` that's more than 30 days old. ### Benefits of these changes As described in #708 (comment) ... CI here will work as it does in other RAPIDS repos.... if any jobs fail for retryable reasons (like network issues), you can safely click "re-run failed jobs" and make incremental progress towards all builds passing. Also reduces the need to maintain code that has to keep up with the DockerHub API in two places (by deleting `ci/delete-temp-images.sh` here). Authors: - James Lamb (https://github.com/jameslamb) - Bradley Dice (https://github.com/bdice) Approvers: - Bradley Dice (https://github.com/bdice) - Ray Douglass (https://github.com/raydouglass) - https://github.com/jakirkham URL: #709



Follow-up to #702 and #693.
Created based on #696 (comment)
testjobs are not currently running on pull requests here, because they requirebuild-multiarch-manifestjobs, which have this condition that causes such jobs to be skipped on PR builds:docker/.github/workflows/build-test-publish-images.yml
Lines 171 to 172 in 1c27d92
This PR ensures that
testjobs always run on PRs, and that merging is blocked until they succeed.