Skip to content

ensure test jobs always run on PRs#708

Merged
rapids-bot[bot] merged 5 commits intorapidsai:branch-24.10from
jameslamb:run-tests
Aug 20, 2024
Merged

ensure test jobs always run on PRs#708
rapids-bot[bot] merged 5 commits intorapidsai:branch-24.10from
jameslamb:run-tests

Conversation

@jameslamb
Copy link
Copy Markdown
Member

@jameslamb jameslamb commented Aug 14, 2024

Follow-up to #702 and #693.

Created based on #696 (comment)

test jobs are not currently running on pull requests here, because they require build-multiarch-manifest jobs, which have this condition that causes such jobs to be skipped on PR builds:

build-multiarch-manifest:
if: ${{ !cancelled() && inputs.build_type == 'branch' }}

This PR ensures that test jobs always run on PRs, and that merging is blocked until they succeed.

@jameslamb jameslamb added bug Something isn't working testing labels Aug 14, 2024
@jameslamb
Copy link
Copy Markdown
Member Author

It looks to me like the test jobs either need to depend on the build-multiarch-manifest job, or need to include architecture in their expected image tags.

See this failing build:

/usr/local/bin/docker pull rapidsai/staging:docker-notebooks-708-24.10a-cuda11.8-py3.9
  Error response from daemon: manifest for rapidsai/staging:docker-notebooks-708-24.10a-cuda11.8-py3.9 not found: manifest unknown: manifest unknown
  Warning: Docker pull failed with exit code 1, back off 4.989 seconds before retry.
  /usr/local/bin/docker pull rapidsai/staging:docker-notebooks-708-24.10a-cuda11.8-py3.9
  Error response from daemon: manifest for rapidsai/staging:docker-notebooks-708-24.10a-cuda11.8-py3.9 not found: manifest unknown: manifest unknown
  Warning: Docker pull failed with exit code 1, back off 9.9[16](https://github.com/rapidsai/docker/actions/runs/10393969779/job/28784483770?pr=708#step:3:19) seconds before retry.
  /usr/local/bin/docker pull rapidsai/staging:docker-notebooks-708-24.10a-cuda11.8-py3.9
  Error response from daemon: manifest for rapidsai/staging:docker-notebooks-708-24.10a-cuda11.8-py3.9 not found: manifest unknown: manifest unknown
  Error: Docker pull failed with exit code 1

(build link)

There are only architecture-specific tags available at https://hub.docker.com/r/rapidsai/staging/tags?page=&page_size=&ordering=&name=docker-notebooks-708-24.10a-cuda11.8-py3.9.

Screenshot 2024-08-14 at 3 43 53 PM

I just pushed 11d05f7 adding the architecture to NOTEBOOK_TAG. Since in #702 it looks like the build-multiarch-manifest job was intentionally skipped for PRs.

@jameslamb
Copy link
Copy Markdown
Member Author

Here's an example where one build failed with a network error, and where I was able to re-run only the failed job (the main design goal of #702):

Screenshot 2024-08-14 at 4 33 31 PM

https://github.com/rapidsai/docker/actions/runs/10394821734/job/28785524564?pr=708

And I can see that the delete-* jobs have not run yet, so all the images needed by the test jobs should exist.

@jameslamb
Copy link
Copy Markdown
Member Author

Ok I think this is working!

✅ was able to run "re-run failed jobs" on build jobs (no need to re-run all of them)
✅ all test jobs ran
✅ 0 test jobs ran until all build jobs passed
✅ 0 delete-temp-images job ran until all test jobs passed
✅ all delete-temp-images ran successfully, and there are 0 images left behind from this PR (DockerHub search for '-708' tags for 'rapidsai/staging')

@jameslamb
Copy link
Copy Markdown
Member Author

@raydouglass @AyodeAwe if you agree with the proposal in this PR, could you please change which pr-builder is required in the branch protection checks? It should now be ci / docker / pr-builder / run.

Screenshot 2024-08-15 at 10 10 13 AM

@jameslamb jameslamb marked this pull request as ready for review August 15, 2024 15:11
@jameslamb jameslamb requested a review from a team as a code owner August 15, 2024 15:11
@jameslamb jameslamb requested a review from raydouglass August 15, 2024 15:11
@jameslamb jameslamb changed the title WIP: ensure test jobs always run on PRs ensure test jobs always run on PRs Aug 15, 2024
Comment thread .github/workflows/pr.yml
Copy link
Copy Markdown
Member

@jakirkham jakirkham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks James! 🙏

Had a question below

delete-temp-images:
if: ${{ !cancelled() && needs.test.result == 'success' }}
needs: [compute-matrix, build-multiarch-manifest, test]
needs: [compute-matrix, build, test]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought we wanted to finish the multiarch builds before doing cleanup. Or am I misunderstanding something?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can help explain.

build publishes images with tags like this:

rapidsai/staging:docker-notebooks-706-24.10a-cuda12.0-py3.11-arm64
rapidsai/staging:docker-notebooks-706-24.10a-cuda12.0-py3.11-amd64

The build-multiarch-manifest job pushes a new manifest to DockerHub which combines those 2 images together, so that something like this will work:

docker pull \
    rapidsai/staging:docker-notebooks-706-24.10a-cuda12.0-py3.11

(notice .... no -{arch} suffix)

As of #702, that build-multiarch-manifest job is not run on pull requests at all.

build-multiarch-manifest:
if: ${{ !cancelled() && inputs.build_type == 'branch' }}

I believe that was intentional, as a way to cut down on CI time and network calls. The combined manifests are a convenience for downstream consumers of these images, and not necessary for one CI job here to read the outputs from a prior job.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks James! 🙏

That's a helpful explanation. Agree we don't want this constraint in PRs

Though I think with builds on branches, this was a change that Jake introduced in PR ( #702 ) to make sure we were not deleting images before the manifest was generated

delete-temp-images:
if: ${{ !cancelled() && needs.test.result == 'success' }}
needs: [compute-matrix, build-multiarch-manifest, test]

Otherwise if a build failed, we would delete the images needed for the manifest and restarting the failed jobs would just fail immediately (as the manifest cannot be created)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gah you're 100% right. We need to account for that here. I'll push some changes.

That condition never could have worked on PRs, because build-multiarch-manifest doesn't run on PRs (see related conversation in this other thread: #708 (comment)). Sorry I didn't catch it in previous reviews.

Comment thread .github/workflows/build-test-publish-images.yml Outdated
needs:
- checks
- compute-matrix
- build
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we missing build-multiarch-manifest here?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it (I think intentionally) does not run on PRs: #708 (comment)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AyodeAwe what was the reason for not running build-multiarch-manifest on PRs any more as of #702?

I made some assumptions here but maybe those assumptions are wrong.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

build-multiarch-manifest was never usually run in PRs:

if: inputs.build_type == 'branch'

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah! Ok great, thanks for checking my understanding there.

Copy link
Copy Markdown
Member Author

@jameslamb jameslamb Aug 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just pushed a commit making build-multiarch-manifest run on PRs (and adding it to pr-builder here).

I think it just makes everything simpler...:

  • it can now be required by the pr-builder job (to be sure it isn't accidentally skipped)
  • issues with it can be detected on PRs, instead of only on branch builds
  • it can be unconditionally included in the needs: block for delete-temp-images

Those jobs tend to take like 5-30 seconds each and don't use GPUs. I think that's a pretty small price to pay in exchange for the benefits I listed above.

delete-temp-images:
if: ${{ !cancelled() && needs.test.result == 'success' }}
needs: [compute-matrix, build-multiarch-manifest, test]
needs: [compute-matrix, build, test]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this change mean that for branch builds delete-temp-images will not run at all?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so? Because this workflow is invoked on branch builds.

on:
push:
branches:
- "branch-*"

docker:
uses: ./.github/workflows/build-test-publish-images.yml
with:
build_type: branch
run_tests: ${{ inputs.run_tests || false }}
secrets: inherit

.... BUT it looks like delete-temp-images is ALREADY not running on branch builds 🙃

Look at the most recent one:

https://github.com/rapidsai/docker/actions/runs/10405157692

Screenshot 2024-08-15 at 12 19 51 PM

And that's because it has [test] in needs: but tests are being skipped, because on branch builds this is run like this:

build_type: branch
run_tests: false

Copy link
Copy Markdown
Member Author

@jameslamb jameslamb Aug 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These requirements can't all be satisfied:

  • "do not run test on branch builds"
  • "only run delete-temp-images after the test succeeds"
  • "always run delete-temp-images on branch builds"

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@raydouglass what do you think about just completely eliminating delete-temp-images and instead relying on the other, higher-level cleanup of the rapidsai/staging DockerHub repo: https://github.com/rapidsai/workflows/blob/main/.github/workflows/cleanup_staging.yaml

That'd simplify things here a lot, I think.

Copy link
Copy Markdown
Contributor

@raydouglass raydouglass Aug 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessarily opposed to letting the cleanup happen there, but what if the condition was updated to be "only run delete-temp-images after the test succeeds or is skipped"?

if: ${{ !cancelled() && (needs.test.result == 'success' || needs.test.result == 'skipped') }} maybe?

Copy link
Copy Markdown
Member Author

@jameslamb jameslamb Aug 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok sure, I just pushed a commit with this modified condition.

I do strongly support just dropping delete-temp-images entirely. If you'd prefer to discuss that separately from this PR let me know, I can write up an issue.

It'd make CI here work more like it does in other RAPIDS repos, which might reduce the incidence of bugs like those being fixed in this PR.

It'd also remove some complexity... no need to keep https://github.com/rapidsai/docker/blob/branch-24.10/ci/delete-temp-images.sh up to date with changes to the set of image tags or changes in DockerHub's API.

It'd also remove a source of network calls in CI here, which reduces the risk of networking-based job failures.

@jameslamb
Copy link
Copy Markdown
Member Author

This is ready to merge, as soon as the branch protections are changed as described in #708 (comment)

@raydouglass
Copy link
Copy Markdown
Contributor

This is ready to merge, as soon as the branch protections are changed as described in #708 (comment)

Done! Checks look correct!

@raydouglass
Copy link
Copy Markdown
Contributor

/merge

@rapids-bot rapids-bot bot merged commit afd8c20 into rapidsai:branch-24.10 Aug 20, 2024
@jameslamb jameslamb deleted the run-tests branch August 20, 2024 15:55
@jameslamb
Copy link
Copy Markdown
Member Author

Awesome, thanks so much for the help!

rapids-bot bot pushed a commit that referenced this pull request Aug 22, 2024
Follow-up to #708.

Proposes completely removing the `delete-temp-images` job, in favor of relying on the scheduled nightly cleanup at https://github.com/rapidsai/workflows/blob/main/.github/workflows/cleanup_staging.yaml.

## Notes for Reviewers

### Details

CI here writes images to the `rapidsai/staging` repo on DockerHub, then later copies them to individual user-facing repos.
To avoid those temporary CI artifacts piling up in the `rapidsai/staging` repo, pull requests and branch builds run a job called `delete-temp-images` which does what it sounds like.

In exchange for more aggressive cleaning, this job introduces significant complexity for development here. Most notably, we've observed several instances where that job deletes images before all CI jobs needing them have completed successfully, leading to all of CI needing to be re-run.

Significant effort has been put into trying to avoid that, and we've found it's been difficult to get it right:

some attempts:

* #702
* #708

a recent example:

* #696 (comment)

### Ok so how will we clean up?

The workflow at https://github.com/rapidsai/workflows/blob/main/.github/workflows/cleanup_staging.yaml.

It runs once a day and deletes anything from `rapidsai/staging` that's more than 30 days old.

### Benefits of these changes

As described in #708 (comment) ...

CI here will work as it does in other RAPIDS repos.... if any jobs fail for retryable reasons (like network issues), you can safely click "re-run failed jobs" and make incremental progress towards all builds passing.

Also reduces the need to maintain code that has to keep up with the DockerHub API in two places (by deleting `ci/delete-temp-images.sh` here).

Authors:
  - James Lamb (https://github.com/jameslamb)
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Ray Douglass (https://github.com/raydouglass)
  - https://github.com/jakirkham

URL: #709
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working testing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants