Skip to content

Drop CUDA 12.2 Docker images#696

Merged
rapids-bot[bot] merged 6 commits intorapidsai:branch-24.10from
bdice:drop-cuda-12.2
Aug 29, 2024
Merged

Drop CUDA 12.2 Docker images#696
rapids-bot[bot] merged 6 commits intorapidsai:branch-24.10from
bdice:drop-cuda-12.2

Conversation

@bdice
Copy link
Copy Markdown
Contributor

@bdice bdice commented Jul 26, 2024

Following rapidsai/docs#526, we can remove CUDA 12.2 from the RAPIDS 24.10 Docker images.

@bdice bdice changed the base branch from branch-24.08 to branch-24.10 July 26, 2024 18:06
@bdice bdice marked this pull request as ready for review July 26, 2024 18:06
@bdice bdice requested a review from a team as a code owner July 26, 2024 18:06
@bdice bdice requested a review from AyodeAwe July 26, 2024 18:06
@bdice bdice self-assigned this Jul 26, 2024
Copy link
Copy Markdown
Member

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks for getting that deprecation notice into the 24.08 release.

@jakirkham
Copy link
Copy Markdown
Member

Thanks Bradley and James! 🙏

@jameslamb
Copy link
Copy Markdown
Member

I just noticed that tests are not running here on PRs any more.

https://github.com/rapidsai/docker/actions/runs/10392124806?pr=696

Screenshot 2024-08-14 at 2 21 48 PM

I missed that on #702, and test should have been added to the checks in #693 😬

I can put up a PR right now to fix that. It shouldn't block this particular PR though, as this is just deleting things.

rapids-bot bot pushed a commit that referenced this pull request Aug 20, 2024
Follow-up to #702 and #693.

Created based on #696 (comment)

`test` jobs are not currently running on pull requests here, because they require `build-multiarch-manifest` jobs, which have this condition that causes such jobs to be skipped on PR builds:

https://github.com/rapidsai/docker/blob/1c27d9245fd9d99ee35981b970acaf10961ca45b/.github/workflows/build-test-publish-images.yml#L171-L172

This PR ensures that `test` jobs always run on PRs, and that merging is blocked until they succeed.

Authors:
  - James Lamb (https://github.com/jameslamb)

Approvers:
  - Jake Awe (https://github.com/AyodeAwe)
  - Ray Douglass (https://github.com/raydouglass)

URL: #708
@bdice
Copy link
Copy Markdown
Contributor Author

bdice commented Aug 20, 2024

@jameslamb There is an error in the CI that I am not expecting. Can you take a look? Seems related to the recent CI changes.

@jameslamb
Copy link
Copy Markdown
Member

Ok I was confused by the mix of results at first because some jobs were re-run and some weren't. Looking at just the first run, I see what happened:

https://github.com/rapidsai/docker/actions/runs/10477159554/attempts/1?pr=696

image

Some build jobs failed, which led to build-multiarch-manifest and test having the status "skipped"or something similar, which alloweddelete-temp-imagesto run. That deleted some of those images created bybuild, which is why your re-runs of build-multiarch-manifest` failed.

This is why I want to just remove delete-temp-images: #708 (comment)

This complexity of handling the difference between "failed" and "skipped" is just not worth it here, in my opinion, when we already have another mechanism for cleaning up old stuff.

I'm going to put up a PR proposing that.

For now, for you here, you'll need to do a full re-run (or just wait for that other PR first).

rapids-bot bot pushed a commit that referenced this pull request Aug 22, 2024
Follow-up to #708.

Proposes completely removing the `delete-temp-images` job, in favor of relying on the scheduled nightly cleanup at https://github.com/rapidsai/workflows/blob/main/.github/workflows/cleanup_staging.yaml.

## Notes for Reviewers

### Details

CI here writes images to the `rapidsai/staging` repo on DockerHub, then later copies them to individual user-facing repos.
To avoid those temporary CI artifacts piling up in the `rapidsai/staging` repo, pull requests and branch builds run a job called `delete-temp-images` which does what it sounds like.

In exchange for more aggressive cleaning, this job introduces significant complexity for development here. Most notably, we've observed several instances where that job deletes images before all CI jobs needing them have completed successfully, leading to all of CI needing to be re-run.

Significant effort has been put into trying to avoid that, and we've found it's been difficult to get it right:

some attempts:

* #702
* #708

a recent example:

* #696 (comment)

### Ok so how will we clean up?

The workflow at https://github.com/rapidsai/workflows/blob/main/.github/workflows/cleanup_staging.yaml.

It runs once a day and deletes anything from `rapidsai/staging` that's more than 30 days old.

### Benefits of these changes

As described in #708 (comment) ...

CI here will work as it does in other RAPIDS repos.... if any jobs fail for retryable reasons (like network issues), you can safely click "re-run failed jobs" and make incremental progress towards all builds passing.

Also reduces the need to maintain code that has to keep up with the DockerHub API in two places (by deleting `ci/delete-temp-images.sh` here).

Authors:
  - James Lamb (https://github.com/jameslamb)
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Ray Douglass (https://github.com/raydouglass)
  - https://github.com/jakirkham

URL: #709
@jameslamb
Copy link
Copy Markdown
Member

I've merged latest branch-24.10 into this, to re-trigger CI and to pull in the changes from #709.

If any builds fail with temporary issues like network errors, just re-running those failed jobs should be safe and successful.

@jameslamb
Copy link
Copy Markdown
Member

most build jobs here were failing with what look like temporary errors. I've restarted them.

Comment thread matrix-test.yaml Outdated
@raydouglass
Copy link
Copy Markdown
Contributor

/merge

@rapids-bot rapids-bot bot merged commit fd9100e into rapidsai:branch-24.10 Aug 29, 2024
@jakirkham
Copy link
Copy Markdown
Member

Thanks all! 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants