Skip to content

restructure AMD scheduled CI#27743

Merged
ydshieh merged 4 commits intomainfrom
fix_amd_scheduled_ci
Dec 4, 2023
Merged

restructure AMD scheduled CI#27743
ydshieh merged 4 commits intomainfrom
fix_amd_scheduled_ci

Conversation

@ydshieh
Copy link
Copy Markdown
Collaborator

@ydshieh ydshieh commented Nov 28, 2023

What does this PR do?

So far, the AMD scheduled CI is run as a single workflow with mi210 and mi250 both in it (each has ~500 jobs): see here

Screenshot 2023-11-28 140000

This causes 2 issues:

  • the workflow run page is too large to display (A unicorn image with This page is taking too long to load.)
  • the artifact produced by the runs of mi210 and mi250 are mixed (overwritten by each other), so the report might be inaccurate.

This PR restructure AMD scheduled CI to make mi210 and mi250 run in 2 workflow run, so avoid the above 2 issues.

with:
gpu_flavor: mi250
secrets: inherit
run_scheduled_amd_ci:
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trigger the 2 new workflow file mi210-caller and mi250-caller via workflow run event. They will run as 2 independent workflow runs.

CI_SLACK_REPORT_CHANNEL_ID: ${{ secrets.CI_SLACK_CHANNEL_ID_DAILY_AMD }}
ACCESS_REPO_INFO_TOKEN: ${{ secrets.ACCESS_REPO_INFO_TOKEN }}
CI_EVENT: Scheduled CI (AMD)
CI_EVENT: Scheduled CI (AMD) - ${{ inputs.gpu_flavor }}
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a tiny fix so we can see

Results of the Scheduled CI (AMD) - mi250 tests.

instead of

Results of the Scheduled CI (AMD) tests.

on the slack

Comment on lines +129 to +131
self.n_additional_single_gpu_failures = 0
self.n_additional_multi_gpu_failures = 0
self.n_additional_unknown_gpu_failures = 0
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For AMD push CI, additional_results is empty {} . And dicts_to_sum is not working with this case (failed in functools.reduce)

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

@ydshieh ydshieh requested review from LysandreJik and removed request for LysandreJik November 28, 2023 13:26
Copy link
Copy Markdown
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great,thanks @ydshieh !

@ydshieh ydshieh merged commit e0d2e69 into main Dec 4, 2023
@ydshieh ydshieh deleted the fix_amd_scheduled_ci branch December 4, 2023 14:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants