Skip to content

Conversation

@mumoshu
Copy link
Collaborator

@mumoshu mumoshu commented Mar 11, 2022

We had been retrying RemoveRunner API call on each graceful runner stop attempt when the runner was still busy.
And there was no reliable way to throttle the retry attempts. That resulted in ARC spamming RemoveRunner calls(one call per a reconciliation loop but the loop runs quite often due to how the controller works) when it failed once due to that the runner is in the middle of running a workflow job.

This fixes that, by adding a few short-circuit conditions that would work for ephemeral runners. An ephemeral runner can unregister itself on completion so in most of cases ARC can just wait for the runner to stop if it's already running a job. As a RemoveRunner response of status 422 implies that the runner is running a job, we can use that as a trigger to start the runner stop waiter.

The end result is that 422 errors will be observed at most once per the whole graceful termination process of an ephemeral runner pod. RemoveRunner API calls are never retried for ephemeral runners. ARC consumes less GitHub API rate limit budget and logs are much cleaner than before.

Ref #1167 (comment)

@mumoshu mumoshu added this to the v0.22.0 milestone Mar 11, 2022
@mumoshu mumoshu changed the title Prevent RemoveRunner spam on ephemeral runner scale down Prevent RemoveRunner spam on busy ephemeral runner scale down Mar 11, 2022
@mumoshu mumoshu force-pushed the fix-delete-runner-log-spam branch from 7e52798 to 03f5cca Compare March 11, 2022 09:58
@mumoshu mumoshu merged commit 9628bb2 into master Mar 11, 2022
@mumoshu mumoshu deleted the fix-delete-runner-log-spam branch March 11, 2022 10:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants