-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Description
Describe the bug
In the current deployment of actions-runner-controller, whenever a GitHub runner has an action cancelled (via GitHub UI) or has an action that fails (like a unit test), it seems to cause a cascading effect that triggers other random cancellations of actions running on other runners in the system. Furthermore, there is a high chance that at the end of a GitHub workflow that one of the last remaining runners (IE: a runner that survives until a scale down is triggered via reconciling) is cancelled while 'busy' with a Workflow Job/Step.
Checks
- we are not using ephemeral runners at this time - though ephemeral runners appear to have the same issue
- we are on v
summerwind/actions-runner-controller:v0.20.1and use the the provided GitHub-runner with our own custom OpenJDK installation - controller is installed via Helm:
syncPeriod: 5m
githubAPICacheDuration: 5m
nodeSelector:
environment: ops
purpose: infrastructure- AWS EKS w/ Cluster Autoscaling and Spot Instances (confirmed spot instances aren't causing a shutdown issue)
To Reproduce
Steps to reproduce the behavior (though is more random than not):
- Setup GitHub-action-controller with the above Helm deployment and the following spec:
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
namespace: default
name: org-runners
spec:
template:
metadata:
annotations:
cluster-autoscaler.kubernetes.io/safe-to-evict: "true"
spec:
dockerdWithinRunnerContainer: true
organization: <Github Org>
nodeSelector:
capacity-type: SPOT
purpose: github-runners
environment: ops
labels:
- java
- linux
- eks
- self-hosted
image: <GitHub-runner image here>
resources:
requests:
cpu: "1.0"
memory: "10Gi"
---
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
namespace: default
name: runners-autoscaler
spec:
scaleTargetRef:
name: org-runners
scaleDownDelaySecondsAfterScaleOut: 3600
minReplicas: 1
maxReplicas: 150
metrics:
- type: TotalNumberOfQueuedAndInProgressWorkflowRuns
repositoryNames:
- <your repository here>- Run a large set (32-64 jobs) workflow and cancel one or two of the jobs.
Expected behavior
The expectation is that the jobs continue - regardless of success/failure and the workflow succeeds in finishing the appointed jobs/tasks.
Screenshots
No screenshots, but the lines simply output Error: The Operation was canceled.
Environment (please complete the following information):
- Controller Version: v0.20.1
- Deployment Method: Helm
- Helm Chart Version: v0.13.1
Additional context
The workflow is extremely active and can have anywhere between 32-64 running jobs. Each job takes between 6 and 20 minutes, depending on the test it is deploying. What I have seen is that it appears that the logs for 'unregistered' runners have an impact once those runners are to be reconciled. I see a lot of:
2021-10-26T05:41:39.576Z DEBUG actions-runner-controller.runnerreplicaset Failed to check if runner is busy. Either this runner has never been successfully registered to GitHub or it still needs more time. {"runnerreplicaset": "default/org-runners-749hr", "runnerName": "org-runners-749hr-gr85x"}
2021-10-26T05:41:40.172Z INFO actions-runner-controller.runner Skipped registration check because it's deferred until 2021-10-26 05:42:30 +0000 UTC. Retrying in 48.827500109s at latest {"runner": "default/org-runners-749hr-w9ll4", "lastRegistrationCheckTime": "2021-10-26 05:41:30 +0000 UTC", "registrationCheckInterval": "1m0s"}
and based on the behavior, I wonder if what's happening is that the reconciler takes a registration timeout of a previous pod and applies it to these as they are trying to catch up - thus resulting in a runner being marked for deletion even though it is only a minute or two into its registration check interval.
With that being said I was hoping to increase these intervals and have more of a grace period between the controller just shutting things down where it doesn't 'know' if the runner is busy or not, but honestly I can't tell if thats what is happening.
Any insight would be helpful - and let me know if there is anything else I can provide!