Random `Operation Cancelled` Runner Decommissions 

**Describe the bug**

In the current deployment of actions-runner-controller, whenever a GitHub runner has an action cancelled (via GitHub UI) or has an action that fails (like a unit test), it seems to cause a cascading effect that triggers other random cancellations of actions running on other runners in the system.  Furthermore, there is a high chance that at the end of a GitHub workflow that one of the last remaining runners (IE: a runner that survives until a scale down is triggered via reconciling) is cancelled while 'busy' with a Workflow Job/Step.

**Checks**

- we are not using ephemeral runners at this time - though ephemeral runners appear to have the same issue
- we are on v`summerwind/actions-runner-controller:v0.20.1` and use the the provided GitHub-runner with our own custom OpenJDK installation
- controller is installed via Helm:
```yaml
syncPeriod: 5m
githubAPICacheDuration: 5m
nodeSelector:
  environment: ops
  purpose: infrastructure
```
- AWS EKS w/ Cluster Autoscaling and Spot Instances (confirmed spot instances aren't causing a shutdown issue)

**To Reproduce**
Steps to reproduce the behavior (though is more random than not):
1. Setup GitHub-action-controller with the above Helm deployment and the following spec:
```yaml
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  namespace: default
  name: org-runners
spec:
  template:
    metadata:
      annotations:
        cluster-autoscaler.kubernetes.io/safe-to-evict: "true"
    spec:
      dockerdWithinRunnerContainer: true
      organization: <Github Org>
      nodeSelector:
        capacity-type: SPOT
        purpose: github-runners
        environment: ops
      labels:
        - java
        - linux
        - eks
        - self-hosted
      image: <GitHub-runner image here>
      resources:
        requests:
          cpu: "1.0"
          memory: "10Gi"
---
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
  namespace: default
  name: runners-autoscaler
spec:
  scaleTargetRef:
    name: org-runners
  scaleDownDelaySecondsAfterScaleOut: 3600
  minReplicas: 1
  maxReplicas: 150
  metrics:
  - type: TotalNumberOfQueuedAndInProgressWorkflowRuns
    repositoryNames:
    - <your repository here>
```
2. Run a large set (32-64 jobs) workflow and cancel one or two of the jobs.

**Expected behavior**
The expectation is that the jobs continue - regardless of success/failure and the workflow succeeds in finishing the appointed jobs/tasks.

**Screenshots**
No screenshots, but the lines simply output `Error: The Operation was canceled`.

**Environment (please complete the following information):**
 - Controller Version: v0.20.1
 - Deployment Method: Helm
 - Helm Chart Version: v0.13.1

**Additional context**
The workflow is extremely active and can have anywhere between 32-64 running jobs.  Each job takes between 6 and 20 minutes, depending on the test it is deploying.  What I have seen is that it appears that the logs for 'unregistered' runners have an impact once those runners are to be reconciled.  I see a lot of:

```
2021-10-26T05:41:39.576Z	DEBUG	actions-runner-controller.runnerreplicaset	Failed to check if runner is busy. Either this runner has never been successfully registered to GitHub or it still needs more time.	{"runnerreplicaset": "default/org-runners-749hr", "runnerName": "org-runners-749hr-gr85x"}
2021-10-26T05:41:40.172Z	INFO	actions-runner-controller.runner	Skipped registration check because it's deferred until 2021-10-26 05:42:30 +0000 UTC. Retrying in 48.827500109s at latest	{"runner": "default/org-runners-749hr-w9ll4", "lastRegistrationCheckTime": "2021-10-26 05:41:30 +0000 UTC", "registrationCheckInterval": "1m0s"}
```

and based on the behavior, I wonder if what's happening is that the reconciler takes a registration timeout of a previous pod and applies it to these as they are trying to catch up - thus resulting in a runner being marked for deletion even though it is only a minute or two into its registration check interval.

With that being said I was hoping to increase these intervals and have more of a grace period between the controller just shutting things down where it doesn't 'know' if the runner is busy or not, but honestly I can't tell if thats what is happening.

Any insight would be helpful - and let me know if there is anything else I can provide! 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Random `Operation Cancelled` Runner Decommissions #911

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Random Operation Cancelled Runner Decommissions #911

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Random `Operation Cancelled` Runner Decommissions #911