Skip to content

Random Operation Cancelled Runner Decommissions  #911

@jbkc85

Description

@jbkc85

Describe the bug

In the current deployment of actions-runner-controller, whenever a GitHub runner has an action cancelled (via GitHub UI) or has an action that fails (like a unit test), it seems to cause a cascading effect that triggers other random cancellations of actions running on other runners in the system. Furthermore, there is a high chance that at the end of a GitHub workflow that one of the last remaining runners (IE: a runner that survives until a scale down is triggered via reconciling) is cancelled while 'busy' with a Workflow Job/Step.

Checks

  • we are not using ephemeral runners at this time - though ephemeral runners appear to have the same issue
  • we are on vsummerwind/actions-runner-controller:v0.20.1 and use the the provided GitHub-runner with our own custom OpenJDK installation
  • controller is installed via Helm:
syncPeriod: 5m
githubAPICacheDuration: 5m
nodeSelector:
  environment: ops
  purpose: infrastructure
  • AWS EKS w/ Cluster Autoscaling and Spot Instances (confirmed spot instances aren't causing a shutdown issue)

To Reproduce
Steps to reproduce the behavior (though is more random than not):

  1. Setup GitHub-action-controller with the above Helm deployment and the following spec:
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  namespace: default
  name: org-runners
spec:
  template:
    metadata:
      annotations:
        cluster-autoscaler.kubernetes.io/safe-to-evict: "true"
    spec:
      dockerdWithinRunnerContainer: true
      organization: <Github Org>
      nodeSelector:
        capacity-type: SPOT
        purpose: github-runners
        environment: ops
      labels:
        - java
        - linux
        - eks
        - self-hosted
      image: <GitHub-runner image here>
      resources:
        requests:
          cpu: "1.0"
          memory: "10Gi"
---
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
  namespace: default
  name: runners-autoscaler
spec:
  scaleTargetRef:
    name: org-runners
  scaleDownDelaySecondsAfterScaleOut: 3600
  minReplicas: 1
  maxReplicas: 150
  metrics:
  - type: TotalNumberOfQueuedAndInProgressWorkflowRuns
    repositoryNames:
    - <your repository here>
  1. Run a large set (32-64 jobs) workflow and cancel one or two of the jobs.

Expected behavior
The expectation is that the jobs continue - regardless of success/failure and the workflow succeeds in finishing the appointed jobs/tasks.

Screenshots
No screenshots, but the lines simply output Error: The Operation was canceled.

Environment (please complete the following information):

  • Controller Version: v0.20.1
  • Deployment Method: Helm
  • Helm Chart Version: v0.13.1

Additional context
The workflow is extremely active and can have anywhere between 32-64 running jobs. Each job takes between 6 and 20 minutes, depending on the test it is deploying. What I have seen is that it appears that the logs for 'unregistered' runners have an impact once those runners are to be reconciled. I see a lot of:

2021-10-26T05:41:39.576Z	DEBUG	actions-runner-controller.runnerreplicaset	Failed to check if runner is busy. Either this runner has never been successfully registered to GitHub or it still needs more time.	{"runnerreplicaset": "default/org-runners-749hr", "runnerName": "org-runners-749hr-gr85x"}
2021-10-26T05:41:40.172Z	INFO	actions-runner-controller.runner	Skipped registration check because it's deferred until 2021-10-26 05:42:30 +0000 UTC. Retrying in 48.827500109s at latest	{"runner": "default/org-runners-749hr-w9ll4", "lastRegistrationCheckTime": "2021-10-26 05:41:30 +0000 UTC", "registrationCheckInterval": "1m0s"}

and based on the behavior, I wonder if what's happening is that the reconciler takes a registration timeout of a previous pod and applies it to these as they are trying to catch up - thus resulting in a runner being marked for deletion even though it is only a minute or two into its registration check interval.

With that being said I was hoping to increase these intervals and have more of a grace period between the controller just shutting things down where it doesn't 'know' if the runner is busy or not, but honestly I can't tell if thats what is happening.

Any insight would be helpful - and let me know if there is anything else I can provide!

Metadata

Metadata

Assignees

Labels

enhancementNew feature or requestquestionFurther information is requested

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions