Skip to content

Failing pod never times out #6504

@duglin

Description

@duglin

In what area(s)?

/area autoscale

What version of Knative?

HEAD

Expected Behavior

When a pod fails to start properly it should eventually be terminated.

Actual Behavior

If an instance/pod fails to start and it is the first time the revision is starting a pod then the pod will eventually be terminated. But, if the first instance of the revision starts ok, then scales down to zero, if the next instance/pod that is created fails to start then the pod will continually crash-loop (which is expected) but it'll never be terminated and never goes away.

It seems like there should be consistency between a "first time pod" and a "2+ time pod" w.r.t. what happens when it crashes.

Steps to Reproduce the Problem

You can reproduce this by running this bash script:

#!/bin/bash

set -e
kubectl delete ksvc/bugsvc > /dev/null 2>&1 || true
kubectl delete ksvc/bugsvc2 > /dev/null 2>&1 || true

export CRASH=$(( $(date -u '+%s') + 120))

echo "Time now: ${now:15:5}
echo "Will die: ${CRASH}

kubectl apply -f - <<EOF
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: bugsvc
spec:
  template:
    spec:
      containers:
      - image: duglin/echo
        env:
          - name: CRASH
            value: ${CRASH}
EOF
sleep 10
URL=$(kubectl get ksvc/bugsvc -o custom-columns=URL:.status.url --no-headers)

echo "Send curl just to make sure it works"
curl $URL

echo "Wait for it to scale to zero"
while kubectl get pods | grep bugsvc ; do
  sleep 10
done

echo "Sleep for 2 minutes just to make sure we're past the crash time"
sleep 120

echo "Create bugsvc2 so it fails immediately"
kubectl apply -f - <<EOF
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: bugsvc2
spec:
  template:
    spec:
      containers:
      - image: duglin/echo
        env:
          - name: CRASH
            value: "true"
EOF
echo "Now curl bugsvc again to force it to scale up to 1"
curl $URL &

echo "Pods should be failing, but bugsvc2 will eventually vanish"
kubectl get pods -w

The image used will crash if it is started after the time (hour:min) of the CRASH env var. So, in the case of bugsvc we create the ksvc before CRASH time, let it scale down to zero, then hit it after CRASH time so that the pod fails. KnService bugsvc2 crashes immediately to show how the pod will be removed (for me after about 2 minutes) while bugsvc's pod seems to live forever (or at least a LOT longer).

Metadata

Metadata

Assignees

Labels

area/APIAPI objects and controllersarea/autoscalehelp wantedDenotes an issue that needs help from a contributor. Must meet "help wanted" guidelines.kind/bugCategorizes issue or PR as related to a bug.lifecycle/frozenIndicates that an issue or PR should not be auto-closed due to staleness.triage/acceptedIssues which should be fixed (post-triage)

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions