Skip to content

Emit corev1 Events for debugging purposes #724

@grantr

Description

@grantr

Debugging event configuration is really hard. How can we make it easier? I think logging more corev1.Events when things happen would be a good start.

Problem

The eventing objects have many failure modes. We don't want users to be looking at controller logs to figure out what went wrong. Even if logs were easy to get from Kibana, there are too many controllers, and logs aren't a well-defined API. They'll move around to different pods and be reformatted over time, and we don't want docs and user experience to become obsolete when that happens.

We do have statuses, but they only tell the user the state right now, not how that state came to be. They're also (currently) optimized for consumption by computers, not humans. If the reason for Ready is NotFound, how do I debug that? What does Severity: Error mean? We can and should improve statuses so they're more obvious to users, but it's not the whole story.

(Imperfect) Solution

Other complex objects like Pods use corev1.Events to communicate to the user what is happening when, and more importantly, what is failing when. Here's how I'd debug a Pod with a typo in the image url:

$ kubectl get pods -n e2etest-knative-eventing
NAME                     READY   STATUS             RESTARTS   AGE
e2e-singleevent-sender   1/2     ImagePullBackOff   0          30s

I see immediately that the busybox pod is in the error state and it's related to the image.

$ kubectl describe pod -n e2etest-knative-eventing e2e-singleevent-sender
...
Events:
  Type     Reason     Age                From                                             Message
  ----     ------     ----               ----                                             -------
  Normal   Scheduled  78s                default-scheduler                                Successfully assigned e2etest-knative-eventing/e2e-singleevent-sender to gke-knative-default-pool-910b35a0-ftph
  Normal   Pulled     76s                kubelet, gke-knative-default-pool-910b35a0-ftph  Container image "docker.io/istio/proxy_init:1.0.2" already present on machine
  Normal   Created    76s                kubelet, gke-knative-default-pool-910b35a0-ftph  Created container
  Normal   Started    76s                kubelet, gke-knative-default-pool-910b35a0-ftph  Started container
  Normal   Pulled     74s                kubelet, gke-knative-default-pool-910b35a0-ftph  Container image "docker.io/istio/proxyv2:1.0.2" already present on machine
  Normal   Created    74s                kubelet, gke-knative-default-pool-910b35a0-ftph  Created container
  Normal   Started    74s                kubelet, gke-knative-default-pool-910b35a0-ftph  Started container
  Normal   Pulling    33s (x3 over 75s)  kubelet, gke-knative-default-pool-910b35a0-ftph  pulling image "github.com/knative/eventing/test/test_images/sendevent"
  Warning  Failed     32s (x3 over 74s)  kubelet, gke-knative-default-pool-910b35a0-ftph  Failed to pull image "github.com/knative/eventing/test/test_images/sendevent": rpc error: code = Unknown desc = Error response from daemon: error parsing HTTP 404 response body: invalid character 'N' looking for beginning of value: "Not Found"
  Warning  Failed     32s (x3 over 74s)  kubelet, gke-knative-default-pool-910b35a0-ftph  Error: ErrImagePull
  Normal   BackOff    9s (x5 over 73s)   kubelet, gke-knative-default-pool-910b35a0-ftph  Back-off pulling image "github.com/knative/eventing/test/test_images/sendevent"
  Warning  Failed     9s (x5 over 73s)   kubelet, gke-knative-default-pool-910b35a0-ftph  Error: ImagePullBackOff

I can read the events in reverse-time order to find the most recent useful event, which in this case is the Failed to pull image error. That tells me that the image is missing (in this case, I didn't run the yaml through ko first).

Is this perfect? No, but at least I didn't have to figure out which pod is running the controller and try to interpret its logline format.

Will it help?

Let's try to make the eventing objects' corev1.Events at least as useful as the Pod events for real debugging scenarios. These two issues came up today:

  • Missing provisioner controller for a Channel
  • Provisioner controller missing a secret value

Would emitting corev1.Events when interesting things happen have made debugging those problems easier?

Scope

This is only about emitting corev1.Events from the control plane. The data plane will probably need a different debugging solution (tracing).

Caveat

The k8s-events source may need to take special care to avoid a loop. It could avoid emitting corev1.Events while processing a corev1.Event, or annotate its own events to recognize them in the data plane later.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions