Debugging event configuration is really hard. How can we make it easier? I think logging more corev1.Events when things happen would be a good start.
Problem
The eventing objects have many failure modes. We don't want users to be looking at controller logs to figure out what went wrong. Even if logs were easy to get from Kibana, there are too many controllers, and logs aren't a well-defined API. They'll move around to different pods and be reformatted over time, and we don't want docs and user experience to become obsolete when that happens.
We do have statuses, but they only tell the user the state right now, not how that state came to be. They're also (currently) optimized for consumption by computers, not humans. If the reason for Ready is NotFound, how do I debug that? What does Severity: Error mean? We can and should improve statuses so they're more obvious to users, but it's not the whole story.
(Imperfect) Solution
Other complex objects like Pods use corev1.Events to communicate to the user what is happening when, and more importantly, what is failing when. Here's how I'd debug a Pod with a typo in the image url:
$ kubectl get pods -n e2etest-knative-eventing
NAME READY STATUS RESTARTS AGE
e2e-singleevent-sender 1/2 ImagePullBackOff 0 30s
I see immediately that the busybox pod is in the error state and it's related to the image.
$ kubectl describe pod -n e2etest-knative-eventing e2e-singleevent-sender
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 78s default-scheduler Successfully assigned e2etest-knative-eventing/e2e-singleevent-sender to gke-knative-default-pool-910b35a0-ftph
Normal Pulled 76s kubelet, gke-knative-default-pool-910b35a0-ftph Container image "docker.io/istio/proxy_init:1.0.2" already present on machine
Normal Created 76s kubelet, gke-knative-default-pool-910b35a0-ftph Created container
Normal Started 76s kubelet, gke-knative-default-pool-910b35a0-ftph Started container
Normal Pulled 74s kubelet, gke-knative-default-pool-910b35a0-ftph Container image "docker.io/istio/proxyv2:1.0.2" already present on machine
Normal Created 74s kubelet, gke-knative-default-pool-910b35a0-ftph Created container
Normal Started 74s kubelet, gke-knative-default-pool-910b35a0-ftph Started container
Normal Pulling 33s (x3 over 75s) kubelet, gke-knative-default-pool-910b35a0-ftph pulling image "github.com/knative/eventing/test/test_images/sendevent"
Warning Failed 32s (x3 over 74s) kubelet, gke-knative-default-pool-910b35a0-ftph Failed to pull image "github.com/knative/eventing/test/test_images/sendevent": rpc error: code = Unknown desc = Error response from daemon: error parsing HTTP 404 response body: invalid character 'N' looking for beginning of value: "Not Found"
Warning Failed 32s (x3 over 74s) kubelet, gke-knative-default-pool-910b35a0-ftph Error: ErrImagePull
Normal BackOff 9s (x5 over 73s) kubelet, gke-knative-default-pool-910b35a0-ftph Back-off pulling image "github.com/knative/eventing/test/test_images/sendevent"
Warning Failed 9s (x5 over 73s) kubelet, gke-knative-default-pool-910b35a0-ftph Error: ImagePullBackOff
I can read the events in reverse-time order to find the most recent useful event, which in this case is the Failed to pull image error. That tells me that the image is missing (in this case, I didn't run the yaml through ko first).
Is this perfect? No, but at least I didn't have to figure out which pod is running the controller and try to interpret its logline format.
Will it help?
Let's try to make the eventing objects' corev1.Events at least as useful as the Pod events for real debugging scenarios. These two issues came up today:
- Missing provisioner controller for a Channel
- Provisioner controller missing a secret value
Would emitting corev1.Events when interesting things happen have made debugging those problems easier?
Scope
This is only about emitting corev1.Events from the control plane. The data plane will probably need a different debugging solution (tracing).
Caveat
The k8s-events source may need to take special care to avoid a loop. It could avoid emitting corev1.Events while processing a corev1.Event, or annotate its own events to recognize them in the data plane later.
Debugging event configuration is really hard. How can we make it easier? I think logging more corev1.Events when things happen would be a good start.
Problem
The eventing objects have many failure modes. We don't want users to be looking at controller logs to figure out what went wrong. Even if logs were easy to get from Kibana, there are too many controllers, and logs aren't a well-defined API. They'll move around to different pods and be reformatted over time, and we don't want docs and user experience to become obsolete when that happens.
We do have statuses, but they only tell the user the state right now, not how that state came to be. They're also (currently) optimized for consumption by computers, not humans. If the reason for
ReadyisNotFound, how do I debug that? What doesSeverity: Errormean? We can and should improve statuses so they're more obvious to users, but it's not the whole story.(Imperfect) Solution
Other complex objects like Pods use corev1.Events to communicate to the user what is happening when, and more importantly, what is failing when. Here's how I'd debug a Pod with a typo in the image url:
I see immediately that the
busyboxpod is in the error state and it's related to the image.I can read the events in reverse-time order to find the most recent useful event, which in this case is the
Failed to pull imageerror. That tells me that the image is missing (in this case, I didn't run the yaml throughkofirst).Is this perfect? No, but at least I didn't have to figure out which pod is running the controller and try to interpret its logline format.
Will it help?
Let's try to make the eventing objects' corev1.Events at least as useful as the Pod events for real debugging scenarios. These two issues came up today:
Would emitting corev1.Events when interesting things happen have made debugging those problems easier?
Scope
This is only about emitting corev1.Events from the control plane. The data plane will probably need a different debugging solution (tracing).
Caveat
The k8s-events source may need to take special care to avoid a loop. It could avoid emitting corev1.Events while processing a corev1.Event, or annotate its own events to recognize them in the data plane later.