Describe the bug
When event sending fails at the end of an upgrade test and the test suite is re-sending the event while
the sender is being shut down, it can happen that the receiver gets the event but sender thinks it was
not delivered (because the sender is interrupted). The test than fails with an error similar to this one: "expecting to have 16690 unique events received, but received 16691 unique events".
The code is here.
The upgrade test framework should check if the last step was interrupted and in that case allow the number
of received event to be "number of sent events plus 1".
Test log:
logger.go:130: 2023-11-13T09:02:40.170-0500 INFO WARNING: Duplicate:event #16691 received 2 times, but should be received only once
logger.go:130: 2023-11-13T09:02:40.169-0500 INFO WARNING: Duplicate:event #16690 received 2 times, but should be received only once
prober.go:78: There were 16691 events propagated, but 1 errors occurred. Listing them below.
prober.go:106: expecting to have 16690 unique events received, but received 16691 unique events
logger.go:130: 2023-11-13T09:02:40.172-0500 DEBUG Finished "ChannelContinualTest"
Sender:
{"level":"warn","ts":"2023-11-13T14:00:03.816696676Z","caller":"sender/services.go:102","msg":"Could not send step event 16690, retrying (0): Post \"http://sut-kn-channel.eventing-e2e0.svc.cluster.local\": EOF"}
{"level":"info","ts":"2023-11-13T14:00:03.816770975Z","caller":"sender/services.go:207","msg":"Sending step event #16691 to \"http://sut-kn-channel.eventing-e2e0.svc.cluster.local\""}
{"level":"warn","ts":"2023-11-13T14:00:03.821102525Z","caller":"sender/services.go:102","msg":"Could not send step event 16690, retrying (1): Post \"http://sut-kn-channel.eventing-e2e0.svc.cluster.local\": dial tcp 172.30.99.91:80: connect: connection refused"}
{"level":"info","ts":"2023-11-13T14:00:03.821158301Z","caller":"sender/services.go:207","msg":"Sending step event #16691 to \"http://sut-kn-channel.eventing-e2e0.svc.cluster.local\""}
{"level":"warn","ts":"2023-11-13T14:00:03.825665798Z","caller":"sender/services.go:102","msg":"Could not send step event 16690, retrying (2): Post \"http://sut-kn-channel.eventing-e2e0.svc.cluster.local\": dial tcp 172.30.99.91:80: connect: connection refused"}
Note: There's a bug in logging the exact step that failed. The message "Could not send step event 16690" should actually be "Could not send step event 16691"
because the message include a number of already sent events, not the index of the current message (which is +1).
Receiver:
{"level":"error","ts":"2023-11-13T14:00:04.955956723Z","caller":"event/services.go:210","msg":"event #16690 received 2 times, but should be received only once","stacktrace":"knative.dev/eventing/test/upgrade/prober/wathola/event.(*ErrorStore).appendThrown\n\tknative.dev/eventing/test/upgrade/prober/wathola/event/services.go:210\nknative.dev/eventing/test/upgrade/prober/wathola/event.(*ErrorStore).throwDuplicated\n\tknative.dev/eventing/test/upgrade/prober/wathola/event/services.go:190\nknative.dev/eventing/test/upgrade/prober/wathola/event.(*stepStore).RegisterStep\n\tknative.dev/eventing/test/upgrade/prober/wathola/event/services.go:79\nknative.dev/eventing/test/upgrade/prober/wathola/receiver.receiver.receiveEvent\n\tknative.dev/eventing/test/upgrade/prober/wathola/receiver/services.go:73\nreflect.Value.call\n\treflect/value.go:556\nreflect.Value.Call\n\treflect/value.go:339\ngithub.com/cloudevents/sdk-go/v2/client.(*receiverFn).invoke\n\tgithub.com/cloudevents/sdk-go/v2@v2.13.0/client/receiver.go:91\ngithub.com/cloudevents/sdk-go/v2/client.(*receiveInvoker).Invoke.func2\n\tgithub.com/cloudevents/sdk-go/v2@v2.13.0/client/invoker.go:85\ngithub.com/cloudevents/sdk-go/v2/client.(*receiveInvoker).Invoke\n\tgithub.com/cloudevents/sdk-go/v2@v2.13.0/client/invoker.go:88\ngithub.com/cloudevents/sdk-go/v2/client.(*ceClient).StartReceiver.func2.1\n\tgithub.com/cloudevents/sdk-go/v2@v2.13.0/client/client.go:253\ngithub.com/cloudevents/sdk-go/v2/client.(*ceClient).StartReceiver.func2.2\n\tgithub.com/cloudevents/sdk-go/v2@v2.13.0/client/client.go:265"}
{"level":"error","ts":"2023-11-13T14:00:05.77001412Z","caller":"event/services.go:210","msg":"event #16691 received 2 times, but should be received only once","stacktrace":"knative.dev/eventing/test/upgrade/prober/wathola/event.(*ErrorStore).appendThrown\n\tknative.dev/eventing/test/upgrade/prober/wathola/event/services.go:210\nknative.dev/eventing/test/upgrade/prober/wathola/event.(*ErrorStore).throwDuplicated\n\tknative.dev/eventing/test/upgrade/prober/wathola/event/services.go:190\nknative.dev/eventing/test/upgrade/prober/wathola/event.(*stepStore).RegisterStep\n\tknative.dev/eventing/test/upgrade/prober/wathola/event/services.go:79\nknative.dev/eventing/test/upgrade/prober/wathola/receiver.receiver.receiveEvent\n\tknative.dev/eventing/test/upgrade/prober/wathola/receiver/services.go:73\nreflect.Value.call\n\treflect/value.go:556\nreflect.Value.Call\n\treflect/value.go:339\ngithub.com/cloudevents/sdk-go/v2/client.(*receiverFn).invoke\n\tgithub.com/cloudevents/sdk-go/v2@v2.13.0/client/receiver.go:91\ngithub.com/cloudevents/sdk-go/v2/client.(*receiveInvoker).Invoke.func2\n\tgithub.com/cloudevents/sdk-go/v2@v2.13.0/client/invoker.go:85\ngithub.com/cloudevents/sdk-go/v2/client.(*receiveInvoker).Invoke\n\tgithub.com/cloudevents/sdk-go/v2@v2.13.0/client/invoker.go:88\ngithub.com/cloudevents/sdk-go/v2/client.(*ceClient).StartReceiver.func2.1\n\tgithub.com/cloudevents/sdk-go/v2@v2.13.0/client/client.go:253\ngithub.com/cloudevents/sdk-go/v2/client.(*ceClient).StartReceiver.func2.2\n\tgithub.com/cloudevents/sdk-go/v2@v2.13.0/client/client.go:265"}
Expected behavior
When the last step on sender side failed and the test was shutdown at that moment, the receiver should tolerate one more event to be received because there's no way to verify if the event was actually delivered or not.
To Reproduce
This happens only rarely and can be reproduced by running the upgrade tests.
Knative release version
Additional context
Add any other context about the problem here such as proposed priority
Describe the bug
When event sending fails at the end of an upgrade test and the test suite is re-sending the event while
the sender is being shut down, it can happen that the receiver gets the event but sender thinks it was
not delivered (because the sender is interrupted). The test than fails with an error similar to this one: "expecting to have 16690 unique events received, but received 16691 unique events".
The code is here.
The upgrade test framework should check if the last step was interrupted and in that case allow the number
of received event to be "number of sent events plus 1".
Test log:
Sender:
Note: There's a bug in logging the exact step that failed. The message "Could not send step event 16690" should actually be "Could not send step event 16691"
because the message include a number of already sent events, not the index of the current message (which is +1).
Receiver:
Expected behavior
When the last step on sender side failed and the test was shutdown at that moment, the receiver should tolerate one more event to be received because there's no way to verify if the event was actually delivered or not.
To Reproduce
This happens only rarely and can be reproduced by running the upgrade tests.
Knative release version
Additional context
Add any other context about the problem here such as proposed priority