Skip to content

An event sent during upgrade test shutdown might be actually delivered #7446

@mgencur

Description

@mgencur

Describe the bug
When event sending fails at the end of an upgrade test and the test suite is re-sending the event while
the sender is being shut down, it can happen that the receiver gets the event but sender thinks it was
not delivered (because the sender is interrupted). The test than fails with an error similar to this one: "expecting to have 16690 unique events received, but received 16691 unique events".
The code is here.
The upgrade test framework should check if the last step was interrupted and in that case allow the number
of received event to be "number of sent events plus 1".

Test log:

    logger.go:130: 2023-11-13T09:02:40.170-0500	INFO	WARNING: Duplicate:event #16691 received 2 times, but should be received only once
    logger.go:130: 2023-11-13T09:02:40.169-0500	INFO	WARNING: Duplicate:event #16690 received 2 times, but should be received only once
    prober.go:78: There were 16691 events propagated, but 1 errors occurred. Listing them below.
    prober.go:106: expecting to have 16690 unique events received, but received 16691 unique events
    logger.go:130: 2023-11-13T09:02:40.172-0500	DEBUG	Finished "ChannelContinualTest"

Sender:

{"level":"warn","ts":"2023-11-13T14:00:03.816696676Z","caller":"sender/services.go:102","msg":"Could not send step event 16690, retrying (0): Post \"http://sut-kn-channel.eventing-e2e0.svc.cluster.local\": EOF"}
{"level":"info","ts":"2023-11-13T14:00:03.816770975Z","caller":"sender/services.go:207","msg":"Sending step event #16691 to \"http://sut-kn-channel.eventing-e2e0.svc.cluster.local\""}
{"level":"warn","ts":"2023-11-13T14:00:03.821102525Z","caller":"sender/services.go:102","msg":"Could not send step event 16690, retrying (1): Post \"http://sut-kn-channel.eventing-e2e0.svc.cluster.local\": dial tcp 172.30.99.91:80: connect: connection refused"}
{"level":"info","ts":"2023-11-13T14:00:03.821158301Z","caller":"sender/services.go:207","msg":"Sending step event #16691 to \"http://sut-kn-channel.eventing-e2e0.svc.cluster.local\""}
{"level":"warn","ts":"2023-11-13T14:00:03.825665798Z","caller":"sender/services.go:102","msg":"Could not send step event 16690, retrying (2): Post \"http://sut-kn-channel.eventing-e2e0.svc.cluster.local\": dial tcp 172.30.99.91:80: connect: connection refused"}

Note: There's a bug in logging the exact step that failed. The message "Could not send step event 16690" should actually be "Could not send step event 16691"
because the message include a number of already sent events, not the index of the current message (which is +1).

Receiver:

{"level":"error","ts":"2023-11-13T14:00:04.955956723Z","caller":"event/services.go:210","msg":"event #16690 received 2 times, but should be received only once","stacktrace":"knative.dev/eventing/test/upgrade/prober/wathola/event.(*ErrorStore).appendThrown\n\tknative.dev/eventing/test/upgrade/prober/wathola/event/services.go:210\nknative.dev/eventing/test/upgrade/prober/wathola/event.(*ErrorStore).throwDuplicated\n\tknative.dev/eventing/test/upgrade/prober/wathola/event/services.go:190\nknative.dev/eventing/test/upgrade/prober/wathola/event.(*stepStore).RegisterStep\n\tknative.dev/eventing/test/upgrade/prober/wathola/event/services.go:79\nknative.dev/eventing/test/upgrade/prober/wathola/receiver.receiver.receiveEvent\n\tknative.dev/eventing/test/upgrade/prober/wathola/receiver/services.go:73\nreflect.Value.call\n\treflect/value.go:556\nreflect.Value.Call\n\treflect/value.go:339\ngithub.com/cloudevents/sdk-go/v2/client.(*receiverFn).invoke\n\tgithub.com/cloudevents/sdk-go/v2@v2.13.0/client/receiver.go:91\ngithub.com/cloudevents/sdk-go/v2/client.(*receiveInvoker).Invoke.func2\n\tgithub.com/cloudevents/sdk-go/v2@v2.13.0/client/invoker.go:85\ngithub.com/cloudevents/sdk-go/v2/client.(*receiveInvoker).Invoke\n\tgithub.com/cloudevents/sdk-go/v2@v2.13.0/client/invoker.go:88\ngithub.com/cloudevents/sdk-go/v2/client.(*ceClient).StartReceiver.func2.1\n\tgithub.com/cloudevents/sdk-go/v2@v2.13.0/client/client.go:253\ngithub.com/cloudevents/sdk-go/v2/client.(*ceClient).StartReceiver.func2.2\n\tgithub.com/cloudevents/sdk-go/v2@v2.13.0/client/client.go:265"}
{"level":"error","ts":"2023-11-13T14:00:05.77001412Z","caller":"event/services.go:210","msg":"event #16691 received 2 times, but should be received only once","stacktrace":"knative.dev/eventing/test/upgrade/prober/wathola/event.(*ErrorStore).appendThrown\n\tknative.dev/eventing/test/upgrade/prober/wathola/event/services.go:210\nknative.dev/eventing/test/upgrade/prober/wathola/event.(*ErrorStore).throwDuplicated\n\tknative.dev/eventing/test/upgrade/prober/wathola/event/services.go:190\nknative.dev/eventing/test/upgrade/prober/wathola/event.(*stepStore).RegisterStep\n\tknative.dev/eventing/test/upgrade/prober/wathola/event/services.go:79\nknative.dev/eventing/test/upgrade/prober/wathola/receiver.receiver.receiveEvent\n\tknative.dev/eventing/test/upgrade/prober/wathola/receiver/services.go:73\nreflect.Value.call\n\treflect/value.go:556\nreflect.Value.Call\n\treflect/value.go:339\ngithub.com/cloudevents/sdk-go/v2/client.(*receiverFn).invoke\n\tgithub.com/cloudevents/sdk-go/v2@v2.13.0/client/receiver.go:91\ngithub.com/cloudevents/sdk-go/v2/client.(*receiveInvoker).Invoke.func2\n\tgithub.com/cloudevents/sdk-go/v2@v2.13.0/client/invoker.go:85\ngithub.com/cloudevents/sdk-go/v2/client.(*receiveInvoker).Invoke\n\tgithub.com/cloudevents/sdk-go/v2@v2.13.0/client/invoker.go:88\ngithub.com/cloudevents/sdk-go/v2/client.(*ceClient).StartReceiver.func2.1\n\tgithub.com/cloudevents/sdk-go/v2@v2.13.0/client/client.go:253\ngithub.com/cloudevents/sdk-go/v2/client.(*ceClient).StartReceiver.func2.2\n\tgithub.com/cloudevents/sdk-go/v2@v2.13.0/client/client.go:265"}

Expected behavior
When the last step on sender side failed and the test was shutdown at that moment, the receiver should tolerate one more event to be received because there's no way to verify if the event was actually delivered or not.

To Reproduce
This happens only rarely and can be reproduced by running the upgrade tests.

Knative release version

Additional context
Add any other context about the problem here such as proposed priority

Metadata

Metadata

Assignees

Labels

kind/bugCategorizes issue or PR as related to a bug.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions