KAFKA-14598: Fix flaky ConnectRestApiTest#13084
KAFKA-14598: Fix flaky ConnectRestApiTest#13084ashwinpankaj wants to merge 7 commits intoapache:trunkfrom ashwinpankaj:trunk
Conversation
|
Is this a change in behavior? It appears that the default mode is LISTEN: kafka/tests/kafkatest/services/connect.py Line 78 in 95910af kafka/tests/kafkatest/services/connect.py Lines 124 to 125 in 95910af It seems like the problem is not that the startup_mode is wrong, but that the STARTUP_MODE_LISTEN is not waiting for all resources to be registered. |
Thanks for taking a look @gharris1727 ! |
|
Ah, I recall working on this before. I was the one that added that STARTUP_MODE_JOIN override you linked: #9040 In the description, I mentioned how JOIN was a superset of LISTEN, and I think that's still the case. The jetty server starts: However, as you've noticed, the registration of the resources occurs after the server begins listening, and after the herder joins the group. So neither LISTEN or JOIN is sufficient to ensure that the resources are registered. But changing from JOIN to LISTEN is going to have the opposite effect that you're intending, as the LISTEN condition is true even earlier than JOIN is. |
|
Thanks @gharris1727 for clarifying the reason why JOIN was used as the mode initially and why the previous change was not enough to guarantee that Connect Rest service was up. I re-checked the code for start_and_wait_to_start_listening. I think if we set the retry_on_exc flag to true in wait_until, we will achieve the desired effect of trying to list connectors, backing off if that attempt fails and retrying again till a timeout of 60s expires. To test this theory, in my latest revision I have set retry_on_exc to True in start_and_wait_to_start_listening(). Hope this gives us the confidence that the fix works. |
|
|
||
| self.cc.start() | ||
| self.logger.info("Waiting till Connect REST server is listening") | ||
| self.cc.start(mode=ConnectServiceBase.STARTUP_MODE_LISTEN) |
There was a problem hiding this comment.
I think this is unnecessary for the stabilization fix, and actually weakens the test. Because this is actually creating the connectors in distributed mode, I think it would be smart to wait for the cluster to actually join the cluster.
So we can revert these two lines and leave it as it was.
|
Thanks @ashwinpankaj for following up, I think that this is good after one last nit comment.
I was going to suggest something like this, thanks for verifying that the fix actually stabilizes the test! |
|
This PR is being marked as stale since it has not had any activity in 90 days. If you would like to keep this PR alive, please ask a committer for review. If the PR has merge conflicts, please update it with the latest from trunk (or appropriate release branch) If this PR is no longer valid or desired, please feel free to close it. If no activity occurrs in the next 30 days, it will be automatically closed. |
https://issues.apache.org/jira/browse/KAFKA-14598
ConnectRestApiTest sometimes fails with the message
This happens because ConnectDistributedService.start() by default waits till the the line
Joined group at generation ..is visible in the logs.In most cases this is sufficient. But in the cases where the test fails, we see that this message appears even before Connect RestServer has finished initialization.
GET /connectorsis called multiple times till it returns 200 OK. This shows that the test waits till Connect REST service is up.Committer Checklist (excluded from commit message)