Skip to content

KAFKA-14598: Fix flaky ConnectRestApiTest#13084

Closed
ashwinpankaj wants to merge 7 commits intoapache:trunkfrom
ashwinpankaj:trunk
Closed

KAFKA-14598: Fix flaky ConnectRestApiTest#13084
ashwinpankaj wants to merge 7 commits intoapache:trunkfrom
ashwinpankaj:trunk

Conversation

@ashwinpankaj
Copy link
Copy Markdown
Contributor

https://issues.apache.org/jira/browse/KAFKA-14598

ConnectRestApiTest sometimes fails with the message

ConnectRestError(404, '<html>\n<head>\n<meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1"/>\n<title>Error 404 Not Found</title>\n</head>\n<body><h2>HTTP ERROR 404 Not Found</h2>\n<table>\n<tr><th>URI:</th><td>/connector-plugins/</td></tr>\n<tr><th>STATUS:</th><td>404</td></tr>\n<tr><th>MESSAGE:</th><td>Not Found</td></tr>\n<tr><th>SERVLET:</th><td>-</td></tr>\n</table>\n\n</body>\n</html>\n', 'http://172.31.1.75:8083/connector-plugins/')

This happens because ConnectDistributedService.start() by default waits till the the lineJoined group at generation .. is visible in the logs.

In most cases this is sufficient. But in the cases where the test fails, we see that this message appears even before Connect RestServer has finished initialization.

[2022-12-15 15:40:29,064] INFO [Worker clientId=connect-1, groupId=connect-cluster] Joined group at generation 2 with protocol version 1 and got assignment: Assignment
Unknown macro: {error=0, leader='connect-1-07d9da63-9acb-4633-aee4-1ab79f4ab1ae', leaderUrl='http}
with rebalance delay: 0 (org.apache.kafka.connect.runtime.distributed.DistributedHerder)

[2022-12-15 15:40:29,560] INFO 172.31.5.66 - - [15/Dec/2022:15:40:29 +0000] "GET /connector-plugins/ HTTP/1.1" 404 375 "-" "python-requests/2.24.0" 71 (org.apache.kafka.connect.runtime.rest.RestServer)
[2022-12-15 15:40:29,579] INFO REST resources initialized; server is started and ready to handle requests (org.apache.kafka.connect.runtime.rest.RestServer)
  • Ran the fixed test. In connect logs we can see that GET /connectors is called multiple times till it returns 200 OK. This shows that the test waits till Connect REST service is up.

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

@mimaison mimaison added connect tests Test fixes (including flaky tests) labels Jan 6, 2023
@gharris1727
Copy link
Copy Markdown
Contributor

Is this a change in behavior? It appears that the default mode is LISTEN:

self.startup_mode = self.STARTUP_MODE_LISTEN
and it is only overwritten if the incoming mode is truey:
if mode:
self.startup_mode = mode

It seems like the problem is not that the startup_mode is wrong, but that the STARTUP_MODE_LISTEN is not waiting for all resources to be registered.

@ashwinpankaj
Copy link
Copy Markdown
Contributor Author

Is this a change in behavior? It appears that the default mode is LISTEN:

self.startup_mode = self.STARTUP_MODE_LISTEN

and it is only overwritten if the incoming mode is truey:

if mode:
self.startup_mode = mode

It seems like the problem is not that the startup_mode is wrong, but that the STARTUP_MODE_LISTEN is not waiting for all resources to be registered.

Thanks for taking a look @gharris1727 !
The code which you linked is for ConnectServiceBase.
ConnectRestApiTest uses ConnectDistributedService, which overrides startup_mode to STARTUP_MODE_JOIN

@gharris1727
Copy link
Copy Markdown
Contributor

Ah, I recall working on this before. I was the one that added that STARTUP_MODE_JOIN override you linked: #9040

In the description, I mentioned how JOIN was a superset of LISTEN, and I think that's still the case. The jetty server starts:

before the herder does:
herder.start();
rest.initializeResources(herder);

However, as you've noticed, the registration of the resources occurs after the server begins listening, and after the herder joins the group. So neither LISTEN or JOIN is sufficient to ensure that the resources are registered. But changing from JOIN to LISTEN is going to have the opposite effect that you're intending, as the LISTEN condition is true even earlier than JOIN is.

@ashwinpankaj
Copy link
Copy Markdown
Contributor Author

ashwinpankaj commented Jan 10, 2023

Thanks @gharris1727 for clarifying the reason why JOIN was used as the mode initially and why the previous change was not enough to guarantee that Connect Rest service was up.

I re-checked the code for start_and_wait_to_start_listening.
Here the test should ideally wait till it is able to succesfully list connectors.
Once the test code is able to list connectors, we can assume that the rest server is up.

I think if we set the retry_on_exc flag to true in wait_until, we will achieve the desired effect of trying to list connectors, backing off if that attempt fails and retrying again till a timeout of 60s expires.

To test this theory, in my latest revision I have set retry_on_exc to True in start_and_wait_to_start_listening().
I also added a 40 second sleep in RestServer.initializeResources().
ConnectRestApiTests passed inspite of the delay.

Hope this gives us the confidence that the fix works.

@ijuma ijuma requested a review from C0urante January 15, 2023 07:16

self.cc.start()
self.logger.info("Waiting till Connect REST server is listening")
self.cc.start(mode=ConnectServiceBase.STARTUP_MODE_LISTEN)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is unnecessary for the stabilization fix, and actually weakens the test. Because this is actually creating the connectors in distributed mode, I think it would be smart to wait for the cluster to actually join the cluster.

So we can revert these two lines and leave it as it was.

@gharris1727
Copy link
Copy Markdown
Contributor

Thanks @ashwinpankaj for following up, I think that this is good after one last nit comment.

To test this theory, in my latest revision I have set retry_on_exc to True in start_and_wait_to_start_listening().
I also added a 40 second sleep in RestServer.initializeResources().
ConnectRestApiTests passed inspite of the delay.

I was going to suggest something like this, thanks for verifying that the fix actually stabilizes the test!

@github-actions
Copy link
Copy Markdown

This PR is being marked as stale since it has not had any activity in 90 days. If you would like to keep this PR alive, please ask a committer for review. If the PR has merge conflicts, please update it with the latest from trunk (or appropriate release branch)

If this PR is no longer valid or desired, please feel free to close it. If no activity occurrs in the next 30 days, it will be automatically closed.

@github-actions github-actions Bot added the stale Stale PRs label Jun 10, 2023
@ashwinpankaj ashwinpankaj closed this by deleting the head repository Dec 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

connect stale Stale PRs tests Test fixes (including flaky tests)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants