KAFKA-14732: Use an exponential backoff retry mechanism while reconfiguring connector tasks by yashmayya · Pull Request #13276 · apache/kafka

yashmayya · 2023-02-18T11:58:18Z

Kafka Connect in distributed mode currently retries infinitely with a fixed retry backoff (250 ms) in case of errors arising during connector task reconfiguration.
Tasks can be "reconfigured" during connector startup (to get the initial task configs from the connector), a connector resume or if a connector explicitly requests it via its context.
Task reconfiguration essentially entails requesting a connector instance for its task configs and writing them to the Connect cluster's config storage (in case a change in task configs is detected).
A fixed retry backoff of 250 ms leads to very aggressive retries - consider a Debezium connector which attempts to initiate a database connection in its taskConfigs method. If the connection fails due to something like an invalid login, the Connect worker will essentially spam connection attempts frequently and indefinitely (until the connector config / database side configs are fixed).
An exponential backoff retry mechanism seems more well suited for the DistributedHerder::reconfigureConnectorTasksWithRetry method. The initial retry backoff is retained as 250 ms with a chosen maximum backoff of 60000 ms.

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

mukkachaitanya

Thanks, @yashmayya! The intent seems right. Had a couple of comments/suggestions

mukkachaitanya · 2023-02-20T11:42:33Z

I am curious if there is a way to not do infinite retries. If we are actually retrying infinitely, esp in the case of startConnector phase, then the connector just doesn't have tasks. Is it possible to somehow bubble up errors as part of connector (not task) status?

Hm, that's an interesting idea and I don't see the harm in limiting the number of retries to some reasonable value and then marking the connector as failed after that (we could include the last exception's trace in the connector's status).

Is it possible to somehow bubble up errors as part of connector (not task) status?

The AbstractHerder (DistributedHerder's parent class) implements the ConnectorStatus.Listener interface and so we should be able to update the connector's status to failed via

kafka/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/AbstractHerder.java

Lines 191 to 195 in 81b3b2f

@Override

public void onFailure(String connector, Throwable cause) {

statusBackingStore.putSafe(new ConnectorStatus(connector, ConnectorStatus.State.FAILED,

trace(cause), workerId, generation()));

}

For instance, here we use the onFailure hook to update a connector's status as failed if there is an exception thrown during startup.

yashmayya · 2023-02-21T12:33:24Z

Hi @C0urante, could you please take a look?

PS - I'd be happy to move the minor Javadoc improvements to a separate PR if you'd like (we never got to a conclusion here 🙂 )

C0urante

Thanks Yash, this looks great.

RE non-infinite retries: probably a good idea, needs a KIP though IMO.

RE Javadoc comment updates: a separate PR would be great. Unless they're required by a functional change, they should be separated, which makes it easier to review and merge.

C0urante · 2023-02-27T17:12:37Z

This test is great. I think it'd be worth it to perform a third and fourth tick. The third can be used to simulate successfully generating task configs after the two failed attempts, and the fourth can be used to ensure that we don't retry any further.

It's also worth noting that we're only testing the case where Connector::taskConfigs (or really, Worker::connectorTaskConfigs) fails, but the logic that's being added here applies if intra-cluster communication fails as well (which may happen if the leader of the cluster is temporarily unavailable, for example). It'd be nice if we could have test coverage for that too, but I won't block this PR on that.

Thanks, I've updated the test to add another herder tick which runs a successful task reconfiguration request (I skipped the addition of another tick because the no further retries bit can be verified by the poll timeout at the end of the previous tick).

Regarding the test case for the task reconfiguration REST request to the leader - I did consider that initially but while trying to add one, there were some complications (timing related issues) arising from the use of the forwardRequestExecutor at which point I felt like it was more trouble than it was worth. However, your comment made me revisit it and I've made some changes to drop in a simple mock executor service which runs requests synchronously (on the same thread as the caller). Let me know what you think?

Ah, fair point about the fourth tick!

I don't love using a synchronous executor here since it diverges significantly from the non-testing behavior of the herder. But, I can't think of a better way to test this without going overboard in complexity, and it does give us decent coverage.

So, good enough 👍

…tor tasks

…check extra herder tick after task configs generated successfully; Add testTaskReconfigurationRetriesWithLeaderRequestForwardingException; Revert Javadoc changes to ExponentialBackoff

C0urante

LGTM, thanks Yash!

mukkachaitanya reviewed Feb 20, 2023

View reviewed changes

C0urante added the connect label Feb 21, 2023

C0urante reviewed Feb 27, 2023

View reviewed changes

yashmayya added 2 commits February 27, 2023 23:18

Use an exponential backoff retry mechanism while reconfiguring connec…

6527094

…tor tasks

Minor refactor to centralize instantiation of ExponentialBackoff

724b714

yashmayya force-pushed the KAFKA-14732 branch from 388eae4 to f53d0ad Compare February 28, 2023 08:44

yashmayya mentioned this pull request Feb 28, 2023

MINOR: ExponentialBackoff Javadoc improvements #13317

Merged

3 tasks

Update testTaskReconfigurationRetries to verify mock invocations and …

a75630b

…check extra herder tick after task configs generated successfully; Add testTaskReconfigurationRetriesWithLeaderRequestForwardingException; Revert Javadoc changes to ExponentialBackoff

yashmayya force-pushed the KAFKA-14732 branch from f53d0ad to a75630b Compare February 28, 2023 10:40

C0urante approved these changes Feb 28, 2023

View reviewed changes

C0urante merged commit 8dd697b into apache:trunk Feb 28, 2023

C0urante mentioned this pull request Mar 2, 2023

KAFKA-14670: (part 1) Wrap Connectors in IsolatedConnector objects #13185

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KAFKA-14732: Use an exponential backoff retry mechanism while reconfiguring connector tasks#13276

KAFKA-14732: Use an exponential backoff retry mechanism while reconfiguring connector tasks#13276
C0urante merged 3 commits intoapache:trunkfrom
yashmayya:KAFKA-14732

yashmayya commented Feb 18, 2023

Uh oh!

mukkachaitanya left a comment

Uh oh!

Uh oh!

mukkachaitanya Feb 20, 2023

Uh oh!

yashmayya Feb 21, 2023

Uh oh!

yashmayya commented Feb 21, 2023

Uh oh!

C0urante left a comment

Uh oh!

Uh oh!

Uh oh!

C0urante Feb 27, 2023

Uh oh!

yashmayya Feb 28, 2023 •

edited

Loading

Uh oh!

C0urante Feb 28, 2023

Uh oh!

C0urante left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	@Override
	public void onFailure(String connector, Throwable cause) {
	statusBackingStore.putSafe(new ConnectorStatus(connector, ConnectorStatus.State.FAILED,
	trace(cause), workerId, generation()));
	}

Conversation

yashmayya commented Feb 18, 2023

Committer Checklist (excluded from commit message)

Uh oh!

mukkachaitanya left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mukkachaitanya Feb 20, 2023

Choose a reason for hiding this comment

Uh oh!

yashmayya Feb 21, 2023

Choose a reason for hiding this comment

Uh oh!

yashmayya commented Feb 21, 2023

Uh oh!

C0urante left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

C0urante Feb 27, 2023

Choose a reason for hiding this comment

Uh oh!

yashmayya Feb 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

C0urante Feb 28, 2023

Choose a reason for hiding this comment

Uh oh!

C0urante left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yashmayya Feb 28, 2023 •

edited

Loading