KAFKA-14670: (part 1) Wrap Connectors in IsolatedConnector objects #13185
KAFKA-14670: (part 1) Wrap Connectors in IsolatedConnector objects #13185gharris1727 wants to merge 20 commits intoapache:trunkfrom
Conversation
Signed-off-by: Greg Harris <greg.harris@aiven.io>
Signed-off-by: Greg Harris <greg.harris@aiven.io>
Signed-off-by: Greg Harris <greg.harris@aiven.io>
Signed-off-by: Greg Harris <greg.harris@aiven.io>
Signed-off-by: Greg Harris <greg.harris@aiven.io>
Signed-off-by: Greg Harris <greg.harris@aiven.io>
There was a problem hiding this comment.
Thanks @gharris1727. This is a great improvement that makes several common bugs much harder to write and I'm excited to see it land so that we can stop worrying about what's running on the herder thread, doing our due diligence around the context classloader, etc.
I've only taken a look at the functional changes and haven't reviewed changes to tests yet. I hope to do a full pass sometime this week.
| try { | ||
| updateConnectorTasks(connName); | ||
| } catch (Exception e) { | ||
| log.error("Unable to generate task configs for {}", connName, e); | ||
| } |
There was a problem hiding this comment.
This is a change in behavior too, right? We no longer throw in ConnectorContext::requestTaskReconfiguration if we encounter any errors.
This also seems reasonable (it aligns the behavior across standalone and distributed modes), but it does have consequences for the REST API, where restarting a connector no longer fails if we're unable to generate task configs for it (which is currently the case for both distributed and standalone modes).
There was a problem hiding this comment.
Yes this is a change in behavior.
There is precedent for throwing ConnectException from ConnectorContext::requestTaskReconfiguration, so perhaps wrapping this in a ConnectException and propagating it would be a better behavior. I can move this to HerderConnectorContext, except it would only be effective for the standalone herder.
We can also see this as an opportunity to improve the StandaloneHerder by handling reconfigurations asynchronously and retry them in the background, rather than 500'ing the REST API or dropping the failure silently.
There was a problem hiding this comment.
Perhaps we can also consider this a failure of the signature of Herder::requestTaskReconfiguration. The DistributedHerder makes this asynchronous, but provides no future or callback to confirm the progress of the request.
Arguably StandaloneHerder is implementing the function signature correctly as a request that either succeeds or fails.
It also makes me think that a connector which repeatedly calls requestTaskReconfiguration (and then always fails in generateTaskConfigs) could spam the herder with retried restart requests. This is such a messy situation that the old function signatures hid from us :)
There was a problem hiding this comment.
Okay, a lot to unpack here!
The more I think about it, the more I like the existing behavior for handling failures in task config generation. We automatically retry in distributed mode in order to absorb the risk of writing to the config topic or issuing a REST request to the leader, but since neither of those take place in standalone mode, it's fine to just throw the exception back to the caller (either a connector invoking ConnectorContext::requestTaskReconfiguration, or a REST API call to restart the connector) since the likeliest cause is a failed call to Connector::taskConfigs and automatic retries are less likely to be useful.
I think we should basically just preserve existing behavior here, with the one exception of fixing how we handle failed calls to requestTaskReconfiguration that occur during a call to restartConnector. Right now we don't handle any of those and, IIUC, just cause the REST request to time out after 90 seconds. Instead of timing out, we should return a 500 response in that case.
There was a problem hiding this comment.
I don't think it's especially likely for connectors to continually invoke requestTaskReconfiguration given the automatic retry logic in distributed mode, and as of #13276, the impact of ongoing retries for that operation is drastically reduced.
Signed-off-by: Greg Harris <greg.harris@aiven.io>
Signed-off-by: Greg Harris <greg.harris@aiven.io>
Signed-off-by: Greg Harris <greg.harris@aiven.io>
Signed-off-by: Greg Harris <greg.harris@aiven.io>
Signed-off-by: Greg Harris <greg.harris@aiven.io>
|
This PR is being marked as stale since it has not had any activity in 90 days. If you would like to keep this PR alive, please ask a committer for review. If the PR has merge conflicts, please update it with the latest from trunk (or appropriate release branch) If this PR is no longer valid or desired, please feel free to close it. If no activity occurs in the next 30 days, it will be automatically closed. |
Signed-off-by: Greg Harris <greg.harris@aiven.io>
|
This PR is being marked as stale since it has not had any activity in 90 days. If you If you are having difficulty finding a reviewer, please reach out on the [mailing list](https://kafka.apache.org/contact). If this PR is no longer valid or desired, please feel free to close it. If no activity occurs in the next 30 days, it will be automatically closed. |
|
This PR has been closed since it has not had any activity in 120 days. If you feel like this |
Jira
This is the first part of the above ticket, applied only to SinkConnector and SourceConnector plugins.
Additional PRs will cover the other plugins, as the refactor was too large to reasonably review at once.
Design decisions:
IsolatedPlugin<P>class will be a common superclass for all plugin wrappers.IsolatedPluginsuperclass provides utility methods for subclasses to manage swapping the ThreadContextClassLoader for each call in a way that has minimal boilerplate.Isolated*classes are intended to only be constructed within the plugin isolation infrastructure, and will all have package-local constructors.throws Exceptionto remind callers that they may throw arbitrary exceptions.Open questions/issues:
hashCode,equals, andtoStringmethods do not havethrows Exceptionas the Object class does not have these throws clauses. That means that calling code cannot be forced to handle exceptions from these methods. For toString, the exception message is provided in place of the toString result, and the hashCode and equals are wholly decoupled from the underlying hashCode and equals implementations.Exceptionand notThrowable. The distinction being thatExceptions are considered by the Java Language to be reasonable to catch in an application, andThrowables were not. I wasn't sure whether the Connect runtime should be forced to handle errors like OutOfMemoryError, LinkageError, etc, or just let them propagate and kill the calling thread.Committer Checklist (excluded from commit message)