KAFKA-13348: Allow Source Tasks to Handle Producer Exceptions by TheKnowles · Pull Request #11382 · apache/kafka

TheKnowles · 2021-10-05T20:17:48Z

This change allows Source Connectors the option to set "error.tolerance" to "all" to allow them to handle/ignore producer exceptions. In the event the producer cannot write to Kafka, the connector commitRecord() callback is invoked with null RecordMetadata. This is new behavior for the errors.tolerance setting. Default behavior is still to kill the task unconditionally if errors.tolerance is "none".

A unit test has been added to validate the producer callback for failure being invoked. The sourceTask will ignore the exception and the task will not be killed.

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

TheKnowles · 2021-10-07T14:46:06Z

Unrelated tests locally and in jenkins appear flaky. All tests related to this change pass deterministically.

… source connectors to ignore producer exceptions. The connector will receive null RecordMetadata in the commitRecord callback in lieu of the task failling unconditionally.

TheKnowles · 2021-11-19T14:56:33Z

Rebased, squashed, and force pushed for merging post KIP vote. Tests related to this change pass locally. There are a handful of unrelated nondeterministic test failures.

edit: Fixed a related test that was missed until CI picked it up. lastSendFailed state was removed in WorkerSourceTask.

C0urante · 2021-11-19T15:43:57Z

@@ -366,7 +367,11 @@ private boolean sendRecords() {
                        if (e != null) {
                            log.error("{} failed to send record to {}: ", WorkerSourceTask.this, topic, e);


Should we modify this line to respect the errors.log.enable (and possibly errors.log.include.messages) properties?

I wonder if it might be useful to still unconditionally set producerSendException (or perhaps even convert that field from an AtomicReference<Throwable> to some kind of list, and append to it here) and then modify the contents (and possibly also name) of maybeThrowProducerSendException to have our error-handling logic. Thoughts?

Actually, that may complicate things by causing records to be given to SourceTask::commitRecord out of order (a record that caused a producer failure may be committed after a record that was dispatched to the producer after it). So probably best to keep the error-handling logic here, but I do still wonder if we can respect the logging-related configuration properties.

Now that this could be a tolerated error, it makes sense to have it respect the errors.log.enable configuration, but the log line would be duplicated, unconditionally writing it in the event we do not tolerate and a config check if we do.

Are you envisioning something like this?

if (retryWithToleranceOperator.getErrorToleranceType().equals(ToleranceType.ALL)) { if (errorLogEnabled) { // get this value from the config in some manner log.error("{} failed to send record to {}: ", WorkerSourceTask.this, topic, e); log.trace("{} Failed record: {}", WorkerSourceTask.this, preTransformRecord); } commitTaskRecord(preTransformRecord, null); } else { log.error("{} failed to send record to {}: ", WorkerSourceTask.this, topic, e); log.trace("{} Failed record: {}", WorkerSourceTask.this, preTransformRecord); producerSendException.compareAndSet(null, e); }

I would need to look more closely at the other layers of objects on top of the SourceTask. enableErrorLog() is available in the ConnectorConfig, but only the SinkConnectorConfig makes use of it. I would need to spin up some additional infrastructure. Not sure if I would want to add WorkerErrantRecordReporter to WorkerSourceTask or have the configuration pass down in some other manner.

Yes, I was thinking the behavior could be something like that code snippet, although we'd also want to respect the errors.log.include.messages property and would probably want the format of the error messages to be similar to the error messages we emit in other places where messages are tolerated (such as when conversion or transformation fails).

The error retry handling infrastructure predominantly concerns itself with the sink side of the house. Insofar that any refactoring I would want to do would probably necessitate a KIP on its own. To that end, I have added an additional executeFailed() function to RetryWithToleranceOperator to allow the source worker to handle error logging with all of the existing infrastructure/configuration that exists for sink tasks.

I toy'ed around with the idea of having the new executeFailed() fire without a tolerance type check. This would work for failing/ignoring as expected, but with no mechanism to then decide if we should call commitRecord(). We could block on the future from executeFailed() and then check withinToleranceLimits() but that introduces non determinism with interrupt/execution exceptions.

…ng logging infrastructure/configuration.

mimaison

Thanks @TheKnowles for the PR. I've made a first pass and left a few comments.

mimaison · 2022-01-26T11:21:45Z

-                            log.error("{} failed to send record to {}: ", WorkerSourceTask.this, topic, e);
-                            log.trace("{} Failed record: {}", WorkerSourceTask.this, preTransformRecord);
-                            producerSendException.compareAndSet(null, e);
+                            if (retryWithToleranceOperator.getErrorToleranceType().equals(ToleranceType.ALL)) {


We can use == to compare enums.

mimaison · 2022-01-26T11:23:15Z

+                                // executeFailed here allows the use of existing logging infrastructure/configuration
+                                retryWithToleranceOperator.executeFailed(Stage.KAFKA_PRODUCE, WorkerSourceTask.class,
+                                        preTransformRecord, e);
+                                commitTaskRecord(preTransformRecord, null);


Should we have a debug/trace log in this path?

Previously it was suggested to have the tolerance operator handle via the logging report. I would personally find it useful to have it in the connect log regardless of tolerance error logging configuration. I've moved the error/debug log lines to above the tolerance check to log in all instances.

We should not be logging at ERROR level for every single record if we aren't failing the task unless the user has explicitly enabled this by setting errors.log.enable to true in their connector config.

Let's keep the existing trace and error log lines in the else block.
My suggestion is to add a line at the debug or trace level in the if block so users can know if an error is ignored.

My misunderstanding, thank you both for the feedback. Update made.

mimaison · 2022-01-26T11:23:59Z

+    // For source connectors that want to skip kafka producer errors.
+    // They cannot use withinToleranceLimits() as no failure may have actually occurred prior to the producer failing
+    // to write to kafka.
+    public synchronized ToleranceType getErrorToleranceType() {


Does this need to be synchronized?

It does not. Type is immutable and thread safe. I had dug through the ticket that retroactively made this class thread safe and it seemed like a good idea at the time to slap a synchronized on it to match the rest of the class, but is not necessary at all. Removed.

mimaison · 2022-01-26T11:26:23Z

+        Future<Void> errantRecordFuture = context.report();
+        if (!withinToleranceLimits()) {
+            errorHandlingMetrics.recordError();
+            throw new ConnectException("Tolerance exceeded in error handler", error);


Now that this message can come from 2 different paths, should we add some context to the message to disambiguate them?

I added some context to the string error message denoting it was a Source Worker. I am open to suggestions on how verbose this message should be.

mimaison · 2022-01-26T11:30:12Z

        createWorkerTask(TargetState.STARTED);
    }

+    private void createWorkerTaskWithErrorToleration() {


Can we reuse the createWorkerTask() method just below by passing a RetryWithToleranceOperator argument instead of creating the WorkerSourceTask object here?

+1 I have refactored the constructors to be cleaner with various parameter lists.

mimaison · 2022-01-26T11:43:19Z

+
+        expectSendRecordOnce();
+        expectSendRecordProducerCallbackFail();
+        sourceTask.commitRecord(EasyMock.anyObject(SourceRecord.class), EasyMock.anyObject(RecordMetadata.class));


Instead of EasyMock.anyObject(RecordMetadata.class) should we use EasyMock.isNull() to assert we indeed pass null to the task in case there was a failure?

…r task creation in test. Misc. code cleanup.

TheKnowles · 2022-01-26T22:30:58Z

Thanks @TheKnowles for the PR. I've made a first pass and left a few comments.

@mimaison Thank you for reviewing. I've replied to each comment above and pushed changes.

TheKnowles · 2022-01-27T12:01:55Z

Happy to squash and force push once everyone is pleased with the changes.

mimaison · 2022-01-27T12:48:05Z

@TheKnowles Don't worry about squashing everything, it's done automatically when we merge PRs.

Thanks for the quick update, I'll take another look.

mimaison

Thanks for the PR, LGTM

mimaison · 2022-01-27T17:19:52Z

@C0urante Do you have further comments or just I merged?

C0urante

LGTM, thanks Knowles!

mimaison · 2022-01-27T18:19:51Z

Thanks @TheKnowles for this contribution! Sorry it took so long between getting votes on the KIP and reviews on your PR. This feature will be in the next minor release, Kafka 3.2.0.

mimaison added the kip Requires or implements a KIP label Oct 6, 2021

kkonstantine added the connect label Oct 19, 2021

KAFKA-13348 errors.tolerance "all" behavior has been updated to allow…

34d93a8

… source connectors to ignore producer exceptions. The connector will receive null RecordMetadata in the commitRecord callback in lieu of the task failling unconditionally.

TheKnowles force-pushed the KAFKA-13348 branch from eb5a48e to 34d93a8 Compare November 19, 2021 14:55

C0urante reviewed Nov 19, 2021

View reviewed changes

TheKnowles added 2 commits November 19, 2021 14:59

KAFKA-13348 Fixed related failing test after lastSendFail was removed.

b956c19

KAFKA-13348 Updated error tolerance in WorkerSourceTask to use existi…

22daf36

…ng logging infrastructure/configuration.

mimaison reviewed Jan 26, 2022

View reviewed changes

KAFKA-13348 Added additional connect worker logging. Cleaned up worke…

878b131

…r task creation in test. Misc. code cleanup.

KAFKA-13348 Updated logging for error ignore path.

ad098eb

mimaison approved these changes Jan 27, 2022

View reviewed changes

C0urante approved these changes Jan 27, 2022

View reviewed changes

mimaison merged commit 9f2f63e into apache:trunk Jan 27, 2022

C0urante mentioned this pull request Feb 15, 2022

KAFKA-10000: Exactly-once support for source connectors (KIP-618) #10907

Closed

3 tasks

		@@ -366,7 +367,11 @@ private boolean sendRecords() {
		if (e != null) {
		log.error("{} failed to send record to {}: ", WorkerSourceTask.this, topic, e);

Conversation

TheKnowles commented Oct 5, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Committer Checklist (excluded from commit message)

Uh oh!

TheKnowles commented Oct 7, 2021

Uh oh!

TheKnowles commented Nov 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mimaison left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TheKnowles commented Jan 26, 2022

Uh oh!

TheKnowles commented Jan 27, 2022

Uh oh!

mimaison commented Jan 27, 2022

Uh oh!

mimaison left a comment

Choose a reason for hiding this comment

Uh oh!

mimaison commented Jan 27, 2022

Uh oh!

C0urante left a comment

Choose a reason for hiding this comment

Uh oh!

mimaison commented Jan 27, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

TheKnowles commented Oct 5, 2021 •

edited

Loading

TheKnowles commented Nov 19, 2021 •

edited

Loading