Skip to content

KAFKA-9971: Error Reporting in Sink Connectors (KIP-610)#8720

Merged
kkonstantine merged 8 commits intoapache:trunkfrom
aakashnshah:dlq
May 28, 2020
Merged

KAFKA-9971: Error Reporting in Sink Connectors (KIP-610)#8720
kkonstantine merged 8 commits intoapache:trunkfrom
aakashnshah:dlq

Conversation

@aakashnshah
Copy link
Copy Markdown
Contributor

@aakashnshah aakashnshah commented May 23, 2020

Implementation for KIP-610: https://cwiki.apache.org/confluence/display/KAFKA/KIP-610%3A+Error+Reporting+in+Sink+Connectors

This PR adds the ErrantRecordReporter interface as well as its implementation - WorkerErrantRecordReporter. The WorkerErrantRecordReporter is created in Worker and brought up through WorkerSinkTask to WorkerSinkTaskContext. An integration test and unit test has been added.

Copy link
Copy Markdown
Contributor

@levzem levzem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aakashnshah looks great! excited for this feature

just got mostly nits for you

Comment thread connect/api/src/main/java/org/apache/kafka/connect/sink/ErrantRecordReporter.java Outdated
Comment thread connect/api/src/main/java/org/apache/kafka/connect/sink/ErrantRecordReporter.java Outdated
Comment thread connect/api/src/main/java/org/apache/kafka/connect/sink/ErrantRecordReporter.java Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extract to variable

Comment on lines 242 to 256
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this invalid config step here

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it hurts to add this step.

Copy link
Copy Markdown
Contributor

@rhauch rhauch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @aakashnshah! I have a few comments below, including a kind of big question: why are we not reusing the existing RetryWithToleranceOperator that already tracks the tolerance and 1 or more reporters?

Comment thread connect/api/src/main/java/org/apache/kafka/connect/sink/ErrantRecordReporter.java Outdated
Comment thread connect/api/src/main/java/org/apache/kafka/connect/sink/ErrantRecordReporter.java Outdated
Comment thread connect/api/src/main/java/org/apache/kafka/connect/sink/ErrantRecordReporter.java Outdated
Comment thread connect/runtime/src/main/java/org/apache/kafka/connect/runtime/Worker.java Outdated
Comment thread connect/runtime/src/main/java/org/apache/kafka/connect/runtime/Worker.java Outdated
Comment on lines 50 to 57
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All fields that can be final should be marked as such. This provides semantic intent to future developers and helps prevent unintentionally changing the fields in the future.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really want to pass the ExecutionException to the ConnectException, or would it be better to pass that exception's cause to the ConnectException?

How about log messages here?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggested earlier about reusing the RetryWithToleranceOperator, and that doing so might require adding a produce-like method to that class that simply reports a new error. If that method took a Callback here and passed it to its producer.send(...) call, then we could provide a callback that removed the (completed) future from our list, helping to keep that list as small as possible with only the incomplete futures.

If we did that, we'd want to use a LinkedList rather than an ArrayList, since we're no longer removing futures only from the ends of the list.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we ensure that the producer future is never null, then we can remove the if (future == null) kind of checks in this class' methods.

Copy link
Copy Markdown
Contributor

@gharris1727 gharris1727 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this test ever encounter this exception? I don't think we will be able to backport this test to < 2.6 because the method won't exist at all, much less generate the exception that is being caught here.
If anything, this generates a less informative NPE later in put, and hides the actual root cause.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I should remove this comment. This test won't encounter this exception since it's always going to have the class and method if the test itself exists.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aakashnshah let's remove this comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this have an unbounded waiting time? How does this interact with task.shutdown.graceful.timeout.ms? What is the delivery guarantee of these error reports?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the waiting time is unbounded. The delivery guarantee is that all errant records up to the latest offset in preCommit() will be sent to Kafka before preCommit() is invoked.

@rhauch
Copy link
Copy Markdown
Contributor

rhauch commented May 27, 2020

ok to test

Copy link
Copy Markdown
Contributor

@rhauch rhauch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @aakashnshah. I think this is much closer to a more straightforward implementation of the KIP.

I took another pass, and there are quite a few things to change/correct. I'll keep reviewing the test code in a subsequent review.

Comment thread connect/api/src/main/java/org/apache/kafka/connect/sink/ErrantRecordReporter.java Outdated
Comment thread connect/api/src/main/java/org/apache/kafka/connect/sink/ErrantRecordReporter.java Outdated
Comment thread connect/api/src/main/java/org/apache/kafka/connect/sink/ErrantRecordReporter.java Outdated
Comment on lines 73 to 80
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can't change this public API. InternalSinkRecord needs to be in the runtime module.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you move InternalSinkRecord to the runtime module, be sure to make this private final.

Copy link
Copy Markdown
Contributor

@rhauch rhauch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made another pass, with comments inline below. Thanks, @aakashnshah.

Comment thread connect/api/src/main/java/org/apache/kafka/connect/sink/SinkRecord.java Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a trace log message before and after this call.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aakashnshah let's remove this comment

Signed-off-by: Aakash Shah <ashah@confluent.io>
Signed-off-by: Aakash Shah <ashah@confluent.io>
Signed-off-by: Aakash Shah <ashah@confluent.io>
Copy link
Copy Markdown
Contributor

@rhauch rhauch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more suggestions

@rhauch
Copy link
Copy Markdown
Contributor

rhauch commented May 27, 2020

retest this please

@rhauch
Copy link
Copy Markdown
Contributor

rhauch commented May 27, 2020

ok to test

@rhauch
Copy link
Copy Markdown
Contributor

rhauch commented May 27, 2020

retest this please

@rhauch
Copy link
Copy Markdown
Contributor

rhauch commented May 27, 2020

ok to test

@rhauch
Copy link
Copy Markdown
Contributor

rhauch commented May 27, 2020

retest this please

1 similar comment
@rhauch
Copy link
Copy Markdown
Contributor

rhauch commented May 27, 2020

retest this please

@rhauch
Copy link
Copy Markdown
Contributor

rhauch commented May 27, 2020

Retest this please.

@rhauch
Copy link
Copy Markdown
Contributor

rhauch commented May 27, 2020

ok to test

@rhauch
Copy link
Copy Markdown
Contributor

rhauch commented May 27, 2020

Closing to try to fix builds; will reopen shortly.

@rhauch rhauch closed this May 27, 2020
@rhauch rhauch reopened this May 27, 2020
@rhauch
Copy link
Copy Markdown
Contributor

rhauch commented May 27, 2020

ok to test

@rhauch
Copy link
Copy Markdown
Contributor

rhauch commented May 27, 2020

retest this please

@rhauch
Copy link
Copy Markdown
Contributor

rhauch commented May 27, 2020

retest this please.

@rhauch
Copy link
Copy Markdown
Contributor

rhauch commented May 27, 2020

retest this please

Copy link
Copy Markdown
Contributor

@rhauch rhauch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more fixes. It's getting close.

@rhauch
Copy link
Copy Markdown
Contributor

rhauch commented May 28, 2020

Copy link
Copy Markdown
Contributor

@kkonstantine kkonstantine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @aakashnshah
Looks good overall.
I left a few mostly minor comments after a quick pass, if there's time to address.

Comment on lines +53 to +66
@Override
public boolean equals(Object o) {
return super.equals(o);
}

@Override
public int hashCode() {
return super.hashCode();
}

@Override
public String toString() {
return super.toString();
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these overrides don't seem to add much.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, spotbugs complained if these were not here.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

never mind then. I'll leave this to AI.

Comment thread connect/runtime/src/main/java/org/apache/kafka/connect/runtime/Worker.java Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, topic can never be null if it's coming from a parsed config value that doesn't have null as its default value. (another way to think of that is that you can't pass a null value from properties)


// delete connector
connect.deleteConnector(CONNECTOR_NAME);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

@rhauch
Copy link
Copy Markdown
Contributor

rhauch commented May 28, 2020

@rhauch
Copy link
Copy Markdown
Contributor

rhauch commented May 28, 2020

ok to test

@rhauch
Copy link
Copy Markdown
Contributor

rhauch commented May 28, 2020

retest this please

1 similar comment
@rhauch
Copy link
Copy Markdown
Contributor

rhauch commented May 28, 2020

retest this please

Copy link
Copy Markdown
Contributor

@rhauch rhauch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @aakashnshah! LGTM, pending green builds (that are finally running!)

Copy link
Copy Markdown
Contributor

@kkonstantine kkonstantine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
Nicely done @aakashnshah

@kkonstantine
Copy link
Copy Markdown
Contributor

2/3 builds are green. Merging now to allow for other PRs to resolve conflicts sooner rather than later, if needed.

@kkonstantine kkonstantine merged commit 38c1e96 into apache:trunk May 28, 2020
@kkonstantine kkonstantine changed the title KAFKA-9971: Error Reporting in Sink Connectors KAFKA-9971: Error Reporting in Sink Connectors (KIP-610) May 28, 2020
Kvicii pushed a commit to Kvicii/kafka that referenced this pull request May 30, 2020
* 'trunk' of github.com:apache/kafka: (36 commits)
  Remove redundant `containsKey` call in KafkaProducer (apache#8761)
  KAFKA-9494; Include additional metadata information in DescribeConfig response (KIP-569) (apache#8723)
  KAFKA-10061; Fix flaky `ReassignPartitionsIntegrationTest.testCancellation` (apache#8749)
  KAFKA-9130; KIP-518 Allow listing consumer groups per state (apache#8238)
  KAFKA-9501: convert between active and standby without closing stores (apache#8248)
  KAFKA-10056; Ensure consumer metadata contains new topics on subscription change (apache#8739)
  MINOR: Log the reason for coordinator discovery failure (apache#8747)
  KAFKA-10029; Don't update completedReceives when channels are closed to avoid ConcurrentModificationException (apache#8705)
  MINOR: remove unnecessary timeout for admin request (apache#8738)
  MINOR: Relax Percentiles test (apache#8748)
  MINOR: regression test for task assignor config (apache#8743)
  MINOR: Update documentation.html to refer to 2.6 (apache#8745)
  MINOR: Update documentation.html to refer to 2.5 (apache#8744)
  KAFKA-9673: Filter and Conditional SMTs (apache#8699)
  KAFKA-9971: Error Reporting in Sink Connectors (KIP-610) (apache#8720)
  KAFKA-10052: Harden assertion of topic settings in Connect integration tests (apache#8735)
  MINOR: Slight MetadataCache tweaks to avoid unnecessary work (apache#8728)
  KAFKA-9802; Increase transaction timeout in system tests to reduce flakiness (apache#8736)
  KAFKA-10050: kafka_log4j_appender.py fixed for JDK11 (apache#8731)
  KAFKA-9146: Add option to force delete active members in StreamsResetter (apache#8589)
  ...

# Conflicts:
#	core/src/main/scala/kafka/log/Log.scala
@aakashnshah aakashnshah deleted the dlq branch June 7, 2020 22:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants