KAFKA-9051: Prematurely complete source offset read requests for stopped tasks by C0urante · Pull Request #7532 · apache/kafka

C0urante · 2019-10-16T18:12:41Z

The changes here cause source offset readers to forcefully close when tasks fail to shut down within the graceful shutdown timeout period. When this happens, all pending and future offset read requests will throw an exception.

This is in line with the API for the OffsetStorageReader class, which states that "The only case when an exception will be thrown is if the entire request failed, e.g. because the underlying storage was unavailable.". If a task is blocked on reading offsets from Kafka to the point where it has failed to shut down within the graceful shutdown timeout period, it's safe to say that the offset read request has failed and as a result, throw an exception.

Initially, I considered just returning null values from the offset reader once closed; however, this may cause source tasks to mutate some external state as if there were no offset for the requested source partitions and may negatively impact other tasks that have since been started by this connector. An exception is safer, and with the distinction between throwing one when a task has exceeded its graceful shutdown period vs. just being scheduled to stop, should not damage the functionality of tasks running in a healthy environment.

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

wicknicks

thanks for the fix, @C0urante. some comments.

wicknicks · 2019-10-16T19:38:51Z

I think we should try to retain using the interface everywhere. extend the interface if you have to.

wicknicks · 2019-10-16T20:35:57Z

I think this would return stale values for a key. if this future was prematurely closed, then we might still return a stale values for a key. we should be careful that this doesn't break any of the existing contracts of OffsetReader.

Yes, this is noted in the description. However, if we want to stick strictly to the API for the OffsetStorageReader, once option we have is to simply return an empty map to all callers blocked on the future.

wicknicks · 2019-10-16T20:36:28Z

can we call this method stop() or forceStop()?

How about forceComplete? stop is a little ambiguous w/r/t whether threads blocked on get encounter a CancellationException or continue normally.

wicknicks · 2019-10-16T20:40:16Z

we should cache the fact that this future has been force stopped, and if so, any subsequent calls to get() should immediately return with either en empty map. right now, it looks like it will re-try to catch up with the remote log.

I agree with the gist of this comment, but I think the right place for that caching behavior is in the offset reader class, not in the backing store (since each task is given its own offset reader, but the backing store is shared among all of them).

C0urante · 2019-10-16T21:01:24Z

Thanks for the review, @liukrimhrim and @wicknicks! I've incorporated most of your comments and responded to the rest; ready for another round when you have time

…ped tasks

liukrimhrim

LGTM

wicknicks

LGTM. Thanks for the fix, @C0urante.

…turning null/empty map

C0urante · 2019-10-23T18:24:48Z

@wicknicks I've altered the functionality here to be a little more permissive for tasks in a healthy Kafka cluster and (hopefully) a little less likely to provide bad data to zombie tasks that may negatively impact running (or to-be-run) tasks. Would you mind doing another round when you have time?

ncliang

Looks good in general. Thanks for implementing the changes we discussed yesterday! Just one comment about testing the change.

ncliang · 2019-10-23T18:48:35Z

+            }
+            this.cancelled = true;
+            finishedLatch.countDown();
+            return true;


Do we have tests to test the cancellation logic?

Added those now.

ncliang

Looks great! Just a few more comments.

ncliang · 2019-10-23T19:55:02Z

+    }
+
+    protected static void runSeparateTestThread(Runnable task) {
+        Thread t = new Thread(task);


You can use ExecutorService API and replace calls to this method with just executorService.submit()

ncliang · 2019-10-23T20:04:48Z

+        });
+        assertFalse(testCallback.isDone());
+        testCallback.get();
+        if (testThreadException.get() != null) {


Shouldn't the get() be canceled and throw CancellationException above? When will the code below be executed?

It shouldn't be executed, but just in case something goes wrong, it seems more appropriate to throw that exception than just to fail the test because the CancellationException wasn't thrown from get.

ncliang · 2019-10-23T20:31:50Z

    @Override
    public boolean cancel(boolean b) {
-        return false;
+        if (!b) {


This implementation doesn't strictly adhere to the Future#cancel() contract - https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/Future.html#cancel-boolean-

After this method returns, subsequent calls to isDone() will always return true. Before, we were ignoring the mayInterruptIfRunning flag and always cancelling when cancel() is called. I think I'd prefer that behavior than not actually cancelling when cancel(false) is called.

Before, we were ignoring the mayInterruptIfRunning flag and always cancelling when cancel() is called

I don't believe this is correct. Before, cancel was just a no-op and no cancellation occurred, regardless of the value of mayInterruptIfRunning. That also violated the contract of the Future interface; I don't see this as a huge issue since we know that no other parts of the code base rely on the cancel method anyways.
Still, after reading the javadocs more carefully, I think it should be possible to implement cancel correctly if mayInterruptIfRunning is false.

I was referring to a previous snapshot where you didn't check the flag.

ncliang · 2019-10-23T20:33:37Z

+        testCallback.onCompletion(expectedError, null);
+        assertEquals(0, testCallback.numberOfConversions());
+        try {
+            testCallback.get();


fail() if no exception?

Ack, addressed

ncliang · 2019-10-23T20:36:01Z

+        testCallback.onCompletion(null, "420");
+        assertEquals(0, testCallback.numberOfConversions());
+        try {
+            testCallback.get();


ditto, missing fail() call

Ack, addressed

ncliang · 2019-10-23T20:38:51Z

+        });
+        assertFalse(testCallback.isDone());
+        try {
+            testCallback.get();


ditto, fail()

Ack, addressed

C0urante · 2019-10-23T21:15:49Z

@ncliang thanks, ready for the next round

ncliang

LGTM! Looks great!

rhauch

Thanks, @C0urante. I have a few suggestions, and one question about removing the callback parameter.

rhauch · 2019-11-19T19:46:54Z

    }
+
+    public void close() {
+        synchronized (offsetReadFutures) {


We should first check to see if this is closed, and if so simply return.

Ack, will address.

rhauch · 2019-11-19T19:47:22Z

+        synchronized (offsetReadFutures) {
+            closed.set(true);
+            for (Future<Map<ByteBuffer, ByteBuffer>> offsetReadFuture : offsetReadFutures) {
+                offsetReadFuture.cancel(true);


It is now possible with your changes for ConvertingFutureCallback to throw an exception during cancel(boolean). If that happens, then this reader instance will not be properly/completely closed, so that needs to be handled.

Ack, will address.

rhauch

LGTM. Thanks, @C0urante!

Waiting for a green build before merging.

…ped tasks (#7532) Prematurely complete source offset read requests for stopped tasks, and added unit tests. Author: Chris Egerton <chrise@confluent.io> Reviewers: Arjun Satish <arjun@confluent.io>, Nigel Liang <nigel@nigelliang.com>, Jinxin Liu <liukrimhim@gmail.com>, Randall Hauch <rhauch@gmail.com>

liukrimhrim reviewed Oct 16, 2019

View reviewed changes

wicknicks reviewed Oct 16, 2019

View reviewed changes

C0urante added 2 commits October 16, 2019 15:28

KAFKA-9051: Prematurely complete source offset read requests for stop…

24ba9bd

…ped tasks

KAFKA-9051: Address review comments

f89b454

C0urante force-pushed the kafka-9051 branch from 51d1666 to f89b454 Compare October 16, 2019 22:28

liukrimhrim approved these changes Oct 16, 2019

View reviewed changes

wicknicks approved these changes Oct 16, 2019

View reviewed changes

KAFKA-9051: Alter cancellation logic to throw exception instead of re…

9bda84b

…turning null/empty map

ncliang reviewed Oct 23, 2019

View reviewed changes

C0urante added 2 commits October 23, 2019 11:53

KAFKA-9051: Add unit test

7f2f21b

KAFKA-9051: Add unit tests for ConvertingFutureCallback class

746b46b

ncliang reviewed Oct 23, 2019

View reviewed changes

KAFKA-9051: Address review comments

3828abf

ncliang approved these changes Oct 23, 2019

View reviewed changes

rhauch reviewed Nov 19, 2019

View reviewed changes

KAFKA-9051: Address review comments

28c156e

rhauch approved these changes Nov 19, 2019

View reviewed changes

rhauch merged commit da43372 into apache:trunk Nov 20, 2019

C0urante deleted the kafka-9051 branch November 20, 2019 05:20

kkonstantine added the connect label Oct 16, 2020

Conversation

C0urante commented Oct 16, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Committer Checklist (excluded from commit message)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wicknicks left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

C0urante commented Oct 16, 2019

Uh oh!

liukrimhrim left a comment

Choose a reason for hiding this comment

Uh oh!

wicknicks left a comment

Choose a reason for hiding this comment

Uh oh!

C0urante commented Oct 23, 2019

Uh oh!

ncliang left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ncliang left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

C0urante commented Oct 23, 2019

Uh oh!

ncliang left a comment

Choose a reason for hiding this comment

Uh oh!

rhauch left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

C0urante commented Oct 16, 2019 •

edited

Loading

rhauch left a comment •

edited

Loading