KAFKA-19479: Add test to verify data-loss bug was fixed inside the producer#20254
KAFKA-19479: Add test to verify data-loss bug was fixed inside the producer#20254mjsax merged 6 commits intoapache:trunkfrom
Conversation
|
Thanks for the PR. -- I did check it out and run it locally, and believe it's actually a bug inside Before Kafka Streams commits offsets, it does call The problem inside the producer seems to be, that it initializes a flush by bumping There seems to be some other issue in splitting batches though, as there is repeated logs: It seems, the producer is actually not able to split the batch (guess we hit broker side I am not a producer expert, and thus, I am not 100% sure right now what the right fix should be -- it seems to be a more difficult fix. \cc @lianetm @kirktrue |
|
Thank you so much for the detailed explanation @mjsax. I have not yet started my investigation into this, but i think this really helps me and gives me a solid starting point to trace the bug. Since, you've confirmed that this is indeed a bug, I would like to start by writing failing unit tests that are more targeted and commit within the same PR. I think this would make it easier for everyone to review and propose a fix/approach since your initial impression was that it seems to be a more difficult fix. |
|
Hey all, interesting. I don't have too much context but it does look fishy when we Definitely needs more investigation, maybe start by adding a test to check the retry behaviour around Hope it helps, I will stay in the loop. |
Given that the issue is in the producer, it might be simpler/better to just try writing a more tailored unit test in a new PR? Ie, do couple of Or you try to go even one level deeper, and try to unit test |
|
@mjsax : Hmm, the splitting code has some logic to chain the response futures. When we append a record to a new batch after a split, we create a new future for the record and chain it to the old future.
So, a pending @lianetm : Currently, |
|
A label of 'needs-attention' was automatically added to this PR in order to raise the |
|
@lianetm @junrao @mjsax tagging as a reminder for follow-up. Since, this is confirmed to be a bug (potentially critical) and I believe may require more discussions within the community experts. To make it easier for others to reproduce and debug, one option could be that I can temporarily disable the failing test and we can merge this PR, so that community members can pick it up more easily. Please let me know if I can contribute in any way to this issue. If there's any way I can help and contribute to this issue, I would really love to help. |
|
A label of 'needs-attention' was automatically added to this PR in order to raise the |
@junrao Yes (as described above: #20254 (comment)), I did set a breakpoint somewhere in the producer to see what it's doing, and while it was retrying the send internally it did unblock |
If a record fails to be delivered after delivery.timeout.ms, Sender calls failBatch(), which eventually calls ProduceRequestResult.done(). This unblocks Producer.flush(), but it won't fail with a TimeoutException since it doesn't check the error in ProduceRequestResult. Does KStreams depend on Producer.flush() failing? It seems that KStreams needs to check the future returned in the send() call to know if the record is sent successfully.
We have the following code in Sender. It only tries to split the batch if it has more than 1 record. So, the splitting shouldn't occur forever. It does have the issue that when a split batch is created, it gets a new createMs, which inadvently extends the timeout.
Do you have a test that can reproduce this? |
|
A label of 'needs-attention' was automatically added to this PR in order to raise the |
|
@mjsax : Took a closer look. This is indeed a bug in the producer. ProducerBatch has a produceFuture of type ProduceRequestResult and a list of Thunk, one for each produced record. Each Thunk has a future of type FutureRecordMetadata. When we split a ProducerBatch, we have the logic to chain Thunk.future to the newly split batches. This makes sure that we don't unblock the producer.send() prematurely. But the problem is that flush() doesn't wait on Thunk.future. Instead, it waits on ProducerBatch.produceFuture. After splitting, we immediately call ProducerBatch.produceFuture.done(). This will unblock the flush() call prematurely since the splitted batches haven't been completed. As for the fix, one way is to change the logic in RecordAccumulator.awaitFlushCompletion(). Instead of doing the following, we collect all thunks in each incomplete batch and wait on each thunk's FutureRecordMetadata. This way, the chaining logic for FutureRecordMetadata will kick in. As for why the splitting loops forever, this is probably because of a recently fixed bug in https://github.com/apache/kafka/pull/20358/files. @shashankhs11 : Would you be willing to submit a patch? |
@junrao Yes, I would love to. I will look into this and try to get a better understanding of what is required. |
|
@shashankhs11 : Thanks. Since the fix doesn't change any public API, no KIP is needed. |
054e6e2 to
1537aea
Compare
|
@junrao patch is available for review. |
|
A label of 'needs-attention' was automatically added to this PR in order to raise the |
…ges when the producer fails with MESSAGE_TOO_LARGE, violating delivery guarantees (#20285) Bug Fix in Producer where flush() does not wait for a batch to complete after splitting. Cf - #20254 (comment) and [KAFKA-19479](https://issues.apache.org/jira/browse/KAFKA-19479) for more details Reviewers: Jun Rao <junrao@gmail.com>
…ges when the producer fails with MESSAGE_TOO_LARGE, violating delivery guarantees (#20285) Bug Fix in Producer where flush() does not wait for a batch to complete after splitting. Cf - #20254 (comment) and [KAFKA-19479](https://issues.apache.org/jira/browse/KAFKA-19479) for more details Reviewers: Jun Rao <junrao@gmail.com>
…ges when the producer fails with MESSAGE_TOO_LARGE, violating delivery guarantees (#20285) Bug Fix in Producer where flush() does not wait for a batch to complete after splitting. Cf - #20254 (comment) and [KAFKA-19479](https://issues.apache.org/jira/browse/KAFKA-19479) for more details Reviewers: Jun Rao <junrao@gmail.com>
|
A label of 'needs-attention' was automatically added to this PR in order to raise the |
1537aea to
9d895b5
Compare
| @Test | ||
| public void shouldCommitOffsetsAndProduceMessagesNormallyForSmallerRecordCount() throws Exception { |
There was a problem hiding this comment.
I am wondering if we need the 2nd test?
Each test takes about 30-31s to run
There was a problem hiding this comment.
I agree. Having the negative test should be sufficient, as it's the interesting case which did not work. The positive case is implicitly covered by many other integration tests, so it seems unncessary.
|
@mjsax I updated the test. Requesting to take a look |
db61357 to
c30fdc6
Compare
|
CI check failed due to a failing test. Rebased with latest trunk. |
|
Thanks for the test! Merged to |
…ges when the producer fails with MESSAGE_TOO_LARGE, violating delivery guarantees (apache#20285) Bug Fix in Producer where flush() does not wait for a batch to complete after splitting. Cf - apache#20254 (comment) and [KAFKA-19479](https://issues.apache.org/jira/browse/KAFKA-19479) for more details Reviewers: Jun Rao <junrao@gmail.com>
…oducer (apache#20254) Adds a new integration test to verify that we don't lose any data after a MESSAGE_TOO_LARGE error on write. Reviewers: Matthias J. Sax <matthias@confluent.io>
…ges when the producer fails with MESSAGE_TOO_LARGE, violating delivery guarantees (apache#20285) Bug Fix in Producer where flush() does not wait for a batch to complete after splitting. Cf - apache#20254 (comment) and [KAFKA-19479](https://issues.apache.org/jira/browse/KAFKA-19479) for more details Reviewers: Jun Rao <junrao@gmail.com>
…oducer (apache#20254) Adds a new integration test to verify that we don't lose any data after a MESSAGE_TOO_LARGE error on write. Reviewers: Matthias J. Sax <matthias@confluent.io>
…ges when the producer fails with MESSAGE_TOO_LARGE, violating delivery guarantees (apache#20285) Bug Fix in Producer where flush() does not wait for a batch to complete after splitting. Cf - apache#20254 (comment) and [KAFKA-19479](https://issues.apache.org/jira/browse/KAFKA-19479) for more details Reviewers: Jun Rao <junrao@gmail.com>
…oducer (apache#20254) Adds a new integration test to verify that we don't lose any data after a MESSAGE_TOO_LARGE error on write. Reviewers: Matthias J. Sax <matthias@confluent.io>
Adds a new integration test to verify that we don't lose any data after
a MESSAGE_TOO_LARGE error on write.
Reviewers: Matthias J. Sax matthias@confluent.io