KAFKA-12980: Return empty record batch from Consumer::poll when position advances due to aborted transactions#11046
Conversation
|
@hachikuji could you take a look at this when you have a chance? |
8aff618 to
d29d5f4
Compare
d29d5f4 to
4175df1
Compare
|
Thanks for the review @hachikuji; I agree that it's simpler and more future-proof to make the logic more general and not worry about the specific case of aborted transactions. I've updated the PR (and title) accordingly. |
hachikuji
left a comment
There was a problem hiding this comment.
Looks pretty good overall. Left some small comments.
| final Fetch<K, V> fetch = pollForFetches(timer); | ||
| if (!fetch.isEmpty()) { | ||
| if (fetch.records().isEmpty()) { | ||
| log.debug( |
There was a problem hiding this comment.
I'm not so sure about the value of this log line. Maybe it would be more useful to ensure that we have enough logging in Fetcher that we can see when the position gets advanced without any record data? That would tell us not only why a poll() returned with empty data, but which partition caused it to do so. What do you think?
There was a problem hiding this comment.
I think that works. My goal with this line was to help people understand the cause for this new behavior, as it may seem like a bug at first if poll returns empty batches before the poll timeout has elapsed. I've taken a stab at a message that has functional value (i.e., naming the specific topic partition whose position has advanced without any user-visible records) and still tries to let the user know about how/why this may affect poll behavior; LMKWYT.
Also, this section was pretty buggy as-was since it skipped the call to client.transmitSends and didn't pass the empty batch through the consumer's interceptors. I've fixed both of these; LMK if you think we should skip passing an empty batch to interceptors, though.
|
Thanks @hachikuji. I've addressed the nits and tried to address the more substantial comment regarding logging; interested in your thoughts when you have a moment. |
| * This method returns immediately if there are records available. Otherwise, it will await the passed timeout. | ||
| * If the timeout expires, an empty record set will be returned. Note that this method may block beyond the | ||
| * timeout in order to execute custom {@link ConsumerRebalanceListener} callbacks. | ||
| * This method returns immediately if there are records available or if the position advances past control records. |
There was a problem hiding this comment.
nit: past control records or aborted transactions when isolation.level=READ_COMMITTED
| if (partRecords.isEmpty()) { | ||
| log.debug( | ||
| "Advanced position for partition {} without receiving any user-visible records. " | ||
| + "This is likely due to skipping over control records in the current fetch, " |
There was a problem hiding this comment.
How about this. First, we can augment the message above to something like this:
log.trace("Updating fetch position from {} to {} for partition {} and returning {} records from `poll()`",
position, nextPosition, completedFetch.partition, partRecords.size());This gives us enough information in the logs to see which partitions caused the poll() to return and it tell us exactly where in the log the aborted transaction/control records exist. Second, maybe we can add back your previous log line, but make it a little more terse and put it at trace level:
log.trace(
"Returning empty records from `poll()` since the consumer's position has advanced "
+ "for at least one topic partition")There was a problem hiding this comment.
Sounds good to me 👍
hachikuji
left a comment
There was a problem hiding this comment.
LGTM overall. Just two minor comments.
| completedFetch.partition | ||
| ); | ||
| log.trace("Returning empty records from `poll()` " | ||
| + "since the consumer's position has advanced for at least one topic partition"); |
There was a problem hiding this comment.
nit: I think this comment made more sense in its original location in KafkaConsumer. At this level, it seems redundant after the changes in the log message above. We would already say "... and returning 0 records from poll()"
There was a problem hiding this comment.
This is true. I was hoping we could have something in here that explicitly states that this can happen because of the change in behavior implemented in this PR (i.e., skipping control records or aborted transactions). If you think it's worth it to call that out in a log message we can do that here, otherwise the entire if (partRecords.isEmpty()) branch is unnecessary.
There was a problem hiding this comment.
I'm inclined to either remove the log line entirely or move it back to its former location in KafkaConsumer. Will leave it up to you.
There was a problem hiding this comment.
Moved it to KafkaConsumer.
…onsumer.java Co-authored-by: Jason Gustafson <jason@confluent.io>
hachikuji
left a comment
There was a problem hiding this comment.
LGTM. Thanks for the updates.
…ion advances due to aborted transactions (apache#11046) Return empty record batch from Consumer::poll when position advances due to aborted transactions or control records. This is useful for reads to the end of a topic that contains aborted or empty transactions. If an aborted transaction is at the end of the topic, the consumer can now be expected to return from `poll` if it advances past that aborted transaction, and users can query the consumer's latest `position` for the relevant topic partitions to see if it has managed to make it past the end of the topic (or rather, what was the end of the topic when the attempt to read to the end of the topic began). Reviewers: Jason Gustafson <jason@confluent.io>
|
@C0urante FYI this has a functionality loss. If you |
|
@nbali this PR was merged over two years ago and I've lost almost all of the context around it. If you are seeing problems because of the changes made here, please file a Jira ticket describing what's going wrong (including examples and/or a practical use case if possible) and link it to the ticket for this change. |
Jira
This is useful for reads to the end of a topic that contains aborted or empty transactions. If an aborted transaction is at the end of the topic, the consumer can now be expected to return from
pollif it advances past that aborted transaction, and users can query the consumer's latestpositionfor the relevant topic partitions to see if it has managed to make it past the end of the topic (or rather, what was the end of the topic when the attempt to read to the end of the topic began).For a concrete example of this logic, see the
KafkaBasedLog::readToLogEndmethod that Connect employs to refresh its view of internal topics.No new unit tests are added, but many existing ones are modified to ensure that aborted and empty transactions are detected and reported correctly by
Fetcher::collectFetch.Committer Checklist (excluded from commit message)