KAFKA-5097: guard against unassigned partitions#2876
KAFKA-5097: guard against unassigned partitions#2876enothereska wants to merge 3 commits intoapache:0.10.2from enothereska:KAFKA-4755-fetcher-only
Conversation
|
@ijuma @hachikuji in trunk this has been fixed, but not in 0.10.2. Results in occasional stream test failure. A separate PR updates the stream tests and triggers the failure all the time #2867 (but can be ignored for now since problem manifests itself even without updated tests, although only occassionally). Note that Fetcher.java has diverged quite a bit in trunk from 0.10.2, so a full cherrypicking is hard. This just fixes the immediate problem. |
| TopicPartition partition = nextInLineRecords.partition; | ||
| fetchedPartition = partition; | ||
| fetchedOffsets = subscriptions.position(partition); | ||
| if (subscriptions.isAssigned(partition)) { |
There was a problem hiding this comment.
I think we need to skip all the code in this else block if the partition is no longer assigned.
There was a problem hiding this comment.
I'm not 100% sure. In the trunk code https://github.com/apache/kafka/blob/trunk/clients/src/main/java/org/apache/kafka/clients/consumer/internals/Fetcher.java#L523 we seem to drain the records (but the logic is slightly different, in that we drain them, but don't process them). Any suggestions welcome since I'm not super familiar with this code.
There was a problem hiding this comment.
I took a closer look and it seems like this is a regression from a267316#diff-b45245913eaae46aa847d2615d62cde0R490 so we may need another RC.
Maybe we need to move the position call to inside !records.isEmpty since drainRecords does the check. But not sure, better to wait for @hachikuji to take a look.
There was a problem hiding this comment.
Maybe we can go with this solution since it seems pretty safe. Shall we just remove the log.debug since we'll get that from the drainRecords call anyway? And let's run the system tests on this branch after that.
There was a problem hiding this comment.
@enothereska Looking at the previous patch, it seems like the scope of the try/catch is unnecessarily large. I think it only needs to cover the call to parseCompletedFetch. If we do that, then there should be no need to access subscriptions.position down this path.
There was a problem hiding this comment.
Mentioned offline, but a better solution (if Becket is right that the try/catch needs to cover this branch) would be the following:
fetchedOffsets = nextInLineRecords.fetchOffset;Also, seems this variable shouldn't be plural.
There was a problem hiding this comment.
@enothereska The trunk code does not need to access subscription.position, instead it uses PartitionRecords.nextInlineOffsets which should be the same as position because the position is updated to this value every time after a successful fetchRecords().
The big try/catch is to make sure the the exception from fetchRecords will also be caught and not result in loss of non-empty fetched.
There was a problem hiding this comment.
so @becketqin to confirm, you are saying @hachikuji's suggestion on fetchedOffsets = nextInLineRecords.fetchOffset; is the way to go?
I ask because it takes 7 hours to run system tests :)
There was a problem hiding this comment.
@enothereska Yes, I think that is a better solution. But I think @hachikuji was right that we don't need to cover both if/else branches. We just need to cover the parseCompletedFetch
There was a problem hiding this comment.
Ok, stay tuned, making that change with next commit.
|
Refer to this link for build results (access rights to CI server needed): |
|
Refer to this link for build results (access rights to CI server needed): |
|
Refer to this link for build results (access rights to CI server needed): |
|
Refer to this link for build results (access rights to CI server needed): |
|
Refer to this link for build results (access rights to CI server needed): |
|
Refer to this link for build results (access rights to CI server needed): |
|
Retest this please |
ijuma
left a comment
There was a problem hiding this comment.
LGTM, thanks. Will wait for @hachikuji's review.
|
Refer to this link for build results (access rights to CI server needed): |
|
Refer to this link for build results (access rights to CI server needed): |
|
Refer to this link for build results (access rights to CI server needed): |
|
Sorry for the regression. It was my bad. Will submit a PR next when backporting a patch does not merge cleanly. @hachikuji The reason to have the big try catch is that some exceptions such as InvalidRecordException can be thrown from The fix LGTM. |
|
@becketqin Are you sure about that? I must be missing it, can you point me to the line? |
|
@hachikuji You are right. This part of the code is different from trunk. We already parsed all the records in |
|
@hachikuji @becketqin @ijuma done. Restarting system tests. Btw, with previous change, all tests passed. |
|
Refer to this link for build results (access rights to CI server needed): |
hachikuji
left a comment
There was a problem hiding this comment.
LGTM. Good catch and thanks for the fix!
|
Full test running here: https://jenkins.confluent.io/job/system-test-kafka-branch-builder-2/267/ |
…entially unassigned partitions Author: Eno Thereska <eno.thereska@gmail.com> Reviewers: Ismael Juma <ismael@juma.me.uk>, Jiangjie Qin <becket.qin@gmail.com>, Jason Gustafson <jason@confluent.io> Closes #2876 from enothereska/KAFKA-4755-fetcher-only
|
@enothereska I merged this to 0.10.2. Would you mind closing the PR since it looks like the merge doesn't take care of this automatically. |
|
Refer to this link for build results (access rights to CI server needed): |
|
Refer to this link for build results (access rights to CI server needed): |
No description provided.