KAFKA-7285: Create new producer on each rebalance if EOS enabled#5501
KAFKA-7285: Create new producer on each rebalance if EOS enabled#5501mjsax merged 6 commits intoapache:trunkfrom
Conversation
|
Call for review @guozhangwang @bbejeck @vvcephei @hachikuji |
|
Triggered system tests: https://jenkins.confluent.io/job/system-test-kafka-branch-builder/1908/ |
guozhangwang
left a comment
There was a problem hiding this comment.
Overall looks fine to me.
I'm a bit concerned about our current inter-twined logic on the close(clean, isZombie) calls. But that is sort of a tech debt and out of this scope.
| public void resume() { | ||
| // nothing to do; new transaction will be started only after topology is initialized | ||
| log.debug("Resuming"); | ||
| if (eosEnabled) { |
There was a problem hiding this comment.
nit: above comment can be removed.
|
@guozhangwang Can you create a JIRA about the |
|
The system tests failed: http://confluent-kafka-branch-builder-system-test-results.s3-us-west-2.amazonaws.com/2018-08-13--001.1534210196--mjsax--kafka-7285-producer-fence-fix--285a344/report.html looking at: Worker got: Picking another random test: Looks like the same failure (worker6): |
| if (!isZombie && transactionInFlight) { | ||
| producer.abortTransaction(); | ||
| } | ||
| transactionInFlight = false; |
There was a problem hiding this comment.
nit: do we want to consider setting producer to null here as well if eosEnabled? I realize this branch of the code should only get exercised when closing, but just in case we make changes I don't think it will hurt.
|
Updated this. Re-triggered system tests: |
| } | ||
| try { | ||
| if (!isZombie) { | ||
| recordCollector.close(); |
There was a problem hiding this comment.
Should we avoid calling recordCollector.close() unless eosEnabled?
It seems like this block used to be guarded by if(eosEnabled), but it's not anymore.
- fixed close logic - added more tests
|
Updated again. Added more test and discovered some more issues with close/suspend and fixed it. |
|
Previous system test passed. Re-triggered again: https://jenkins.confluent.io/job/system-test-kafka-branch-builder/1912/ |
|
Jenkins failures are relevant? |
|
Seems like... will have a look. |
|
Updated. |
| if (eosEnabled && !isZombie) { | ||
| if (transactionInFlight) { | ||
| try { | ||
| producer.abortTransaction(); |
There was a problem hiding this comment.
If suspend fails, we will always call task.close(..) again inside AssignedTasks, in which we will abort transactions and close record collectors. So do we have to do it here?
There was a problem hiding this comment.
That's a good point. If we do the cleanup twice, we might call producer.abortTransaction() and producer.close() again, after a successful producer.close() and thus log unnecessary error message (the exceptions itself would get swallowed).
However, we need to make sure that we don't fail a second time with closeTopology() when task.close(...) calls task.suspend(false,...) a second time.
Will update the PR.
|
Previous system tests passed. Updated and retriggered: https://jenkins.confluent.io/job/system-test-kafka-branch-builder/1913/ |
|
failure unrelated retest this please |
|
@mjsax no futher comments, updated code looks good to me. |
|
Retest this please |
|
LGTM. Please feel free to merge after tests pass. |
|
Again Retest this please |
Retest this please |
|
Java8: |
Reviewers: Guozhang Wang <guozhang@confluent.io>, John Roesler <john@confluent.io>, Bill Bejeck <bill@confluent.io>
|
Merged to |
Reviewers: Guozhang Wang <guozhang@confluent.io>, John Roesler <john@confluent.io>, Bill Bejeck <bill@confluent.io>
Back porting #5501 broke some tests. Reviewers: John Roesler <vvcephei@users.noreply.github.com>, Guozhang Wang <wangguoz@gmail.com>
…che#5501) Reviewers: Guozhang Wang <guozhang@confluent.io>, John Roesler <john@confluent.io>, Bill Bejeck <bill@confluent.io>
Committer Checklist (excluded from commit message)