Skip to content

KAFKA-9750; Fix race condition with log dir reassign completion#8412

Merged
hachikuji merged 4 commits intoapache:trunkfrom
hachikuji:KAFKA-9750
Apr 3, 2020
Merged

KAFKA-9750; Fix race condition with log dir reassign completion#8412
hachikuji merged 4 commits intoapache:trunkfrom
hachikuji:KAFKA-9750

Conversation

@hachikuji
Copy link
Copy Markdown
Contributor

@hachikuji hachikuji commented Apr 3, 2020

There is a race on receiving a LeaderAndIsr request for a replica with an active log dir reassignment. If the reassignment completes just before the LeaderAndIsr handler updates epoch information, it can lead to an illegal state error since no future log dir exists. This patch fixes the problem by ensuring that the future log dir exists when the fetcher is started. Removal cannot happen concurrently because it requires access the same partition state lock.

Co-authored-by: Chia-Ping Tsai chia7712@gmail.com

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

request.minBytes,
request.maxBytes,
request.version <= 2,
false,
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We always build with the latest fetch version, so this check was unneeded.

try {
// It is possible that the log dir fetcher completed just before this call, so we
// filter only the partitions which still have a future log dir.
val filteredFetchStates = initialFetchStates.filterKeys(replicaMgr.futureLogExists)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So there is a extra check of future folder here and the check is in the lock so the race condition is resolved.

@chia7712
Copy link
Copy Markdown
Member

chia7712 commented Apr 3, 2020

@hachikuji Your approach is better than me :)

Could you include the test of my PR and please let me test your PR on my local later.

// filter only the partitions which still have a future log dir.
val filteredFetchStates = initialFetchStates.filterKeys(replicaMgr.futureLogExists)
super.addPartitions(filteredFetchStates)
filteredFetchStates.keySet
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems this line is useless.

// It is possible that the log dir fetcher completed just before this call, so we
// filter only the partitions which still have a future log dir.
val filteredFetchStates = initialFetchStates.filterKeys(replicaMgr.futureLogExists)
super.addPartitions(filteredFetchStates)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we skip to create thread if the collection of filtered states is empty?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will be a no-op in addPartitions, so it's probably harmless

@hachikuji
Copy link
Copy Markdown
Contributor Author

@chia7712 Thanks for reviewing. I added your test case improvements to the patch. Will consider additional improvements in the morning.

Copy link
Copy Markdown
Member

@mumrah mumrah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patch @hachikuji and @chia7712! This looks good to me 👍

Comment thread core/src/main/scala/kafka/server/AbstractFetcherManager.scala
// It is possible that the log dir fetcher completed just before this call, so we
// filter only the partitions which still have a future log dir.
val filteredFetchStates = initialFetchStates.filterKeys(replicaMgr.futureLogExists)
super.addPartitions(filteredFetchStates)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will be a no-op in addPartitions, so it's probably harmless

Comment thread core/src/main/scala/kafka/server/ReplicaAlterLogDirsManager.scala Outdated
* leader epoch. This is the offset the follower should truncate to ensure
* accurate log replication.
* - Finally truncate the logs for partitions in the truncating phase and mark them
* - Finally truncate the logs for partitions in the truncating phase and mark the
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was tempted to fix this with the plural truncations.

@hachikuji
Copy link
Copy Markdown
Contributor Author

The failure looks like a known flaky test: https://issues.apache.org/jira/browse/KAFKA-8460. I will merge to trunk, 2.5, and 2.4.

@hachikuji hachikuji merged commit 62dcfa1 into apache:trunk Apr 3, 2020
hachikuji added a commit that referenced this pull request Apr 3, 2020
There is a race on receiving a LeaderAndIsr request for a replica with an active log dir reassignment. If the reassignment completes just before the LeaderAndIsr handler updates epoch information, it can lead to an illegal state error since no future log dir exists. This patch fixes the problem by ensuring that the future log dir exists when the fetcher is started. Removal cannot happen concurrently because it requires access the same partition state lock.

Reviewers: Chia-Ping Tsai <chia7712@gmail.com>, David Arthur <mumrah@gmail.com>

Co-authored-by: Chia-Ping Tsai <chia7712@gmail.com>
hachikuji added a commit that referenced this pull request Apr 3, 2020
There is a race on receiving a LeaderAndIsr request for a replica with an active log dir reassignment. If the reassignment completes just before the LeaderAndIsr handler updates epoch information, it can lead to an illegal state error since no future log dir exists. This patch fixes the problem by ensuring that the future log dir exists when the fetcher is started. Removal cannot happen concurrently because it requires access the same partition state lock.

Reviewers: Chia-Ping Tsai <chia7712@gmail.com>, David Arthur <mumrah@gmail.com>

Co-authored-by: Chia-Ping Tsai <chia7712@gmail.com>
hachikuji added a commit that referenced this pull request Apr 10, 2020
There is a race on receiving a LeaderAndIsr request for a replica with an active log dir reassignment. If the reassignment completes just before the LeaderAndIsr handler updates epoch information, it can lead to an illegal state error since no future log dir exists. This patch fixes the problem by ensuring that the future log dir exists when the fetcher is started. Removal cannot happen concurrently because it requires access the same partition state lock.

Reviewers: Chia-Ping Tsai <chia7712@gmail.com>, David Arthur <mumrah@gmail.com>

Co-authored-by: Chia-Ping Tsai <chia7712@gmail.com>
hachikuji added a commit that referenced this pull request Apr 10, 2020
There is a race on receiving a LeaderAndIsr request for a replica with an active log dir reassignment. If the reassignment completes just before the LeaderAndIsr handler updates epoch information, it can lead to an illegal state error since no future log dir exists. This patch fixes the problem by ensuring that the future log dir exists when the fetcher is started. Removal cannot happen concurrently because it requires access the same partition state lock.

Reviewers: Chia-Ping Tsai <chia7712@gmail.com>, David Arthur <mumrah@gmail.com>

Co-authored-by: Chia-Ping Tsai <chia7712@gmail.com>
hachikuji added a commit that referenced this pull request Apr 10, 2020
There is a race on receiving a LeaderAndIsr request for a replica with an active log dir reassignment. If the reassignment completes just before the LeaderAndIsr handler updates epoch information, it can lead to an illegal state error since no future log dir exists. This patch fixes the problem by ensuring that the future log dir exists when the fetcher is started. Removal cannot happen concurrently because it requires access the same partition state lock.

Reviewers: Chia-Ping Tsai <chia7712@gmail.com>, David Arthur <mumrah@gmail.com>

Co-authored-by: Chia-Ping Tsai <chia7712@gmail.com>
qq619618919 pushed a commit to qq619618919/kafka that referenced this pull request May 12, 2020
…he#8412)

There is a race on receiving a LeaderAndIsr request for a replica with an active log dir reassignment. If the reassignment completes just before the LeaderAndIsr handler updates epoch information, it can lead to an illegal state error since no future log dir exists. This patch fixes the problem by ensuring that the future log dir exists when the fetcher is started. Removal cannot happen concurrently because it requires access the same partition state lock.

Reviewers: Chia-Ping Tsai <chia7712@gmail.com>, David Arthur <mumrah@gmail.com>

Co-authored-by: Chia-Ping Tsai <chia7712@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants