KAFKA-8012: Ensure partitionStates have not been removed before truncating. by colinhicks · Pull Request #6333 · apache/kafka

colinhicks · 2019-02-27T01:16:33Z

When reproducing the flakiness of DynamicBrokerReconfigurationTest#testThreadPoolResize (KAFKA-7988) on my machine, I saw that failures correspond to an NPE in one or more replica fetcher threads. This happens when the replica fetcher manager simultaneously calls removeFetcherForPartitions, removing the corresponding partitionStates, while a replica fetcher thread attempts to truncate the same partition(s) in truncateToHighWatermark.

This change simply checks that the partitionState is not null first. Note that a similar guard exists in truncateToEpochEndOffsets.

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

stanislavkozlovski

Seems to be a regression from d152989

Nice find, LGTM!

stanislavkozlovski · 2019-02-27T15:58:40Z

-          info(s"Failed to truncate $tp", e)
-          partitionsWithError += tp
+      val partitionState = partitionStates.stateValue(tp)
+      if (partitionState != null) {


nit: Looking at the usage in the file, I think we would prefer:

Option(partitionStates.stateValue(topicPartition)).foreach { currentFetchState = ...

instead of the null check

hachikuji

Thanks, the fix looks good. Could we have a test case?

hachikuji · 2019-02-27T22:54:41Z

  }

+  @Test
+  def testTruncateToHighWatermarkDuringRemovePartitions(): Unit = {


Thanks for the test case. Maybe we can add a similar test case for the truncateToEpochEndOffsets path? I think testLeaderEpochChangeDuringFetchEpochsFromLeader has most of what we want, but we don't need to put the partition back after removing it.

colinhicks · 2019-02-28T16:21:41Z

@hachikuji @stanislavkozlovski thanks again for the reviews.

I added a test to exercise the truncateToEpochEndOffsets path, where there is an existing null check relevant to the NPE. Open to suggestions on how to minimize this test case further if advisable.

hachikuji

LGTM. Thanks for the patch!

hachikuji · 2019-02-28T19:48:58Z

retest this please

…ating. (#6333) This patch fixes a regression in the replica fetcher which occurs when the replica fetcher manager simultaneously calls `removeFetcherForPartitions`, removing the corresponding partitionStates, while a replica fetcher thread attempts to truncate the same partition(s) in `truncateToHighWatermark`. This causes an NPE which causes the fetcher to crash. This change simply checks that the `partitionState` is not null first. Note that a similar guard exists in `truncateToEpochEndOffsets`. Reviewers: Stanislav Kozlovski <stanislav_kozlovski@outlook.com>, Jason Gustafson <jason@confluent.io>

* apache/trunk: KAFKA-7880:Naming worker thread by task id (apache#6275) improve some logging statements (apache#6078) KAFKA-7312: Change broker port used in testMinimumRequestTimeouts and testForceClose KAFKA-7997: Use automatic RPC generation in SaslAuthenticate KAFKA-8002; Log dir reassignment stalls if future replica has different segment base offset (apache#6346) KAFKA-3522: Add TimestampedKeyValueStore builder/runtime classes (apache#6152) HOTFIX: add igore import to streams_upgrade_test MINOR: ConsumerNetworkClient does not need to send the remaining requests when the node is not ready (apache#6264) KAFKA-7922: Return authorized operations in describe consumer group responses (KIP-430 Part-1) KAFKA-7918: Inline generic parameters Pt. III: in-memory window store (apache#6328) KAFKA-8012; Ensure partitionStates have not been removed before truncating. (apache#6333) KAFKA-8011: Fix for race condition causing concurrent modification exception (apache#6338) KAFKA-7912: Support concurrent access in InMemoryKeyValueStore (apache#6336) MINOR: Skip quota check when replica is in sync (apache#6344) HOTFIX: Change header back to http instead of https to path license header test (apache#6347) MINOR: fix release.py script (apache#6317) MINOR: Remove types from caching stores (apache#6331) MINOR: Improve logging for alter log dirs (apache#6302) MINOR: state.cleanup.delay.ms default is 600,000 ms (10 minutes). (apache#6345) MINOR: disable Streams system test for broker upgrade/downgrade (apache#6341)

…ating. (apache#6333) This patch fixes a regression in the replica fetcher which occurs when the replica fetcher manager simultaneously calls `removeFetcherForPartitions`, removing the corresponding partitionStates, while a replica fetcher thread attempts to truncate the same partition(s) in `truncateToHighWatermark`. This causes an NPE which causes the fetcher to crash. This change simply checks that the `partitionState` is not null first. Note that a similar guard exists in `truncateToEpochEndOffsets`. Reviewers: Stanislav Kozlovski <stanislav_kozlovski@outlook.com>, Jason Gustafson <jason@confluent.io>

MINOR: Ensure partitionStates have not been removed before truncating.

fa03443

stanislavkozlovski approved these changes Feb 27, 2019

View reviewed changes

hachikuji self-assigned this Feb 27, 2019

hachikuji reviewed Feb 27, 2019

View reviewed changes

colinhicks changed the title ~~MINOR: Ensure partitionStates have not been removed before truncating.~~ KAFKA-8012: Ensure partitionStates have not been removed before truncating. Feb 27, 2019

Adds test to cover failure case.

88185d3

hachikuji reviewed Feb 27, 2019

View reviewed changes

Tests epochEndOffset truncation during partition removal.

6b395b6

hachikuji approved these changes Feb 28, 2019

View reviewed changes

hachikuji merged commit 70828ce into apache:trunk Mar 1, 2019

colinhicks deleted the KAFKA-7988-flaky-dyn-reconfig-thread-pool branch March 1, 2019 10:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KAFKA-8012: Ensure partitionStates have not been removed before truncating.#6333

KAFKA-8012: Ensure partitionStates have not been removed before truncating.#6333
hachikuji merged 3 commits intoapache:trunkfrom
colinhicks:KAFKA-7988-flaky-dyn-reconfig-thread-pool

colinhicks commented Feb 27, 2019

Uh oh!

stanislavkozlovski left a comment

Uh oh!

stanislavkozlovski Feb 27, 2019

Uh oh!

hachikuji left a comment •

edited

Loading

Uh oh!

hachikuji Feb 27, 2019 •

edited

Loading

Uh oh!

colinhicks commented Feb 28, 2019 •

edited

Loading

Uh oh!

hachikuji left a comment

Uh oh!

hachikuji commented Feb 28, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

colinhicks commented Feb 27, 2019

Committer Checklist (excluded from commit message)

Uh oh!

stanislavkozlovski left a comment

Choose a reason for hiding this comment

Uh oh!

stanislavkozlovski Feb 27, 2019

Choose a reason for hiding this comment

Uh oh!

hachikuji left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hachikuji Feb 27, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

colinhicks commented Feb 28, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hachikuji left a comment

Choose a reason for hiding this comment

Uh oh!

hachikuji commented Feb 28, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hachikuji left a comment •

edited

Loading

hachikuji Feb 27, 2019 •

edited

Loading

colinhicks commented Feb 28, 2019 •

edited

Loading