Skip to content

KAFKA-8012: Ensure partitionStates have not been removed before truncating.#6333

Merged
hachikuji merged 3 commits intoapache:trunkfrom
colinhicks:KAFKA-7988-flaky-dyn-reconfig-thread-pool
Mar 1, 2019
Merged

KAFKA-8012: Ensure partitionStates have not been removed before truncating.#6333
hachikuji merged 3 commits intoapache:trunkfrom
colinhicks:KAFKA-7988-flaky-dyn-reconfig-thread-pool

Conversation

@colinhicks
Copy link
Copy Markdown
Contributor

When reproducing the flakiness of DynamicBrokerReconfigurationTest#testThreadPoolResize (KAFKA-7988) on my machine, I saw that failures correspond to an NPE in one or more replica fetcher threads. This happens when the replica fetcher manager simultaneously calls removeFetcherForPartitions, removing the corresponding partitionStates, while a replica fetcher thread attempts to truncate the same partition(s) in truncateToHighWatermark.

This change simply checks that the partitionState is not null first. Note that a similar guard exists in truncateToEpochEndOffsets.

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

Copy link
Copy Markdown
Contributor

@stanislavkozlovski stanislavkozlovski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to be a regression from d152989

Nice find, LGTM!

info(s"Failed to truncate $tp", e)
partitionsWithError += tp
val partitionState = partitionStates.stateValue(tp)
if (partitionState != null) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Looking at the usage in the file, I think we would prefer:

Option(partitionStates.stateValue(topicPartition)).foreach { currentFetchState =
...

instead of the null check

@hachikuji hachikuji self-assigned this Feb 27, 2019
Copy link
Copy Markdown
Contributor

@hachikuji hachikuji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, the fix looks good. Could we have a test case?

@colinhicks colinhicks changed the title MINOR: Ensure partitionStates have not been removed before truncating. KAFKA-8012: Ensure partitionStates have not been removed before truncating. Feb 27, 2019
}

@Test
def testTruncateToHighWatermarkDuringRemovePartitions(): Unit = {
Copy link
Copy Markdown
Contributor

@hachikuji hachikuji Feb 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the test case. Maybe we can add a similar test case for the truncateToEpochEndOffsets path? I think testLeaderEpochChangeDuringFetchEpochsFromLeader has most of what we want, but we don't need to put the partition back after removing it.

@colinhicks
Copy link
Copy Markdown
Contributor Author

colinhicks commented Feb 28, 2019

@hachikuji @stanislavkozlovski thanks again for the reviews.

I added a test to exercise the truncateToEpochEndOffsets path, where there is an existing null check relevant to the NPE. Open to suggestions on how to minimize this test case further if advisable.

Copy link
Copy Markdown
Contributor

@hachikuji hachikuji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for the patch!

@hachikuji
Copy link
Copy Markdown
Contributor

retest this please

@hachikuji hachikuji merged commit 70828ce into apache:trunk Mar 1, 2019
hachikuji pushed a commit that referenced this pull request Mar 1, 2019
…ating. (#6333)

This patch fixes a regression in the replica fetcher which occurs when the replica fetcher manager simultaneously calls `removeFetcherForPartitions`, removing the corresponding partitionStates, while a replica fetcher thread attempts to truncate the same partition(s) in `truncateToHighWatermark`. This causes an NPE which causes the fetcher to crash.

This change simply checks that the `partitionState` is not null first. Note that a similar guard exists in `truncateToEpochEndOffsets`.

Reviewers: Stanislav Kozlovski <stanislav_kozlovski@outlook.com>, Jason Gustafson <jason@confluent.io>
hachikuji pushed a commit that referenced this pull request Mar 1, 2019
…ating. (#6333)

This patch fixes a regression in the replica fetcher which occurs when the replica fetcher manager simultaneously calls `removeFetcherForPartitions`, removing the corresponding partitionStates, while a replica fetcher thread attempts to truncate the same partition(s) in `truncateToHighWatermark`. This causes an NPE which causes the fetcher to crash.

This change simply checks that the `partitionState` is not null first. Note that a similar guard exists in `truncateToEpochEndOffsets`.

Reviewers: Stanislav Kozlovski <stanislav_kozlovski@outlook.com>, Jason Gustafson <jason@confluent.io>
@colinhicks colinhicks deleted the KAFKA-7988-flaky-dyn-reconfig-thread-pool branch March 1, 2019 10:11
jarekr pushed a commit to confluentinc/kafka that referenced this pull request Apr 18, 2019
* apache/trunk:
  KAFKA-7880:Naming worker thread by task id (apache#6275)
  improve some logging statements (apache#6078)
  KAFKA-7312: Change broker port used in testMinimumRequestTimeouts and testForceClose
  KAFKA-7997: Use automatic RPC generation in SaslAuthenticate
  KAFKA-8002; Log dir reassignment stalls if future replica has different segment base offset (apache#6346)
  KAFKA-3522: Add TimestampedKeyValueStore builder/runtime classes (apache#6152)
  HOTFIX: add igore import to streams_upgrade_test
  MINOR: ConsumerNetworkClient does not need to send the remaining requests when the node is not ready (apache#6264)
  KAFKA-7922: Return authorized operations in describe consumer group responses (KIP-430 Part-1)
  KAFKA-7918: Inline generic parameters Pt. III: in-memory window store (apache#6328)
  KAFKA-8012; Ensure partitionStates have not been removed before truncating. (apache#6333)
  KAFKA-8011: Fix for race condition causing concurrent modification exception (apache#6338)
  KAFKA-7912: Support concurrent access in InMemoryKeyValueStore (apache#6336)
  MINOR: Skip quota check when replica is in sync (apache#6344)
  HOTFIX: Change header back to http instead of https to path license header test (apache#6347)
  MINOR: fix release.py script (apache#6317)
  MINOR: Remove types from caching stores (apache#6331)
  MINOR: Improve logging for alter log dirs (apache#6302)
  MINOR: state.cleanup.delay.ms default is 600,000 ms (10 minutes). (apache#6345)
  MINOR: disable Streams system test for broker upgrade/downgrade (apache#6341)
pengxiaolong pushed a commit to pengxiaolong/kafka that referenced this pull request Jun 14, 2019
…ating. (apache#6333)

This patch fixes a regression in the replica fetcher which occurs when the replica fetcher manager simultaneously calls `removeFetcherForPartitions`, removing the corresponding partitionStates, while a replica fetcher thread attempts to truncate the same partition(s) in `truncateToHighWatermark`. This causes an NPE which causes the fetcher to crash.

This change simply checks that the `partitionState` is not null first. Note that a similar guard exists in `truncateToEpochEndOffsets`.

Reviewers: Stanislav Kozlovski <stanislav_kozlovski@outlook.com>, Jason Gustafson <jason@confluent.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants