Skip to content

KAFKA-12648: fix NPE due to race condtion between resetting offsets and removing a topology#11847

Merged
ableegoldman merged 3 commits intoapache:trunkfrom
ableegoldman:HOTFIX-prevent-NPE-in-removeNamedTopology-with-resetOffsets
Mar 7, 2022
Merged

KAFKA-12648: fix NPE due to race condtion between resetting offsets and removing a topology#11847
ableegoldman merged 3 commits intoapache:trunkfrom
ableegoldman:HOTFIX-prevent-NPE-in-removeNamedTopology-with-resetOffsets

Conversation

@ableegoldman
Copy link
Copy Markdown
Member

@ableegoldman ableegoldman commented Mar 4, 2022

While debugging the flaky NamedTopologyIntegrationTest. shouldRemoveOneNamedTopologyWhileAnotherContinuesProcessing test, I did discover one real bug. The problem was that we update the TopologyMetadata's builders map (with the known topologies) inside the #removeNamedTopology call directly, whereas the StreamThread may not yet have reached the poll() in the loop and in case of an offset reset, we get an NP.e
I changed the NPE to just log a warning for now, going forward I think we should try to tackle some tech debt by keeping the processing tasks and the TopologyMetadata in sync

Copy link
Copy Markdown
Contributor

@wcarlson5 wcarlson5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just once thing that can be a follow up

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this null check still necessary as we don't return null anymore? (and maybe we can update the offsetResetStrategy) as well

This can be a follow up as it is not going to change the behavior and it would be good to get this fix for a flaky test it.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well we no longer return null within an individual topology's InternalTopologyBuilder, however the topologyMetadata.offsetResetStrategy may still return null due to the race condition (see comment above TODO) -- this is unavoidable until we can address the tech debt mentioned in the TODO (however we end up doing that)

@wcarlson5
Copy link
Copy Markdown
Contributor

That is a lot of test failures and some of those seem relevant

org.apache.kafka.streams.integration.NamedTopologyIntegrationTest.shouldRemoveOneNamedTopologyWhileAnotherContinuesProcessing

@ableegoldman ableegoldman force-pushed the HOTFIX-prevent-NPE-in-removeNamedTopology-with-resetOffsets branch from be4a28b to beed6b2 Compare March 7, 2022 07:22
Copy link
Copy Markdown
Contributor

@guozhangwang guozhangwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Read through the latest two commits, lgtm.

@ableegoldman ableegoldman merged commit 539f006 into apache:trunk Mar 7, 2022
@ableegoldman
Copy link
Copy Markdown
Member Author

No test failures in NamedTopologyIntegrationTest! Merged to trunk 🥳

The integration test should be completely stable now as all known issues and sources of flakiness have been resolved -- _any new test failure sightings should be reported and looked into as possible new bugs _

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants