KAFKA-12648: fix NPE due to race condtion between resetting offsets and removing a topology#11847
Conversation
wcarlson5
left a comment
There was a problem hiding this comment.
Just once thing that can be a follow up
There was a problem hiding this comment.
Is this null check still necessary as we don't return null anymore? (and maybe we can update the offsetResetStrategy) as well
This can be a follow up as it is not going to change the behavior and it would be good to get this fix for a flaky test it.
There was a problem hiding this comment.
Well we no longer return null within an individual topology's InternalTopologyBuilder, however the topologyMetadata.offsetResetStrategy may still return null due to the race condition (see comment above TODO) -- this is unavoidable until we can address the tech debt mentioned in the TODO (however we end up doing that)
|
That is a lot of test failures and some of those seem relevant
|
be4a28b to
beed6b2
Compare
guozhangwang
left a comment
There was a problem hiding this comment.
Read through the latest two commits, lgtm.
|
No test failures in NamedTopologyIntegrationTest! Merged to trunk 🥳 The integration test should be completely stable now as all known issues and sources of flakiness have been resolved -- _any new test failure sightings should be reported and looked into as possible new bugs _ |
While debugging the flaky
NamedTopologyIntegrationTest. shouldRemoveOneNamedTopologyWhileAnotherContinuesProcessingtest, I did discover one real bug. The problem was that we update the TopologyMetadata'sbuildersmap (with the known topologies) inside the #removeNamedTopology call directly, whereas the StreamThread may not yet have reached thepoll()in the loop and in case of an offset reset, we get an NP.eI changed the NPE to just log a warning for now, going forward I think we should try to tackle some tech debt by keeping the processing tasks and the TopologyMetadata in sync