Skip to content

MINOR: Fix race conditions in KafkaStreams when topic is missing.#20362

Closed
chickenchickenlove wants to merge 1 commit intoapache:trunkfrom
chickenchickenlove:250816-no-issue
Closed

MINOR: Fix race conditions in KafkaStreams when topic is missing.#20362
chickenchickenlove wants to merge 1 commit intoapache:trunkfrom
chickenchickenlove:250816-no-issue

Conversation

@chickenchickenlove
Copy link
Copy Markdown
Contributor

@chickenchickenlove chickenchickenlove commented Aug 16, 2025

Background

There is race condition in ConsumerNetworkThread for Kafka Streams with new_protocol.
So it make HandlingSourceTopicDeletionIntegrationTest#shouldThrowErrorAfterSourceTopicDeleted be flaky.
Please refer to details in attached sequence diagram.
image

CC. @mjsax

Result

  • Fixes flaky test HandlingSourceTopicDeletionIntegrationTest#shouldThrowErrorAfterSourceTopicDeleted.

Related

@github-actions github-actions Bot added triage PRs from the community consumer clients small Small PRs labels Aug 16, 2025
@chickenchickenlove
Copy link
Copy Markdown
Contributor Author

@mjsax sorry to mention.
Could you take a look this PR?
It is related with Kafka streams!

@github-actions
Copy link
Copy Markdown

A label of 'needs-attention' was automatically added to this PR in order to raise the
attention of the committers. Once this issue has been triaged, the triage label
should be removed to prevent this automation from happening again.

@chickenchickenlove
Copy link
Copy Markdown
Contributor Author

I’d greatly appreciate it if someone could kindly review this PR when they have a chance.

@github-actions
Copy link
Copy Markdown

A label of 'needs-attention' was automatically added to this PR in order to raise the
attention of the committers. Once this issue has been triaged, the triage label
should be removed to prevent this automation from happening again.

@chickenchickenlove
Copy link
Copy Markdown
Contributor Author

Gently Ping, could you take a look this PR?

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Sep 8, 2025

A label of 'needs-attention' was automatically added to this PR in order to raise the
attention of the committers. Once this issue has been triaged, the triage label
should be removed to prevent this automation from happening again.

@mjsax
Copy link
Copy Markdown
Member

mjsax commented Sep 28, 2025

We are somewhat swamped recently. Sorry for the delay... Don't know when I will get to it. But please keeping nagging us about it :)

@chickenchickenlove
Copy link
Copy Markdown
Contributor Author

@mjsax Thanks for your comment even if you are in busy!
If you know persons who can review this PR, could you mention him/her ?
Thanks for your help and effort always! 🙇‍♂️

@github-actions
Copy link
Copy Markdown

A label of 'needs-attention' was automatically added to this PR in order to raise the
attention of the committers. Once this issue has been triaged, the triage label
should be removed to prevent this automation from happening again.

@lianetm
Copy link
Copy Markdown
Member

lianetm commented Oct 21, 2025

Hey @chickenchickenlove , thanks for looking in to this, interesting. Help me understand this issue better (the proposed change is harmless as I see it and could fix the flakiness, but want to make sure it is not hiding something else).

High level, from the consumer POV, the 2 sides involved happen in the background (so safe there):

  • updating assignment when a reconciliation completes (topic delete)
  • process unsubscribe

So how exactly do we end up in step 8??? Is it that the partitions passed to the unsubscribe here (taken from the currentAssignment) are not in the subscriptionState anymore?

return revokeActiveTasks(toTaskIdSet(currentAssignment.activeTasks));

This bit it different between streams and the consumer mgrs actually (the partitions to revoked passed in the consumer are taken directly from the subscriptionState, not from the currentAssignment like the streamsMgr does)

@chickenchickenlove
Copy link
Copy Markdown
Contributor Author

chickenchickenlove commented Oct 22, 2025

@lianetm , Thanks for your feedback!
I’ll dig into it and let you know If I find something suspicious. 🙇‍♂️

I have a question.
The changes and scenario that I wrote is reasonable to you?

@lianetm
Copy link
Copy Markdown
Member

lianetm commented Oct 22, 2025

The changes and scenario that I wrote is reasonable to you?

yes, I have no concern with the fix really, I just don't fully understand the issue (want us to make sure we're not covering an issue with the assignment passed to the task revocation)

Thanks for looking into it!

@lianetm lianetm removed the triage PRs from the community label Oct 22, 2025
@chickenchickenlove
Copy link
Copy Markdown
Contributor Author

I see!
Let me check!

@chickenchickenlove
Copy link
Copy Markdown
Contributor Author

@lianetm

It looks like a commit masked the issue I was trying to fix.

  1. I tested and debugged to solve this problem based on commit hash (986322d)
  2. Currently, this PR based on commit hash (8861629)

In this branch, I reverted this PR(38e3359) to check HandlingSourceTopicDeletionIntegrationTest can be success with new consumer protocol.
But, it never success.

However, If I tested it based on 986322d, HandlingSourceTopicDeletionIntegrationTest is always success.

In other words, one of the PRs between those two commits made the HandlingSourceTopicDeletionIntegrationTest#shouldThrowErrorAfterSourceTopicDeleted test permanently fail.

That is, it succeeds under the legacy protocol but fails under the new protocol.
To put it differently, in the legacy protocol, when a topic is deleted, all Stream Threads are expected to transition to the Error state, whereas in the new protocol, when a topic is deleted, all Stream Threads remain in the Running state.

Is this the expected behavior for the new protocol in Kafka Streams?

@chickenchickenlove
Copy link
Copy Markdown
Contributor Author

chickenchickenlove commented Oct 23, 2025

@lianetm
I think this commit affects.
handleMissingSourceTopicsWithTimeout(...)
03190e4

image In my local, timer set the value `maxPollTimeMs` (300000ms). If I set the timer as `1000`ms, same problem occur and this patch will work.

So I believe this change is masking the issue.
If you’d like me to temporarily shorten the timer interval to investigate further, please let me know.
Currently, the value in the trunk branch is set to twice the heartbeat interval, which is already quite short, so I think it’s meaningful to dig into it.

So, Please let me know your opinion!
Thanks a lot.

FYI,
When I rebase to current trunk branch, and I tested it again and it succeed!!!

@chickenchickenlove
Copy link
Copy Markdown
Contributor Author

There was happening 😅

In summary,

  1. The issue could not be reproduced on the current PR’s base branch due to the previous commit 03190e4
  2. However, after rebasing the base branch onto the latest trunk branch, the issue reappeared. When I reverted commit 38e3359 and tested again with the new protocol, all tests passed successfully.

So, Let me dig into more.

@chickenchickenlove
Copy link
Copy Markdown
Contributor Author

@lianetm
I dug into more.

I think this problem has been solved by this PR. (#19400)
Because, StreamThread don't send a UNSUBSCRIBE events anymore.

image

So, Step 5 disappear, ConsumerNetworkThread will not handle UNSUBSCRIBE event.
Therefore, there is no race condition any more.

What do you think?

@github-actions
Copy link
Copy Markdown

This PR is being marked as stale since it has not had any activity in 90 days. If you
would like to keep this PR alive, please leave a comment asking for a review. If the PR has
merge conflicts, update it with the latest from the base branch.

If you are having difficulty finding a reviewer, please reach out on the [mailing list](https://kafka.apache.org/contact).

If this PR is no longer valid or desired, please feel free to close it. If no activity occurs in the next 30 days, it will be automatically closed.

@github-actions github-actions Bot added the stale Stale PRs label Jan 22, 2026
@github-actions
Copy link
Copy Markdown

This PR has been closed since it has not had any activity in 120 days. If you feel like this
was a mistake, or you would like to continue working on it, please feel free to re-open the
PR and ask for a review.

@github-actions github-actions Bot added the closed-stale PRs that were closed due to inactivity label Feb 21, 2026
@github-actions github-actions Bot closed this Feb 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

clients closed-stale PRs that were closed due to inactivity consumer small Small PRs stale Stale PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants