-
Notifications
You must be signed in to change notification settings - Fork 15.2k
PROPOSAL: support async event execution in group coordinator #14705
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, while this guarantees ordering, it disables pipelining and thus potentially reduces the throughput, since we have to wait for each event's records to be fully replicated before processing the next event.
We probably could introduce a different callback in
ReplicaManager.appendRecordsthat's invoked when the records are appended to the local log.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We actually don't need to wait for replication, so the current pipelining works without changes -- the current logic uses acks=1 and captures the offset and then waits for HWM to be advanced to complete the write request. It may prevent potential pipelining opportunities if new async stages are added for acks=1 (e.g. transaction verification). But the most important thing is that with this proposal, innovating under appendRecords interface would just work out of box, which is the purpose of having interfaces -- innovating under the interface doesn't break callers that use interface correctly (which makes system modular).
If we find out that we want the pipelining for transaction verification we can make this optimization later (if we find it to be a problem). We will have a choice between complexity and potentially better pipelining; with the current model, we don't have the choice -- the workflow will break if we add an async state to acks=1 processing and will have to fix it before shipping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation, Artem. Yes, it's true that the new group coordinator only depends on acks=1.
Each new group coordinator thread handles requests from multiple groups and multiple clients within the same group. In the proposed approach, if one client's log append is blocked for additional async check, it blocks the processing of other clients and other groups. So, it still seems to reduce the overall throughput somewhat.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is correct, it may become a perf problem, we can measure and see if it's worth fixing in practice, we'll have this choice (as well as the choice to postpone the fix, if we have time pressure to release). But it won't be a functional problem. Right now it is a functional problem, which is suboptimal in many ways:
IMO, the fact that transaction verification implementation just doesn't work out-of-box with the new group coordinator (and in fact requires quite non-trivial follow-up work that will block the release) is an architectural issue. We should strive to make the system more decoupled, so that the context an engineer needs to understand to make local changes in a part of system is less.
I don't think it's bound to a thread, but indeed the concurrency is limited to partition -- we don't let operations on the same partition run concurrently, so all the groups that are mapped to the same partition are contending. This is, however, a specific implementation choice, it should be possible to make a group to be a unit of concurrency, and if that's not enough, we can let offset commits for different partitions go concurrently as well (they just need to make sure that group doesn't change, which is sort of a "read lock" on the group), at which point there probably wouldn't be any contention in the common path.
Now, one might ask a question, implementing per-group synchronization adds complexity and handling transaction verification as an explicit state transition in group coordinator adds complexity, what the difference? I'd say the difference is fundamental -- per-group synchronization complexity is encapsulated in one component and keeps the system decoupled: an engineer tasked to improve transaction protocol, doesn't need understand implementation details of group coordinator and vice versa. Changes are smaller, can be made faster, and less bug prone. Win-win-win.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking into this. Here is my take:
I strongly disagree on blocking the event loop. It will not become a perf problem. It is one. It is also an anti-pattern.
It is technically not a functional problem, at least not yet, because I haven't not implemented the transactional offset commit in the new coordinator. ;)
I will change this to not use appendRecords, this will make the contract clear.
This is incorrect. We knew about this and we always had an implementation in mind which works. I will basically decouple the write in two stages: 1) validate/prepare the transaction; and 2) update state and write. As we discussed in the other PR, this is also required for the old coordinator to work correctly.
I don't agree with this. As we just saw, we already failed to make it work correctly for the existing coordinator so the dependency was already there. Again, we can do better, I agree.
This is completely unrelated in my opinion as this is true for both the old and the new coordinator.
Overall, I agree that we could do better but I think that it is not the right time to change this. We are already under high time pressure and actually changing this in the right way puts even more pressure. We should look for a proper solution afterwards.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's true that it's a problem with the old coordinator, and we should make the whatever minimal fixes required for the old coordinator to work (and if it happens to work end-to-end, which I think it might, we won't need to fix it), but that code is going away and shouldn't define the forward-looking architecture.
As we build the new coordinator, we should build it in a way that improves forward-looking architecture. Keeping the right abstraction is good, coincidentally it helps with the timelines -- we can use this proposal and use the work that already has been done instead of doing new work of bringing implementation details into group coordinator.
Moreover, I wonder if we need yet another thread pool to handle group coordinator logic, I think it would be good to just re-use the request handler threads to run this functionality. This would avoid thread pools proliferation and also reuse various useful improvements that work only on request pool threads, e.g. RequestLocal (hopefully we'll make it into a real thread local to be used at the point of use instead of passing the argument), various observability things, etc. Here is a PoC that does that using NonBlockingSynchronizer and KafkaRequestHandler.wrap
46acf02
The NonBlockingSynchronizer replaces EventAccumulator and MultiThreadedEventProcessor (I didn't remove them to keep the change small), it has some perf benefits e.g. in uncontended cases, the processing continues running on the request thread instead of being rescheduled on the gc thread pool. I can also easily implement read-write synchronization for the NonBlockingSynchronizer (so that readers won't block each other out), e.g. to implement non-blocking read "lock" on group when committing offsets.
It's not to say I don't like the current code, but it feels like we re-building functionality that we already have elsewhere in Kafka and we we could re-use the existing building blocks so that the gc focuses on group coordination rather than managing thread pools, getting into the details of transactional protocol, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest, if we implement proper concurrency granularity for groups (serialize group updates [not whole partition], keep read "lock" on groups during commit updates) I'm not sure if we'd get much extra perf gain from piercing the appendRecords abstraction to implement pipelining. Then we could get rid of the timeline snapshot structure and hooking into replication pipeline to listen for HWM updates; we could just do appendRecords and wait for completion. Then we could completely decouple group coordinator logic from the storage stack and make it simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another potential issue is the ordering.
withActiveContextOrThrowholds a partition level lock to make sure the record is replayed in the state machine in the same order as it's appended to the log. WithwithActiveContextOrThrowAsync, we hold the lock to replay the record, but appends to the log without the lock. The could create a situation that the state machine may not be exactly recreated by replaying records from the log.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The locking model is not changed -- it holds the lock around the whole call, see line 1241
result = asyncFunc.apply(context).whenComplete((none, t) -> context.lock.unlock());the .whenComplete callback will execute after the function is complete, so lock is held around the whole thing.
The unlock in the
finallyclause is so that if we asyncFunc.apply throws an exception (which would happen if the function in fact is executed synchronously) and we didn't get the future, then we unlock inline.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, Artem. Got it.