KIP-396: Add AlterConsumerGroup/List Offsets to AdminClient by mimaison · Pull Request #7296 · apache/kafka

mimaison · 2019-09-04T17:32:29Z

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

mimaison · 2019-09-10T17:02:45Z

@guozhangwang @vahidhashemian @bbejeck @harshach @cmccabe @hachikuji As you all voted on this KIP, can a few of you review the PR? Thanks

ryannedolan

Looking forward to this!

guozhangwang

I did a quick look at the PR and it looks good, would try to squeeze out some time with a thorough review.

mimaison · 2019-09-27T12:11:52Z

@guozhangwang @vahidhashemian @bbejeck @harshach @cmccabe @hachikuji I'd love to get this in 2.4, can you take a look? It's a relatively straight forward KIP/PR.

HeartSaVioR · 2019-09-29T14:14:25Z

I couldn't help reviewing but this would be really helpful to the Kafka integration on Spark. Looking forward to the feature!

hachikuji

Thanks, left a few initial comments.

mimaison · 2019-10-04T02:25:42Z

Thanks @hachikuji and @guozhangwang for the feedback!

I've pushed an update:

addressing all the minor issues
hiding the ListOffsetRequest magic values
adding some retry logic for fetching metadata. I've re-use the pattern from FindCoordinator.

I'm flying back to the UK tomorrow evening but I should be able to make more changes tomorrow if needed. If we're not happy with the metadata retry logic, maybe a temporary solution would be to remove it and keep using the Consumer in the consumer group tool for now. Then next week I can revisit it. What do you think?

hachikuji

Thanks for the updates, left some more comments.

mimaison · 2019-10-04T19:40:59Z

@hachikuji Thanks again for the review. I've pushed an update. I'll start adding coverage in KafkaAdminClientTest now

hachikuji · 2019-10-04T21:41:32Z

Not something we have to do here, but one way we could improve this in the future is by taking into account leader epoch information from individual partitions. We can ensure that epochs increase monotonically in order to prevent using stale information during retry.

Another thing we could do is reduce the topics we are fetching metadata for as the ListOffsets requests complete. Ideally we'd only be refetching metadata for topics with metadata errors.

Yes these improvements would be nice. At the moment, I've kept it very simple and just make it retry the full metadata request every time.

hachikuji

@mimaison Thanks, found a couple more problems that we need to fix before we can merge. There will be a few follow-ups as well that we can save for next week.

hachikuji · 2019-10-05T00:01:30Z

I'm a bit confused what's going on with this API. We first wait on the aggregate future and then we wrap it in another future. That seems wrong, right? The call to all() shouldn't itself block.

Yes that is wrong.

hachikuji · 2019-10-05T00:07:05Z

I don't think we need the nested futures here. The new api just works with a single group, so seems like the type should just be KafkaFuture<Map<TopicPartition, Void>>. Also, note that we don't want to return KafkaFutureImpl directly.

Yes. For consistency, it's actually best to have KafkaFuture<Map<TopicPartition, Errors>>. So it's the same as deleteConsumerGroupOffsets().

Huhmm... Actually I think that's a mistake in deleteConsumerGroupOffsets. We don't want to expose Errors directly. I will submit a separate PR

https://issues.apache.org/jira/browse/KAFKA-8992

Let's just use the same two APIs from deleteConsumerGroupOffsets:

public KafkaFuture<Void> partitionResult(final TopicPartition partition); public KafkaFuture<Void> all();

This is a good catch, thanks @hachikuji , we can address 8992 within the 2.4 deadline.

mimaison · 2019-10-07T13:02:17Z

Gathering the TODOs we identified:

update deleteRecords() to not fail all partitions on error: KAFKA-8983; AdminClient deleteRecords should not fail all partitions unnecessarily #7449 (KIP-396: Add AlterConsumerGroup/List Offsets to AdminClient #7296 (comment))
update methods using metadata to use MetadataOperationContext (KIP-396: Add AlterConsumerGroup/List Offsets to AdminClient #7296 (comment))
generalize MetadataOperationContext and ConsumerGroupOperationContext
update getAlterConsumerGroupOffsetsCall to only retry partitions with errors (KIP-396: Add AlterConsumerGroup/List Offsets to AdminClient #7296 (comment))
update MetadataOperationContext to only retry partitions with an error (KIP-396: Add AlterConsumerGroup/List Offsets to AdminClient #7296 (comment))
maybe include a top-level error in OffsetCommitResponse (KIP-396: Add AlterConsumerGroup/List Offsets to AdminClient #7296 (comment), KIP-396: Add AlterConsumerGroup/List Offsets to AdminClient #7296 (comment))
Allow Consumer.offsetsForTimes() to take an IsolationLevel: https://issues.apache.org/jira/browse/KAFKA-8975 (KIP-396: Add AlterConsumerGroup/List Offsets to AdminClient #7296 (comment))

hachikuji · 2019-10-07T16:28:50Z

@mimaison Note the compilation failure:

06:06:20 /home/jenkins/jenkins-slave/workspace/kafka-pr-jdk11-scala2.12/core/src/main/scala/kafka/admin/ConsumerGroupCommand.scala:394: not found: type AlterOffsetsOptions
06:06:20                     withTimeoutMs(new AlterOffsetsOptions)

mimaison · 2019-10-07T17:18:57Z

Thanks @hachikuji, fixed

hachikuji

Thanks, I think we're almost there, but still a couple problems to fix.

hachikuji · 2019-10-08T00:02:19Z

Hmm.. This is a little different from what we have in DeleteConsumerGroupOffsetsResult. I think it makes sense to check all the partition level errors. cc @dajac

That's a fair point but I am not sure what the best one is. The rational behind not looking at individual topic/partitions was that it allows to use all() to wait for the completion of the request and then check the individual results. In this case all() fails only if the whole group has failed.

To be more concrete, it allows to do the following:

DeleteConsumerGroupOffsetsResult result = ...; try { // wait for the whole group, only raise when a group level or // transport level exception affection the whole request occurs result.all().get() // inspect individual topic/partition try { result.partitionResult(...).get() } catch (Exception e) { // handle partition exception } } catch (Exception e) { // handle group level exception }

I think that this facilitates the error handling. What do you think?

That's an interesting point. I think the usual semantics of all is to only succeed if all individual operations have succeeded. It's sort of designed for lazy error handling I guess. If users care about the individual operations, they can check them individually. Otherwise they have a convenient way to check for any errors. Based on what I've seen, this tends to be the most frequent use. I think also part of the idea is to abstract away from the underlying requests. Some of the admin APIs result in multiple broker requests which makes exposing the full granularity of errors quite cumbersome.

Just made a pass on all XXXResult classes and I think the API semantics are a bit inconsistency in general: originally I thought we only need the all function if the result contains futures in the form of Map<..., KafkaFuture<...>> which potentially requires one trip for each nested future, and the all function is used as a lazy way to check that all entries have completed successfully. But some (e.g. RemoveMemberFromGroupResult in form of Map<MemberIdentity, KafkaFuture<Void>>) actually only requires one request too, so all futures would actually be always completed at the same time. For those cases we do not need an all function either.

But it seems like for results that only contain a KafkaFuture<Object> we also have a dummy all function, and many of their all semantics are different too.

Honestly I think not all results needs an all function, but it seems we are already a bit messy here..

Yeah, unfortunately the admin APIs have such a big surface area it's hard to maintain consistency. I think the original intent is what I described though.

hachikuji · 2019-10-09T20:41:13Z

@mimaison I see the recent comments were marked resolved, but I don't see the changes. Are you still working on an update?

mimaison · 2019-10-09T20:46:50Z

Yes I started making the changes but I haven't had the time to finish them yet. I'll push an update tomorrow or Friday. Sorry for the delay.

…

On Wed, 9 Oct 2019, 21:42 Jason Gustafson, ***@***.***> wrote: @mimaison <https://github.com/mimaison> I see the recent comments were marked resolved, but I don't see the changes. Are you still working on an update? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#7296?email_source=notifications&email_token=AAG4TPZ33VDOHJ5Y6CS26I3QNY62PA5CNFSM4ITUTAJ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEAZJBFI#issuecomment-540184725>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAG4TP3I4FVD63QMS5UHKK3QNY62PANCNFSM4ITUTAJQ> .

mimaison · 2019-10-10T19:58:39Z

@hachikuji I've pushed an update

mimaison · 2019-10-15T12:57:52Z

Thanks @hachikuji for the feedback, I've pushed another update

hachikuji

Thanks @mimaison . I posted a final round of comments and then we can merge.

hachikuji · 2019-10-15T18:28:19Z

The user is trying to access a partition that was not requested. I think we could raise IllegalArgumentException directly to the user.

hachikuji · 2019-10-16T00:50:44Z

This is a bit subtle, but I think we want to raise the InvalidMetadataException rather than constructing a new Call. The problem is that we lose the retry bookkeeping which means these retries will not respect the backoff. By throwing the exception, we let the retry logic in Call.fail kick in. This would be consistent with the logic in getFindCoordinatorCall.

mimaison · 2019-10-19T14:51:12Z

Thanks @hachikuji, I've updated the PR and rebased on trunk

hachikuji

LGTM. Thanks for the patch!

hachikuji · 2019-10-19T18:54:00Z

retest this pleasee

guozhangwang · 2019-10-19T20:38:53Z

LGTM!

HeartSaVioR · 2019-10-20T07:01:45Z

Amazing! Thanks all for the every efforts on this!

gaborgsomogyi · 2019-10-20T07:57:28Z

Thanks guys, starting the Spark integration part.

mimaison force-pushed the kip-396 branch from 8798ad7 to 71aff46 Compare September 10, 2019 17:00

ryannedolan approved these changes Sep 11, 2019

View reviewed changes

Comment thread core/src/main/scala/kafka/admin/ConsumerGroupCommand.scala Outdated

Comment thread clients/src/main/java/org/apache/kafka/clients/admin/Admin.java Outdated

mimaison force-pushed the kip-396 branch from 71aff46 to 6035868 Compare September 16, 2019 09:41

guozhangwang reviewed Sep 19, 2019

View reviewed changes

hachikuji reviewed Oct 2, 2019

View reviewed changes

guozhangwang reviewed Oct 3, 2019

View reviewed changes

mimaison force-pushed the kip-396 branch 2 times, most recently from e0c6be4 to 412608e Compare October 4, 2019 02:24

hachikuji reviewed Oct 4, 2019

View reviewed changes

Comment thread clients/src/main/java/org/apache/kafka/clients/admin/KafkaAdminClient.java Outdated

hachikuji reviewed Oct 4, 2019

View reviewed changes

hachikuji reviewed Oct 5, 2019

View reviewed changes

hachikuji reviewed Oct 8, 2019

View reviewed changes

mimaison force-pushed the kip-396 branch from 1d6abec to acc5d0c Compare October 10, 2019 19:16

mimaison force-pushed the kip-396 branch from 4f3001e to a39bda9 Compare October 11, 2019 09:49

hachikuji reviewed Oct 14, 2019

View reviewed changes

hachikuji reviewed Oct 16, 2019

View reviewed changes

KIP-396: Add AlterConsumerGroup/List Offsets to AdminClient

5426cf7

mimaison added 11 commits October 19, 2019 15:49

Address feedback

522a160

2nd update following feedback

ebdda5d

Add first batch of tests

666da34

Small tests updates

4297358

Another batch of updates

13f1714

Update AlterConsumerGroupOffsetsResult

bf0876c

Update ConsumerGroupCommand

3843624

Address last round of reviews

502bdc4

Small updates

e92459f

Updates following last round of reviews

f5ac2ee

Let AdminClient retry on Metadata error

adb3377

mimaison force-pushed the kip-396 branch from d1b0a7b to adb3377 Compare October 19, 2019 14:50

hachikuji approved these changes Oct 19, 2019

View reviewed changes

hachikuji merged commit 99a4068 into apache:trunk Oct 20, 2019

mimaison deleted the kip-396 branch May 15, 2020 23:08

lesterfan mentioned this pull request Jun 2, 2021

KIP-396: Implement AlterConsumerGroupOffsets and ListConsumerGroupOffsets admin APIs confluentinc/librdkafka#3410

Closed

aartigao mentioned this pull request Jun 25, 2021

KAFKA-12995 [WIP] Allow old Broker compatibility for Metadata calls #10929

Closed

3 tasks

Conversation

mimaison commented Sep 4, 2019

Committer Checklist (excluded from commit message)

Uh oh!

mimaison commented Sep 10, 2019

Uh oh!

ryannedolan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

guozhangwang left a comment

Choose a reason for hiding this comment

Uh oh!

mimaison commented Sep 27, 2019

Uh oh!

HeartSaVioR commented Sep 29, 2019

Uh oh!

hachikuji left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mimaison commented Oct 4, 2019

Uh oh!

hachikuji left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mimaison commented Oct 4, 2019

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hachikuji left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment