Skip to content

KAFKA-15859: Add timeout field to the ListOffsets request#17112

Merged
satishd merged 3 commits intoapache:trunkfrom
kamalcph:KAFKA-15859b
Sep 26, 2024
Merged

KAFKA-15859: Add timeout field to the ListOffsets request#17112
satishd merged 3 commits intoapache:trunkfrom
kamalcph:KAFKA-15859b

Conversation

@kamalcph
Copy link
Copy Markdown
Contributor

@kamalcph kamalcph commented Sep 6, 2024

This is the part-3 of the KIP-1075

Added a timeoutMs field to the ListOffsets request. This timeout is applicable only for the topic/partitions that are enabled with remote storage.

When the timeout is defined in the request, then we use it to define the delay timeout for DelayedRemoteListOffsets request. When the timeout is not defined (requests from older client), then we take the dynamic remote.list.offsets.request.timeout.ms server config as the timeout.

Consumer and Admin client behavior are different. Consumer retries the LIST_OFFSETS request in-case of an error but not the AdminClient. And, consumer timeouts the request, if the response exceeds request.timeout.ms, whereas, AdminClient timeouts the request when it exceeds the default.api.timeout.ms.

To retain the same behavior, we are passing the requestTimeoutMs as timeout from the consumer and defaultApiTimeout / overwritten ListOffsetsOption timeout from the admin.

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

@kamalcph kamalcph added the tiered-storage Related to the Tiered Storage feature label Sep 6, 2024
@kamalcph
Copy link
Copy Markdown
Contributor Author

kamalcph commented Sep 6, 2024

This PR is not dependent on #16602 and can be reviewed separately. PTAL.

"validVersions": "0-9",
//
// Version 10 enables async remote list offsets support (KIP-1075)
"validVersions": "0-10",
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we have flexible versions from v6, is it required to bump the version to 10?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think this is required.

@kamalcph
Copy link
Copy Markdown
Contributor Author

This is the final part-3 of KIP-1075. The PR is ready for review. PTAL. Thanks!

Copy link
Copy Markdown
Member

@satishd satishd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kamalcph for the PR, left a minor comment.

Comment thread clients/src/main/resources/common/message/ListOffsetsRequest.json Outdated
@kamalcph kamalcph force-pushed the KAFKA-15859b branch 2 times, most recently from 61ded02 to 4c274e8 Compare September 19, 2024 13:56
Copy link
Copy Markdown
Member

@showuon showuon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we should do some guard like this one: KAFKA-17331?

"validVersions": "0-9",
//
// Version 10 enables async remote list offsets support (KIP-1075)
"validVersions": "0-10",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think this is required.

Comment thread server-common/src/main/java/org/apache/kafka/server/common/MetadataVersion.java Outdated
@kamalcph
Copy link
Copy Markdown
Contributor Author

Do you think we should do some guard like this one: KAFKA-17331?

If the client doesn't provides the timeout, then we will use the remote.list.offsets.request.timeout.ms timeout configured on the server. So, I think it is not required.

@kamalcph
Copy link
Copy Markdown
Contributor Author

Verified that the timeout propagates as expected by running the server, admin, and consumer locally:

To retain the same behavior, we are passing the requestTimeoutMs as timeout from the consumer and defaultApiTimeout / overwritten ListOffsetsOption timeout from the admin.

Copy link
Copy Markdown
Member

@showuon showuon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Copy Markdown
Member

@satishd satishd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kamalcph for addressing the review comments. LGTM.

@satishd satishd merged commit 560076b into apache:trunk Sep 26, 2024
@kamalcph kamalcph deleted the KAFKA-15859b branch September 26, 2024 10:42
@mumrah
Copy link
Copy Markdown
Member

mumrah commented Sep 26, 2024

@satishd @showuon this PR broke some tests and has caused trunk builds to fail.

The latest CI run for this PR was https://github.com/apache/kafka/actions/runs/11027485627 which shows both of the JUnit steps failing.

Unlike with Jenkins, we need to look at every failed test build -- especially before merging. Looking at these two failed jobs, we can see we have actual failed tests and not just flaky ones

image

https://github.com/apache/kafka/actions/runs/11027485627/job/30626323008

We want to see all the status checks in the PR to be green before merging. I would like to have branch protections in place to prevent this sort of regression, but until we sort out the flaky tests we can't really.

Anyways, I'm not meaning to pick on this PR -- just trying to raise awareness of our "new normal" for build expectations :)

@kamalcph
Copy link
Copy Markdown
Contributor Author

The failed test are fixed in #17287

@satishd
Copy link
Copy Markdown
Member

satishd commented Sep 27, 2024

Sorry for missing the tests(thought those were the flaky ones). Thanks @mumrah for following up and getting the fix merged.

@lucasbru
Copy link
Copy Markdown
Member

lucasbru commented Oct 2, 2024

@kamalcph We were using current trunk with brokers that do not have your change yet, and AdminClient's ListOffset call is running into

org.apache.kafka.common.errors.UnsupportedVersionException: Attempted to write a non-default timeoutMs at version 8

It seems this could be related to this PR. Could you please take a look?

@kamalcph
Copy link
Copy Markdown
Contributor Author

kamalcph commented Oct 3, 2024

org.apache.kafka.common.errors.UnsupportedVersionException: Attempted to write a non-default timeoutMs at version 8

Opened #17358 to address this issue. PTAL.

tedyu pushed a commit to tedyu/kafka that referenced this pull request Jan 6, 2025
This is part-3 of the KIP-1075.

Added a `timeoutMs` field to the ListOffsets request. This timeout is applicable only for the topic/partitions that are enabled with remote storage. 

When the timeout is defined in the request, then we use it to define the delay timeout for `DelayedRemoteListOffsets` request. When the timeout is not defined (requests from older client), then we take the dynamic `remote.list.offsets.request.timeout.ms` server config as the timeout.

Consumer and Admin client behavior are different. Consumer retries the LIST_OFFSETS request in-case of an error but not the AdminClient. And, consumer timeouts the request, if the response exceeds `request.timeout.ms`, whereas, AdminClient timeouts the request when it exceeds the `default.api.timeout.ms`. 

To retain the same behavior, we are passing the `requestTimeoutMs` as timeout from the consumer and defaultApiTimeout / overwritten ListOffsetsOption timeout from the admin.

Reviewers: Satish Duggana <satishd@apache.org>, Luke Chen <showuon@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

clients consumer core Kafka Broker tiered-storage Related to the Tiered Storage feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants