KAFKA-13126: guard against overflow when computing `joinGroupTimeoutMs` by ableegoldman · Pull Request #11111 · apache/kafka

ableegoldman · 2021-07-22T21:58:52Z

In older versions of Kafka Streams, the max.poll.interval.ms config was overridden by default to Integer.MAX_VALUE. Even after we removed this override, users of both the plain consumer client and kafka streams still set the poll interval to MAX_VALUE somewhat often. Unfortunately, this causes an overflow when computing the joinGroupTimeoutMs and results in it being set to the request.timeout.ms instead, which is much lower.

This can easily make consumers drop out of the group, since they must rejoin now within 30s (by default) yet have no obligation to almost ever call poll() given the high max.poll.interval.ms. We just need to check for overflow and fix it to Integer.MAX_VALUE when it occurs.

Also fixes a few other misc. possible overflows on the side (from a ticket I came across while searching for existing tickets on the joinGroupTimeout bug: KAFKA-6948)

vvcephei

Thanks, @ableegoldman ! These all look good except the one I marked.

vvcephei · 2021-07-22T23:17:22Z

                throw new IOException("Connection to " + node + " failed.");
            }
-            long pollTimeout = expiryTime - attemptStartTime;
+            long pollTimeout = (startTime - attemptStartTime) + timeoutMs;


This would still overflow if timeoutMs is MAX_VALUE, right?

The startTime is set once at the beginning of the method while the attemptStartTime is initialized just before the first attempt and then updated again after every iteration. So the attemptStartTime is always greater than the startTime and therefore the quantity being added to the timeoutMs here is actually negative.

But I see how that's confusing, I'll refactor the expression to make this more clear

Ah, thanks!

vvcephei

Thanks!

showuon

LGTM!

showuon · 2021-07-23T06:43:26Z

Failed tests are unrelated.

    Build / JDK 11 and Scala 2.13 / kafka.api.TransactionsExpirationTest.testBumpTransactionalEpochAfterInvalidProducerIdMapping()
    Build / JDK 8 and Scala 2.12 / kafka.api.ConsumerBounceTest.testCloseDuringRebalance()
    Build / JDK 8 and Scala 2.12 / kafka.api.ConsumerBounceTest.testCloseDuringRebalance()
    Build / JDK 16 and Scala 2.13 / kafka.api.ConsumerBounceTest.testCloseDuringRebalance()
    Build / JDK 16 and Scala 2.13 / kafka.api.ConsumerBounceTest.testCloseDuringRebalance()

ableegoldman · 2021-07-23T23:22:47Z

Merged to trunk

…s` (apache#11111) Setting the max.poll.interval.ms to MAX_VALUE causes overflow when computing the joinGroupTimeoutMs and results in the JoinGroup timeout being set to the request.timeout.ms instead, which is much lower. This can easily make consumers drop out of the group, since they must rejoin now within 30s (by default) yet have no obligation to almost ever call poll() given the high max.poll.interval.ms, especially when each record takes a long time to process or the `max.poll.records` is also very large. We just need to check for overflow and fix it to Integer.MAX_VALUE when it occurs. Reviewers: Luke Chen <showuon@gmail.com>, John Roesler <vvcephei@apache.org>

…oupTimeoutMs`" Integrated PR from apache/kafka: apache#11111

guard against overflow

3fe7ed8

ableegoldman requested review from guozhangwang and hachikuji July 22, 2021 21:58

fix some other possible overflows from KAFKA-6948 on the side

ceb14f6

ableegoldman requested a review from mjsax July 22, 2021 22:17

ableegoldman changed the title ~~HOTFIX: guard against overflow when computing joinGroupTimeoutMs~~ KAFKA-13126: guard against overflow when computing joinGroupTimeoutMs Jul 22, 2021

vvcephei requested changes Jul 22, 2021

View reviewed changes

clarify pollTimeout

ca9707a

ableegoldman mentioned this pull request Jul 23, 2021

KAFKA-6948 - Change comparison to avoid overflow inconsistencies #5183

Closed

3 tasks

vvcephei approved these changes Jul 23, 2021

View reviewed changes

checkstyle

ac5c8cd

showuon approved these changes Jul 23, 2021

View reviewed changes

ableegoldman merged commit 8b1eca1 into apache:trunk Jul 23, 2021

MaximGonnissen added a commit to MaximGonnissen/kafka that referenced this pull request May 29, 2022

Integrate "KAFKA-13126: guard against overflow when computing `joinGr…

0932f99

…oupTimeoutMs`" Integrated PR from apache/kafka: apache#11111

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KAFKA-13126: guard against overflow when computing `joinGroupTimeoutMs`#11111

KAFKA-13126: guard against overflow when computing `joinGroupTimeoutMs`#11111
ableegoldman merged 4 commits intoapache:trunkfrom
ableegoldman:HOTFIX-guard-against-joinGroupTimeoutMs-overflow

ableegoldman commented Jul 22, 2021 •

edited

Loading

Uh oh!

vvcephei left a comment

Uh oh!

vvcephei Jul 22, 2021

Uh oh!

ableegoldman Jul 23, 2021

Uh oh!

vvcephei Jul 23, 2021

Uh oh!

vvcephei left a comment

Uh oh!

showuon left a comment

Uh oh!

showuon commented Jul 23, 2021

Uh oh!

ableegoldman commented Jul 23, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ableegoldman commented Jul 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vvcephei left a comment

Choose a reason for hiding this comment

Uh oh!

vvcephei Jul 22, 2021

Choose a reason for hiding this comment

Uh oh!

ableegoldman Jul 23, 2021

Choose a reason for hiding this comment

Uh oh!

vvcephei Jul 23, 2021

Choose a reason for hiding this comment

Uh oh!

vvcephei left a comment

Choose a reason for hiding this comment

Uh oh!

showuon left a comment

Choose a reason for hiding this comment

Uh oh!

showuon commented Jul 23, 2021

Uh oh!

ableegoldman commented Jul 23, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ableegoldman commented Jul 22, 2021 •

edited

Loading