Skip to content

KAFKA-9677: Fix consumer fetch with small consume bandwidth quotas#8290

Merged
rajinisivaram merged 3 commits intoapache:trunkfrom
apovzner:kafka-9677
Mar 14, 2020
Merged

KAFKA-9677: Fix consumer fetch with small consume bandwidth quotas#8290
rajinisivaram merged 3 commits intoapache:trunkfrom
apovzner:kafka-9677

Conversation

@apovzner
Copy link
Copy Markdown
Contributor

@apovzner apovzner commented Mar 12, 2020

When we changed quota communication with KIP-219, fetch requests get throttled by returning empty response with the delay in throttle_time_ms and Kafka consumer retries again after the delay. With default configs, the maximum fetch size could be as big as 50MB (or 10MB per partition). The default broker config (1-second window, 10 full windows of tracked bandwidth/thread utilization usage) means that < 5MB/s consumer quota (per broker) may block consumers from being able to fetch any data.

This PR ensures that consumers cannot get blocked by quota by capping fetchMaxBytes in KafkaApis.handleFetchRequest() to quota window * consume bandwidth quota. In the example of default configs (10-second quota window) and 1MB/s consumer bandwidth quota, fetchMaxBytes would be capped to 10MB.

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

@hachikuji
Copy link
Copy Markdown
Contributor

ok to test

Copy link
Copy Markdown
Contributor

@rajinisivaram rajinisivaram left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@apovzner Thanks for the PR, looks good. Left one question. CustomQuotaTest failure in the PR builds may be related, so worth investigating that.

val metric = throttleMetric(QuotaType.Fetch, consumerClientId)
throttled = metric != null && metricValue(metric) > 0
} while (numConsumed < maxRecords && !throttled)
} while (numConsumed < maxRecords && !throttled && System.currentTimeMillis < startMs + longTimeoutMs)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this really need 10 minutes? One minute itself seems like a long time for the tests. Should we send larger messages to get the test to complete faster?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did such a long timeout because before there was no timeout, wanted to make sure it is enough. I verified that each test runs just 20-30 seconds max. I updated the code to use 1 minute for a timeout.

@apovzner
Copy link
Copy Markdown
Contributor Author

Hi @rajinisivaram, thanks for the review! I fixed CustomQuotaCallbackTest and reduced the timeout in BaseQuotaTest.

@rajinisivaram
Copy link
Copy Markdown
Contributor

retest this please

@rajinisivaram
Copy link
Copy Markdown
Contributor

ok to test

Copy link
Copy Markdown
Contributor

@rajinisivaram rajinisivaram left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@apovzner Thanks for the PR, LGTM. Merging to trunk.

@rajinisivaram rajinisivaram merged commit 78554e7 into apache:trunk Mar 14, 2020
ijuma added a commit to confluentinc/kafka that referenced this pull request Mar 17, 2020
* apache-github/trunk: (39 commits)
  MINOR: cleanup and add tests to StateDirectoryTest (apache#8304)
  HOTFIX: StateDirectoryTest should use Set instead of List (apache#8305)
  MINOR: Fix build and JavaDoc warnings (apache#8291)
  MINOR: Fix kafka.server.RequestQuotaTest missing new ApiKeys. (apache#8302)
  KAFKA-9712: Catch and handle exception thrown by reflections scanner (apache#8289)
  KAFKA-9670; Reduce allocations in Metadata Response preparation (apache#8236)
  MINOR: fix Scala 2.13 build error introduced in apache#8083 (apache#8301)
  MINOR: enforce non-negative invariant for checkpointed offsets (apache#8297)
  MINOR: comment apikey types in generated switch (apache#8201)
  MINOR: Fix typo in CreateTopicsResponse.json (apache#8300)
  KIP-546: Implement describeClientQuotas and alterClientQuotas. (apache#8083)
  KAFKA-6647: Do note delete the lock file while holding the lock (apache#8267)
  KAFKA-9677: Fix consumer fetch with small consume bandwidth quotas (apache#8290)
  KAFKA-9533: Fix JavaDocs of KStream.transformValues (apache#8298)
  MINOR: reuse pseudo-topic in FKJoin (apache#8296)
  KAFKA-6145: Pt 2. Include offset sums in subscription (apache#8246)
  KAFKA-9714; Eliminate unused reference to IBP in `TransactionStateManager` (apache#8293)
  KAFKA-9718; Don't log passwords for AlterConfigs in request logs (apache#8294)
  KAFKA-8768: DeleteRecords request/response automated protocol (apache#7957)
  KAFKA-9685: Solve Set concatenation perf issue in AclAuthorizer
  ...
@ijuma
Copy link
Copy Markdown
Member

ijuma commented Mar 19, 2020

@rajinisivaram shall we cherry-pick to 2.5 and 2.4 branches?

@rajinisivaram
Copy link
Copy Markdown
Contributor

@ijuma Yes, cherry-picking to 2.5 and 2.4.

rajinisivaram pushed a commit that referenced this pull request Mar 25, 2020
…8290)

When we changed quota communication with KIP-219, fetch requests get throttled by returning empty response with the delay in throttle_time_ms and Kafka consumer retries again after the delay. With default configs, the maximum fetch size could be as big as 50MB (or 10MB per partition). The default broker config (1-second window, 10 full windows of tracked bandwidth/thread utilization usage) means that < 5MB/s consumer quota (per broker) may block consumers from being able to fetch any data.

This PR ensures that consumers cannot get blocked by quota by capping fetchMaxBytes in KafkaApis.handleFetchRequest() to quota window * consume bandwidth quota. In the example of default configs (10-second quota window) and 1MB/s consumer bandwidth quota, fetchMaxBytes would be capped to 10MB.

Reviewers: Rajini Sivaram <rajinisivaram@googlemail.com>
rajinisivaram pushed a commit that referenced this pull request Mar 25, 2020
…8290)

When we changed quota communication with KIP-219, fetch requests get throttled by returning empty response with the delay in throttle_time_ms and Kafka consumer retries again after the delay. With default configs, the maximum fetch size could be as big as 50MB (or 10MB per partition). The default broker config (1-second window, 10 full windows of tracked bandwidth/thread utilization usage) means that < 5MB/s consumer quota (per broker) may block consumers from being able to fetch any data.

This PR ensures that consumers cannot get blocked by quota by capping fetchMaxBytes in KafkaApis.handleFetchRequest() to quota window * consume bandwidth quota. In the example of default configs (10-second quota window) and 1MB/s consumer bandwidth quota, fetchMaxBytes would be capped to 10MB.

Reviewers: Rajini Sivaram <rajinisivaram@googlemail.com>
qq619618919 pushed a commit to qq619618919/kafka that referenced this pull request May 12, 2020
…pache#8290)

When we changed quota communication with KIP-219, fetch requests get throttled by returning empty response with the delay in throttle_time_ms and Kafka consumer retries again after the delay. With default configs, the maximum fetch size could be as big as 50MB (or 10MB per partition). The default broker config (1-second window, 10 full windows of tracked bandwidth/thread utilization usage) means that < 5MB/s consumer quota (per broker) may block consumers from being able to fetch any data.

This PR ensures that consumers cannot get blocked by quota by capping fetchMaxBytes in KafkaApis.handleFetchRequest() to quota window * consume bandwidth quota. In the example of default configs (10-second quota window) and 1MB/s consumer bandwidth quota, fetchMaxBytes would be capped to 10MB.

Reviewers: Rajini Sivaram <rajinisivaram@googlemail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants