Skip to content

KAFKA-13152: Replace "buffered.records.per.partition" with "input.buffer.max.bytes"#11424

Merged
guozhangwang merged 18 commits intoapache:trunkfrom
vamossagar12:KAFKA-13152
Jan 28, 2022
Merged

KAFKA-13152: Replace "buffered.records.per.partition" with "input.buffer.max.bytes"#11424
guozhangwang merged 18 commits intoapache:trunkfrom
vamossagar12:KAFKA-13152

Conversation

@vamossagar12
Copy link
Copy Markdown
Contributor

This PR is an implementation of: https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=186878390. The following changes have been made:

  • Adding a new config input.buffer.max.bytes applicable at a topology level.
  • Adding new config statestore.cache.max.bytes.
  • Adding new metric called total-bytes .
  • The per partition config buffered.records.per.partition is deprecated.
  • The config cache.max.bytes.buffering is deprecated.

@vamossagar12
Copy link
Copy Markdown
Contributor Author

@guozhangwang Implementation of the KIP. Note that number of files is big as this renames the cache size config.

@vamossagar12
Copy link
Copy Markdown
Contributor Author

@ableegoldman , could you also review this whenever you get the chance.

@ableegoldman ableegoldman changed the title Kafka 13152: Replace "buffered.records.per.partition" with "input.buffer.max.bytes" KAFKA-13152: Replace "buffered.records.per.partition" with "input.buffer.max.bytes" Oct 27, 2021
Copy link
Copy Markdown
Member

@ableegoldman ableegoldman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Lmk if you have any questions about any of my comments/questions -- I think the main thing is just that we, unfortunately, can't get rid of any of the code handling the old configs just yet, since not everyone will upgrade their code off of deprecated APIs right away (though we wish they would!)

Comment thread streams/src/main/java/org/apache/kafka/streams/KafkaStreams.java Outdated
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this is exactly the same logic as getCacheSizePerThread, can we instead just change the name of the existing method to match both cases rather than writing duplicate code? Maybe something like getMemorySharePerThread or getMemorySizePerThread?

Comment thread streams/src/main/java/org/apache/kafka/streams/KafkaStreams.java Outdated
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, if it's going to be exactly the same then we only need one method.

Although, I don't think the GlobalThread actually even has an input buffer the way StreamThreads do (doesn't seem to make sense for it to need one because it can just process all of its polled records right away, whereas the StreamThread may need to buffer them for various reasons)

You could still probably combine into a single method and just include a flag for whether or not to call resize on the global thread (with it being true for the cache case, and false for the input buffer resizing)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 here as well. I think we would always resize both buffer and state cache at the same time moving forward.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you still need (or at least want) to keep these deprecated configs defined

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: call this one resizeCache to differentiate between this and the buffer size? Also, is there a method somewhere to update the input buffer size? Seems like we're always resizing the cache when a thread is added/removed

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha yeah that's a good catch. Missed that part. This is the method which is supposed to do the same

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this? Seems like we can get rid of the extra maxBufferSizeBytes and instead just directly read out the value of maxBufferResizeSize (in which case we should probably rename maxBufferResizeSize to maxBufferSizeBytes -- my point is, I think we can get away with just storing the current size of the buffer as a long and the max size as an AtomicLong that can be updated from outside the thread)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Lmk if you have any questions about any of my comments/questions -- I think the main thing is just that we, unfortunately, can't get rid of any of the code handling the old configs just yet, since not everyone will upgrade their code off of deprecated APIs right away (though we wish they would!)

Thanks for the review. I have added some comments/questions. Other suggestions, make total sense and would incorporate them.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic is a little difficult to read 😅 -- would it help to just move the bufferSize -= processedData.totalBytesConsumed; to before this check?

Although on that note, it might be cleaner to just move this check regarding whether to resume consuming to right before we call poll, that way there's a nice symmetry between the pause and resume checks, and all the logic is consolidated to one place

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha :D
The reason I have this check => bufferSize > maxBufferSizeBytes is that the resume should happen only if after the current round of consumption, the buffersize which had breached the threshold, now went below. Without that check, it will always enter the if block- even when it's already lesser(and we subtract something more and reduce it further). Did that make sense? :D
The idea of placing it here is that, right after removing some records from the buffer, did the buffer size come down. It similar to how StreamThread resume used to work(the one I removed). This logic can very well go before poll but I thought adding it here was more non invasive as there's already some metrics related stuff and other things happening in this block.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also feel this logic is a bit awkward, starting from the fact that we need to report how many bytes we've consumed from the process :) I think we can simply do the following:

At the end of polling phase, and at the end the process loop (a.k.a. here), we loop over all the active tasks, and get their "input buffer size", which would delegate to each task's corresponding PartitionGroup and then RecordQueue. And then based on that we can decide whether to resume / pause accordingly. Then

  1. we do not need to maintain a local bufferSize at the task here, i.e. we always re-compute from the task's record queue, which is the source of truth.
  2. we do not need to maintain and propagate up the consumed bytes within each iteration here.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we only pause the non-empty partitions? If the buffer is full, we have to pause all of them, no?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah.. this was something that was discussed/decided in the JIRA conversation. You can find the explanation here: https://issues.apache.org/jira/browse/KAFKA-13152?focusedCommentId=17400647&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17400647

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah now I remember, thanks 😄

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you leave a comment above this method explaining this? Don't want to forget why we did this a year from now and then accidentally break things

Comment on lines 739 to 722
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately we can't remove this logic yet, since some users may still be setting only the buffered.records.per.partition config, and they shouldn't see a sudden explosion of memory just because they didn't switch over to the new config right away

Copy link
Copy Markdown
Contributor Author

@vamossagar12 vamossagar12 Oct 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see... but if both the configs are in play, won't that also lead to pause/resume of partitions based upon whichever threshold is breached? Maybe we can add some check that we do this only if maxBufferedSize is set to some value? We might also want to consider if it has a default value and use the condition accordingly,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can add some check that we do this only if maxBufferedSize is set to some value?

Yeah, sorry, I should have been more clear here -- we only need to continue doing this if the user is still setting the buffered.records.per.partition config. I mentioned this in another comment, but just in case you weren't aware, you can call originals() on a StreamsConfig object to get a map of the actual configs passed in by the user -- that way you know what they actually set vs what's just a default.

Then you can just set maxBufferedSize to null or define a static constant NOT_SET = -1 and then only continue doing this partition-level pause/resume if the user is still using buffered.records.per.partition. Does that make sense?

Copy link
Copy Markdown
Contributor

@guozhangwang guozhangwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! I made a first pass on it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Could we merge these two info lines into a single one? It seems a bit redundant to log twice here. Also this new log line seems wrong since it has four parameters but only three values provided.

E.g.

Adding StreamThread-{}, the current total number of thread is {}, each thread now has a buffer size {} and cache size {}

And

Terminating StreamThread-{}, the current total number of thread is {}, each thread now has a buffer size {} and cache size {}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: ditto here. See above for the consolidated log line. Here we can emphasize it is "Terminating newly added threads".

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto here as wel.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 here as well. I think we would always resize both buffer and state cache at the same time moving forward.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand this logic here: it seems we only call setBytesConsumed once with 0 here and there's no other callers?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, even with the correct logic, I'm wondering if we can just define it as a local variable within the process here instead of augmenting the Task interface?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not look right to me: why we use the size value read from cacheResizeSize to assign to maxBufferSizeBytes? They should be totally orthogonal.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can simply the logic and do not need to keep track of "consumed bytes" within a task here, see my other comment.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also feel this logic is a bit awkward, starting from the fact that we need to report how many bytes we've consumed from the process :) I think we can simply do the following:

At the end of polling phase, and at the end the process loop (a.k.a. here), we loop over all the active tasks, and get their "input buffer size", which would delegate to each task's corresponding PartitionGroup and then RecordQueue. And then based on that we can decide whether to resume / pause accordingly. Then

  1. we do not need to maintain a local bufferSize at the task here, i.e. we always re-compute from the task's record queue, which is the source of truth.
  2. we do not need to maintain and propagate up the consumed bytes within each iteration here.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function seems not used.

BTW if we do not maintain the local bufferSize then we would not need it anyways :)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my other comment: I think we can avoid propagating both processed-records and processed-bytes from the process call.

@vamossagar12 vamossagar12 force-pushed the KAFKA-13152 branch 2 times, most recently from 6f6316d to 656c01b Compare November 6, 2021 19:04
@guozhangwang
Copy link
Copy Markdown
Contributor

@vamossagar12 the jenkins failure are due to compilation warnings:


[2021-11-06T19:08:04.459Z] > Task :core:compileTestScala

[2021-11-06T19:08:04.459Z] [Warn] /home/jenkins/jenkins-agent/workspace/Kafka_kafka-pr_PR-11424/core/src/test/scala/integration/kafka/server/DynamicBrokerReconfigurationTest.scala:38: imported `QuorumTestHarness` is permanently hidden by definition of object QuorumTestHarness in package server

[2021-11-06T19:08:06.285Z] [Warn] /home/jenkins/jenkins-agent/workspace/Kafka_kafka-pr_PR-11424/core/src/test/scala/integration/kafka/server/MultipleListenersWithSameSecurityProtocolBaseTest.scala:30: imported `QuorumTestHarness` is permanently hidden by definition of object QuorumTestHarness in package server

[2021-11-06T19:08:16.686Z] [Warn] /home/jenkins/jenkins-agent/workspace/Kafka_kafka-pr_PR-11424/core/src/test/scala/unit/kafka/server/AdvertiseBrokerTest.scala:22: imported `QuorumTestHarness` is permanently hidden by definition of object QuorumTestHarness in package server

[2021-11-06T19:08:16.686Z] [Warn] /home/jenkins/jenkins-agent/workspace/Kafka_kafka-pr_PR-11424/core/src/test/scala/unit/kafka/server/BrokerEpochIntegrationTest.scala:27: imported `QuorumTestHarness` is permanently hidden by definition of object QuorumTestHarness in package server

[2021-11-06T19:08:16.686Z] [Warn] /home/jenkins/jenkins-agent/workspace/Kafka_kafka-pr_PR-11424/core/src/test/scala/unit/kafka/server/DynamicConfigTest.scala:21: imported `QuorumTestHarness` is permanently hidden by definition of object QuorumTestHarness in package server

[2021-11-06T19:08:18.513Z] [Warn] /home/jenkins/jenkins-agent/workspace/Kafka_kafka-pr_PR-11424/core/src/test/scala/unit/kafka/server/FinalizedFeatureChangeListenerTest.scala:22: imported `QuorumTestHarness` is permanently hidden by definition of object QuorumTestHarness in package server

[2021-11-06T19:08:18.513Z] [Warn] /home/jenkins/jenkins-agent/workspace/Kafka_kafka-pr_PR-11424/core/src/test/scala/unit/kafka/server/KafkaMetricReporterClusterIdTest.scala:23: imported `QuorumTestHarness` is permanently hidden by definition of object QuorumTestHarness in package server

[2021-11-06T19:08:18.513Z] [Warn] /home/jenkins/jenkins-agent/workspace/Kafka_kafka-pr_PR-11424/core/src/test/scala/unit/kafka/server/KafkaMetricsReporterTest.scala:24: imported `QuorumTestHarness` is permanently hidden by definition of object QuorumTestHarness in package server

[2021-11-06T19:08:18.513Z] [Warn] /home/jenkins/jenkins-agent/workspace/Kafka_kafka-pr_PR-11424/core/src/test/scala/unit/kafka/server/KafkaServerTest.scala:22: imported `QuorumTestHarness` is permanently hidden by definition of object QuorumTestHarness in package server

[2021-11-06T19:08:18.513Z] [Warn] /home/jenkins/jenkins-agent/workspace/Kafka_kafka-pr_PR-11424/core/src/test/scala/unit/kafka/server/LeaderElectionTest.scala:32: imported `QuorumTestHarness` is permanently hidden by definition of object QuorumTestHarness in package server

[2021-11-06T19:08:18.513Z] [Warn] /home/jenkins/jenkins-agent/workspace/Kafka_kafka-pr_PR-11424/core/src/test/scala/unit/kafka/server/LogRecoveryTest.scala:25: imported `QuorumTestHarness` is permanently hidden by definition of object QuorumTestHarness in package server

[2021-11-06T19:08:20.343Z] [Warn] /home/jenkins/jenkins-agent/workspace/Kafka_kafka-pr_PR-11424/core/src/test/scala/unit/kafka/server/ReplicaFetchTest.scala:23: imported `QuorumTestHarness` is permanently hidden by definition of object QuorumTestHarness in package server

[2021-11-06T19:08:20.343Z] [Warn] /home/jenkins/jenkins-agent/workspace/Kafka_kafka-pr_PR-11424/core/src/test/scala/unit/kafka/server/ReplicationQuotasTest.scala:28: imported `QuorumTestHarness` is permanently hidden by definition of object QuorumTestHarness in package server

[2021-11-06T19:08:20.343Z] [Warn] /home/jenkins/jenkins-agent/workspace/Kafka_kafka-pr_PR-11424/core/src/test/scala/unit/kafka/server/ServerGenerateBrokerIdTest.scala:23: imported `QuorumTestHarness` is permanently hidden by definition of object QuorumTestHarness in package server

[2021-11-06T19:08:20.343Z] [Warn] /home/jenkins/jenkins-agent/workspace/Kafka_kafka-pr_PR-11424/core/src/test/scala/unit/kafka/server/ServerGenerateClusterIdTest.scala:29: imported `QuorumTestHarness` is permanently hidden by definition of object QuorumTestHarness in package server

[2021-11-06T19:08:20.343Z] [Warn] /home/jenkins/jenkins-agent/workspace/Kafka_kafka-pr_PR-11424/core/src/test/scala/unit/kafka/server/ServerShutdownTest.scala:19: imported `QuorumTestHarness` is permanently hidden by definition of object QuorumTestHarness in package server

[2021-11-06T19:08:20.343Z] [Warn] /home/jenkins/jenkins-agent/workspace/Kafka_kafka-pr_PR-11424/core/src/test/scala/unit/kafka/server/ServerStartupTest.scala:21: imported `QuorumTestHarness` is permanently hidden by definition of object QuorumTestHarness in package server

[2021-11-06T19:09:02.900Z] [Warn] /home/jenkins/jenkins-agent/workspace/Kafka_kafka-pr_PR-11424/core/src/test/scala/unit/kafka/log/OffsetIndexTest.scala:52: @nowarn annotation does not suppress any warnings

[2021-11-06T19:09:02.900Z] 18 warnings found

This seems not relevant to the PR, I will re-trigger it.

Copy link
Copy Markdown
Contributor

@guozhangwang guozhangwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @vamossagar12 , I've made a second pass and left some more comments.

Regarding the metric name, I will ping @ableegoldman and @cadonna again and get back to you.

Comment thread streams/src/main/java/org/apache/kafka/streams/KafkaStreams.java Outdated
Comment thread streams/src/main/java/org/apache/kafka/streams/KafkaStreams.java Outdated
Comment thread streams/src/main/java/org/apache/kafka/streams/KafkaStreams.java Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could we add a TODO here that this logic should be removed once we remove the deprecated old config as well?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we do not need the ?focusedCommentId=17400647&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17400647 suffix in the javadoc :)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh that's because the ticket has a lot of comments and I intended to point to the comment which talks about the design decision made here. If it doesn't make sense, will remove it :)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function seems not addressed? Or was I missing anything?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These functions seem not used any more?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment seems not addressed.

@vamossagar12
Copy link
Copy Markdown
Contributor Author

#11424 (comment)

This is being used now. StreamThread delegates to this function to get the totalBytesBuffered.

@vamossagar12
Copy link
Copy Markdown
Contributor Author

Thanks @vamossagar12 , I've made a second pass and left some more comments.

Regarding the metric name, I will ping @ableegoldman and @cadonna again and get back to you.

Thanks @guozhangwang . I had one question on the changes suggested by you.

@guozhangwang
Copy link
Copy Markdown
Contributor

Thanks @guozhangwang . I had one question on the changes suggested by you.

Just replied on your question. LMK if that works or not.

@vamossagar12
Copy link
Copy Markdown
Contributor Author

Thanks @guozhangwang . I had one question on the changes suggested by you.

Just replied on your question. LMK if that works or not.

Yeah it does.. Made the changes. Plz let me know when we have the decision on the metric name. Also, plz review if any other changes are needed.

@guozhangwang
Copy link
Copy Markdown
Contributor

Thanks @vamossagar12 , I do not have further comments now, re-triggering the jenkins tests.

@guozhangwang
Copy link
Copy Markdown
Contributor

@vamossagar12 there are some compilation errors in jenkins, could you rebase from latest trunk and check what's the issue? Please ping me once that's resolved.

@vamossagar12
Copy link
Copy Markdown
Contributor Author

@guozhangwang , it's due to the usage of deprecated configs and gradle having a check against it.

[2022-01-04T23:35:30.575Z] /home/jenkins/jenkins-agent/workspace/Kafka_kafka-pr_PR-11424/streams/src/main/java/org/apache/kafka/streams/processor/internals/namedtopology/TopologyConfig.java:61: warning: [deprecation] CACHE_MAX_BYTES_BUFFERING_CONFIG in org.apache.kafka.streams.StreamsConfig has been deprecated

[2022-01-04T23:35:30.575Z]             .define(CACHE_MAX_BYTES_BUFFERING_CONFIG,

[2022-01-04T23:35:30.575Z]                     ^

[2022-01-04T23:35:30.575Z] /home/jenkins/jenkins-agent/workspace/Kafka_kafka-pr_PR-11424/streams/src/main/java/org/apache/kafka/streams/processor/internals/namedtopology/TopologyConfig.java:56: warning: [deprecation] BUFFERED_RECORDS_PER_PARTITION_CONFIG in org.apache.kafka.streams.StreamsConfig has been deprecated

[2022-01-04T23:35:30.575Z]              .define(BUFFERED_RECORDS_PER_PARTITION_CONFIG,

[2022-01-04T23:35:30.575Z]                      ^

[2022-01-04T23:35:30.575Z] /home/jenkins/jenkins-agent/workspace/Kafka_kafka-pr_PR-11424/streams/src/main/java/org/apache/kafka/streams/processor/internals/namedtopology/TopologyConfig.java:115: warning: [deprecation] BUFFERED_RECORDS_PER_PARTITION_CONFIG in org.apache.kafka.streams.StreamsConfig has been deprecated

[2022-01-04T23:35:30.575Z]         if (isTopologyOverride(BUFFERED_RECORDS_PER_PARTITION_CONFIG, topologyOverrides)) {

[2022-01-04T23:35:30.575Z]                                ^

[2022-01-04T23:35:30.575Z] /home/jenkins/jenkins-agent/workspace/Kafka_kafka-pr_PR-11424/streams/src/main/java/org/apache/kafka/streams/processor/internals/namedtopology/TopologyConfig.java:116: warning: [deprecation] BUFFERED_RECORDS_PER_PARTITION_CONFIG in org.apache.kafka.streams.StreamsConfig has been deprecated

[2022-01-04T23:35:30.575Z]             maxBufferedSize = getInt(BUFFERED_RECORDS_PER_PARTITION_CONFIG);

[2022-01-04T23:35:30.575Z]                                      ^

[2022-01-04T23:35:30.575Z] /home/jenkins/jenkins-agent/workspace/Kafka_kafka-pr_PR-11424/streams/src/main/java/org/apache/kafka/streams/processor/internals/namedtopology/TopologyConfig.java:117: warning: [deprecation] BUFFERED_RECORDS_PER_PARTITION_CONFIG in org.apache.kafka.streams.StreamsConfig has been deprecated

[2022-01-04T23:35:30.575Z]             log.info("Topology {} is overriding {} to {}", topologyName, BUFFERED_RECORDS_PER_PARTITION_CONFIG, maxBufferedSize);

[2022-01-04T23:35:30.575Z]                                                                          ^

[2022-01-04T23:35:30.575Z] /home/jenkins/jenkins-agent/workspace/Kafka_kafka-pr_PR-11424/streams/src/main/java/org/apache/kafka/streams/processor/internals/namedtopology/TopologyConfig.java:119: warning: [deprecation] BUFFERED_RECORDS_PER_PARTITION_CONFIG in org.apache.kafka.streams.StreamsConfig has been deprecated

[2022-01-04T23:35:30.575Z]             maxBufferedSize = globalAppConfigs.originals().containsKey(StreamsConfig.BUFFERED_RECORDS_PER_PARTITION_CONFIG)

[2022-01-04T23:35:30.575Z]                                                                                     ^

[2022-01-04T23:35:30.575Z] /home/jenkins/jenkins-agent/workspace/Kafka_kafka-pr_PR-11424/streams/src/main/java/org/apache/kafka/streams/processor/internals/namedtopology/TopologyConfig.java:120: warning: [deprecation] BUFFERED_RECORDS_PER_PARTITION_CONFIG in org.apache.kafka.streams.StreamsConfig has been deprecated

[2022-01-04T23:35:30.575Z]                     ? globalAppConfigs.getInt(StreamsConfig.BUFFERED_RECORDS_PER_PARTITION_CONFIG) : -1;

[2022-01-04T23:35:30.575Z]                                                            ^

[2022-01-04T23:35:30.575Z] /home/jenkins/jenkins-agent/workspace/Kafka_kafka-pr_PR-11424/streams/src/main/java/org/apache/kafka/streams/processor/internals/namedtopology/TopologyConfig.java:123: warning: [deprecation] CACHE_MAX_BYTES_BUFFERING_CONFIG in org.apache.kafka.streams.StreamsConfig has been deprecated

[2022-01-04T23:35:30.575Z]         if (isTopologyOverride(CACHE_MAX_BYTES_BUFFERING_CONFIG, topologyOverrides)) {

[2022-01-04T23:35:30.575Z]                                ^

[2022-01-04T23:35:30.575Z] /home/jenkins/jenkins-agent/workspace/Kafka_kafka-pr_PR-11424/streams/src/main/java/org/apache/kafka/streams/processor/internals/namedtopology/TopologyConfig.java:124: warning: [deprecation] CACHE_MAX_BYTES_BUFFERING_CONFIG in org.apache.kafka.streams.StreamsConfig has been deprecated

[2022-01-04T23:35:30.575Z]             cacheSize = getLong(CACHE_MAX_BYTES_BUFFERING_CONFIG);

[2022-01-04T23:35:30.575Z]                                 ^

[2022-01-04T23:35:30.575Z] /home/jenkins/jenkins-agent/workspace/Kafka_kafka-pr_PR-11424/streams/src/main/java/org/apache/kafka/streams/processor/internals/namedtopology/TopologyConfig.java:125: warning: [deprecation] CACHE_MAX_BYTES_BUFFERING_CONFIG in org.apache.kafka.streams.StreamsConfig has been deprecated

[2022-01-04T23:35:30.575Z]             log.info("Topology {} is overriding {} to {}", topologyName, CACHE_MAX_BYTES_BUFFERING_CONFIG, cacheSize);

[2022-01-04T23:35:30.575Z]                                                                          ^

[2022-01-04T23:35:30.575Z] /home/jenkins/jenkins-agent/workspace/Kafka_kafka-pr_PR-11424/streams/src/main/java/org/apache/kafka/streams/processor/internals/namedtopology/TopologyConfig.java:127: warning: [deprecation] CACHE_MAX_BYTES_BUFFERING_CONFIG in org.apache.kafka.streams.StreamsConfig has been deprecated

[2022-01-04T23:35:30.575Z]             cacheSize = globalAppConfigs.getLong(CACHE_MAX_BYTES_BUFFERING_CONFIG);

[2022-01-04T23:35:30.575Z]                                                  ^

[2022-01-04T23:35:30.575Z] error: warnings found and -Werror specified

In my local I commented out that check for compile/test/checkstyle etc. I think we don't want to get rid of these configs immediately, so these usages are needed for now. What would you suggest here?

@guozhangwang
Copy link
Copy Markdown
Contributor

@vamossagar12 Note that TopologyConfig is an internal class for now, and hence should not use deprecated values, I think we need to update this class to directly use the new config and remove the old ones --- since it is internal, it would not be breaking any compatibilities.

@vamossagar12
Copy link
Copy Markdown
Contributor Author

hey @guozhangwang , For CACHE_MAX_BYTES_BUFFERING_CONFIG , I can replace but as per one of the comments here => #11424 (comment) we wanted to check if the deprecated config (BUFFERED_RECORDS_PER_PARTITION_CONFIG) has been set or otherwise set it to -1. For setting it to -1, I am also checking if it's a user supplied value or a default one and that's where the need to use this config in TopologyConfig class.

@vamossagar12
Copy link
Copy Markdown
Contributor Author

vamossagar12 commented Jan 6, 2022

hey @guozhangwang , For CACHE_MAX_BYTES_BUFFERING_CONFIG , I can replace but as per one of the comments here => #11424 (comment) we wanted to check if the deprecated config (BUFFERED_RECORDS_PER_PARTITION_CONFIG) has been set or otherwise set it to -1. For setting it to -1, I am also checking if it's a user supplied value or a default one and that's where the need to use this config in TopologyConfig class.

@guozhangwang , i tried to address this again, but what I see is that since the 2 deprecated configs are topology level configs, they are being set/checked in TopologyConfig. As I said, for backward compatibility reasons, we are checking if it has been set by the user or not. Looks like the values being set here are further being used for getting/setting task level configs..

@vamossagar12
Copy link
Copy Markdown
Contributor Author

hey @guozhangwang , did you get a chance to look at my above comment?

@guozhangwang
Copy link
Copy Markdown
Contributor

hey @guozhangwang , did you get a chance to look at my above comment?

Ah yes, for StreamsConfigs in order for backward compatibility we still need to reference some deprecated configs. So:

  1. For all references of CACHE_MAX_BYTES_BUFFERING_CONFIG, we should be replacing it with the new config. We just add the @suppressWarning in the place we reference it only for compatibility.

  2. For BUFFERED_RECORDS_PER_PARTITION_CONFIG, similarly we use @suppressWarning on the code where we need to still reference it for backward compatibility like [here] (https://github.com/apache/kafka/pull/11424/files#diff-0e5e608831150c058e2ad1b45d38ad941739562588ec0fdb97cc9f742919fb1fR119) in TopologyConfig

Does that sound good to you? If yes we can go ahead and make it done.

@vamossagar12
Copy link
Copy Markdown
Contributor Author

hey @guozhangwang , did you get a chance to look at my above comment?

Ah yes, for StreamsConfigs in order for backward compatibility we still need to reference some deprecated configs. So:

  1. For all references of CACHE_MAX_BYTES_BUFFERING_CONFIG, we should be replacing it with the new config. We just add the @suppressWarning in the place we reference it only for compatibility.
  2. For BUFFERED_RECORDS_PER_PARTITION_CONFIG, similarly we use @suppressWarning on the code where we need to still reference it for backward compatibility like [here] (https://github.com/apache/kafka/pull/11424/files#diff-0e5e608831150c058e2ad1b45d38ad941739562588ec0fdb97cc9f742919fb1fR119) in TopologyConfig

Does that sound good to you? If yes we can go ahead and make it done.

Thanks @guozhangwang . I added that annotation wherever applicable. Now the only thing pending is the renaming of the total-bytes metric :)

@guozhangwang
Copy link
Copy Markdown
Contributor

@vamossagar12 seems the jenkins fails still but due to 2 SpotBugs violations were found this time.

@ableegoldman
Copy link
Copy Markdown
Member

@cadonna suggested renaming the metric to input-buffer-bytes-total, which makes sense to me as well

@vamossagar12
Copy link
Copy Markdown
Contributor Author

@guozhangwang , i fixed that. Hopefully this time we won't see errors related to this PR. I will also check it. Also, @ableegoldman , thanks for confirming the metric name. I have updated the PR and also sent out an update on the KIP.

@guozhangwang
Copy link
Copy Markdown
Contributor

I'm fine with input-buffer-bytes-total too. Thanks!

BTW I thought we have a similar metric for store cache, but after checking I realize we actually do not have one yet :P Maybe we can add later.

@guozhangwang
Copy link
Copy Markdown
Contributor

There are still some failures in Jenkins, I'm gonna retrigger them.

If they still fails @vamossagar12 please take a look and see if they are relevant.

@vamossagar12
Copy link
Copy Markdown
Contributor Author

and these were the ones that had failed on my local as well. I am assuming these aren't relevant to this change. WDYT?

Hmm.. do you mean these tests even fail on trunk? If they fail on your branch, and also here, they are likely to be relevant right?

@guozhangwang , sorry it was an oversight on my part. Looking at the names, it wasn't evident that these were related to this PR but apparently they were :D

I ran the tests twice locally and there are 3-4 which fail intermittently due to timeout errors. When I ran them afterwards, even they seemed to have passed. Let's see what happens in this run.

@guozhangwang
Copy link
Copy Markdown
Contributor

I ran the tests twice locally and there are 3-4 which fail intermittently due to timeout errors. When I ran them afterwards, even they seemed to have passed. Let's see what happens in this run.

Cool. Thanks @vamossagar12

@vamossagar12
Copy link
Copy Markdown
Contributor Author

I ran the tests twice locally and there are 3-4 which fail intermittently due to timeout errors. When I ran them afterwards, even they seemed to have passed. Let's see what happens in this run.

Cool. Thanks @vamossagar12

Looks like the tests passed this time. ARM BUILD has some issue with
[2022-01-27T18:31:03.652Z] > Task :storage:unitTest FAILED while other jdk based builds show the tests as green. Couple of examples of failed tests are =>

[2022-01-27T18:41:55.727Z] ConsumerBounceTest > testRollingBrokerRestartsWithSmallerMaxGroupSizeConfigDisruptsBigGroup() FAILED
[2022-01-27T18:41:55.727Z]     org.opentest4j.AssertionFailedError: The remaining consumers in the group could not fetch the expected records
[2022-01-27T18:41:55.727Z]         at org.junit.jupiter.api.AssertionUtils.fail(AssertionUtils.java:39)
[2022-01-27T18:41:55.727Z]         at org.junit.jupiter.api.Assertions.fail(Assertions.java:117)
[2022-01-27T18:41:55.727Z]         at kafka.api.ConsumerBounceTest.testRollingBrokerRestartsWithSmallerMaxGroupSizeConfigDisruptsBigGroup(ConsumerBounceTest.scala:331)

in jdk8 and

[2022-01-27T19:49:17.737Z] > Task :core:integrationTest
[2022-01-27T19:49:17.737Z] kafka.api.PlaintextAdminIntegrationTest.testReplicaCanFetchFromLogStartOffsetAfterDeleteRecords() failed, log available in /home/jenkins/workspace/Kafka_kafka-pr_PR-11424/core/build/reports/testOutput/kafka.api.PlaintextAdminIntegrationTest.testReplicaCanFetchFromLogStartOffsetAfterDeleteRecords().test.stdout
[2022-01-27T19:49:17.737Z] 
[2022-01-27T19:49:17.737Z] PlaintextAdminIntegrationTest > testReplicaCanFetchFromLogStartOffsetAfterDeleteRecords() FAILED
[2022-01-27T19:49:17.737Z]     org.opentest4j.AssertionFailedError: Expected follower to catch up to log end offset 200
[2022-01-27T19:49:17.737Z]         at app//org.junit.jupiter.api.AssertionUtils.fail(AssertionUtils.java:39)
[2022-01-27T19:49:17.737Z]         at app//org.junit.jupiter.api.Assertions.fail(Assertions.java:117)
[2022-01-27T19:49:17.737Z]         at app//kafka.api.PlaintextAdminIntegrationTest.waitForFollowerLog$1(PlaintextAdminIntegrationTest.scala:730)
[2022-01-27T19:49:17.737Z]         at app//kafka.api.PlaintextAdminIntegrationTest.testReplicaCanFetchFromLogStartOffsetAfterDeleteRecords(PlaintextAdminIntegrationTest.scala:760)

in jdk 17. Do you think it's better now?

@vamossagar12
Copy link
Copy Markdown
Contributor Author

Also, I created a new JIRA https://issues.apache.org/jira/browse/KAFKA-13624 for adding a new metric for cache size. I am assuming it needs a KIP and hence added needs-kip tag to it.

@guozhangwang
Copy link
Copy Markdown
Contributor

I checked the jenkins failures and they are irrelevant indeed this time. Merging to trunk now.

Thank you so much @vamossagar12 for the great contribution! Please feel free to update the JIRA ticket and the KIP wiki as well.

@guozhangwang guozhangwang merged commit 14c6030 into apache:trunk Jan 28, 2022
@vamossagar12
Copy link
Copy Markdown
Contributor Author

I checked the jenkins failures and they are irrelevant indeed this time. Merging to trunk now.

Thank you so much @vamossagar12 for the great contribution! Please feel free to update the JIRA ticket and the KIP wiki as well.

Thanks @guozhangwang and @ableegoldman for the support on this one! Also, this ticket => https://issues.apache.org/jira/browse/KAFKA-13624, do you think it makes sense? Do we need a KIP for this?

}
} else {
cacheSize = globalAppConfigs.getLong(CACHE_MAX_BYTES_BUFFERING_CONFIG);
cacheSize = globalAppConfigs.getLong(STATESTORE_CACHE_MAX_BYTES_CONFIG);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey guys, sorry I didn't get around to doing another full review of this earlier. This is a bug, we should always check whether the old deprecated config has been set and use that value if it was set and the new config was not.

I noticed something similar in the TopologyTestDriver, although I'm not sure how much of an effect that would have since AFAICT there's not really "caching" in the TTD -- at least, there's not supposed to be according to the javadocs

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ableegoldman , I have added that check. The line above is in the else block when neither of the 2 configs are set. Here's the complete block of code =>

if (isTopologyOverride(STATESTORE_CACHE_MAX_BYTES_CONFIG, topologyOverrides) ||
                isTopologyOverride(CACHE_MAX_BYTES_BUFFERING_CONFIG, topologyOverrides)) {

            if (isTopologyOverride(STATESTORE_CACHE_MAX_BYTES_CONFIG, topologyOverrides) && isTopologyOverride(CACHE_MAX_BYTES_BUFFERING_CONFIG, topologyOverrides)) {
                cacheSize = getLong(STATESTORE_CACHE_MAX_BYTES_CONFIG);
                log.info("Topology {} is using both {} and deprecated config {}. overriding {} to {}",
                        topologyName,
                        STATESTORE_CACHE_MAX_BYTES_CONFIG,
                        CACHE_MAX_BYTES_BUFFERING_CONFIG,
                        STATESTORE_CACHE_MAX_BYTES_CONFIG,
                        cacheSize);
            } else if (isTopologyOverride(CACHE_MAX_BYTES_BUFFERING_CONFIG, topologyOverrides)) {
                cacheSize = getLong(CACHE_MAX_BYTES_BUFFERING_CONFIG);
                log.info("Topology {} is using deprecated config {}. overriding {} to {}", topologyName, CACHE_MAX_BYTES_BUFFERING_CONFIG, CACHE_MAX_BYTES_BUFFERING_CONFIG, cacheSize);
            } else {
                cacheSize = getLong(STATESTORE_CACHE_MAX_BYTES_CONFIG);
                log.info("Topology {} is overriding {} to {}", topologyName, STATESTORE_CACHE_MAX_BYTES_CONFIG, cacheSize);
            }
        } else {
            cacheSize = globalAppConfigs.getLong(STATESTORE_CACHE_MAX_BYTES_CONFIG);
        }

Am I missing something here?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see the confusion. The #isTopologyOverride method checks whether the config has been overridden for the specific topology, ie has been set in the Properties passed in to StreamsBuilder#build -- it's not looking at what we call the globalAppConfigs which are the actual application configs: ie those passed in to the KafkaStreams constructor.

So basically there are two sets of configs. The value should be taken as the first of these to be set by the user, in the following order:

  1. statestore.cache.max.bytes in topologyOverrides
  2. cache.max.bytes.buffering in topologyOverrides
    3)statestore.cache.max.bytes in globalAppConfigs
  3. cache.max.bytes.buffering in globalAppConfigs

Essentially, using #getTotalCacheSize on the topologyOverrides if either of them is set (which this PR is doing) and on the globalAppConfigs if they are not (which is the regression here).

On that note -- we also need to move ##getTotalCacheSize out of StreamsConfig, because it's a public class and wasn't listed as a public API in the KIP (nor should it be, imo). I recommend creating a new static utility class for things like this, eg StreamsConfigUtils in the org.apache.kafka.streams.internals package. There are some other methods that would belong there, for example the StreamThread methods #processingMode and #eosEnabled should be moved as well

Hope that all makes sense -- and lmk if you don't think you'll have the time to put out a full patch, and I or another Streams dev can help out 🙂

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @ableegoldman! i am running slightly occupied currently. But, I will make the changes in the next few days. As you said, i will introduce a new utility class and move these methods out.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good! There's no rush, but I'll make sure we have your new PRs reviewed and merged quickly whenever they are ready, since you've worked so hard on this already. I'm sorry I wasn't able to make another pass on your original PR, but hopefully this won't be too much of a bother.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry just catching up on this comment @ableegoldman . That's perfectly fine!

final ThreadCache cache = new ThreadCache(
logContext,
Math.max(0, streamsConfig.getLong(StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG)),
Math.max(0, streamsConfig.getLong(StreamsConfig.STATESTORE_CACHE_MAX_BYTES_CONFIG)),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto to this comment, we can't just read off the new statestore.cache.max.bytes without checking whether the old cache.max.bytes.buffering was set

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh here, I thought since it's a test case so it shouldn't really matter. Isn't that the case?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since it's a test case so it shouldn't really matter. Isn't that the case?

Well if someone is using the TTD to write unit tests and those tests start to fail after they upgrade because the caching is different, I would say that's compatibility change.

Although I read the TTD's javadocs earlier and remembered that it actually processes records synchronously, which effectively means that the only thing that matters/affects the TTD results is whether the cache size is non-zero or has been set to 0 -- and setting it to 0 only matters if it's correctly set to 0 in the TopologyConfig, not the value here. Which is a long way of saying that in hindsight, this config/bug doesn't really impact anything after all. In fact, imho we should probably just hard-code the TTD's ThreadCache cache size to 0 -- but let's not wrap that change into an already rather large PR in case there's something I'm not taking into account here.

So tl;dr, for future reference do still need to maintain backwards compatibility in the TTD since it's part of the public interface. But it just so happens that this particular bug doesn't actually break anything "real" or have any visible impact (at least AFAICT

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it.. If i understood correctly, we need a new PR with all the changes in this PR and the new ones along with document changes, right?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it.. If i understood correctly, we need a new PR with all the changes in this PR and the new ones along with document changes, right?

That sounds right to me.

@mjsax
Copy link
Copy Markdown
Member

mjsax commented Feb 1, 2022

Seems this PR breaks backward compatibility. Will need to revert it for now. Sorry.

@vamossagar12 -- Can you do a new PR containing the commit plus the required fixes? Happy to help reviewing if necessary.

mjsax added a commit that referenced this pull request Feb 1, 2022
…nput.buffer.max.bytes" (#11424)"

This reverts commit 14c6030.
Reason: Implemenation breaks backward compatibility
@mjsax
Copy link
Copy Markdown
Member

mjsax commented Feb 1, 2022

Btw: this PR does not update the docs -- we should docs updates, too. (Also ok to do in a follow up PR.)

To ensure backward compatibility, it might also be good to split the actually PR into two: first so the internal config change without rewriting any tests -- this should ensure that no existing test breaks. -- In a follow up PR we update the test to use the new config.

And we should add a test that expliclity tests the old config to test backward compatibility explicitly.

@guozhangwang
Copy link
Copy Markdown
Contributor

@ableegoldman @mjsax My read on the code is that we only need to change the TopologyTestDriver, while the first place seems fine to me. Did I miss anything?

@mjsax
Copy link
Copy Markdown
Member

mjsax commented Feb 3, 2022

I did not dig into the details myself. Anyway, might be better to discuss on the new PR?

@ableegoldman
Copy link
Copy Markdown
Member

My read on the code is that we only need to change the TopologyTestDriver, while the first place seems fine to me. Did I miss anything?

If by "first place" you mean the bug in the TopologyConfig class, then no, other way around actually. The TopologyConfig bug is the one that actually breaks compatibility, the TTD one actually doesn't really do anything -- if I understand the TTD correctly. See this

@guozhangwang
Copy link
Copy Markdown
Contributor

If by "first place" you mean #11424 (comment), then no, other way around actually. The TopologyConfig bug is the one that actually breaks compatibility, the TTD one actually doesn't really do anything -- if I understand the TTD correctly. See #11424 (comment)

Crystal! Thanks for the clarification @ableegoldman

// and then resize them later
streamThread = createAndAddStreamThread(0L, 0L, threadIdx);
final int numLiveThreads = getNumLiveStreamThreads();
resizeThreadCacheAndBufferMemory(numLiveThreads + 1);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more thing -- @wcarlson5 mentioned there was an off-by-one error here due to the re-ordering of these calls. Specifically, the + 1 in this line was necessary before now because we called resizeThreadCache before actually adding the new thread, so we had to account for the new thread by adding one. But since we now create/add the new thread first, the getNumLiveStreamThreads method will actually return the correct number of threads, so we don't need the + 1 anymore.

On that note, I take it we reordered these calls because we now create the thread without the cache value and then call resize to set the cache after the thread has already been created. I was wondering: why do we need to do this post-construction resizing? I only looked at this part of the PR briefly, but it seems to me like we always have the actual cache size known when we're creating the thread, so can't we just pass that in to the StreamThread#create method/constructor? It's just a bit confusing to initialize the cache size to some random value, it took me a little while to figure out what was going on with that

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering: why do we need to do this post-construction resizing?

The main motivation is to consolidate the resizing of the thread cache and buffer within a single call. More details can be found in this comment thread: #11424 (comment) I suggested we initialize the new thread with 0 value -- should not be a random value? -- and then resize (at that time we have the correct number of threads to divide).

@mjsax
Copy link
Copy Markdown
Member

mjsax commented Feb 10, 2022

@vamossagar12 -- Any update from your side about opening a new PR?

@vamossagar12
Copy link
Copy Markdown
Contributor Author

@mjsax not yet.. I couldn't find the time yet to pick this up. Would send an update soon.

ableegoldman pushed a commit that referenced this pull request Oct 21, 2022
PR implementing KIP-770 (#11424) was reverted as it brought in a regression wrt pausing/resuming the consumer. That KIP also introduced a change to deprecate config CACHE_MAX_BYTES_BUFFERING_CONFIG and replace it with STATESTORE_CACHE_MAX_BYTES_CONFIG.

Reviewers: Anna Sophie Blee-Goldman <ableegoldman@apache.org>
guozhangwang pushed a commit to guozhangwang/kafka that referenced this pull request Jan 25, 2023
PR implementing KIP-770 (apache#11424) was reverted as it brought in a regression wrt pausing/resuming the consumer. That KIP also introduced a change to deprecate config CACHE_MAX_BYTES_BUFFERING_CONFIG and replace it with STATESTORE_CACHE_MAX_BYTES_CONFIG.

Reviewers: Anna Sophie Blee-Goldman <ableegoldman@apache.org>
@guozhangwang
Copy link
Copy Markdown
Contributor

Thanks for keep looking into this issue, and continue to try getting it merged. After chatting with you offline I thought a bit more around the logic of pause / resume. And I think that besides the additional logic of:

  1. pause all non-empty partitions when the global bytes threshold is exceeded;
  2. resume all paused partitions when below the global bytes threshold.

We also do the following:

  1. modify the single partition resume logic (a.k.a. https://github.com/apache/kafka/pull/11424/files#diff-a76674468cda8772230fb8411717cf9068b1a363a792f32c602fb2ec5ba9efd7R722) that if the partition buffer becomes empty after the last record is retrieved, resume it if it was paused. I.e. we should remove the TODO since it would not be removed, and also not condition on maxBufferedSize != -1.
  2. deprecated the single partition pause logic (a.k.a. https://github.com/apache/kafka/pull/11424/files#diff-a76674468cda8772230fb8411717cf9068b1a363a792f32c602fb2ec5ba9efd7R977) just like what you did previously still, since we would still remove it in the future.

cc @mjsax @vvcephei who also looked into this.

rutvijmehta-harness pushed a commit to rutvijmehta-harness/kafka that referenced this pull request Feb 9, 2024
PR implementing KIP-770 (apache#11424) was reverted as it brought in a regression wrt pausing/resuming the consumer. That KIP also introduced a change to deprecate config CACHE_MAX_BYTES_BUFFERING_CONFIG and replace it with STATESTORE_CACHE_MAX_BYTES_CONFIG.

Reviewers: Anna Sophie Blee-Goldman <ableegoldman@apache.org>
rutvijmehta-harness added a commit to rutvijmehta-harness/kafka that referenced this pull request Feb 9, 2024
PR implementing KIP-770 (apache#11424) was reverted as it brought in a regression wrt pausing/resuming the consumer. That KIP also introduced a change to deprecate config CACHE_MAX_BYTES_BUFFERING_CONFIG and replace it with STATESTORE_CACHE_MAX_BYTES_CONFIG.

Reviewers: Anna Sophie Blee-Goldman <ableegoldman@apache.org>

Co-authored-by: vamossagar12 <sagarmeansocean@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants