Kinesis adaptive memory management by zachjsh · Pull Request #15360 · apache/druid

zachjsh · 2023-11-10T00:28:56Z

Description

Our Kinesis consumer works by using the GetRecords API in some number of fetchThreads, each fetching some number of records (recordsPerFetch) and each inserting into a shared buffer that can hold a recordBufferSize number of records. The logic is described in our documentation at: https://druid.apache.org/docs/27.0.0/development/extensions-core/kinesis-ingestion/#determine-fetch-settings

There is a problem with the logic that this pr fixes: the memory limits rely on a hard-coded “estimated record size” that is 10 KB if deaggregate: false and 1 MB if deaggregate: true. There have been cases where a supervisor had deaggregate: true set even though it wasn’t needed, leading to under-utilization of memory and poor ingestion performance.

Users don’t always know if their records are aggregated or not. Also, even if they could figure it out, it’s better to not have to. So we’d like to eliminate the deaggregate parameter, which means we need to do memory management more adaptively based on the actual record sizes.

We take advantage of the fact that GetRecords doesn’t return more than 10MB (https://docs.aws.amazon.com/streams/latest/dev/service-sizes-and-limits.html ):

This pr:

eliminates recordsPerFetch, always use the max limit of 10000 records (the default limit if not set)

eliminate deaggregate, always have it true

cap fetchThreads to ensure that if each fetch returns the max (10MB) then we don't exceed our budget (100MB or 5% of heap). In practice this means fetchThreads will never be more than 10. Tasks usually don't have that many processors available to them anyway, so in practice I don't think this will change the number of threads for too many deployments

add recordBufferSizeBytes as a bytes-based limit rather than records-based limit for the shared queue. We do know the byte size of kinesis records by at this point. Default should be 100MB or 10% of heap, whichever is smaller.

add maxBytesPerPoll as a bytes-based limit for how much data we poll from shared buffer at a time. Default is 1000000 bytes.

deprecate recordBufferSize, use recordBufferSizeBytes instead. Warning is logged if recordBufferSize is specified

deprecate maxRecordsPerPoll, use maxBytesPerPoll instead. Warning is logged if maxRecordsPerPoll` is specified

Fixed issue that when the record buffer is full, the fetchRecords logic throws away the rest of the GetRecords result after recordBufferOfferTimeout and starts a new shard iterator. This seems excessively churny. Instead, wait an unbounded amount of time for queue to stop being full. If the queue remains full, we’ll end up right back waiting for it after the restarted fetch.

There was also a call to newQ::offer without check in filterBufferAndResetBackgroundFetch, which seemed like it could cause data loss. Now checking return value here, and failing if false.

Release Note

Kinesis ingestion memory tuning config has been greatly simplified, and a more adaptive approach is now taken for the configuration. Here is a summary of the changes made:

eliminates recordsPerFetch, always use the max limit of 10000 records (the default limit if not set)

eliminate deaggregate, always have it true

cap fetchThreads to ensure that if each fetch returns the max (10MB) then we don't exceed our budget (100MB or 5% of heap). In practice this means fetchThreads will never be more than 10. Tasks usually don't have that many processors available to them anyway, so in practice I don't think this will change the number of threads for too many deployments

add recordBufferSizeBytes as a bytes-based limit rather than records-based limit for the shared queue. We do know the byte size of kinesis records by at this point. Default should be 100MB or 10% of heap, whichever is smaller.

add maxBytesPerPoll as a bytes-based limit for how much data we poll from shared buffer at a time. Default is 1000000 bytes.

deprecate recordBufferSize, use recordBufferSizeBytes instead. Warning is logged if recordBufferSize is specified

deprecate maxRecordsPerPoll, use maxBytesPerPoll instead. Warning is logged if maxRecordsPerPoll` is specified

This PR has:

…nd as needed

…mory-management

…nd web-console * fix fetchThreads calculation

…rOfferTimeout * check return value of newQ::offer and fail if false

jon-wei · 2023-11-21T19:20:28Z

-For estimation purposes, Druid uses a figure of 10 KB for regular records and 1 MB for [aggregated records](#deaggregation).
- `maxRecordsPerPoll`: 100 for regular records, 1 for [aggregated records](#deaggregation).
+- `recordBufferSizeBytes`: 100 MB or an estimated 10% of available heap, whichever is smaller.
+- `maxRecordsPerPoll`: 1.


Should this be higher? I wonder if this is too low in the case of non-aggregated records

I wondered the same actually. tbh, im not sure. I think validation for this requires extensive performance testing.

Changed it so that it polls for at least one record and at most 1_000_000 bytes if more than 1 record, which is what we were targeting for before.

So does that mean we should update the maxRecordsPerPoll: 1 here?

jon-wei · 2023-11-21T19:22:15Z

+    );
+    int maxFetchThreads = Math.max(
+        1,
+        (int) (memoryToUse / 10_000_000L)


nit: maybe use a constant for the 10MB limit with a comment that explains the limit comes from the Kinesis library

jon-wei · 2023-11-21T19:29:29Z

      Boolean resetOffsetAutomatically,
      Boolean skipSequenceNumberAvailabilityCheck,
-      Integer recordBufferSize,
+      @Nullable Integer recordBufferSizeBytes,


Do you think it'd make sense to log a warning if the eliminated property is provided?

good thought, will add.

jon-wei · 2023-11-21T19:38:59Z

-
-              scheduleBackgroundFetch(recordBufferFullWait);
-              return;
+              recordBufferOfferWaitMillis = recordBufferFullWait;


How come the shardIterator doesn't need to be reset here as before?

Previously when the record buffer is full here, the fetchRecords logic threw away the rest of the GetRecords result after recordBufferOfferTimeout and starts a new shard iterator. This seemed excessively churny. Instead we wait an unbounded amount of time for queue to stop being full. If the queue remains full, we’ll end up right back waiting for it after the restarted fetch.

jon-wei · 2023-11-21T19:41:29Z

-        Class<?> kclUserRecordclass = Class.forName("com.amazonaws.services.kinesis.clientlibrary.types.UserRecord");
-        MethodHandles.Lookup lookup = MethodHandles.publicLookup();
+    try {
+      Class<?> kclUserRecordclass = Class.forName("com.amazonaws.services.kinesis.clientlibrary.types.UserRecord");


Are the points about the licensing above still correct? Looks like amazon-kinesis-client is Apache licensed now: https://github.com/awslabs/amazon-kinesis-client/blob/master/LICENSE.txt

Removed the licensing comments.

Oh, it even looks like since #12370, amazon-kinesis-client with an Apache license is a regular dependency. So this reflective stuff is no longer needed. Please either rewrite it to use regular Java calls, or if you don't rewrite it, include a comment describing the situation. Something like:

The deaggregate function is implemented by the amazon-kinesis-client, whose license was formerly not compatible with Apache. The code here avoids the license issue by using reflection, but is no longer necessary since amazon-kinesis-client is now Apache-licensed and is now a dependency of Druid. This code could safely be modified to use regular calls rather than reflection.

jon-wei · 2023-11-21T19:44:09Z

-           .forEachOrdered(newQ::offer);
+        .filter(x -> !partitions.contains(x.getData().getStreamPartition()))
+        .forEachOrdered(x -> {
+          if (!newQ.offer(x)) {


Is this a new failure mode? What would've happened in the old code if the queue size was exceeded?

it is a new failure more. I believe if the data was not added here, it could have resulted in data loss. Any other suggestion here? I was a little concerned about this too, but I think potential data loss is worse.

added comment saying that this shouldnt really happen, but is added for safety.

…mory-management

gianm · 2023-11-29T06:32:50Z

Tagged "release notes" since various memory-related configs are changed.

gianm · 2023-11-29T06:53:50Z

-For estimation purposes, Druid uses a figure of 10 KB for regular records and 1 MB for [aggregated records](#deaggregation).
- `maxRecordsPerPoll`: 100 for regular records, 1 for [aggregated records](#deaggregation).
+- `recordBufferSizeBytes`: 100 MB or an estimated 10% of available heap, whichever is smaller.
+- `maxRecordsPerPoll`: 1.


So does that mean we should update the maxRecordsPerPoll: 1 here?

gianm · 2023-11-29T06:56:47Z

+      records.drain(
          polledRecords,
-          expectedSize,
+          MAX_BYTES_PER_POLL,


It looks like maxRecordsPerPoll isn't doing anything anymore. Is that right? If so let's get rid of it.

removed, and added maxBytesPerPoll which is being used instead now.

gianm · 2023-11-29T06:57:22Z

+        .filter(x -> !partitions.contains(x.getData().getStreamPartition()))
+        .forEachOrdered(x -> {
+          if (!newQ.offer(x)) {
+            // this should never really happen in practice but adding check here for safety.


Checks that should never happen, but are for safety, should be DruidException.defensive

thanks! updated

gianm · 2023-11-29T07:01:54Z


            // If the buffer was full and we weren't able to add the message, grab a new stream iterator starting
            // from this message and back off for a bit to let the buffer drain before retrying.
-            if (!records.offer(currRecord, recordBufferOfferTimeout, TimeUnit.MILLISECONDS)) {


The comment above is no longer accurate -- we aren't grabbing new stream iterators anymore when the buffer is full.

…mory-management

gianm · 2023-11-29T20:21:22Z

    );
    this.useListShards = useListShards;
    this.awsCredentialsConfig = awsCredentialsConfig;
+    if (tuningConfig.getRecordBufferSizeConfigured() != null) {


Please move these two checks to run rather than the constructor, because we don't need to log this stuff every time a task object is constructed. (That happens at various points on the Overlord due to various API calls and internal machinations, and will create a log of log spam.)

Good catch. Moved.

gianm · 2023-11-29T20:23:41Z

+        (int) (memoryToUse / GET_RECORDS_MAX_BYTES_PER_CALL)
+    );
+    if (fetchThreads > maxFetchThreads) {
+      log.warn("fetchThreads [%d] being lowered to [%d]", fetchThreads, maxFetchThreads);


This warning should only get logged if configuredFetchThreads != null. There's no reason to log it if runtimeInfo.getAvailableProcessors() * 2 is lower than maxFetchThreads.

Good catch, updated.

gianm · 2023-11-29T20:27:41Z

      @JsonProperty("awsAssumedRoleArn") String awsAssumedRoleArn,
      @JsonProperty("awsExternalId") String awsExternalId,
-      @Nullable @JsonProperty("autoScalerConfig") AutoScalerConfig autoScalerConfig,
-      @JsonProperty("deaggregate") boolean deaggregate


The recordsPerFetch and deaggregate properties should stay here for better compatibility during rolling updates and rollbacks. (We don't want to lose track of them prior to a potential rollback.)

So let's instead mark them deprecated, but keep them.

added back and marked as deprecated.

jon-wei · 2023-11-30T22:15:20Z

+      records.drain(
          polledRecords,
-          expectedSize,
+          maxBytesPerPoll,


What happens if a single record is larger than maxBytePerPoll? Would this get stuck and make no progress?

good question, it always drains at least one record, clarified that in the docs. I added a test for this, see org.apache.druid.java.util.common.MemoryBoundLinkedBlockingQueueTest#test_drain_queueWithFirstItemSizeGreaterThanLimit_succeeds

…mory-management

zachjsh added 2 commits October 31, 2023 13:12

* do stuff

85f25ed

* move existing MemoryBoundLinkedBlockingQueue to druid core and exte…

013b9d7

…nd as needed

github-actions Bot added Area - Metrics/Event Emitting Area - Streaming Ingestion labels Nov 10, 2023

zachjsh added 3 commits November 13, 2023 20:09

* fix configs, tests

39545aa

* add tests

ff77302

Merge remote-tracking branch 'apache/master' into kinesis-adaptive-me…

97a7ae3

…mory-management

zachjsh marked this pull request as ready for review November 14, 2023 08:34

zachjsh changed the title ~~Kinesis adaptive memory management WIP~~ Kinesis adaptive memory management Nov 14, 2023

* fix serde

7b155f6

zachjsh requested a review from gianm November 14, 2023 08:46

zachjsh added 3 commits November 14, 2023 03:55

* minor fix to logging

74e0ad2

* update comment

9547251

* remove references to removed config properties from documentation a…

d7f9c26

…nd web-console * fix fetchThreads calculation

github-actions Bot added Area - Documentation Area - Web Console labels Nov 15, 2023

zachjsh added 2 commits November 15, 2023 22:57

* fix spellcheck

fd57dfb

* dont throw away the rest of the GetRecords result after recordBuffe…

6666bc8

…rOfferTimeout * check return value of newQ::offer and fail if false

zachjsh requested review from clintropolis and jon-wei November 17, 2023 19:36

jon-wei reviewed Nov 21, 2023

View reviewed changes

zachjsh added 3 commits November 27, 2023 18:47

* address review comments

8bd7c69

Merge remote-tracking branch 'apache/master' into kinesis-adaptive-me…

f3bac06

…mory-management

* fix integration test compilation failure

b6349fb

zachjsh requested a review from jon-wei November 27, 2023 23:55

github-advanced-security AI found potential problems Nov 28, 2023

View reviewed changes

Comment thread processing/src/main/java/org/apache/druid/java/util/common/MemoryBoundLinkedBlockingQueue.java Fixed

* fix code scan failure

0300074

gianm added the Release Notes label Nov 29, 2023

gianm reviewed Nov 29, 2023

View reviewed changes

zachjsh added 3 commits November 29, 2023 14:59

* review comments

f4c0665

Merge remote-tracking branch 'apache/master' into kinesis-adaptive-me…

a4b3b35

…mory-management

* fix ingestion-spec.tsx

bca26e9

zachjsh requested a review from gianm November 29, 2023 20:14

zachjsh added 2 commits November 29, 2023 15:47

* change back to ArrayList

127bf0d

* fix ingestion-spec.tsx again

07840a7

gianm reviewed Nov 30, 2023

View reviewed changes

jon-wei reviewed Nov 30, 2023

View reviewed changes

zachjsh added 2 commits December 1, 2023 12:45

* more review comments

0287eb1

Merge remote-tracking branch 'apache/master' into kinesis-adaptive-me…

329a4d6

…mory-management

zachjsh requested review from gianm and jon-wei December 1, 2023 17:46

zachjsh added 9 commits December 15, 2023 10:53

* make blocking queue actually block for time specified

1a6d83a

* fix checkstyle

bce0530

* fix failing test

047c266

* signal not empty

765efc4

* fix test

30e0148

* fix test finally

5e5a84e

Merge remote-tracking branch 'apache/master' into kinesis-adaptive-me…

48425e4

…mory-management

Merge remote-tracking branch 'apache/master' into kinesis-adaptive-me…

40f4f9e

…mory-management

Merge remote-tracking branch 'apache/master' into kinesis-adaptive-me…

cb1fe85

…mory-management

jon-wei approved these changes Jan 19, 2024

View reviewed changes

zachjsh merged commit 9d4e805 into apache:master Jan 19, 2024

zachjsh deleted the kinesis-adaptive-memory-management branch January 19, 2024 19:30

adarshsanjeev added this to the 30.0.0 milestone May 6, 2024

adarshsanjeev mentioned this pull request May 28, 2024

[DRAFT] 30.0.0 release notes #16505

Closed

Conversation

zachjsh commented Nov 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Release Note

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zachjsh Nov 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zachjsh Nov 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gianm commented Nov 29, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zachjsh Dec 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

zachjsh commented Nov 10, 2023 •

edited

Loading

zachjsh Nov 27, 2023 •

edited

Loading

zachjsh Nov 21, 2023 •

edited

Loading

zachjsh Dec 1, 2023 •

edited

Loading