[Groupby Query Metrics] Add merge buffer tracking by GWphua · Pull Request #18731 · apache/druid

GWphua · 2025-11-10T03:18:29Z

Huge thanks to @gianm for the implementation tip in the issue!

Description

Tracking merge buffer usage

Usage of a direct byte buffer is done under AbstractBufferHashGrouper and its implementations.

Each direct byte buffer uses a ByteBufferHashTable along with an offset tracker.
Usage is calculated by tracking the maximum capacity of the byte buffer in ByteBufferHashTable, and maximum offset size calculated throughout the query's lifecycle.

Incorporated a helpful suggestion by @aho135 : since the size of the hash tables are ever-changing, it makes sense to conduct calculations by taking the maximum values across queries -- so operators can have a better understanding of how the size of merge buffers can be configured.
Edit: max metrics provided in #18934

Here's an example of the current SUM implementations, vs the MAX implementation The latter helps to tell us that we should probably configure merge buffer sizes to 2G for this case:

Release note

GroupByStatsMonitor now provides metrics "mergeBuffer/bytesUsed", and max metrics for merge buffer acquisition time, bytes used, spilled bytes, and merge dictionary size.

Key changed/added classes in this PR

GroupByStatsProvider
Groupers + underlying ByteBuffer table/lists.

This PR has:

been self-reviewed.
a release note entry in the PR description.
added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
been tested in a test Druid cluster.

Possible further enhancements

While building this PR, I have come across some problems which we can further enhance in the future:

Nested Group-bys

The current metric is great, but will not report accurately for nested group-by's. (Do correct me on this if I'm mistaken though!)

As far as I know, nested groupby limits the merge buffers usage count to 2, meaning the merge buffer will be re-used. IIUC, every ConcurrentGrouper (if concurrency is enabled) / SpillingGrouper (if concurrency disabled) is created and closed multiple times, and hence a per-query metric will likely over-report the merge buffer usage.

Simplify Memory Management

Right now we need to configure the following for each queryable service:

size of merge buffer
number of merge buffer
direct memory = (numProcessingThreads + numMergeBuffer + 1) * mergeBufferSizeBytes

It will be great if we can simplify the calculations down to simply configuring direct memory, and we can manage a memory pool instead. This allows for more flexibility (unused memory allocated for merge buffers may be used by processing threads instead).

aho135 · 2025-11-19T20:51:45Z

      if (perQueryStats.getMergeBufferAcquisitionTimeNs() > 0) {
        mergeBufferQueries++;
        mergeBufferAcquisitionTimeNs += perQueryStats.getMergeBufferAcquisitionTimeNs();
+        mergeBufferTotalUsage += perQueryStats.getMergeBufferTotalUsage();


@GWphua Instead of summing here, what do you think about taking the max? Then the metric emitted would be the max merge buffer usage of a single query in that emission period. This would be a good signal for operators on whether they need to tweak the mergeBuffer size.

I was thinking about this.

There are other metrics where taking MAX will also make sense --
spilledBytes --> How much storage would be good to configure?
dictionarySize --> How large can the merge dictionary size get?

I am considering adding another metric (maxSpilledBytes, maxDictionarySize, maxSpilledBytes). What do you think?

Yeah agreed. I do think it makes sense to have it for those 3 metrics

Even for mergeBuffer/acquisitionTimeNs I think there's value in having the max, as it gives operators a signal on whether to increase numMergeBuffers

aho135 · 2025-11-24T21:57:52Z

Thanks for adding those max metrics @GWphua!

What do you think about adding sqlQueryId as a dimension only for the MAX metrics? I think this would be useful for understanding how much the query execution time was affected by the mergeBufferAcquisition. Can also do this in a follow-up PR if you think it's useful.

GWphua · 2025-11-25T02:09:22Z

Hi @aho135

Thanks for the review! I also find that it will be very helpful to emit metrics for each query, so we know which query will take up alot of resources. In our version of Druid, we simply appended each of the PerQueryStat to the statsMap in QueryLifecycle#emitLogsAndMetrics, but I feel its quite a hacky way of doing it. sqlQueryId as a dimension in the GroupByStatsMonitor will definitely help.

Alternatively, we can look into migrating the groupBy query metrics in GroupByStatsMonitor to GroupByQueryMetrics, which should emit metrics for each GroupBy query. In that way, this can make the MAX and SUM metrics redundant as we can now emit metrics for each query.

We can do more of this in a seperate PR.

aho135 · 2025-11-25T23:14:27Z

Hi @aho135

Thanks for the review! I also find that it will be very helpful to emit metrics for each query, so we know which query will take up alot of resources. In our version of Druid, we simply appended each of the PerQueryStat to the statsMap in QueryLifecycle#emitLogsAndMetrics, but I feel its quite a hacky way of doing it. sqlQueryId as a dimension in the GroupByStatsMonitor will definitely help.

Alternatively, we can look into migrating the groupBy query metrics in GroupByStatsMonitor to GroupByQueryMetrics, which should emit metrics for each GroupBy query. In that way, this can make the MAX and SUM metrics redundant as we can now emit metrics for each query.

We can do more of this in a seperate PR.

Sounds good @GWphua I was thinking on very similar lines to emit these from GroupByQueryMetrics

I have a first draft on this: aho135@9f82091
Lmk if you have any thoughts on this. Thanks!

GWphua · 2025-11-26T10:11:57Z

Hi @aho135, since the scope of adding GroupByQueryMetrics is out of this PR, I have created #18781 to allow us to further discuss it there.

I have a first draft on this: aho135@9f82091
Lmk if you have any thoughts on this. Thanks!

I have a draft for GroupByQueryMetrics before creating this PR, and my draft is a direct extension of your implementation shared. I think I will try and create a PR with that draft soon. I was actually hoping to get this PR merged, before sharing the draft, because that draft is done as a follow-up to this PR.

GWphua · 2025-12-31T07:02:31Z

Hi @gianm, would appreciate it if I receive a review/feedback on this PR. Thanks!

abhishekrb19

@GWphua, thanks for the improved observability and @aho135 for the helpful max metric suggestions! Overall, the changes look good to me - I will take a closer look at the grouper changes soon.

Checkpointing my review on the GroupByStatsProvider, docs and some test suggestions. Please let me know what you think.

GWphua · 2026-01-12T09:56:49Z

@abhishekrb19, thanks for the review!

I have made changes according to your suggestions.

abhishekrb19

Thanks @GWphua! I tried to take a closer look at the grouper changes - please see my latest comments. To track actual merge buffer usage the way #17902
proposes, we may need some additional thought: using buffer.capacity() in the various places is going to reflect the configured buffer size, not the actual bytes used.

My suggestion would be to split all the new max metrics into a separate PR: mergeBuffer/maxAcquisitionTimeNs, groupBy/maxSpilledBytes, groupBy/maxMergeDictionarySize. Those look more straightforward and would still be useful for operators to track.

Then we can keep this PR and the linked issue open to track mergeBuffer/bytesUsed and mergeBuffer/maxBytesUsed. We'll also want to think about and add some test coverage for the grouper changes once the approach is finalized.

abhishekrb19 · 2026-01-12T22:26:18Z

      throw new ISE("Grouper is closed");
    }

-    groupers.forEach(Grouper::reset);


Is this change required? Given that the ConcurrentGrouper operates on SpillingGrouper instance, I suppose this is technically correct. But calling Grouper::reset as it was earlier should already ensure that the specific reset/close methods from SpillingGrouper are invoked? If so, could we revert this change here and below?

You are correct, only SpillingGrouper's method is called here.

I changed this to make my life easier during development:

IntelliJ can jump to the SpillingGrouper method, instead of going to the Grouper interface.

Future readers will be able to tell that the groupers object held by ConcurrentGrouper will be SpillingGroupers.

If the purpose is to keep the changes limited to this PR only connected to the Groupby metrics, I will be opening a new PR. Let me know if you would prefer that.

Got it. It’s generally preferable not to mix refactoring and functional changes together, but this one is relatively straightforward, so it’s okay with me.

abhishekrb19 · 2026-01-12T22:46:53Z

  );

-  private final Grouper<KeyType> grouper;
+  private final AbstractBufferHashGrouper<KeyType> grouper;


Same comment as above. I think it would be better to define these methods in the Grouper interface and leave this change as it was earlier. Mixing Grouper with impls otherwise seems confusing

For this, I will need to maintain the AbstractBufferHashGrouper. The reason will also address why getMergeBufferUsedBytes is not placed in the Groupers interface.

This is because the groupby metrics are only collated when Groupers#close is called. This means that we will not be interfacing with Groupers#getMergeBufferUsedBytes at all, but letting Groupers#close retrieve getMergeBufferUsedBytes. This is why SpillingGrouper#getMergeBufferUsedBytes is private, following the example set by SpillingGrouper#getDictionarySizeEstimate.

The only time we need to interface with getMergeBufferUsedBytes is in AbstractBufferHashGrouper, which are the underlying groupers used by the SpillingGrouper.

abhishekrb19 · 2026-01-12T23:16:56Z

    }
    buffer.putInt(numElements * Integer.BYTES, val);
    numElements++;
+    maxMergeBufferUsageBytes = Math.max(maxMergeBufferUsageBytes, numElements * Integer.BYTES);


Actually I think this variable and state tracking isn't needed in add() since we're tracking numElements already. We can just do it inline numElements * Integer.BYTES from getMaxMergeBufferUsageBytes()

I was thinking to get the maximum usage, because we request a getMergeBufferUsage when SpillingGrouper#close is called. I do not really know whether we are guaranteed to see the maximum usage when the grouper is closing this class...

I am worried about the case where maybe we configure 1GiB to the merge buffer, and the usage in the middle goes to, say 900MiB, but when the Grouper is closed, the usage shows ~300MiB. The user is then encouraged to lower the merge buffer allocation to 500MiB, which will be problematic.

This comment would also hopefully address your query about why reset does not change maxMergeBufferUsageBytes.

abhishekrb19 · 2026-01-13T19:35:32Z


+  protected void updateMaxTableBufferUsage()
+  {
+    maxTableBufferUsage = Math.max(maxTableBufferUsage, tableBuffer.capacity());


Unless I'm missing something, the issue #17902 was created to actually get some visibility into actual merge buffer usage, but this metric would just tell us how much was actually configured instead?

tableBuffer.capacity() would just indicate the capacity of the buffers, so more or less what was configured via druid.processing.buffer.sizeBytes.

You are right about the current metric reporting the allocation.

I relooked the implementation, and found that maybe we can better estimate the ByteBufferHashTable. Since the ByteBufferHashTable is an open-addressing hash table, we can use the number of elements in table (size) * space taken up per bucket (bucketSizeWithHash) to estimate the usage in bytes.

Thanks for the update. Please see https://github.com/apache/druid/pull/18731/changes#r2706872449

Made the AlternatingByteBufferHashTable inherit the max metrics reporting from the superclass. Should now accurately report the usage :)

abhishekrb19 · 2026-01-13T19:36:23Z

+    long hashTableUsage = hashTable.getMaxTableBufferUsage();
+    long offSetListUsage = offsetList.getMaxMergeBufferUsageBytes();
+    return hashTableUsage + offSetListUsage;


Same comment, I think this would just more or less tell us configured size rather than actual buffer usage. (more or less because of offset list tracking)

…atsMonitorTest.java Co-authored-by: Abhishek Radhakrishnan <abhishek.rb19@gmail.com>

GWphua · 2026-01-14T09:56:43Z

Hi @abhishekrb19, I have revisited your comments, and have made the relevant changes/replies. Please take a look at the new approach to calculating the usage. Thanks!

abhishekrb19 · 2026-01-20T05:59:38Z

+      long currentBufferUsedBytes = 0;
+      for (ByteBuffer buffer : subHashTableBuffers) {
+        currentBufferUsedBytes += buffer.capacity();
+      }


I think this effectively would just be tableArenaSize which would reflect the allocated configured size rather than actual used size?

Please update LimitedBufferHashGrouperTest, BufferHashGrouperTest and related grouper tests to validate the correctness of these implementations.

I just pulled in the latest patch locally and ran some group by queries and noticed that the bytesUsed and maxBytesUsed were more or less what was configured druid.processing.buffer.sizeBytes 🤔

I’ll try to dig into this more, but in the meantime, I’d still recommend splitting the PR into two parts: 1. max metrics 2. the bytesUsed and maxBytesUsed metrics
2 seems a bit more involved.

Hello, I have added the tests for the groupers.

I did not get the same results as you, maybe because I used queries for a smaller dataset.

How I did in my tests is to query with spill to disk enabled:

Set druid.processing.buffer.sizeBytes = 1GB

Query on a dataset. (Let's say the results for this is 100MB)

Set druid.processing.buffer.sizeBytes to a much smaller value ~5MB

Query on the same dataset, and watch the usage metrics cap at 5MB, with spillage to disk ~95MB.

Here's an example of how my max metrics look like:

I do have to admit, some of the values are kinda "blocky", like it will report ~28MB repeatedly for, say 3 consecutive metrics, then report some other value. Maybe this is because similar queries are being sent during a short period of time, and perhaps the allocated space is the same for these similar queries. Hopefully, this will be fixed by your catch -- reporting the usage instead of the capacity. 😄

Yeah, I was playing around with it and I noticed more accurate reporting with the latest change than with the previous iterations.

I do have to admit, some of the values are kinda "blocky", like it will report ~28MB repeatedly for, say 3 consecutive metrics, then report some other value.

I wonder if the "stickiness" in the reporting is coming from reset() that was observed in the test: #18731 (comment)

abhishekrb19 · 2026-01-20T06:06:45Z


+  protected void updateMaxTableBufferUsage()
+  {
+    maxTableBufferUsage = Math.max(maxTableBufferUsage, tableBuffer.capacity());


Thanks for the update. Please see https://github.com/apache/druid/pull/18731/changes#r2706872449

GWphua · 2026-01-21T06:13:08Z

My suggestion would be to split all the new max metrics into a separate PR: mergeBuffer/maxAcquisitionTimeNs, groupBy/maxSpilledBytes, groupBy/maxMergeDictionarySize. Those look more straightforward and would still be useful for operators to track.

Then we can keep this PR and the linked issue open to track mergeBuffer/bytesUsed and mergeBuffer/maxBytesUsed.

As suggested, moved 3 max metrics to #18934, leaving maxBytesUsed as the only max metric introduced in this PR.

…rUsedBytes

Tests for buffer hash grouper

GWphua · 2026-01-28T10:46:28Z

@abhishekrb19

The conflicts have been resolved. You can look again when you're free. Thanks!

abhishekrb19 · 2026-01-29T06:03:32Z

@abhishekrb19

The conflicts have been resolved. You can look again when you're free. Thanks!

Ack, thanks @GWphua! I will take a look at the latest changes later this week/early next week

abhishekrb19

Sorry for the delay @GWphua. I had some questions/suggestions and thanks for adding the tests!

abhishekrb19 · 2026-02-26T06:44:49Z

+    Assert.assertEquals(5L * expectedBucketSize, grouper.getMergeBufferUsedBytes());
+


Was curious about the behavior here. Any idea why after a reset(), just adding these isn't working as expected? The 6 entries below do work (just anything under 5 aggregates() fail):

grouper.aggregate(new IntKey(1)); grouper.aggregate(new IntKey(2)); Assert.assertEquals(2L * expectedBucketSize, grouper.getMergeBufferUsedBytes());

For anything less than what was aggregate() from before reset(), it seems to keep track of the max bytes from before, so it's essentially overreporting.

java.lang.AssertionError: Expected :58 Actual :145

(145 = 5 * expectedBucketSize, it keeps memory of stuff from before reset() so it's saving the value from before)

This is because grouper.getMergeBufferUsedBytes tracks the max bytes used for a single query.

During a reset(), the history of the max bytes used is not reset. So after reset(), the value of grouper.getMergeBufferUsedBytes is unchanged (stays at 5). The test below it shows how much bytes are used by the grouper if we try to add a bunch of new items.

The new value of grouper.getMergeBufferUsedBytes will be Math.max(used bytes before reset, used bytes now)

abhishekrb19 · 2026-02-26T07:00:01Z

    tableBuffer.putInt(Groupers.getUsedFlag(keyHash));
    tableBuffer.put(keyBuffer);
    size++;
+    updateMaxMergeBufferUsedBytes();


Should there be a similar call to updateMaxMergeBufferUsedBytes() from adjustTableWhenFull() as well?

The LimitedBufferHashGrouper does that too

Since updateMaxMergeBufferUsedBytes is calculated by size * bucketSizeWithHash, I added this function wherever size or bucketSizeWithHash is changed.

ByteBufferHashTable does not change both values in adjustTableWhenFull, so I did not add that. Granted, from a code reader's perspective, the places where the max merge buffer usage is updated seems quite random...

I have added a Javadoc to the related function.

abhishekrb19 · 2026-02-26T07:02:07Z

      throw new ISE("Grouper is closed");
    }

-    groupers.forEach(Grouper::reset);


Got it. It’s generally preferable not to mix refactoring and functional changes together, but this one is relatively straightforward, so it’s okay with me.

GWphua added 3 commits November 7, 2025 18:22

Add byte buffer tracking for underlying hash tables

21004b4

Byte buffer tracking for underlying offset handlers

c935ea6

Fix tests

c781910

GWphua requested a review from gianm November 10, 2025 03:18

GWphua added 2 commits November 10, 2025 11:46

Fix quidem tests

7063d09

Documentation

19f6bc3

github-actions Bot added the Area - Documentation label Nov 10, 2025

bytesUsed naming

0fcb6a0

aho135 reviewed Nov 19, 2025

View reviewed changes

GWphua added 4 commits November 24, 2025 17:37

Add max metrics

25f10d2

Add missing calculation in BufferHashGrouper

b6ad3c2

Checkstyle

28719eb

Checkstyle

59fe03c

GWphua mentioned this pull request Nov 26, 2025

Migrate GroupByStatsMonitor.PerQueryStats to GroupByQueryMetrics #18781

Open

GWphua added 2 commits December 31, 2025 10:22

Merge remote-tracking branch 'origin/master' into group-by-query

e6020a6

GroupByStatsProvider javadocs

507eecd

abhishekrb19 reviewed Jan 10, 2026

View reviewed changes

GWphua added 3 commits January 12, 2026 10:08

Fix GroupByStatsProviderTest comments

9623e3a

Fix doc order for GroupByStatsProvider metrics

ae40900

Fix test for GroupByStatsMonitorTest

400d0f4

abhishekrb19 reviewed Jan 13, 2026

View reviewed changes

GWphua and others added 3 commits January 14, 2026 15:52

Update server/src/test/java/org/apache/druid/server/metrics/GroupBySt…

8f7b218

…atsMonitorTest.java Co-authored-by: Abhishek Radhakrishnan <abhishek.rb19@gmail.com>

Revert stylistic changes in BufferHashGrouper

df3bf70

Rename mergeBufferUsage to mergeBufferUsedBytes

ac71a63

Track the open addressing hash table

e416867

Merge branch 'master' into group-by-query

cd38a05

abhishekrb19 reviewed Jan 20, 2026

View reviewed changes

GWphua added 2 commits January 21, 2026 11:30

Remove max metrics, push them in another PR...

a26c40a

Remove max metrics in GroupByStatsProviderTest

9725532

GWphua mentioned this pull request Jan 21, 2026

Max metrics for group by queries #18934

Merged

5 tasks

GWphua added 3 commits January 21, 2026 16:45

LimitedBufferHashGrouper to use parent method to report maxTableBuffe…

1455712

…rUsedBytes

Standardised merge buffer names

5db69c5

Tests for buffer hash grouper

9ce074a

Tests for buffer hash grouper

github-advanced-security AI found potential problems Jan 21, 2026

View reviewed changes

Comment thread ...src/test/java/org/apache/druid/query/groupby/epinephelinae/LimitedBufferHashGrouperTest.java Fixed

GWphua added 5 commits January 22, 2026 12:03

Address multiplication cast

4f4a10a

Javadocs for getMergeBufferUsedBytes

e92357d

Remix comments in test for peak calculations

d55e402

Merge branch 'master' into group-by-query

34eb62d

Clean up after merging conflicts

988de09

abhishekrb19 reviewed Feb 26, 2026

View reviewed changes

GWphua added 4 commits February 26, 2026 16:42

Standardize maxMergeBufferUsedBytes

ce05900

Test duplicate buffer adds

32c1ed1

Test

08c235a

Add javadocs for update

dd0267b

abhishekrb19 approved these changes Feb 26, 2026

View reviewed changes

GWphua merged commit de983e4 into apache:master Mar 4, 2026
37 checks passed

GWphua deleted the group-by-query branch March 4, 2026 02:34

This was referenced Mar 4, 2026

Documentation in Basic Cluster Tuning using Group By Metrics #19083

Merged

Prometheus config for mergebuffer used bytes #19110

Merged

GWphua added this to the 37.0.0 milestone Mar 11, 2026

		Assert.assertEquals(5L * expectedBucketSize, grouper.getMergeBufferUsedBytes());

Conversation

GWphua commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Tracking merge buffer usage

Release note

Key changed/added classes in this PR

Possible further enhancements

Nested Group-bys

Simplify Memory Management

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aho135 commented Nov 24, 2025

Uh oh!

GWphua commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aho135 commented Nov 25, 2025

Uh oh!

GWphua commented Nov 26, 2025

Uh oh!

GWphua commented Dec 31, 2025

Uh oh!

abhishekrb19 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

GWphua commented Jan 12, 2026

Uh oh!

abhishekrb19 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

GWphua commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

GWphua Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

GWphua commented Nov 10, 2025 •

edited

Loading

GWphua commented Nov 25, 2025 •

edited

Loading

abhishekrb19 left a comment •

edited

Loading

GWphua commented Jan 14, 2026 •

edited

Loading

GWphua Jan 21, 2026 •

edited

Loading