Support limit push down for GroupBy by jon-wei · Pull Request #3873 · apache/druid

jon-wei · 2017-01-21T02:28:24Z

This patch adds a context flag to the GroupBy query that enables result limiting/sorting at the merge buffer level, allowing the brokers to process less data.

This is accomplished by using a min-max heap with a size limit in the BufferGrouper instead of a list of offsets.

The table buffer in the BufferGrouper is used differently in push down mode: instead of growing as buckets are added, the hash table buffer is split in half. When a half fills, the active table buffer swaps to the other half, and the top N buckets are copied over to the new active buffer.

Note that when the sorting order uses fields that are not in the grouping key, limit push down can result in approximate results.

fjy · 2017-01-21T02:29:38Z

@jon-wei any benchmark results?

jon-wei · 2017-01-21T02:50:11Z

Basic query benchmark results:

master

Benchmark                                              (defaultStrategy)  (initialBuckets)  (numProcessingThreads)  (numSegments)  (queryGranularity)  (rowsPerSegment)  (schemaAndQuery)  Mode  Cnt       Score       Error  Units
GroupByBenchmark.queryMultiQueryableIndex                             v2                -1                       4              4                 all            100000           basic.A  avgt   50  165265.409 ±  5457.303  us/op
GroupByBenchmark.queryMultiQueryableIndex                             v2                -1                       4              4                 all            100000      basic.nested  avgt   50  218690.365 ±  5187.350  us/op
GroupByBenchmark.queryMultiQueryableIndexWithSerde                    v2                -1                       4              4                 all            100000           basic.A  avgt   50  275296.299 ±  4957.742  us/op
GroupByBenchmark.queryMultiQueryableIndexWithSerde                    v2                -1                       4              4                 all            100000      basic.nested  avgt   50  568089.691 ±  6644.650  us/op
GroupByBenchmark.queryMultiQueryableIndexWithSpilling                 v2                -1                       4              4                 all            100000           basic.A  avgt   50  364555.865 ±  3015.114  us/op
GroupByBenchmark.queryMultiQueryableIndexWithSpilling                 v2                -1                       4              4                 all            100000      basic.nested  avgt   50  473057.266 ± 11882.638  us/op
GroupByBenchmark.querySingleIncrementalIndex                          v2                -1                       4              4                 all            100000           basic.A  avgt   50   64229.741 ±  1793.656  us/op
GroupByBenchmark.querySingleIncrementalIndex                          v2                -1                       4              4                 all            100000      basic.nested  avgt   50   89654.676 ±  3170.845  us/op
GroupByBenchmark.querySingleQueryableIndex                            v2                -1                       4              4                 all            100000           basic.A  avgt   50   34988.623 ±  2085.728  us/op
GroupByBenchmark.querySingleQueryableIndex                            v2                -1                       4              4                 all            100000      basic.nested  avgt   50   57162.166 ±  1242.809  us/op

patch

Benchmark                                              (defaultStrategy)  (initialBuckets)  (numProcessingThreads)  (numSegments)  (queryGranularity)  (rowsPerSegment)  (schemaAndQuery)  Mode  Cnt       Score      Error  Units
GroupByBenchmark.queryMultiQueryableIndex                             v2                -1                       4              4                 all            100000           basic.A  avgt   50  166810.603 ± 3960.155  us/op
GroupByBenchmark.queryMultiQueryableIndex                             v2                -1                       4              4                 all            100000      basic.nested  avgt   50  217513.758 ± 5040.907  us/op
GroupByBenchmark.queryMultiQueryableIndexWithSerde                    v2                -1                       4              4                 all            100000           basic.A  avgt   50  277410.702 ± 4311.196  us/op
GroupByBenchmark.queryMultiQueryableIndexWithSerde                    v2                -1                       4              4                 all            100000      basic.nested  avgt   50  556085.046 ± 6133.770  us/op
GroupByBenchmark.queryMultiQueryableIndexWithSpilling                 v2                -1                       4              4                 all            100000           basic.A  avgt   50  366265.502 ± 3057.781  us/op
GroupByBenchmark.queryMultiQueryableIndexWithSpilling                 v2                -1                       4              4                 all            100000      basic.nested  avgt   50  470969.168 ± 4303.920  us/op
GroupByBenchmark.querySingleIncrementalIndex                          v2                -1                       4              4                 all            100000           basic.A  avgt   50   65762.580 ± 2538.104  us/op
GroupByBenchmark.querySingleIncrementalIndex                          v2                -1                       4              4                 all            100000      basic.nested  avgt   50   84656.224 ± 1767.780  us/op
GroupByBenchmark.querySingleQueryableIndex                            v2                -1                       4              4                 all            100000           basic.A  avgt   50   35216.794 ±  925.979  us/op
GroupByBenchmark.querySingleQueryableIndex                            v2                -1                       4              4                 all            100000      basic.nested  avgt   50   56003.673 ± 1221.483  us/op

tpch 'lineitem' becnhmark

Using the tpch lineitem dataset from https://github.com/implydata/redshift-benchmark, running a cluster with a single r3.2xlarge historical and r3.2xlarge broker, with the following configs:

druid.service=druid/broker
druid.port=8082

# HTTP server threads
druid.broker.http.numConnections=5
druid.server.http.numThreads=40

# Processing threads and buffers
#druid.processing.buffer.sizeBytes=536870912
druid.processing.buffer.sizeBytes=1073741824
druid.processing.numThreads=7
druid.processing.numMergeBuffers=4
druid.broker.http.compressionCodec=DEFLATE

# Query cache disabled -- push down caching and merging instead
druid.broker.cache.useCache=false
druid.broker.cache.populateCache=false

druid.service=druid/historical
druid.port=8083

# HTTP server threads
druid.server.http.numThreads=40

# Processing threads and buffers
druid.processing.buffer.sizeBytes=2073741824
druid.processing.numThreads=7
druid.processing.numMergeBuffers=4

druid.query.groupBy.maxMergingDictionarySize=1250000000

# Segment storage
druid.segmentCache.locations=[{"path":"/mnt/var/druid/segment-cache","maxSize"\:130000000000}]
druid.server.maxSize=130000000000

# Query cache
druid.historical.cache.useCache=false
druid.historical.cache.populateCache=false
druid.cache.type=local
druid.cache.sizeInBytes=2000000000

with the following query, grouping on a long column with ~2million cardinality, with a single long sum agg

{
  "intervals": [
    "1992-01-01T00:00:00.000+00:00/1999-01-01T00:00:00.000+00:00"
  ],
  "aggregations": [
    {
      "type": "doubleSum",
      "name": "l_quantity",
      "fieldName": "l_quantity"
    }
  ],
  "dataSource": "tpch_lineitem",
  "filter": null,
  "having": null,
  "granularity": "all",
  "dimensions": [
    {
      "type" : "default",
      "dimension" : "l_partkey",
      "outputName": "l_partkey",
      "outputType": "LONG"
    }
  ],
  "postAggregations": null,
  "limitSpec": {
    "type": "default",
    "columns": [
      {
        "dimension" : "l_partkey",
        "direction" : "ascending",
        "dimensionOrder" : "numeric"
      }
    ],
    "limit": 100
  },
  "queryType": "groupBy",
  "context": {
    "timeout": 600000,
    "useCache": false,
    "populateCache": false,
    "groupByStrategy": "v2"
  }
}

I got the following total query running times using the "time" cmd:

master:

3m44.213s
3m39.775s
3m48.518s

patch (with push down enabled)

3m5.330s
3m2.019s
3m7.189s

I also ran the query above against a similar datasource where l_partkey was ingested as a String instead of long (using the RDruid benchmark script from the redshift-benchmark repo), against 0.9.2, master, and this patch:

0.9.2
group_by_top_100_parts_v2	428.063397074
group_by_top_100_parts_v2	435.508767002

master
group_by_top_100_parts_v2	358.350399609
group_by_top_100_parts_v2	373.393315142

patch (push down enabled)
group_by_top_100_parts_v2	274.935313295
group_by_top_100_parts_v2	292.138325599

himanshug · 2017-01-26T20:10:39Z

getRowOrderingPushDown(..) is only used when limitSpec does not have aggregator in it, so shoudnt' this case of aggIndex >= 0 should actually be an error

removed the agg checks here since they weren't necessary, thanks

himanshug · 2017-01-26T20:10:53Z

can we mark it private

marked private

himanshug · 2017-01-26T20:13:37Z

haven't gone through whole code yet... but, speaking as a user, it would be nice for druid to automatically push down limits where results would be exactly same.
and for other case, push limits only if user asked explicitly.

The approximate results seem to mean that the group-by results may not be valid if non-grouping keys are used for sort. This is because of the behavior of BufferGrouper. It first aggregates input rows until its hash table fills. When the hash table fills, it copies the top-n entries found so far to another hash table, and continues aggregation. Here, the copied entires are able to not belong to the real group-by result, and this can make the result invalid.

In this case, we cannot guarantee the approximation ratio because it will depend on the input order. In the worst case, users can get totally unexpected results but it's difficult for them to figure out until running the same query without limit push down.

So, IMHO, it would be better to disable when non-grouping keys are used for sort. What do you think?

I updated the patch so that the limit push down is enabled when the sorting order only has grouping key fields and noted that within the docs

re: push down when sorting on non-grouping fields, I think it's good to keep it disabled by default, but I also think it's useful to allow it if the user really wants to do it (maybe some users have data/queries that happen to work well even with the approximation), so I've kept the push down context flag, with a note in the docs that there are no guarantees on the accuracy of the results in that case.

jihoonson

Thanks @jon-wei, I'm still reviewing.

jihoonson · 2017-01-27T01:08:47Z

Please remove unnecessary semi colon.

jihoonson · 2017-01-27T05:05:02Z

Iterator.next() should throw NoSuchElementException.

changed to throw NSEE

jihoonson · 2017-01-27T05:08:11Z

should throw NoSuchElementException if curr >= size .

Added that check

jihoonson · 2017-01-27T05:09:40Z

should throw NoSuchElementException if curr >= size .

Added that check

jihoonson · 2017-01-27T05:14:59Z

Please remove codes for debug.

removed debug print fn

jihoonson · 2017-01-27T05:45:27Z

Might be useful if it can be generalized for other types as well.

Hm, I could see that, I'll leave this specialized for the current use case in this PR though, if another use comes up later we can do a PR then

jihoonson · 2017-01-27T14:10:41Z

IMO, it would be better to make a new Grouper supporting limit push down because its behavior is quite different from the original one in many parts to achieve a different goal. It will be helpful for better readability and maybe slight performance as well.

split the limit handling stuff into a new LimitedBufferGrouper

jihoonson · 2017-01-27T14:32:28Z

The approximate results seem to mean that the group-by results may not be valid if non-grouping keys are used for sort. This is because of the behavior of BufferGrouper. It first aggregates input rows until its hash table fills. When the hash table fills, it copies the top-n entries found so far to another hash table, and continues aggregation. Here, the copied entires are able to not belong to the real group-by result, and this can make the result invalid.

In this case, we cannot guarantee the approximation ratio because it will depend on the input order. In the worst case, users can get totally unexpected results but it's difficult for them to figure out until running the same query without limit push down.

So, IMHO, it would be better to disable when non-grouping keys are used for sort. What do you think?

jon-wei · 2017-02-01T21:12:37Z

@himanshug @jihoonson

Thanks for the comments so far, updated the PR

jon-wei · 2017-02-01T22:11:15Z

marking this WIP, fixing some test failures, will reopen later

jon-wei · 2017-02-02T21:43:06Z

@himanshug @jihoonson

Updated this with a few more changes:

added a small optimization to skip aggregation for keys that have already been ejected from the heap when the sorting columns are a subset of the grouping key (can't be in the top N at that point, so no need to aggregate them anymore)
changed "has aggregators" condition to "has non-grouping key fields"
throws an exception now if the user sets the flag to force push down when post aggs are included in the sorting order (needs a change to the post agg interface to support byte serde and something like getMaxIntermediateSize() from AggregatorFactory, and I think that's too big for this PR)

himanshug · 2017-02-03T21:30:59Z

himanshug · 2017-02-03T21:32:02Z

compareDims instead?

himanshug · 2017-05-24T16:08:14Z

+  protected int maxSize;
+
+  // current number of available/used buckets in the table
+  protected int buckets;


can we call this maxPossibleBuckets ?

may be also document in the comment for above two variables that these numbers change on table resize.

maxBuckets is probably better.

Renamed to maxBuckets and added comments re: table resize

himanshug · 2017-05-24T16:08:51Z

+  protected int size;
+
+  // Maximum number of elements in the table before it must be resized
+  protected int maxSize;


can we call this regrowthThreshold ?

renamed to regrowthThreshold

himanshug · 2017-05-24T16:17:03Z

+  protected ByteBuffer buffer;
+  protected int bucketSizeWithHash;
+  protected int tableArenaSize;
+  protected int keySize;


last 4 variables can be final i think

made these final

himanshug · 2017-05-24T16:24:40Z

+
+import java.nio.ByteBuffer;
+
+public class ByteBufferHashTable


@gianm is concept of regrowth there only because at initializing time we need to set 0 in first byte of all the buckets and it would be too expensive to do that upfront for whole buffer ? if yes, then, it'll be interesting to see how Unsafe.setMemory(..) performs over 1 GB buffer and if that is fast enough then regrowth business could possibly be removed.

Yes, that's the reason. When I benchmarked "simple" groupBys with small result sets, zeroing out the first byte of each bucket was a big performance hog.

That seems worth looking into, though I think that would take some time/effort to evaluate and could be handled in a follow-on PR.

himanshug · 2017-05-24T20:00:41Z

+  {
+    private ByteBuffer[] subHashTableBuffers;
+    private ByteBufferHashTable[] subHashTables;
+    private ByteBufferHashTable activeHashTable;


not sure why we are creating ByteBufferHashTable structures on two haves? can't we just keep two ByteBuffer for two halves and a reference to activeByteBuffer ?

also, may be some comments saying there are only two of these (them being array gave me impression that there were many of those) .... alternated on each regrow where elements beyond limit are discarded.

Changed this to use two ByteBuffers directly without the sub HashTables, added a comment on there being two buffers and a description of the swapping

himanshug · 2017-05-24T20:12:52Z

+  // Limit to apply to results.
+  // If limit > 0, track hash table entries in a binary heap with size of limit.
+  // If -1, no limit is applied, hash table entry offsets are tracked with an unordered list with no limit.
+  private int limit;


what is the case where limit <= 0 is valid?

This was an outdated comment from an older revision, limit will never be <= 0 now, I deleted the comment

himanshug · 2017-05-24T20:28:52Z

+    return forcePushDown;
+  }
+
+  public boolean determineApplyLimitPushDown()


i think this can be implemented in one line return validateAndGetForceLimitPushDown() || DefaultLimitSpec.sortingOrderHasNonGroupingFields(..);

you might have to add the defaultLimitSpec.getLimit() == Integer.MAX_VALUE check inside DefaultLimitSpec.sortingOrderHasNonGroupingFields(..)

Hm, I decided to change this area in a different way, I removed an unnecessary null and type check from sortingOrderHasNonGroupingFields(), but kept determineApplyLimitPushDown() the same.

The sortingOrder method is used elsewhere, where the MAX_VALUE check is no longer relevant (as it would've been already checked earlier when the query object was created), so I didn't think it really belongs there.

I also felt that the conditions for when limit push down will be applied are more clear if they're all in the determineApplyLimitPushDown() method.

himanshug · 2017-05-24T21:16:05Z

@jon-wei looks good overall besides some comments. it makes sense overall , but hard to identify bugs. hoping that you have tested it on some cluster with few different nested/non-nested groupBy queries.

jon-wei · 2017-05-25T20:02:15Z

@himanshug

I've addressed your comments, thanks for the review!

Re: testing, the patch has been tested with non-nested groupBys on a large TPC-H dataset as part of the benchmarking. I am currently in the process of doing more stress testing on this with more query variations like nested GroupBys, I'll make a comment here once that testing is done.

himanshug · 2017-05-25T20:26:17Z

+      // clear the used bits of both buffers
+      for (int i = 0; i < maxBuckets; i++) {
+        subHashTableBuffers[0].put(i * bucketSizeWithHash, (byte) 0);
+        subHashTableBuffers[1].put(i * bucketSizeWithHash, (byte) 0);


this one is not needed and will automatically be zeroed out in adjustTableWhenFull(..) when needed.

Got rid of the second buffer reset there

…ypes

jon-wei · 2017-06-01T02:36:51Z

@himanshug

I tested the patch with some nested group bys against the TPC-H dataset that I benchmarked against earlier, the results are correct.

I've deployed the patch to Imply's internal test cluster as well.

Also merged master just now

himanshug · 2017-06-01T17:09:30Z

thanks @jon-wei 👍

fjy · 2017-06-02T22:27:32Z

👍

leventov · 2017-06-08T01:04:40Z

+  private boolean forcePushDownLimit = false;
+
+  @JsonProperty
+  private Class<? extends GroupByQueryMetricsFactory> queryMetricsFactory;


@jon-wei could you please remove this field, getter and setter? They were removed here: https://github.com/druid-io/druid/pull/4336/files#diff-5e30ea112240e82f233099071bb3389e but resurrected in this PR.

@leventov Patch to remove those is here, #4383, thanks for catching that

leventov · 2017-10-10T00:47:56Z

  private final List<AggregatorFactory> aggregatorSpecs;
  private final List<PostAggregator> postAggregatorSpecs;

+  private final Function<Sequence<Row>, Sequence<Row>> limitFn;


This field is unused.

jon-wei added Feature Performance labels Jan 21, 2017

jon-wei added this to the 0.10.0 milestone Jan 21, 2017

jon-wei assigned gianm Jan 21, 2017

jon-wei force-pushed the limitpushdown_notypes branch 2 times, most recently from 30bbc5a to c63e147 Compare January 21, 2017 02:43

jon-wei closed this Jan 21, 2017

jon-wei reopened this Jan 21, 2017

jon-wei force-pushed the limitpushdown_notypes branch from c63e147 to f565ce5 Compare January 21, 2017 04:25

gianm assigned himanshug Jan 24, 2017

himanshug reviewed Jan 26, 2017

View reviewed changes

jihoonson reviewed Jan 27, 2017

View reviewed changes

jon-wei force-pushed the limitpushdown_notypes branch 2 times, most recently from ed3ffb5 to 52cfebe Compare February 1, 2017 21:04

jon-wei changed the title ~~Support limit push down for GroupBy~~ Support limit push down for GroupBy [WIP] Feb 1, 2017

jon-wei force-pushed the limitpushdown_notypes branch 4 times, most recently from 67dba7b to ec4ee4a Compare February 2, 2017 21:40

jon-wei changed the title ~~Support limit push down for GroupBy [WIP]~~ Support limit push down for GroupBy Feb 2, 2017

himanshug reviewed Feb 3, 2017

View reviewed changes

Comment thread processing/src/main/java/io/druid/query/groupby/GroupByQuery.java Outdated

Copy link
Copy Markdown

Contributor

himanshug Feb 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unused

himanshug reviewed Feb 3, 2017

View reviewed changes

Comment thread processing/src/main/java/io/druid/query/groupby/GroupByQuery.java Outdated

Copy link
Copy Markdown

Contributor

himanshug Feb 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

compareDims instead?

himanshug reviewed May 24, 2017

View reviewed changes

Address PR comments

7250a13

jon-wei force-pushed the limitpushdown_notypes branch from 7b71a1d to 7250a13 Compare May 24, 2017 19:26

himanshug reviewed May 24, 2017

View reviewed changes

PR comments

ac25140

himanshug reviewed May 25, 2017

View reviewed changes

jon-wei added 3 commits May 25, 2017 13:41

Remove unnecessary buffer reset

3862632

Fix missing @JsonProperty annotation

7c5d512

Merge remote-tracking branch 'upstream/master' into limitpushdown_not…

6107b66

…ypes

jon-wei merged commit b90c28e into apache:master Jun 2, 2017

leventov reviewed Jun 8, 2017

View reviewed changes

This was referenced Jun 8, 2017

Remove queryMetricsFactory from GroupByQueryConfig #4383

Merged

Druid 0.10.1 release notes #4384

Closed

jon-wei mentioned this pull request Aug 31, 2017

GroupBy V2 query fails when using a large limit #4739

Closed

jon-wei deleted the limitpushdown_notypes branch October 6, 2017 22:18

leventov reviewed Oct 10, 2017

View reviewed changes

jon-wei mentioned this pull request Oct 10, 2017

Remove unused limitFn in GroupByQuery #4935

Merged

gianm mentioned this pull request Nov 28, 2017

groupBy: push down limits when sorting on the group #1872

Closed

clambertus unassigned himanshug and gianm Jul 6, 2018

Conversation

jon-wei commented Jan 21, 2017

Uh oh!

fjy commented Jan 21, 2017

Uh oh!

jon-wei commented Jan 21, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Basic query benchmark results:

tpch 'lineitem' becnhmark

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jon-wei Feb 1, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jihoonson left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jon-wei commented Feb 1, 2017

Uh oh!

jon-wei commented Feb 1, 2017

Uh oh!

jon-wei commented Feb 2, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

jon-wei commented Jan 21, 2017 •

edited

Loading

jon-wei Feb 1, 2017 •

edited

Loading

himanshug May 24, 2017 •

edited

Loading