Add PostAggregators to generator cache keys for top-n queries by jihoonson · Pull Request #3899 · apache/druid

jihoonson · 2017-02-02T04:47:46Z

This PR is to fix #3719.

This change is

…-key-for-post-aggregator

drcrallen · 2017-02-02T04:52:10Z

+import static org.junit.Assert.assertTrue;
+
+@RunWith(Parameterized.class)
+public class CacheKeyBuilderTest


Can you add a test for the following:

Item 1:
String1 : a
String2 : b

Item 2:
String 1:ab

I added tests. Thanks.

nishantmonu51

LGTM, 👍

nishantmonu51 · 2017-02-05T10:16:26Z

+
+  public CacheKeyBuilder appendCacheableList(List<? extends Cacheable> inputs)
+  {
+    for (Cacheable input : inputs) {


can the list be null in any case? We might as well just add null a check here.

Thanks. I don't see any case now, but it will be helpful for future uses. I'll add it.

drcrallen · 2017-02-06T21:04:48Z

+  public CacheKeyBuilder appendStringList(List<String> input)
+  {
+    for (String eachStr : input) {
+      appendItem(stringToByteArray(eachStr));


Ok, I'll be more explicit here.

If you call

appendStringList(ImmutalbeList.of("a","b"));

in one cache key builder, and

appendStringList(ImmutableList.of("ab"))

in another, they will both evaluate (incorrectly) to the same cache key, if I'm reading the code correctly. All strings should be terminated with a byte that is NOT a valid UTF-8 character start (like 0xFF)

One technique I like for this is to write the number of things before writing the things (something like below). This has the advantage of being robust to a byte array that does actually have a 0xff in it for some reason (which byte arrays can totally have). And then separators aren't necessary.

Whether this is the approach taken or not, @drcrallen is right that ["a", "b"] and ["ab"] and ["ab", ""] need to have different cache keys.

public CacheKeyBuilder appendStrings(List<String> strings) { // write number of strings, followed by each string appendInt(strings.size()); for (String str : strings) { appendString(str); } return this; } public CacheKeyBuilder appendString(String string) { // write string as utf-8 bytes return appendBytes(StringUtils.toUtf8WithNullToEmpty(input)); } public CacheKeyBuilder appendInt(int theInt) { // write int as 4 bytes appendItem(Ints.toByteArray(theInt)); return this; } public CacheKeyBuilder appendBytes(byte[] bytes) { // write number of bytes, followed by the actual bytes appendInt(bytes.length); appendItem(bytes); return this; } private void appendItem(byte[] bytes) { items.add(bytes); size += bytes.length; } public byte[] build() { ByteBuffer retVal = ByteBuffer.allocate(size); for (byte[] item : items) { retVal.put(item); } return retVal.array(); }

@drcrallen, thanks. I'll add tests.

@gianm, thank you for the good suggestion. I'm concerned with the cache key size. How about adding some keys for each type like below?

static final byte BYTE_KEY = 0; static final byte BYTE_ARRAY_KEY = 1; ... static final byte INT_KEY = 10; static final byte INT_ARRAY_KEY = 11; ... public CacheKeyBuilder appendByte(byte input) { appendItem(BYTE_KEY, byteToByteArray(input)); return this; } public CacheKeyBuilder appendByteArray(byte[] input) { appendItem(BYTE_ARRAY_KEY, input); return this; } private void appendItem(byte key, byte[] bytes) { ... } ...

@jihoonson, that sounds like a good way to save some space.

I realized that both type keys and list sizes are needed to avoid this problem. I added them to CacheKeyBuilder.

fjy · 2017-02-06T21:16:18Z

👍

gianm · 2017-02-06T22:24:46Z

+  public CacheKeyBuilder appendStringList(List<String> input)
+  {
+    for (String eachStr : input) {
+      appendItem(stringToByteArray(eachStr));


One technique I like for this is to write the number of things before writing the things (something like below). This has the advantage of being robust to a byte array that does actually have a 0xff in it for some reason (which byte arrays can totally have). And then separators aren't necessary.

Whether this is the approach taken or not, @drcrallen is right that ["a", "b"] and ["ab"] and ["ab", ""] need to have different cache keys.

public CacheKeyBuilder appendStrings(List<String> strings) { // write number of strings, followed by each string appendInt(strings.size()); for (String str : strings) { appendString(str); } return this; } public CacheKeyBuilder appendString(String string) { // write string as utf-8 bytes return appendBytes(StringUtils.toUtf8WithNullToEmpty(input)); } public CacheKeyBuilder appendInt(int theInt) { // write int as 4 bytes appendItem(Ints.toByteArray(theInt)); return this; } public CacheKeyBuilder appendBytes(byte[] bytes) { // write number of bytes, followed by the actual bytes appendInt(bytes.length); appendItem(bytes); return this; } private void appendItem(byte[] bytes) { items.add(bytes); size += bytes.length; } public byte[] build() { ByteBuffer retVal = ByteBuffer.allocate(size); for (byte[] item : items) { retVal.put(item); } return retVal.array(); }

gianm · 2017-02-06T22:28:28Z

+import java.nio.ByteBuffer;
+import java.util.List;
+
+public class CacheKeyBuilder


This is a great concept. I can't believe we didn't already have one of these!

gianm · 2017-02-06T22:48:14Z

+  public static final byte LONG_GREATEST = 11;
+  public static final byte LONG_LEAST = 12;
+  public static final byte MAX = 13;
+  public static final byte MIN = 14;


Would prefer to prefix the extension-based post aggregators with the name of the extension, like HISTOGRAM_MIN rather than MIN.

gianm · 2017-02-06T22:57:06Z

+            .appendCacheable(query.getGranularity())
+            .appendCacheable(query.getDimFilter())
+            .appendCacheableList(query.getAggregatorSpecs())
+            .appendCacheableList(query.getDimensions())


@jihoonson, limitBytes and havingBytes disappeared from the cache key. Is this intentional? I guess "having" actually doesn't really need to be there, since it's applied after per-segment results are generated. So removing it should be fine. Currently the same is true for "limit" but #3873 may change that (I'm not sure).

@jon-wei, what do you think?

Thanks. It's my bad. I'll fix it.

After #3873, the group by result can be different if the limit-push-down is enabled and the ordering keys contain non-dimensions. I think this should also be considered to make the cache key.

The limit push down in #3873 is applied when merging the per-segment results, so taking the limit out of the cache key should be fine there as well

gianm · 2017-02-06T23:08:48Z

  );
  private static final List<AggregatorFactory> RENAMED_AGGS = Arrays.asList(
-      new CountAggregatorFactory("rows2"),
+      new CountAggregatorFactory("rows"),


What's the idea behind this change, and the makeTopNResults -> makeTopNResultsWithoutRename change? It's not clear to me why the new test is better.

This was not for a better test, but to make unit tests passed.

CachingClusteredClientTest.testTopN* tests check the cached result, and here the same post aggregator should be used except their names. So, I just changed test queries to use POST_AGG which is also used for caching like here.

To make POST_AGG work properly, the FieldAccessPostAggregator.fieldName should be the one specified in AGGS. So, I changed the name of CountAggregatorFactory from rows2 to rows.

Since there is another renamed field impers2 in RENAMED_AGG, I thought it would not be a problem in other tests for group-by with renaming. If not, please let me know.

The change in makeTopNResultsWithoutRename() is just for my convenience. If its name is verbose, I'll change it.

Ah I see what's going on. There's nothing wrong with verbose methods (usually I prefer them) so that's fine.

…-key-for-post-aggregator

gianm · 2017-02-07T04:21:01Z

+            .appendCacheable(query.getGranularity())
+            .appendCacheable(query.getDimensionsFilter())
+            .appendCacheableList(query.getAggregatorSpecs())
+            .appendCacheableList(query.getPostAggregatorSpecs())


Until now, just changing the post-aggregators on a topN (maybe adding new arithmetic, something like that) would still allow the cache to be used. This is a bug if we're sorting by the post-aggregator but is fine otherwise.

What do you think about changing the cache key generation so it only includes post-aggs we're sorting on (and any transitive dependencies based on getDependentFields)?

Sounds good. I'll do.

…ration

jihoonson · 2017-02-08T04:02:46Z

@drcrallen, @gianm thanks for your review. I addressed comments.
Additionally, I added the below feature to CacheKeyBuilder.

When a collection is appended to CacheKeyBuilder, that collection is sorted by its items' byte representation. This is to guarantee that collections of same items but in different orders make the same result

gianm · 2017-02-08T06:53:41Z

+        byteArrayList.add(byteArray);
+      }
+
+      // Sort the byte array list to guarantee that collections of same items but in different orders make the same result


This is not always safe, for example, in arithmetic post-aggregator if the op is "minus" or "div" then order matters. If you want to allow this optimization then perhaps you could use two methods like appendCacheables vs. appendCacheablesIgnoringOrder and then callers could use the appropriate one.

Good catch. Thanks. I added appendCacheablesIgnoringOrder().
I checked which one of appendCacheables and appendCacheablesIgnoringOrder should be called when building cache keys, but am not sure for SketchSetPostAggregator. Please verify it.

gianm · 2017-02-08T09:38:49Z

  }

+  @VisibleForTesting
+  static Collection<PostAggregator> findPostAggregatorsForSort(


This looks redundant to the already existing private static List<PostAggregator> prunePostAggregators(TopNQuery query) which calls AggregatorUtil.pruneDependentPostAgg (used for similar functionality in other parts of the topN system)

Thanks. I changed to use that method.

gianm · 2017-02-08T17:31:46Z

@jihoonson, thx for the update. had a couple further comments on the new stuff.

jihoonson · 2017-02-09T01:48:39Z

@gianm, thanks. I addressed your comments.

I think we need more tests to verify query caching with various types of fields, aggregators, post aggregators, etc. To make the tests concrete, I think it would be better to add more classes which are specialized for each query type. These tests are parameterized with original query, renamed parameters, and parameters of different order. However, this will cause a huge change in tests. I'm not sure about adding these tests in this PR. Maybe better to add in a following PR? What do you think?

gianm · 2017-02-09T03:22:14Z

That test sounds great -- we don't have amazing coverage of caching and cache key right now. I think it makes sense to do it in a followup PR.

gianm

LGTM, some minor suggestions. thx @jihoonson!

gianm · 2017-02-09T04:04:59Z

+    return builder.build();
+  }
+
+  private static boolean preserveFieldOrder(SketchHolder.Func func)


Suggest calling this preserveFieldOrderInCacheKey for better clarity.

gianm · 2017-02-09T04:05:21Z

           '}';
  }

+  private static boolean preserveFieldOrder(Ops op)


Similar, suggest calling this preserveFieldOrderInCacheKey.

gianm · 2017-02-09T04:08:25Z

  }

-  private static List<PostAggregator> prunePostAggregators(TopNQuery query)
+  public static List<PostAggregator> prunePostAggregators(TopNQuery query)


This doesn't need to be public, it's only called from within this file.

jihoonson · 2017-02-09T04:16:01Z

@gianm, fixed. Thanks!

gianm · 2017-02-09T04:30:42Z

@drcrallen do you have any other comments?

…-key-for-post-aggregator

drcrallen

Couple of comments. Also I believe this will cause caches to be invalid during an upgrade (and pre vs post upgrade) As such it should be called out in release notes.

drcrallen · 2017-02-13T19:34:56Z

+import java.util.Iterator;
+import java.util.List;
+
+public class CacheKeyBuilder


Can you add docs here for a high level overview of how this class works?

drcrallen · 2017-02-13T19:36:02Z

  // for testing.
-  public static final HavingSpec NEVER = new NeverHavingSpec();
-  public static final HavingSpec ALWAYS = new AlwaysHavingSpec();
+  HavingSpec NEVER = new NeverHavingSpec();


why this change?

They don't do anything for interface fields.

drcrallen · 2017-02-13T19:38:23Z

+            .appendCacheable(query.getDimensionsFilter())
+            .appendCacheablesIgnoringOrder(query.getAggregatorSpecs());
+
+        final List<PostAggregator> postAggregators = prunePostAggregators(query);


Does this mean that if I change post aggregators that are not part of the sorted that I will always get a cached result?

That's the idea: only postaggs involved in the sorting could affect the topN by-segment results, so they're the only ones that need to be in the cache key.

oh, duh, because the post-aggs can be calculated from the cached results.

Are they? My mind is drawing a blank there. Do post-aggs get calculated AFTER cached results?

Yes, they are calculated after pulling from cache. But in this case, a postagg can affect the sort order of cached results, even if it's not actually in the cached results. So it still needs to be part of the topN cache key (to fix #3719).

Had to go look myself to make sure, io.druid.client.CachingClusteredClient#mergeCachedAndUncachedSequences and io.druid.client.CachingQueryRunner#run seem to be able to handle it but holy cow, I forgot how convoluted the data paths are for actually running a query.

drcrallen · 2017-02-13T20:08:51Z

Requesting more class docs for the cache helper, but not strong enough to block the PR

gianm · 2017-02-13T20:23:37Z

Going to merge this one then, but @jihoonson could you please do a follow up adding the javadocs that @drcrallen mentioned to CacheKeyBuilder? thx

jihoonson · 2017-02-13T23:52:43Z

@gianm, @drcrallen thanks for your review. I opened #3933 for adding docs.

… of apache#3899

jihoonson added 2 commits February 2, 2017 13:46

Add PostAggregators to generator cache keys for top-n queries

4d7f466

Merge branch 'master' of https://github.com/druid-io/druid into cache…

9ff17ea

…-key-for-post-aggregator

drcrallen reviewed Feb 2, 2017

View reviewed changes

Add tests for strings

44e21ce

fjy added this to the 0.10.0 milestone Feb 2, 2017

gianm added the Bug label Feb 2, 2017

Remove debug comments

ddb4eee

nishantmonu51 approved these changes Feb 5, 2017

View reviewed changes

nishantmonu51 assigned nishantmonu51 and drcrallen Feb 5, 2017

drcrallen requested changes Feb 6, 2017

View reviewed changes

gianm added the Release Notes label Feb 6, 2017

gianm reviewed Feb 6, 2017

View reviewed changes

Merge branch 'master' of https://github.com/druid-io/druid into cache…

022aefd

…-key-for-post-aggregator

gianm reviewed Feb 7, 2017

View reviewed changes

jihoonson added 3 commits February 7, 2017 21:41

Add type keys and list sizes to cache key

f30c724

Make post aggregators used for sort are considered for cache key gene…

f5caefc

…ration

Use assertArrayEquals()

ee91c35

Improve findPostAggregatorsForSort()

6faf2ab

gianm reviewed Feb 8, 2017

View reviewed changes

Address comments

521780a

fix test failure

a49cdaf

gianm approved these changes Feb 9, 2017

View reviewed changes

address comments

e186d67

Merge branch 'master' of https://github.com/druid-io/druid into cache…

d171a2f

…-key-for-post-aggregator

drcrallen requested changes Feb 13, 2017

View reviewed changes

drcrallen approved these changes Feb 13, 2017

View reviewed changes

gianm merged commit 991e285 into apache:master Feb 13, 2017

jihoonson mentioned this pull request Feb 13, 2017

Docs for CacheKeyBuiler #3933

Closed

gianm mentioned this pull request Feb 28, 2017

Druid 0.10.0 release notes #3944

Closed

gianm mentioned this pull request Mar 22, 2017

Fix some query cache key collisions. #4094

Merged

clambertus unassigned nishantmonu51 and drcrallen Jul 6, 2018

seoeun25 pushed a commit to seoeun25/incubator-druid that referenced this pull request Jan 10, 2020

Backport of apache#4280 (Remove cache keys from HavingSpecs) and part…

8995ee7

… of apache#3899

Conversation

jihoonson commented Feb 2, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nishantmonu51 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jihoonson Feb 7, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fjy commented Feb 6, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jihoonson commented Feb 8, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gianm commented Feb 8, 2017

Uh oh!

jihoonson commented Feb 9, 2017

Uh oh!

gianm commented Feb 9, 2017

Uh oh!

gianm left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jihoonson commented Feb 9, 2017

Uh oh!

jihoonson commented Feb 2, 2017 •

edited

Loading

jihoonson Feb 7, 2017 •

edited

Loading