Skip to content

Add PostAggregators to generator cache keys for top-n queries#3899

Merged
gianm merged 13 commits intoapache:masterfrom
jihoonson:cache-key-for-post-aggregator
Feb 13, 2017
Merged

Add PostAggregators to generator cache keys for top-n queries#3899
gianm merged 13 commits intoapache:masterfrom
jihoonson:cache-key-for-post-aggregator

Conversation

@jihoonson
Copy link
Copy Markdown
Contributor

@jihoonson jihoonson commented Feb 2, 2017

This PR is to fix #3719.

This change is Reviewable

import static org.junit.Assert.assertTrue;

@RunWith(Parameterized.class)
public class CacheKeyBuilderTest
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a test for the following:

Item 1:
String1 : a
String2 : b

Item 2:
String 1:ab

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added tests. Thanks.

@fjy fjy added this to the 0.10.0 milestone Feb 2, 2017
@gianm gianm added the Bug label Feb 2, 2017
Copy link
Copy Markdown
Member

@nishantmonu51 nishantmonu51 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, 👍


public CacheKeyBuilder appendCacheableList(List<? extends Cacheable> inputs)
{
for (Cacheable input : inputs) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can the list be null in any case? We might as well just add null a check here.

Copy link
Copy Markdown
Contributor Author

@jihoonson jihoonson Feb 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I don't see any case now, but it will be helpful for future uses. I'll add it.

public CacheKeyBuilder appendStringList(List<String> input)
{
for (String eachStr : input) {
appendItem(stringToByteArray(eachStr));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll be more explicit here.

If you call

appendStringList(ImmutalbeList.of("a","b"));

in one cache key builder, and

appendStringList(ImmutableList.of("ab"))

in another, they will both evaluate (incorrectly) to the same cache key, if I'm reading the code correctly. All strings should be terminated with a byte that is NOT a valid UTF-8 character start (like 0xFF)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One technique I like for this is to write the number of things before writing the things (something like below). This has the advantage of being robust to a byte array that does actually have a 0xff in it for some reason (which byte arrays can totally have). And then separators aren't necessary.

Whether this is the approach taken or not, @drcrallen is right that ["a", "b"] and ["ab"] and ["ab", ""] need to have different cache keys.

public CacheKeyBuilder appendStrings(List<String> strings)
{
  // write number of strings, followed by each string
  appendInt(strings.size());
  for (String str : strings) {
    appendString(str);
  }
  return this;
}

public CacheKeyBuilder appendString(String string)
{
  // write string as utf-8 bytes
  return appendBytes(StringUtils.toUtf8WithNullToEmpty(input));
}

public CacheKeyBuilder appendInt(int theInt)
{
  // write int as 4 bytes
  appendItem(Ints.toByteArray(theInt));
  return this;
}

public CacheKeyBuilder appendBytes(byte[] bytes)
{
  // write number of bytes, followed by the actual bytes
  appendInt(bytes.length);
  appendItem(bytes);
  return this;
}

private void appendItem(byte[] bytes)
{
  items.add(bytes);
  size += bytes.length;
}

public byte[] build()
{
  ByteBuffer retVal = ByteBuffer.allocate(size);
  for (byte[] item : items) {
    retVal.put(item);
  }
  return retVal.array();
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@drcrallen, thanks. I'll add tests.

@gianm, thank you for the good suggestion. I'm concerned with the cache key size. How about adding some keys for each type like below?

static final byte BYTE_KEY = 0;
static final byte BYTE_ARRAY_KEY = 1;
...
static final byte INT_KEY = 10;
static final byte INT_ARRAY_KEY = 11;
...

public CacheKeyBuilder appendByte(byte input)
{
  appendItem(BYTE_KEY, byteToByteArray(input));
  return this;
}

public CacheKeyBuilder appendByteArray(byte[] input)
{
  appendItem(BYTE_ARRAY_KEY, input);
  return this;
}

private void appendItem(byte key, byte[] bytes)
{
  ...
}
...

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jihoonson, that sounds like a good way to save some space.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realized that both type keys and list sizes are needed to avoid this problem. I added them to CacheKeyBuilder.

@fjy
Copy link
Copy Markdown
Contributor

fjy commented Feb 6, 2017

👍

public CacheKeyBuilder appendStringList(List<String> input)
{
for (String eachStr : input) {
appendItem(stringToByteArray(eachStr));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One technique I like for this is to write the number of things before writing the things (something like below). This has the advantage of being robust to a byte array that does actually have a 0xff in it for some reason (which byte arrays can totally have). And then separators aren't necessary.

Whether this is the approach taken or not, @drcrallen is right that ["a", "b"] and ["ab"] and ["ab", ""] need to have different cache keys.

public CacheKeyBuilder appendStrings(List<String> strings)
{
  // write number of strings, followed by each string
  appendInt(strings.size());
  for (String str : strings) {
    appendString(str);
  }
  return this;
}

public CacheKeyBuilder appendString(String string)
{
  // write string as utf-8 bytes
  return appendBytes(StringUtils.toUtf8WithNullToEmpty(input));
}

public CacheKeyBuilder appendInt(int theInt)
{
  // write int as 4 bytes
  appendItem(Ints.toByteArray(theInt));
  return this;
}

public CacheKeyBuilder appendBytes(byte[] bytes)
{
  // write number of bytes, followed by the actual bytes
  appendInt(bytes.length);
  appendItem(bytes);
  return this;
}

private void appendItem(byte[] bytes)
{
  items.add(bytes);
  size += bytes.length;
}

public byte[] build()
{
  ByteBuffer retVal = ByteBuffer.allocate(size);
  for (byte[] item : items) {
    retVal.put(item);
  }
  return retVal.array();
}

import java.nio.ByteBuffer;
import java.util.List;

public class CacheKeyBuilder
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great concept. I can't believe we didn't already have one of these!

public static final byte LONG_GREATEST = 11;
public static final byte LONG_LEAST = 12;
public static final byte MAX = 13;
public static final byte MIN = 14;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would prefer to prefix the extension-based post aggregators with the name of the extension, like HISTOGRAM_MIN rather than MIN.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.

.appendCacheable(query.getGranularity())
.appendCacheable(query.getDimFilter())
.appendCacheableList(query.getAggregatorSpecs())
.appendCacheableList(query.getDimensions())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jihoonson, limitBytes and havingBytes disappeared from the cache key. Is this intentional? I guess "having" actually doesn't really need to be there, since it's applied after per-segment results are generated. So removing it should be fine. Currently the same is true for "limit" but #3873 may change that (I'm not sure).

@jon-wei, what do you think?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. It's my bad. I'll fix it.

After #3873, the group by result can be different if the limit-push-down is enabled and the ordering keys contain non-dimensions. I think this should also be considered to make the cache key.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The limit push down in #3873 is applied when merging the per-segment results, so taking the limit out of the cache key should be fine there as well

);
private static final List<AggregatorFactory> RENAMED_AGGS = Arrays.asList(
new CountAggregatorFactory("rows2"),
new CountAggregatorFactory("rows"),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the idea behind this change, and the makeTopNResults -> makeTopNResultsWithoutRename change? It's not clear to me why the new test is better.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was not for a better test, but to make unit tests passed.

CachingClusteredClientTest.testTopN* tests check the cached result, and here the same post aggregator should be used except their names. So, I just changed test queries to use POST_AGG which is also used for caching like here.

To make POST_AGG work properly, the FieldAccessPostAggregator.fieldName should be the one specified in AGGS. So, I changed the name of CountAggregatorFactory from rows2 to rows.

Since there is another renamed field impers2 in RENAMED_AGG, I thought it would not be a problem in other tests for group-by with renaming. If not, please let me know.

The change in makeTopNResultsWithoutRename() is just for my convenience. If its name is verbose, I'll change it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see what's going on. There's nothing wrong with verbose methods (usually I prefer them) so that's fine.

.appendCacheable(query.getGranularity())
.appendCacheable(query.getDimensionsFilter())
.appendCacheableList(query.getAggregatorSpecs())
.appendCacheableList(query.getPostAggregatorSpecs())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Until now, just changing the post-aggregators on a topN (maybe adding new arithmetic, something like that) would still allow the cache to be used. This is a bug if we're sorting by the post-aggregator but is fine otherwise.

What do you think about changing the cache key generation so it only includes post-aggs we're sorting on (and any transitive dependencies based on getDependentFields)?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. I'll do.

@jihoonson
Copy link
Copy Markdown
Contributor Author

@drcrallen, @gianm thanks for your review. I addressed comments.
Additionally, I added the below feature to CacheKeyBuilder.

  • When a collection is appended to CacheKeyBuilder, that collection is sorted by its items' byte representation. This is to guarantee that collections of same items but in different orders make the same result

byteArrayList.add(byteArray);
}

// Sort the byte array list to guarantee that collections of same items but in different orders make the same result
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not always safe, for example, in arithmetic post-aggregator if the op is "minus" or "div" then order matters. If you want to allow this optimization then perhaps you could use two methods like appendCacheables vs. appendCacheablesIgnoringOrder and then callers could use the appropriate one.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Thanks. I added appendCacheablesIgnoringOrder().
I checked which one of appendCacheables and appendCacheablesIgnoringOrder should be called when building cache keys, but am not sure for SketchSetPostAggregator. Please verify it.

}

@VisibleForTesting
static Collection<PostAggregator> findPostAggregatorsForSort(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks redundant to the already existing private static List<PostAggregator> prunePostAggregators(TopNQuery query) which calls AggregatorUtil.pruneDependentPostAgg (used for similar functionality in other parts of the topN system)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I changed to use that method.

@gianm
Copy link
Copy Markdown
Contributor

gianm commented Feb 8, 2017

@jihoonson, thx for the update. had a couple further comments on the new stuff.

@jihoonson
Copy link
Copy Markdown
Contributor Author

@gianm, thanks. I addressed your comments.

I think we need more tests to verify query caching with various types of fields, aggregators, post aggregators, etc. To make the tests concrete, I think it would be better to add more classes which are specialized for each query type. These tests are parameterized with original query, renamed parameters, and parameters of different order. However, this will cause a huge change in tests. I'm not sure about adding these tests in this PR. Maybe better to add in a following PR? What do you think?

@gianm
Copy link
Copy Markdown
Contributor

gianm commented Feb 9, 2017

That test sounds great -- we don't have amazing coverage of caching and cache key right now. I think it makes sense to do it in a followup PR.

Copy link
Copy Markdown
Contributor

@gianm gianm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, some minor suggestions. thx @jihoonson!

return builder.build();
}

private static boolean preserveFieldOrder(SketchHolder.Func func)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest calling this preserveFieldOrderInCacheKey for better clarity.

'}';
}

private static boolean preserveFieldOrder(Ops op)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar, suggest calling this preserveFieldOrderInCacheKey.

}

private static List<PostAggregator> prunePostAggregators(TopNQuery query)
public static List<PostAggregator> prunePostAggregators(TopNQuery query)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't need to be public, it's only called from within this file.

@jihoonson
Copy link
Copy Markdown
Contributor Author

@gianm, fixed. Thanks!

@gianm
Copy link
Copy Markdown
Contributor

gianm commented Feb 9, 2017

@drcrallen do you have any other comments?

Copy link
Copy Markdown
Contributor

@drcrallen drcrallen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of comments. Also I believe this will cause caches to be invalid during an upgrade (and pre vs post upgrade) As such it should be called out in release notes.

import java.util.Iterator;
import java.util.List;

public class CacheKeyBuilder
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add docs here for a high level overview of how this class works?

// for testing.
public static final HavingSpec NEVER = new NeverHavingSpec();
public static final HavingSpec ALWAYS = new AlwaysHavingSpec();
HavingSpec NEVER = new NeverHavingSpec();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this change?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They don't do anything for interface fields.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gotcha

.appendCacheable(query.getDimensionsFilter())
.appendCacheablesIgnoringOrder(query.getAggregatorSpecs());

final List<PostAggregator> postAggregators = prunePostAggregators(query);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that if I change post aggregators that are not part of the sorted that I will always get a cached result?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the idea: only postaggs involved in the sorting could affect the topN by-segment results, so they're the only ones that need to be in the cache key.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, duh, because the post-aggs can be calculated from the cached results.

Are they? My mind is drawing a blank there. Do post-aggs get calculated AFTER cached results?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, they are calculated after pulling from cache. But in this case, a postagg can affect the sort order of cached results, even if it's not actually in the cached results. So it still needs to be part of the topN cache key (to fix #3719).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had to go look myself to make sure, io.druid.client.CachingClusteredClient#mergeCachedAndUncachedSequences and io.druid.client.CachingQueryRunner#run seem to be able to handle it but holy cow, I forgot how convoluted the data paths are for actually running a query.

@drcrallen
Copy link
Copy Markdown
Contributor

Requesting more class docs for the cache helper, but not strong enough to block the PR

@gianm
Copy link
Copy Markdown
Contributor

gianm commented Feb 13, 2017

Going to merge this one then, but @jihoonson could you please do a follow up adding the javadocs that @drcrallen mentioned to CacheKeyBuilder? thx

@gianm gianm merged commit 991e285 into apache:master Feb 13, 2017
@jihoonson
Copy link
Copy Markdown
Contributor Author

@gianm, @drcrallen thanks for your review. I opened #3933 for adding docs.

seoeun25 pushed a commit to seoeun25/incubator-druid that referenced this pull request Jan 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improper topN caching when sorting by postaggregator

6 participants