Add PostAggregators to generator cache keys for top-n queries#3899
Add PostAggregators to generator cache keys for top-n queries#3899gianm merged 13 commits intoapache:masterfrom
Conversation
| import static org.junit.Assert.assertTrue; | ||
|
|
||
| @RunWith(Parameterized.class) | ||
| public class CacheKeyBuilderTest |
There was a problem hiding this comment.
Can you add a test for the following:
Item 1:
String1 : a
String2 : b
Item 2:
String 1:ab
There was a problem hiding this comment.
I added tests. Thanks.
|
|
||
| public CacheKeyBuilder appendCacheableList(List<? extends Cacheable> inputs) | ||
| { | ||
| for (Cacheable input : inputs) { |
There was a problem hiding this comment.
can the list be null in any case? We might as well just add null a check here.
There was a problem hiding this comment.
Thanks. I don't see any case now, but it will be helpful for future uses. I'll add it.
| public CacheKeyBuilder appendStringList(List<String> input) | ||
| { | ||
| for (String eachStr : input) { | ||
| appendItem(stringToByteArray(eachStr)); |
There was a problem hiding this comment.
Ok, I'll be more explicit here.
If you call
appendStringList(ImmutalbeList.of("a","b"));in one cache key builder, and
appendStringList(ImmutableList.of("ab"))in another, they will both evaluate (incorrectly) to the same cache key, if I'm reading the code correctly. All strings should be terminated with a byte that is NOT a valid UTF-8 character start (like 0xFF)
There was a problem hiding this comment.
One technique I like for this is to write the number of things before writing the things (something like below). This has the advantage of being robust to a byte array that does actually have a 0xff in it for some reason (which byte arrays can totally have). And then separators aren't necessary.
Whether this is the approach taken or not, @drcrallen is right that ["a", "b"] and ["ab"] and ["ab", ""] need to have different cache keys.
public CacheKeyBuilder appendStrings(List<String> strings)
{
// write number of strings, followed by each string
appendInt(strings.size());
for (String str : strings) {
appendString(str);
}
return this;
}
public CacheKeyBuilder appendString(String string)
{
// write string as utf-8 bytes
return appendBytes(StringUtils.toUtf8WithNullToEmpty(input));
}
public CacheKeyBuilder appendInt(int theInt)
{
// write int as 4 bytes
appendItem(Ints.toByteArray(theInt));
return this;
}
public CacheKeyBuilder appendBytes(byte[] bytes)
{
// write number of bytes, followed by the actual bytes
appendInt(bytes.length);
appendItem(bytes);
return this;
}
private void appendItem(byte[] bytes)
{
items.add(bytes);
size += bytes.length;
}
public byte[] build()
{
ByteBuffer retVal = ByteBuffer.allocate(size);
for (byte[] item : items) {
retVal.put(item);
}
return retVal.array();
}There was a problem hiding this comment.
@drcrallen, thanks. I'll add tests.
@gianm, thank you for the good suggestion. I'm concerned with the cache key size. How about adding some keys for each type like below?
static final byte BYTE_KEY = 0;
static final byte BYTE_ARRAY_KEY = 1;
...
static final byte INT_KEY = 10;
static final byte INT_ARRAY_KEY = 11;
...
public CacheKeyBuilder appendByte(byte input)
{
appendItem(BYTE_KEY, byteToByteArray(input));
return this;
}
public CacheKeyBuilder appendByteArray(byte[] input)
{
appendItem(BYTE_ARRAY_KEY, input);
return this;
}
private void appendItem(byte key, byte[] bytes)
{
...
}
...There was a problem hiding this comment.
@jihoonson, that sounds like a good way to save some space.
There was a problem hiding this comment.
I realized that both type keys and list sizes are needed to avoid this problem. I added them to CacheKeyBuilder.
|
👍 |
| public CacheKeyBuilder appendStringList(List<String> input) | ||
| { | ||
| for (String eachStr : input) { | ||
| appendItem(stringToByteArray(eachStr)); |
There was a problem hiding this comment.
One technique I like for this is to write the number of things before writing the things (something like below). This has the advantage of being robust to a byte array that does actually have a 0xff in it for some reason (which byte arrays can totally have). And then separators aren't necessary.
Whether this is the approach taken or not, @drcrallen is right that ["a", "b"] and ["ab"] and ["ab", ""] need to have different cache keys.
public CacheKeyBuilder appendStrings(List<String> strings)
{
// write number of strings, followed by each string
appendInt(strings.size());
for (String str : strings) {
appendString(str);
}
return this;
}
public CacheKeyBuilder appendString(String string)
{
// write string as utf-8 bytes
return appendBytes(StringUtils.toUtf8WithNullToEmpty(input));
}
public CacheKeyBuilder appendInt(int theInt)
{
// write int as 4 bytes
appendItem(Ints.toByteArray(theInt));
return this;
}
public CacheKeyBuilder appendBytes(byte[] bytes)
{
// write number of bytes, followed by the actual bytes
appendInt(bytes.length);
appendItem(bytes);
return this;
}
private void appendItem(byte[] bytes)
{
items.add(bytes);
size += bytes.length;
}
public byte[] build()
{
ByteBuffer retVal = ByteBuffer.allocate(size);
for (byte[] item : items) {
retVal.put(item);
}
return retVal.array();
}| import java.nio.ByteBuffer; | ||
| import java.util.List; | ||
|
|
||
| public class CacheKeyBuilder |
There was a problem hiding this comment.
This is a great concept. I can't believe we didn't already have one of these!
| public static final byte LONG_GREATEST = 11; | ||
| public static final byte LONG_LEAST = 12; | ||
| public static final byte MAX = 13; | ||
| public static final byte MIN = 14; |
There was a problem hiding this comment.
Would prefer to prefix the extension-based post aggregators with the name of the extension, like HISTOGRAM_MIN rather than MIN.
| .appendCacheable(query.getGranularity()) | ||
| .appendCacheable(query.getDimFilter()) | ||
| .appendCacheableList(query.getAggregatorSpecs()) | ||
| .appendCacheableList(query.getDimensions()) |
There was a problem hiding this comment.
@jihoonson, limitBytes and havingBytes disappeared from the cache key. Is this intentional? I guess "having" actually doesn't really need to be there, since it's applied after per-segment results are generated. So removing it should be fine. Currently the same is true for "limit" but #3873 may change that (I'm not sure).
@jon-wei, what do you think?
There was a problem hiding this comment.
Thanks. It's my bad. I'll fix it.
After #3873, the group by result can be different if the limit-push-down is enabled and the ordering keys contain non-dimensions. I think this should also be considered to make the cache key.
There was a problem hiding this comment.
The limit push down in #3873 is applied when merging the per-segment results, so taking the limit out of the cache key should be fine there as well
| ); | ||
| private static final List<AggregatorFactory> RENAMED_AGGS = Arrays.asList( | ||
| new CountAggregatorFactory("rows2"), | ||
| new CountAggregatorFactory("rows"), |
There was a problem hiding this comment.
What's the idea behind this change, and the makeTopNResults -> makeTopNResultsWithoutRename change? It's not clear to me why the new test is better.
There was a problem hiding this comment.
This was not for a better test, but to make unit tests passed.
CachingClusteredClientTest.testTopN* tests check the cached result, and here the same post aggregator should be used except their names. So, I just changed test queries to use POST_AGG which is also used for caching like here.
To make POST_AGG work properly, the FieldAccessPostAggregator.fieldName should be the one specified in AGGS. So, I changed the name of CountAggregatorFactory from rows2 to rows.
Since there is another renamed field impers2 in RENAMED_AGG, I thought it would not be a problem in other tests for group-by with renaming. If not, please let me know.
The change in makeTopNResultsWithoutRename() is just for my convenience. If its name is verbose, I'll change it.
There was a problem hiding this comment.
Ah I see what's going on. There's nothing wrong with verbose methods (usually I prefer them) so that's fine.
…-key-for-post-aggregator
| .appendCacheable(query.getGranularity()) | ||
| .appendCacheable(query.getDimensionsFilter()) | ||
| .appendCacheableList(query.getAggregatorSpecs()) | ||
| .appendCacheableList(query.getPostAggregatorSpecs()) |
There was a problem hiding this comment.
Until now, just changing the post-aggregators on a topN (maybe adding new arithmetic, something like that) would still allow the cache to be used. This is a bug if we're sorting by the post-aggregator but is fine otherwise.
What do you think about changing the cache key generation so it only includes post-aggs we're sorting on (and any transitive dependencies based on getDependentFields)?
There was a problem hiding this comment.
Sounds good. I'll do.
|
@drcrallen, @gianm thanks for your review. I addressed comments.
|
| byteArrayList.add(byteArray); | ||
| } | ||
|
|
||
| // Sort the byte array list to guarantee that collections of same items but in different orders make the same result |
There was a problem hiding this comment.
This is not always safe, for example, in arithmetic post-aggregator if the op is "minus" or "div" then order matters. If you want to allow this optimization then perhaps you could use two methods like appendCacheables vs. appendCacheablesIgnoringOrder and then callers could use the appropriate one.
There was a problem hiding this comment.
Good catch. Thanks. I added appendCacheablesIgnoringOrder().
I checked which one of appendCacheables and appendCacheablesIgnoringOrder should be called when building cache keys, but am not sure for SketchSetPostAggregator. Please verify it.
| } | ||
|
|
||
| @VisibleForTesting | ||
| static Collection<PostAggregator> findPostAggregatorsForSort( |
There was a problem hiding this comment.
This looks redundant to the already existing private static List<PostAggregator> prunePostAggregators(TopNQuery query) which calls AggregatorUtil.pruneDependentPostAgg (used for similar functionality in other parts of the topN system)
There was a problem hiding this comment.
Thanks. I changed to use that method.
|
@jihoonson, thx for the update. had a couple further comments on the new stuff. |
|
@gianm, thanks. I addressed your comments. I think we need more tests to verify query caching with various types of fields, aggregators, post aggregators, etc. To make the tests concrete, I think it would be better to add more classes which are specialized for each query type. These tests are parameterized with original query, renamed parameters, and parameters of different order. However, this will cause a huge change in tests. I'm not sure about adding these tests in this PR. Maybe better to add in a following PR? What do you think? |
|
That test sounds great -- we don't have amazing coverage of caching and cache key right now. I think it makes sense to do it in a followup PR. |
gianm
left a comment
There was a problem hiding this comment.
LGTM, some minor suggestions. thx @jihoonson!
| return builder.build(); | ||
| } | ||
|
|
||
| private static boolean preserveFieldOrder(SketchHolder.Func func) |
There was a problem hiding this comment.
Suggest calling this preserveFieldOrderInCacheKey for better clarity.
| '}'; | ||
| } | ||
|
|
||
| private static boolean preserveFieldOrder(Ops op) |
There was a problem hiding this comment.
Similar, suggest calling this preserveFieldOrderInCacheKey.
| } | ||
|
|
||
| private static List<PostAggregator> prunePostAggregators(TopNQuery query) | ||
| public static List<PostAggregator> prunePostAggregators(TopNQuery query) |
There was a problem hiding this comment.
This doesn't need to be public, it's only called from within this file.
|
@gianm, fixed. Thanks! |
|
@drcrallen do you have any other comments? |
…-key-for-post-aggregator
drcrallen
left a comment
There was a problem hiding this comment.
Couple of comments. Also I believe this will cause caches to be invalid during an upgrade (and pre vs post upgrade) As such it should be called out in release notes.
| import java.util.Iterator; | ||
| import java.util.List; | ||
|
|
||
| public class CacheKeyBuilder |
There was a problem hiding this comment.
Can you add docs here for a high level overview of how this class works?
| // for testing. | ||
| public static final HavingSpec NEVER = new NeverHavingSpec(); | ||
| public static final HavingSpec ALWAYS = new AlwaysHavingSpec(); | ||
| HavingSpec NEVER = new NeverHavingSpec(); |
There was a problem hiding this comment.
They don't do anything for interface fields.
| .appendCacheable(query.getDimensionsFilter()) | ||
| .appendCacheablesIgnoringOrder(query.getAggregatorSpecs()); | ||
|
|
||
| final List<PostAggregator> postAggregators = prunePostAggregators(query); |
There was a problem hiding this comment.
Does this mean that if I change post aggregators that are not part of the sorted that I will always get a cached result?
There was a problem hiding this comment.
That's the idea: only postaggs involved in the sorting could affect the topN by-segment results, so they're the only ones that need to be in the cache key.
There was a problem hiding this comment.
oh, duh, because the post-aggs can be calculated from the cached results.
Are they? My mind is drawing a blank there. Do post-aggs get calculated AFTER cached results?
There was a problem hiding this comment.
Yes, they are calculated after pulling from cache. But in this case, a postagg can affect the sort order of cached results, even if it's not actually in the cached results. So it still needs to be part of the topN cache key (to fix #3719).
There was a problem hiding this comment.
Had to go look myself to make sure, io.druid.client.CachingClusteredClient#mergeCachedAndUncachedSequences and io.druid.client.CachingQueryRunner#run seem to be able to handle it but holy cow, I forgot how convoluted the data paths are for actually running a query.
|
Requesting more class docs for the cache helper, but not strong enough to block the PR |
|
Going to merge this one then, but @jihoonson could you please do a follow up adding the javadocs that @drcrallen mentioned to CacheKeyBuilder? thx |
|
@gianm, @drcrallen thanks for your review. I opened #3933 for adding docs. |
This PR is to fix #3719.
This change is