Add DimensionSelector id -> X caches.#5106
Conversation
Adds the method DimensionSelectorUtils.cacheIfPossible to aid with creation of caches for evaluating functions of dimension dictionary IDs. Uses an array cache when possible, an LRU cache when it seems like it might be a good idea, and skips caching otherwise. The caches are used in three places in this patch: - Cardinality aggregator (caches hashes) - Single-dimension-input expression selector (caches expression results) - Generic expression selector (caches name for any dimension inputs) Benchmarking showed that the LRU cache could introduce unwanted overhead for dimensions with few repeating values, so this patch implements two mitigations: - Skip caching completely if cardinality is >90% of the row count. - After iterating through a few multiples of the cache size, examine the cache hit rate and decide to either freeze it (the cache becomes read-only) or ignore it (the cache is cleared and future operations are uncached). The caches aim to be <1MB per column.
| implements CardinalityAggregatorColumnSelectorStrategy<BaseLongColumnValueSelector> | ||
| public class LongCardinalityAggregatorColumnSelectorStrategy implements CardinalityAggregatorColumnSelectorStrategy | ||
| { | ||
| private BaseLongColumnValueSelector selector; |
| private byte[] hashOneValue(final int id) | ||
| { | ||
| final String value = selector.lookupName(id); | ||
| return CardinalityAggregator.hashFn.hashUnencodedChars(nullToSpecial(value)).asBytes(); |
There was a problem hiding this comment.
Better the field called HASH_FN
| hits++; | ||
| } | ||
|
|
||
| // After a certain point, freeze or ignore the cache, based on hit rate. |
There was a problem hiding this comment.
Please extract this block in a method
| // After a certain point, freeze or ignore the cache, based on hit rate. | ||
| if (lookups > lookupsBeforeFreezing) { | ||
| if (hits < lookups / 3) { | ||
| // Hit rate < 33% |
| @Override | ||
| public ValueMatcherColumnSelectorStrategy makeColumnSelectorStrategy( | ||
| ColumnCapabilities capabilities, ColumnValueSelector selector | ||
| ColumnCapabilities capabilities, ColumnValueSelector selector, int numRows |
There was a problem hiding this comment.
Each param should be on a separate line (and several similar places below in the PR)
| * | ||
| * @see io.druid.segment.DimensionSelectorUtils#cacheIfPossible | ||
| */ | ||
| public class LruCacheIntFunction<T> implements IntFunction<T> |
There was a problem hiding this comment.
Please add a not that this class is unsafe for concurrent use.
| * | ||
| * @see io.druid.segment.DimensionSelectorUtils#cacheIfPossible | ||
| */ | ||
| public class ArrayCacheIntFunction<T> implements IntFunction<T> |
There was a problem hiding this comment.
Please add a note that this class is unsafe for concurrent use.
There was a problem hiding this comment.
Maybe just call it "ArrayDimensionCache" (and "LruDimensionCache") with a method like "getOrCompute()", and then return method reference in DimensionSelectorUtils.cacheIfPossible()
|
|
||
| lookups++; | ||
|
|
||
| T value = cache.getAndMoveToFirst(id); |
| { | ||
| // After a certain point, freeze or ignore the cache, based on hit rate. The idea is that hopefully by then, | ||
| // LRU should have some reasonable values in the cache if there are reasonable values to be found. | ||
| private static final int CACHE_FREEZE_FACTOR = 10; |
There was a problem hiding this comment.
How many passes over a column are done in total?
| } | ||
| return singleValueEvalCache.apply(row.get(0)); | ||
| } else { | ||
| // Tread non-singly-valued rows as nulls, just like ExpressionSelectors.supplierFromDimensionSelector. |
|
Also please fix "unused declaration", IntelliJ inspection fails |
|
Closing this, I don't have bandwidth to continue working on it at this time, and it conflicts with #6794 which is more important to me anyway. Will have to consider reviving in the future. |
Adds the method DimensionSelectorUtils.cacheIfPossible to aid with
creation of caches for evaluating functions of dimension dictionary IDs.
Uses an array cache when possible, an LRU cache when it seems like it
might be a good idea, and skips caching otherwise.
The caches are used in three places in this patch:
Benchmarking showed that both caches help when there are repeated values,
but also the LRU cache could introduce unwanted overhead for dimensions
with few repeating values, so this patch implements two mitigations:
the cache hit rate and decide to either freeze it (the cache becomes
read-only) or ignore it (the cache is cleared and future operations
are uncached).
The caches aim to be 1MB or less per column.