Skip to content

Add DimensionSelector id -> X caches.#5106

Closed
gianm wants to merge 3 commits intoapache:masterfrom
gianm:dimension-id-eval-caches
Closed

Add DimensionSelector id -> X caches.#5106
gianm wants to merge 3 commits intoapache:masterfrom
gianm:dimension-id-eval-caches

Conversation

@gianm
Copy link
Copy Markdown
Contributor

@gianm gianm commented Nov 20, 2017

Adds the method DimensionSelectorUtils.cacheIfPossible to aid with
creation of caches for evaluating functions of dimension dictionary IDs.
Uses an array cache when possible, an LRU cache when it seems like it
might be a good idea, and skips caching otherwise.

The caches are used in three places in this patch:

  • Cardinality aggregator (caches hashes)
  • Single-dimension-input expression selector (caches expression results)
  • Generic expression selector (caches name for any dimension inputs)

Benchmarking showed that both caches help when there are repeated values,
but also the LRU cache could introduce unwanted overhead for dimensions
with few repeating values, so this patch implements two mitigations:

  • Skip caching completely if cardinality is >90% of the row count.
  • After iterating through a few multiples of the cache size, examine
    the cache hit rate and decide to either freeze it (the cache becomes
    read-only) or ignore it (the cache is cleared and future operations
    are uncached).

The caches aim to be 1MB or less per column.

Adds the method DimensionSelectorUtils.cacheIfPossible to aid with
creation of caches for evaluating functions of dimension dictionary IDs.
Uses an array cache when possible, an LRU cache when it seems like it
might be a good idea, and skips caching otherwise.

The caches are used in three places in this patch:

- Cardinality aggregator (caches hashes)
- Single-dimension-input expression selector (caches expression results)
- Generic expression selector (caches name for any dimension inputs)

Benchmarking showed that the LRU cache could introduce unwanted overhead
for dimensions with few repeating values, so this patch implements two
mitigations:

- Skip caching completely if cardinality is >90% of the row count.
- After iterating through a few multiples of the cache size, examine
  the cache hit rate and decide to either freeze it (the cache becomes
  read-only) or ignore it (the cache is cleared and future operations
  are uncached).

The caches aim to be <1MB per column.
implements CardinalityAggregatorColumnSelectorStrategy<BaseLongColumnValueSelector>
public class LongCardinalityAggregatorColumnSelectorStrategy implements CardinalityAggregatorColumnSelectorStrategy
{
private BaseLongColumnValueSelector selector;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Field could be final

private byte[] hashOneValue(final int id)
{
final String value = selector.lookupName(id);
return CardinalityAggregator.hashFn.hashUnencodedChars(nullToSpecial(value)).asBytes();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better the field called HASH_FN

hits++;
}

// After a certain point, freeze or ignore the cache, based on hit rate.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please extract this block in a method

// After a certain point, freeze or ignore the cache, based on hit rate.
if (lookups > lookupsBeforeFreezing) {
if (hits < lookups / 3) {
// Hit rate < 33%
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any rational behind 33%?

@Override
public ValueMatcherColumnSelectorStrategy makeColumnSelectorStrategy(
ColumnCapabilities capabilities, ColumnValueSelector selector
ColumnCapabilities capabilities, ColumnValueSelector selector, int numRows
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each param should be on a separate line (and several similar places below in the PR)

*
* @see io.druid.segment.DimensionSelectorUtils#cacheIfPossible
*/
public class LruCacheIntFunction<T> implements IntFunction<T>
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a not that this class is unsafe for concurrent use.

*
* @see io.druid.segment.DimensionSelectorUtils#cacheIfPossible
*/
public class ArrayCacheIntFunction<T> implements IntFunction<T>
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a note that this class is unsafe for concurrent use.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just call it "ArrayDimensionCache" (and "LruDimensionCache") with a method like "getOrCompute()", and then return method reference in DimensionSelectorUtils.cacheIfPossible()


lookups++;

T value = cache.getAndMoveToFirst(id);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reference: vigna/fastutil#86

{
// After a certain point, freeze or ignore the cache, based on hit rate. The idea is that hopefully by then,
// LRU should have some reasonable values in the cache if there are reasonable values to be found.
private static final int CACHE_FREEZE_FACTOR = 10;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How many passes over a column are done in total?

}
return singleValueEvalCache.apply(row.get(0));
} else {
// Tread non-singly-valued rows as nulls, just like ExpressionSelectors.supplierFromDimensionSelector.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Treat

@leventov
Copy link
Copy Markdown
Member

Also please fix "unused declaration", IntelliJ inspection fails

@fjy fjy closed this Jan 12, 2018
@fjy fjy reopened this Jan 12, 2018
@gianm gianm added the WIP label Aug 26, 2018
@gianm
Copy link
Copy Markdown
Contributor Author

gianm commented Jan 8, 2019

Closing this, I don't have bandwidth to continue working on it at this time, and it conflicts with #6794 which is more important to me anyway. Will have to consider reviving in the future.

@gianm gianm closed this Jan 8, 2019
@gianm gianm deleted the dimension-id-eval-caches branch January 8, 2019 18:41
@gianm gianm restored the dimension-id-eval-caches branch January 8, 2019 18:41
@gianm gianm deleted the dimension-id-eval-caches branch September 23, 2022 19:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants