Support ingestion of long/float dimensions#3966
Conversation
cf90c60 to
1a69a20
Compare
gianm
left a comment
There was a problem hiding this comment.
@jon-wei, the patch looks good other than the comments, although the usage of makeColumnValueSelector seems a little sketchy to me.
It looks like what IncrementalIndexStorageAdapter does is that if you ask for a selector for a dimension, then based on what kind of selector you ask for, it casts the result of makeColumnValueSelector to match (so makeFloatColumnSelector casts it to FloatColumnSelector).
This seems sketchy because it should be allowable to ask for a long selector on a float column, and then you should get a selector that casts each float to a long.
Questions for you:
- Is there some reason (elsewhere in the code) that this is ok?
- Do you have tests covering this? (e.g. asking for a long selector on a float column)
- Do those tests include aggregations? (e.g. longSum on a float column of an incremental index)
- Do those tests include grouping? (e.g. grouping with outputType LONG on a float column of an incremental index)
| "dimensionsSpec" : { | ||
| "dimensions": ["page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city"], | ||
| "dimensions": [ | ||
| "page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city", |
There was a problem hiding this comment.
Given the new additions, placing each of these on its own line would be more readable.
There was a problem hiding this comment.
Moved these to new lines
| "dimensions": [ | ||
| "page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city", | ||
| { | ||
| "type": "LONG", |
There was a problem hiding this comment.
Do these need to be uppercase? Lowercase is more typical for Druid options in JSON.
There was a problem hiding this comment.
They can be lowercase, made them lowercase in the docs now
| | Field | Type | Description | Required | | ||
| |-------|------|-------------|----------| | ||
| | dimensions | JSON String array | The names of the dimensions. If this is an empty array, Druid will treat all columns that are not timestamp or metric columns as dimension columns. | yes | | ||
| | dimensions | JSON Object array | A list of [dimension schema](#dimension-schema) objects or dimension names. Providing a name is equivalent to providing a String-typed dimension schema with the given name. If this is an empty array, Druid will treat all columns that are not timestamp or metric columns as String-typed dimension columns. | yes | |
There was a problem hiding this comment.
It's just a JSON array, since it can mix json strings and json objects.
There was a problem hiding this comment.
Changed to "JSON array"
| ```json | ||
| "dimensionsSpec" : { | ||
| "dimensions": [ | ||
| "page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city", |
There was a problem hiding this comment.
Similar comments to above on formatting.
There was a problem hiding this comment.
Moved dimensions to new lines
| @Override | ||
| public Indexed<Long> getSortedIndexedValues() | ||
| { | ||
| return null; |
There was a problem hiding this comment.
Is this expected to never be called? If so, throw UnsupportedOperationException rather than return null.
There was a problem hiding this comment.
Changed this and the bitmap-related method to throw UnsupportedOperationException
| @Override | ||
| public Long getMinValue() | ||
| { | ||
| return 0L; |
There was a problem hiding this comment.
Long.MIN_VALUE and Long.MAX_VALUE may work better for these two, since people may use the min/max values reported by metadata for pruning segment lists.
There was a problem hiding this comment.
Changed these to use MIN_VALUE and MAX_VALUE
| @Override | ||
| public Object convertUnsortedEncodedKeyComponentToActualArrayOrList(Long key, boolean asList) | ||
| { | ||
| return Lists.newArrayList(key); |
There was a problem hiding this comment.
ImmutableList.of may be cheaper. It just stores the key as a field.
There was a problem hiding this comment.
Changed to use ImmutableList.of
| @Override | ||
| public Float getMinValue() | ||
| { | ||
| return 0.0f; |
There was a problem hiding this comment.
Similar comment to min/max in the Long handler.
There was a problem hiding this comment.
Changed to use min/max here as well
| @Override | ||
| public Indexed<Float> getSortedIndexedValues() | ||
| { | ||
| return null; |
There was a problem hiding this comment.
Similar comment to getSortedIndexedValues in the Long handler.
There was a problem hiding this comment.
Changed this and the bitmap-related method to throw UnsupportedOperationException
| @Override | ||
| public Object convertUnsortedEncodedKeyComponentToActualArrayOrList(Float key, boolean asList) | ||
| { | ||
| return Lists.newArrayList(key); |
There was a problem hiding this comment.
Similar comment to convertUnsortedEncodedKeyComponentToActualArrayOrList in the Long handler.
There was a problem hiding this comment.
Changed to use ImmutableList.of
|
@gianm thanks, I missed that, I added a couple of tests, one in GroupBy and one in TopN that have aggregations on numeric columns ingested as dimensions, these trigger the float->long and long->float cases you mentioned in IncrementalIndexStorageAdapter I added something to makeLong/FloatColumnSelector to create wrapping selectors that cast the values returned by the base selector from the indexer if necessary |
|
@jon-wei The wrapping stuff in IISA feels a bit weird -- I think it'd be nicer for the DimensionIndexers to handle that on their own. So, that means replacing makeColumnValueSelector with makeObjectColumnSelector, makeLongColumnSelector, makeFloatColumnSelector, and makeDimensionSelector, giving full control over behavior to the type handler. The long/float handlers should probably return NullDimensionSelector for makeDimensionSelector since QISA behaves like that too. The string handler should probably return Zero-selectors for long/float since, again, QISA behaves like that. |
|
Also, the current code in your patch throws UnsupportedOperationExceptions in some cases, but I think we should avoid those and return null/zero selectors instead (like QISA). |
…t in inc index storage adapter
gianm
left a comment
There was a problem hiding this comment.
👍 latest changes look good to me
|
👍 |
| IncrementalIndexStorageAdapter.EntryHolder currEntry, IncrementalIndex.DimensionDesc desc | ||
| ) | ||
| { | ||
| return ZeroLongColumnSelector.instance(); |
There was a problem hiding this comment.
How valid is it to return zero selector here? Maybe try to convert from String to number?
There was a problem hiding this comment.
I believe QueryableIndexStorageAdapter does the same thing, so at least it's consistent with that. However, it'd be nice (for schema evolution of a column from string -> numeric) for the behavior to change to try to convert the string to a number.
| IncrementalIndexStorageAdapter.EntryHolder currEntry, IncrementalIndex.DimensionDesc desc | ||
| ) | ||
| { | ||
| return ZeroFloatColumnSelector.instance(); |
|
@licl2014 it could happen, although in that case Druid will do the best it can. If cardinality is not super high it should be fine and go quickly. If cardinality is super high then Druid will spill to disk as needed (assuming disk spilling is enabled). |
This PR adds long and float implementations of DimensionHandler/DimensionIndexer/DimensionMerger, allowing ingestion of long/float typed dimensions.
This also changes the
EncodedTypeArraytype parameter in the interfaces above toEncodedKeyComponentType, removing the restriction that the individual fields of row keys during ingestion must be arrays (to allow for single numeric values in the keys, since long/floats don't support multivalue rows)