Skip to content

Support ingestion of long/float dimensions#3966

Merged
fjy merged 15 commits intoapache:masterfrom
jon-wei:ingestion_types
Mar 1, 2017
Merged

Support ingestion of long/float dimensions#3966
fjy merged 15 commits intoapache:masterfrom
jon-wei:ingestion_types

Conversation

@jon-wei
Copy link
Copy Markdown
Contributor

@jon-wei jon-wei commented Feb 23, 2017

This PR adds long and float implementations of DimensionHandler/DimensionIndexer/DimensionMerger, allowing ingestion of long/float typed dimensions.

This also changes the EncodedTypeArray type parameter in the interfaces above to EncodedKeyComponentType, removing the restriction that the individual fields of row keys during ingestion must be arrays (to allow for single numeric values in the keys, since long/floats don't support multivalue rows)

@jon-wei jon-wei requested a review from gianm February 23, 2017 02:12
@leventov leventov self-assigned this Feb 25, 2017
Copy link
Copy Markdown
Contributor

@gianm gianm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jon-wei, the patch looks good other than the comments, although the usage of makeColumnValueSelector seems a little sketchy to me.

It looks like what IncrementalIndexStorageAdapter does is that if you ask for a selector for a dimension, then based on what kind of selector you ask for, it casts the result of makeColumnValueSelector to match (so makeFloatColumnSelector casts it to FloatColumnSelector).

This seems sketchy because it should be allowable to ask for a long selector on a float column, and then you should get a selector that casts each float to a long.

Questions for you:

  • Is there some reason (elsewhere in the code) that this is ok?
  • Do you have tests covering this? (e.g. asking for a long selector on a float column)
  • Do those tests include aggregations? (e.g. longSum on a float column of an incremental index)
  • Do those tests include grouping? (e.g. grouping with outputType LONG on a float column of an incremental index)

Comment thread docs/content/ingestion/index.md Outdated
"dimensionsSpec" : {
"dimensions": ["page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city"],
"dimensions": [
"page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the new additions, placing each of these on its own line would be more readable.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved these to new lines

Comment thread docs/content/ingestion/index.md Outdated
"dimensions": [
"page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city",
{
"type": "LONG",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these need to be uppercase? Lowercase is more typical for Druid options in JSON.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They can be lowercase, made them lowercase in the docs now

Comment thread docs/content/ingestion/index.md Outdated
| Field | Type | Description | Required |
|-------|------|-------------|----------|
| dimensions | JSON String array | The names of the dimensions. If this is an empty array, Druid will treat all columns that are not timestamp or metric columns as dimension columns. | yes |
| dimensions | JSON Object array | A list of [dimension schema](#dimension-schema) objects or dimension names. Providing a name is equivalent to providing a String-typed dimension schema with the given name. If this is an empty array, Druid will treat all columns that are not timestamp or metric columns as String-typed dimension columns. | yes |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just a JSON array, since it can mix json strings and json objects.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to "JSON array"

Comment thread docs/content/ingestion/index.md Outdated
```json
"dimensionsSpec" : {
"dimensions": [
"page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar comments to above on formatting.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved dimensions to new lines

@Override
public Indexed<Long> getSortedIndexedValues()
{
return null;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this expected to never be called? If so, throw UnsupportedOperationException rather than return null.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed this and the bitmap-related method to throw UnsupportedOperationException

@Override
public Long getMinValue()
{
return 0L;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Long.MIN_VALUE and Long.MAX_VALUE may work better for these two, since people may use the min/max values reported by metadata for pruning segment lists.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed these to use MIN_VALUE and MAX_VALUE

@Override
public Object convertUnsortedEncodedKeyComponentToActualArrayOrList(Long key, boolean asList)
{
return Lists.newArrayList(key);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ImmutableList.of may be cheaper. It just stores the key as a field.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to use ImmutableList.of

@Override
public Float getMinValue()
{
return 0.0f;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar comment to min/max in the Long handler.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to use min/max here as well

@Override
public Indexed<Float> getSortedIndexedValues()
{
return null;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar comment to getSortedIndexedValues in the Long handler.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed this and the bitmap-related method to throw UnsupportedOperationException

@Override
public Object convertUnsortedEncodedKeyComponentToActualArrayOrList(Float key, boolean asList)
{
return Lists.newArrayList(key);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar comment to convertUnsortedEncodedKeyComponentToActualArrayOrList in the Long handler.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to use ImmutableList.of

@jon-wei
Copy link
Copy Markdown
Contributor Author

jon-wei commented Feb 28, 2017

@gianm
Re: your comment on makeColumnValueSelector

thanks, I missed that, I added a couple of tests, one in GroupBy and one in TopN that have aggregations on numeric columns ingested as dimensions, these trigger the float->long and long->float cases you mentioned in IncrementalIndexStorageAdapter

I added something to makeLong/FloatColumnSelector to create wrapping selectors that cast the values returned by the base selector from the indexer if necessary

@jon-wei jon-wei closed this Feb 28, 2017
@jon-wei jon-wei reopened this Feb 28, 2017
@gianm
Copy link
Copy Markdown
Contributor

gianm commented Feb 28, 2017

@jon-wei The wrapping stuff in IISA feels a bit weird -- I think it'd be nicer for the DimensionIndexers to handle that on their own. So, that means replacing makeColumnValueSelector with makeObjectColumnSelector, makeLongColumnSelector, makeFloatColumnSelector, and makeDimensionSelector, giving full control over behavior to the type handler.

The long/float handlers should probably return NullDimensionSelector for makeDimensionSelector since QISA behaves like that too.

The string handler should probably return Zero-selectors for long/float since, again, QISA behaves like that.

@gianm
Copy link
Copy Markdown
Contributor

gianm commented Feb 28, 2017

Also, the current code in your patch throws UnsupportedOperationExceptions in some cases, but I think we should avoid those and return null/zero selectors instead (like QISA).

@gianm gianm closed this Feb 28, 2017
@gianm gianm reopened this Feb 28, 2017
@gianm gianm closed this Feb 28, 2017
@gianm gianm reopened this Feb 28, 2017
@jon-wei jon-wei closed this Mar 1, 2017
@jon-wei jon-wei reopened this Mar 1, 2017
@gianm gianm closed this Mar 1, 2017
@gianm gianm reopened this Mar 1, 2017
Copy link
Copy Markdown
Contributor

@gianm gianm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 latest changes look good to me

@fjy
Copy link
Copy Markdown
Contributor

fjy commented Mar 1, 2017

👍

@fjy fjy merged commit a08660a into apache:master Mar 1, 2017
@fjy fjy added this to the 0.10.0 milestone Mar 1, 2017
IncrementalIndexStorageAdapter.EntryHolder currEntry, IncrementalIndex.DimensionDesc desc
)
{
return ZeroLongColumnSelector.instance();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How valid is it to return zero selector here? Maybe try to convert from String to number?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe QueryableIndexStorageAdapter does the same thing, so at least it's consistent with that. However, it'd be nice (for schema evolution of a column from string -> numeric) for the behavior to change to try to convert the string to a number.

IncrementalIndexStorageAdapter.EntryHolder currEntry, IncrementalIndex.DimensionDesc desc
)
{
return ZeroFloatColumnSelector.instance();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as #3966 (comment).

@jon-wei jon-wei deleted the ingestion_types branch October 6, 2017 22:22
@licl2014
Copy link
Copy Markdown
Contributor

@jon-wei @gianm i have a doubt that long/float dimensions will cause high Cardinality easily when we have a group by query.so have we consider this scenario?

@gianm
Copy link
Copy Markdown
Contributor

gianm commented Nov 30, 2017

@licl2014 it could happen, although in that case Druid will do the best it can. If cardinality is not super high it should be fine and go quickly. If cardinality is super high then Druid will spill to disk as needed (assuming disk spilling is enabled).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants