Skip to content

Add time extraction functions to project time as a dimension#1159

Merged
fjy merged 2 commits intoapache:masterfrom
metamx:time-extraction
Mar 11, 2015
Merged

Add time extraction functions to project time as a dimension#1159
fjy merged 2 commits intoapache:masterfrom
metamx:time-extraction

Conversation

@xvrl
Copy link
Copy Markdown
Member

@xvrl xvrl commented Feb 28, 2015

Time extraction functions are similar to dimension extraction functions in that they allow mapping the time column to arbitrary Strings. Unlike granularities, they allow for non-linear mapping and aggregation across the resulting values. For example, this allows aggregating by day of the week or hour of the day and return the results either as part of a topN or groupBy similar to any other dimension.

This pull request proposes the addition of TimeExtractionDimensionSpec as a new type of DimensionSpec, as well as a the TimeExtractionFn type.

Time is not yet currently available as a true dimension, and this is not the objective of this pull request.
Current topN and groupBy queries still heavily rely on having dimension value indices, which makes it difficult to operate on an arbitrary column. For the sake of cleaner code, it therefore felt more natural to leave the concept of time extraction separate from the existing dimension extraction. This also allows for specific optimizations in for topN and groupBy, given the properties of the time dimension.

Hopefully this can help shape future work towards supporting arbitrary columns as dimensions.

@cheddar
Copy link
Copy Markdown
Contributor

cheddar commented Mar 3, 2015

I just looked at the high level implementation at this point, but I'm a little worried that this is just a further step towards having an extraction interface for every possible primitive type. The fact that this requires adding a new method which is essentially just specializing on a primitive type makes me wonder if the initial "extraction fn" abstraction was correct.

My first high level thought is wondering if we can invert it so that we pass in something that allows the "extraction function" to get the type of value that it wants. Kinda like the MetricColumnSelectors. I can't claim to have thought it through all the way, but the thing that immediately jumps out as less than wonderful in what I'm proposing is that it would complicate implementations by making them care about multi-valued columns. It would also likely negate some of the "caching" of extraction values that can be done with dimensions...

Another option could be changing the interface to take in an Object instead of String. That would allow an implementation to be passed a long instead of a String. Given that we would likely implement these functions on top of JodaTime (doing new DateTime()) this would actually enable one function to handle both String dimensions and long timestamp type things. It would introduce object creation overhead when using the Time style ones, but maybe that's not the end of the world?

Anyway, I have to run now. I'm not necessarily done thinking, but going to hit comment so that the seeds of these thoughts can be planted.

@xvrl
Copy link
Copy Markdown
Member Author

xvrl commented Mar 3, 2015

For a lower-level interface I don't necessarily think that having different function methods for different primitive types is necessarily a bad thing, or at least not worse than having to do instance type checks within a method taking Objects. Hopefully Java will provide value types one day and that won't be a problem anymore. We can offer a higher-level interface for simpler functions and implement various casting operations for convenience, leaving the option to implement the low-level interface directly. Low-level functions could expose capabilities to define the supported types, which can in turn can be leveraged by query engines to optimize accordingly.

Given that the type of functions we are talking about operate on single values and not on rows, I don't think there is an immediate need to invert it. If that is what we nonetheless decide to do in the future, we can easily wrap existing functions to use the inversion mechanisms.

More immediately, I believe the most important thing to focus on is the interface to the user, so we don't have to change the query language too much. Maybe it makes more sense to provide a general ExtractionDimensionSpec, introducing a new extractionFn field and start deprecating the dimExtractionFn field. The result should still be thought of as a dimension though. That would remove the time-specific aspect from the query interface, and leave it to the query engine whether to support non-string values for columns other than time for now. This also opens up the ability to use all the existing dimExtraction Functions on the time column.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please either populate or remove.

@drcrallen
Copy link
Copy Markdown
Contributor

I agree with @xvrl on the low level aspect.

In a general sense, I think anything people may want to do can be accomplished with a javascript extraction function, but we are supplying optimized extractions as they become major use cases.

I think eventually we want the __time dimension to be just another dimension, but that's a bit too big of an ask for this PR.

Rather than ExtractionDimensionSpec how would you feel about EphemeralDimensionSpec? Specifically for a dimension that is only in existence for the life of the query. (also might be too much scope creep for this PR)

@xvrl xvrl added the Discuss label Mar 3, 2015
@xvrl xvrl force-pushed the time-extraction branch 2 times, most recently from 6a14578 to cf61345 Compare March 4, 2015 00:27
@xvrl
Copy link
Copy Markdown
Member Author

xvrl commented Mar 4, 2015

I reworked the interfaces a have a more generic ExtractionFn, with DimExtractionFn being a special case for String dimensions. This allows us to use any existing DimExtraction function on the time column, and I generalized JavaScriptExtractionFunction to not be specific to Strings anymore.

Would this seem more palatable to everyone?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the toString() increase query time for String based extraction functions?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

toString() is just a default implementation, it is not called for String dimensions, those use apply(String) directly as it did before.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvmnd, I see the String override from the inheritance.

@cheddar
Copy link
Copy Markdown
Contributor

cheddar commented Mar 5, 2015

I like the changes, moving to just an ExtractionFn and giving that interface various methods that enable specialization for specific primitives makes sense to me.

@xvrl xvrl force-pushed the time-extraction branch from 931bf14 to 88a528f Compare March 6, 2015 00:25
@xvrl
Copy link
Copy Markdown
Member Author

xvrl commented Mar 6, 2015

@cheddar @drcrallen addressed your comments and added documentation.

@drcrallen
Copy link
Copy Markdown
Contributor

I'm going to have a lot of cleanup to merge this into QTL but this looks good to me

@vogievetsky
Copy link
Copy Markdown
Contributor

Very excited to see this feature. Just updated the TypeScript types in anticipation: https://github.com/facetjs/typescript-druid/commit/5a0778b1d14a085b3f43534650e83d8d4199134d

BTW. It would be great to add a full query example on the doc page.

@xvrl xvrl force-pushed the time-extraction branch from 88a528f to ef39510 Compare March 6, 2015 23:11
@xvrl
Copy link
Copy Markdown
Member Author

xvrl commented Mar 6, 2015

@vogievetsky I added an example here

@xvrl xvrl removed the Discuss label Mar 9, 2015
@xvrl xvrl added this to the 0.7.1 milestone Mar 9, 2015
@xvrl xvrl self-assigned this Mar 9, 2015
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if this logic shouldn't be given to the cursor and done inside of makeDimensionSelector. I'm assuming that you chose not to do it that way because that means each implementation of StorageAdapter has to implement the time lookup properly, but it also seems like each implementation of StorageAdapter might be able to take advantage of optimizations when returning the time column...

I guess it seems like we are building in extra logic about "the time column acts like XYZ", where I think that with 0.7 there was an effort to make the time column just look like any other column. This seems to be breaking that "similarity"...

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that is logic seems out of place here. We should try to centralize the places that have to special case for the time column

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure which one is better frankly. The query engine has better assumptions regarding how the data is scanned, so it can potentially make better decisions on how to optimize, based on the storage adapter capabilities. The Storage adapter may be able to optimize certain things, but would have to handle the more general case, since it doesn't know what the query engine might do.

We can push down all extraction functions to the storage adapter level, if you feel this would be more consistent?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants