Add time extraction functions to project time as a dimension by xvrl · Pull Request #1159 · apache/druid

xvrl · 2015-02-28T06:01:54Z

Time extraction functions are similar to dimension extraction functions in that they allow mapping the time column to arbitrary Strings. Unlike granularities, they allow for non-linear mapping and aggregation across the resulting values. For example, this allows aggregating by day of the week or hour of the day and return the results either as part of a topN or groupBy similar to any other dimension.

This pull request proposes the addition of TimeExtractionDimensionSpec as a new type of DimensionSpec, as well as a the TimeExtractionFn type.

Time is not yet currently available as a true dimension, and this is not the objective of this pull request.
Current topN and groupBy queries still heavily rely on having dimension value indices, which makes it difficult to operate on an arbitrary column. For the sake of cleaner code, it therefore felt more natural to leave the concept of time extraction separate from the existing dimension extraction. This also allows for specific optimizations in for topN and groupBy, given the properties of the time dimension.

Hopefully this can help shape future work towards supporting arbitrary columns as dimensions.

cheddar · 2015-03-03T02:10:05Z

I just looked at the high level implementation at this point, but I'm a little worried that this is just a further step towards having an extraction interface for every possible primitive type. The fact that this requires adding a new method which is essentially just specializing on a primitive type makes me wonder if the initial "extraction fn" abstraction was correct.

My first high level thought is wondering if we can invert it so that we pass in something that allows the "extraction function" to get the type of value that it wants. Kinda like the MetricColumnSelectors. I can't claim to have thought it through all the way, but the thing that immediately jumps out as less than wonderful in what I'm proposing is that it would complicate implementations by making them care about multi-valued columns. It would also likely negate some of the "caching" of extraction values that can be done with dimensions...

Another option could be changing the interface to take in an Object instead of String. That would allow an implementation to be passed a long instead of a String. Given that we would likely implement these functions on top of JodaTime (doing new DateTime()) this would actually enable one function to handle both String dimensions and long timestamp type things. It would introduce object creation overhead when using the Time style ones, but maybe that's not the end of the world?

Anyway, I have to run now. I'm not necessarily done thinking, but going to hit comment so that the seeds of these thoughts can be planted.

xvrl · 2015-03-03T07:05:41Z

For a lower-level interface I don't necessarily think that having different function methods for different primitive types is necessarily a bad thing, or at least not worse than having to do instance type checks within a method taking Objects. Hopefully Java will provide value types one day and that won't be a problem anymore. We can offer a higher-level interface for simpler functions and implement various casting operations for convenience, leaving the option to implement the low-level interface directly. Low-level functions could expose capabilities to define the supported types, which can in turn can be leveraged by query engines to optimize accordingly.

Given that the type of functions we are talking about operate on single values and not on rows, I don't think there is an immediate need to invert it. If that is what we nonetheless decide to do in the future, we can easily wrap existing functions to use the inversion mechanisms.

More immediately, I believe the most important thing to focus on is the interface to the user, so we don't have to change the query language too much. Maybe it makes more sense to provide a general ExtractionDimensionSpec, introducing a new extractionFn field and start deprecating the dimExtractionFn field. The result should still be thought of as a dimension though. That would remove the time-specific aspect from the query interface, and leave it to the query engine whether to support non-string values for columns other than time for now. This also opens up the ability to use all the existing dimExtraction Functions on the time column.

drcrallen · 2015-03-03T17:06:03Z

Please either populate or remove.

drcrallen · 2015-03-03T17:28:30Z

I agree with @xvrl on the low level aspect.

In a general sense, I think anything people may want to do can be accomplished with a javascript extraction function, but we are supplying optimized extractions as they become major use cases.

I think eventually we want the __time dimension to be just another dimension, but that's a bit too big of an ask for this PR.

Rather than ExtractionDimensionSpec how would you feel about EphemeralDimensionSpec? Specifically for a dimension that is only in existence for the life of the query. (also might be too much scope creep for this PR)

xvrl · 2015-03-04T00:30:55Z

I reworked the interfaces a have a more generic ExtractionFn, with DimExtractionFn being a special case for String dimensions. This allows us to use any existing DimExtraction function on the time column, and I generalized JavaScriptExtractionFunction to not be specific to Strings anymore.

Would this seem more palatable to everyone?

drcrallen · 2015-03-04T00:59:05Z

Does the toString() increase query time for String based extraction functions?

toString() is just a default implementation, it is not called for String dimensions, those use apply(String) directly as it did before.

nvmnd, I see the String override from the inheritance.

cheddar · 2015-03-05T19:38:52Z

I like the changes, moving to just an ExtractionFn and giving that interface various methods that enable specialization for specific primitives makes sense to me.

xvrl · 2015-03-06T00:33:52Z

@cheddar @drcrallen addressed your comments and added documentation.

drcrallen · 2015-03-06T00:55:39Z

I'm going to have a lot of cleanup to merge this into QTL but this looks good to me

vogievetsky · 2015-03-06T22:49:29Z

Very excited to see this feature. Just updated the TypeScript types in anticipation: https://github.com/facetjs/typescript-druid/commit/5a0778b1d14a085b3f43534650e83d8d4199134d

BTW. It would be great to add a full query example on the doc page.

xvrl · 2015-03-06T23:15:15Z

@vogievetsky I added an example here

cheddar · 2015-03-10T00:53:25Z

I'm wondering if this logic shouldn't be given to the cursor and done inside of makeDimensionSelector. I'm assuming that you chose not to do it that way because that means each implementation of StorageAdapter has to implement the time lookup properly, but it also seems like each implementation of StorageAdapter might be able to take advantage of optimizations when returning the time column...

I guess it seems like we are building in extra logic about "the time column acts like XYZ", where I think that with 0.7 there was an effort to make the time column just look like any other column. This seems to be breaking that "similarity"...

I agree that is logic seems out of place here. We should try to centralize the places that have to special case for the time column

I'm not sure which one is better frankly. The query engine has better assumptions regarding how the data is scanned, so it can potentially make better decisions on how to optimize, based on the storage adapter capabilities. The Storage adapter may be able to optimize certain things, but would have to handle the more general case, since it doesn't know what the query engine might do.

We can push down all extraction functions to the storage adapter level, if you feel this would be more consistent?

drcrallen reviewed Mar 3, 2015
View reviewed changes

xvrl added the Discuss label Mar 3, 2015

xvrl force-pushed the time-extraction branch 2 times, most recently from 6a14578 to cf61345 Compare March 4, 2015 00:27

drcrallen reviewed Mar 4, 2015
View reviewed changes

xvrl force-pushed the time-extraction branch from 931bf14 to 88a528f Compare March 6, 2015 00:25

xvrl force-pushed the time-extraction branch from 88a528f to ef39510 Compare March 6, 2015 23:11

xvrl removed the Discuss label Mar 9, 2015

xvrl added this to the 0.7.1 milestone Mar 9, 2015

xvrl self-assigned this Mar 9, 2015

cheddar reviewed Mar 10, 2015
View reviewed changes

Conversation

xvrl commented Feb 28, 2015

Uh oh!

cheddar commented Mar 3, 2015

Uh oh!

xvrl commented Mar 3, 2015

Uh oh!

drcrallen Mar 3, 2015

Choose a reason for hiding this comment

Uh oh!

drcrallen commented Mar 3, 2015

Uh oh!

xvrl commented Mar 4, 2015

Uh oh!

drcrallen Mar 4, 2015

Choose a reason for hiding this comment

Uh oh!

xvrl Mar 4, 2015

Choose a reason for hiding this comment

Uh oh!

drcrallen Mar 4, 2015

Choose a reason for hiding this comment

Uh oh!

cheddar commented Mar 5, 2015

Uh oh!

xvrl commented Mar 6, 2015

Uh oh!

drcrallen commented Mar 6, 2015

Uh oh!

vogievetsky commented Mar 6, 2015

Uh oh!

xvrl commented Mar 6, 2015

Uh oh!

cheddar Mar 10, 2015

Choose a reason for hiding this comment

Uh oh!

fjy Mar 10, 2015

Choose a reason for hiding this comment

Uh oh!

xvrl Mar 10, 2015

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants