Add time extraction functions to project time as a dimension#1159
Add time extraction functions to project time as a dimension#1159fjy merged 2 commits intoapache:masterfrom
Conversation
|
I just looked at the high level implementation at this point, but I'm a little worried that this is just a further step towards having an extraction interface for every possible primitive type. The fact that this requires adding a new method which is essentially just specializing on a primitive type makes me wonder if the initial "extraction fn" abstraction was correct. My first high level thought is wondering if we can invert it so that we pass in something that allows the "extraction function" to get the type of value that it wants. Kinda like the MetricColumnSelectors. I can't claim to have thought it through all the way, but the thing that immediately jumps out as less than wonderful in what I'm proposing is that it would complicate implementations by making them care about multi-valued columns. It would also likely negate some of the "caching" of extraction values that can be done with dimensions... Another option could be changing the interface to take in an Object instead of String. That would allow an implementation to be passed a Anyway, I have to run now. I'm not necessarily done thinking, but going to hit comment so that the seeds of these thoughts can be planted. |
|
For a lower-level interface I don't necessarily think that having different function methods for different primitive types is necessarily a bad thing, or at least not worse than having to do instance type checks within a method taking Objects. Hopefully Java will provide value types one day and that won't be a problem anymore. We can offer a higher-level interface for simpler functions and implement various casting operations for convenience, leaving the option to implement the low-level interface directly. Low-level functions could expose capabilities to define the supported types, which can in turn can be leveraged by query engines to optimize accordingly. Given that the type of functions we are talking about operate on single values and not on rows, I don't think there is an immediate need to invert it. If that is what we nonetheless decide to do in the future, we can easily wrap existing functions to use the inversion mechanisms. More immediately, I believe the most important thing to focus on is the interface to the user, so we don't have to change the query language too much. Maybe it makes more sense to provide a general |
There was a problem hiding this comment.
Please either populate or remove.
|
I agree with @xvrl on the low level aspect. In a general sense, I think anything people may want to do can be accomplished with a javascript extraction function, but we are supplying optimized extractions as they become major use cases. I think eventually we want the Rather than |
6a14578 to
cf61345
Compare
|
I reworked the interfaces a have a more generic ExtractionFn, with DimExtractionFn being a special case for String dimensions. This allows us to use any existing DimExtraction function on the time column, and I generalized JavaScriptExtractionFunction to not be specific to Strings anymore. Would this seem more palatable to everyone? |
There was a problem hiding this comment.
Does the toString() increase query time for String based extraction functions?
There was a problem hiding this comment.
toString() is just a default implementation, it is not called for String dimensions, those use apply(String) directly as it did before.
There was a problem hiding this comment.
nvmnd, I see the String override from the inheritance.
|
I like the changes, moving to just an |
|
@cheddar @drcrallen addressed your comments and added documentation. |
|
I'm going to have a lot of cleanup to merge this into QTL but this looks good to me |
|
Very excited to see this feature. Just updated the TypeScript types in anticipation: https://github.com/facetjs/typescript-druid/commit/5a0778b1d14a085b3f43534650e83d8d4199134d BTW. It would be great to add a full query example on the doc page. |
|
@vogievetsky I added an example here |
There was a problem hiding this comment.
I'm wondering if this logic shouldn't be given to the cursor and done inside of makeDimensionSelector. I'm assuming that you chose not to do it that way because that means each implementation of StorageAdapter has to implement the time lookup properly, but it also seems like each implementation of StorageAdapter might be able to take advantage of optimizations when returning the time column...
I guess it seems like we are building in extra logic about "the time column acts like XYZ", where I think that with 0.7 there was an effort to make the time column just look like any other column. This seems to be breaking that "similarity"...
There was a problem hiding this comment.
I agree that is logic seems out of place here. We should try to centralize the places that have to special case for the time column
There was a problem hiding this comment.
I'm not sure which one is better frankly. The query engine has better assumptions regarding how the data is scanned, so it can potentially make better decisions on how to optimize, based on the storage adapter capabilities. The Storage adapter may be able to optimize certain things, but would have to handle the more general case, since it doesn't know what the query engine might do.
We can push down all extraction functions to the storage adapter level, if you feel this would be more consistent?
Time extraction functions are similar to dimension extraction functions in that they allow mapping the time column to arbitrary Strings. Unlike granularities, they allow for non-linear mapping and aggregation across the resulting values. For example, this allows aggregating by day of the week or hour of the day and return the results either as part of a topN or groupBy similar to any other dimension.
This pull request proposes the addition of TimeExtractionDimensionSpec as a new type of DimensionSpec, as well as a the TimeExtractionFn type.
Time is not yet currently available as a true dimension, and this is not the objective of this pull request.
Current topN and groupBy queries still heavily rely on having dimension value indices, which makes it difficult to operate on an arbitrary column. For the sake of cleaner code, it therefore felt more natural to leave the concept of time extraction separate from the existing dimension extraction. This also allows for specific optimizations in for topN and groupBy, given the properties of the time dimension.
Hopefully this can help shape future work towards supporting arbitrary columns as dimensions.