Expressions: Optimization for string expressions on the __time column.#5109
Expressions: Optimization for string expressions on the __time column.#5109gianm wants to merge 2 commits intoapache:masterfrom
Conversation
|
@gianm can you color this with a couple of use cases that highlight the need for optimization? |
| ); | ||
|
|
||
| final List<?> results = Sequences.toList( | ||
| Sequences.map( |
There was a problem hiding this comment.
You can write cursors.map(...)
| null | ||
| ); | ||
|
|
||
| final List<?> results = Sequences.toList( |
There was a problem hiding this comment.
This method is always called with either new ArrayList<>() or Lists.newArrayList() as the second argument. The second parameter should be removed. Also, I suggest to add a default method Sequence.toList().
| } | ||
|
|
||
| @Override | ||
| public void inspectRuntimeShape(final RuntimeShapeInspector inspector) |
There was a problem hiding this comment.
Minor: I try to put this method the last or one of the last, because it's not the business logic.
It's an expression oriented equivalent of the optimization that SingleScanTimeDimSelector does for extractionFns on the __time column. It helps because it takes advantage of the fact that the time column tends to have a lot of runs, and with the optimization, we only compute the function once per distinct value rather than computing it repeatedly. |
|
I think it should be precomputed for column and stored along with the column, if such optimization makes sense for a column. Along with min, max, average for columns, and other types of O(1) information that could be useful during querying. It should be accounted here: https://github.com/druid-io/druid/projects/1 |
|
@gianm we should finish this off |
|
It turns out this patch is not necessary, since #5048 accomplished the same goal. String expressions on the __time column work by first building a base primitive selector, and #5048 optimizes those. That optimization ends up being inherited by the dimension selector. #6599 extends the optimization to apply to Long columns that are not __time. |
WIP: Still need to run benchmarks