Conversation
|
@navis, @jon-wei could you take a look please and let me know your opinion? @navis, judging by comments on #3755, it sounds like you have different plans for how expression support should work. Part of the patch here involves expressions becoming only virtual columns, not being available at the ColumnSelectorFactory, Aggregator, Filter, etc levels. So that would suggest that instead of doing: You would instead do something like: And something similar for aggregators. Rather than knowing about expressions themselves, they would refer to an expression virtual column. This would extend to any other concepts we gain over time, not just math expressions. The idea is moving towards using virtual columns as a sort of "projection" layer before we apply filters, groupings, and aggregations. The JSON is a bit more complex, but the code inside is simpler (no need for aggregators, filters, and storage adapters to know about math expressions or any other sort of transformations) and more composable. I think that's a good tradeoff since it's common for Druid users to use query generators like plywood or SQL anyway, rather than writing the queries themselves, so it shouldn't be too much of a burden. Maybe you oversold me on the virtual column concept since now I want to use it for more things :) |
|
I support having I think the most important objective for |
|
@vogievetsky, yes, in my mind it is definitely an objective to make virtual columns work for all the things: selecting, grouping, filtering, and aggregating. With their introduction in #2511 they just work for selecting, this patch makes them work for aggregating, and future patches could make them work for grouping and filtering. |
|
I think things are much cleaner now after doing a quick read through |
After this patch, virtual columns work well for aggregating, but not for grouping (because the topN and groupBy engines don't deal well with DimensionSelectors that have no dictionaries) or filtering (because the CursorFactories don't do anything useful with them). Virtual columns: - Move VirtualColumn and related classes to their own package. - Add additional methods to VirtualColumn to support more kinds of selectors and to support cycle detection. - Add ExpressionVirtualColumn in core, offering math expressions as a virtual column. Queries: - Add VirtualColumns to timeseries, topN, and groupBy. Aggregators: - Remove the "expression" parameter (replaced by "expression" virtual columns). ColumnSelectorFactory: - Change getColumnCapabilities to getNativeType, since that's all anyone was using it for, and it's simpler. - Remove "makeMathExpressionSelector" (replaced by "expression" virtual columns). Storage adapters: - Add virtual column hooks before checking base columns. I think this ordering makes more sense than checking base columns first, since it prevents surprises when a new base column is added that happens to have the same name as a virtual column.
|
Pushed another commit containing a DimensionSelector for expressions. It doesn't actually work with topN and groupBy, though, since its value cardinality is |
|
@gianm I have some following PRs waiting on expression subject.
inline Virtual columns:
Could you postpone this till things settled down?
This is done in
good idea. Queries:
I think this is done in Aggregators:
I remember this is commented in previous PR. been thinking of replacing ColumnSelectorFactory:
good
This is for making in-line VC which is very useful. but ok, we will keep that in ours. Storage adapters:
good |
|
And actually, it's possible to group-by on vc with some limitations(memory pressure), something like things done in |
|
@navis thanks for the feedback. On your points:
yes, sure, I'll move the existing classes back. I'll still create a "virtual" package for the new classes though since there won't be conflicts there.
Is your impl substantially different from the one in here? If so, what's different about it? Or, if it's not really different, would the the one in here work for you? I'm asking because the impl in here is necessary to keep some of the tests working properly, due to removal of makeMathExpressionSelector from ColumnSelectorFactory. If yours is different & better than my impl, could I merge it into this PR and replace my impl?
I guessed you were using it for this. I see the feature is useful, but I still prefer to remove it and use explicit virtualColumns since that simplifies the code for column selectors, aggregators, and filters. Finally, that's a good point about grouping. But I'd like to eventually modify the query engines so that we can control the memory usage. I was thinking of using a bounded memory dictionary with spilling for that (at least for groupBy, which can spill). That's a separate issue though. |
|
Moved VirtualColumn and VirtualColumns back to original locations. |
| default: | ||
| return new BooleanValueMatcher(predicateFactory.makeStringPredicate().apply(null)); | ||
| final ValueType type = cursor.getNativeType(dimension); | ||
| if (type == ValueType.LONG) { |
There was a problem hiding this comment.
Why change from a switch to if statements?
There was a problem hiding this comment.
getNativeType can return null, and switch doesn't like nulls.
|
Tagged Discuss due to open questions about how to move forward with the subsystem (see also #3755 and https://groups.google.com/d/topic/druid-development/_Sd78s7yU5U/discussion). |
|
@navis would you mind PRing "virtual-column-as-dimension" so we can see what direction you are proposing there? In this PR, not much progress is being made on treating them as dimensions… there's a dimension selector maker on VirtualColumn but it's not actually usable by the query engines or for filtering as-is. I had some vague ideas there but nothing really concrete. In the end I'm happy throwing away most of my PR although it has some things I would like to keep:
To me only the last one seems really controversial. |
|
Going to close this, pending resolution of the druid-development thread, since that appears to be going in a different direction. May reopen it later, scoped down a bit. |
After this patch, virtual columns work well for aggregating, but not for grouping
(because the topN and groupBy engines don't deal well with DimensionSelectors that have
no dictionaries) or filtering (because the CursorFactories don't do anything useful
with them).
Virtual columns:
to support cycle detection.
Queries:
Aggregators:
ColumnSelectorFactory:
it for, and it's simpler.
Storage adapters:
more sense than checking base columns first, since it prevents surprises when a
new base column is added that happens to have the same name as a virtual column.