Expanding the VirtualColumn concept. by gianm · Pull Request #3758 · apache/druid

gianm · 2016-12-07T15:51:44Z

After this patch, virtual columns work well for aggregating, but not for grouping
(because the topN and groupBy engines don't deal well with DimensionSelectors that have
no dictionaries) or filtering (because the CursorFactories don't do anything useful
with them).

Virtual columns:

Move VirtualColumn and related classes to their own package.
Add additional methods to VirtualColumn to support more kinds of selectors and
to support cycle detection.
Add ExpressionVirtualColumn in core, offering math expressions as a virtual column.
Put cache IDs in VirtualColumnCacheHelper.

Queries:

Add VirtualColumns to timeseries, topN, and groupBy.

Aggregators:

Remove the "expression" parameter (replaced by "expression" virtual columns).

ColumnSelectorFactory:

Change getColumnCapabilities to getNativeType, since that's all anyone was using
it for, and it's simpler.
Remove "makeMathExpressionSelector" (replaced by "expression" virtual columns).

Storage adapters:

Add virtual column hooks before checking base columns. I think this ordering makes
more sense than checking base columns first, since it prevents surprises when a
new base column is added that happens to have the same name as a virtual column.

gianm · 2016-12-07T16:04:24Z

@navis, @jon-wei could you take a look please and let me know your opinion?

@navis, judging by comments on #3755, it sounds like you have different plans for how expression support should work. Part of the patch here involves expressions becoming only virtual columns, not being available at the ColumnSelectorFactory, Aggregator, Filter, etc levels. So that would suggest that instead of doing:

"filter" : {
  "type": "expression",
  "expression": "index > 100 && index < 200"
}

You would instead do something like:

"virtualColumns" : [{
  "type" : "expression",
  "name" : "expr",
  "expression" : "index > 100 && index < 200"
}],
"filter" : {
  "type" : "selector",
  "dimension" : "expr",
  "value" : "true"
}

And something similar for aggregators. Rather than knowing about expressions themselves, they would refer to an expression virtual column. This would extend to any other concepts we gain over time, not just math expressions. The idea is moving towards using virtual columns as a sort of "projection" layer before we apply filters, groupings, and aggregations.

The JSON is a bit more complex, but the code inside is simpler (no need for aggregators, filters, and storage adapters to know about math expressions or any other sort of transformations) and more composable. I think that's a good tradeoff since it's common for Druid users to use query generators like plywood or SQL anyway, rather than writing the queries themselves, so it shouldn't be too much of a burden.

Maybe you oversold me on the virtual column concept since now I want to use it for more things :)

gianm · 2016-12-07T16:06:06Z

Also, feedback from others would be great as well; I just called out @navis and @jon-wei since they've been doing work in this area recently.

vogievetsky · 2016-12-07T16:56:15Z

I support having virtualColumns being their own layer that is defined by JSON in the top level of the query. FWIW Plywood has always been built on top of that concept (it is called derivedAttributes).

I think the most important objective for virtualColumns bar none is that they work everywhere, on all query types and in all query constructs. As for the complexity of the API: I think people who make ad-hock queries will opt to use the SQL capabilities and people that are writing software that makes queries programmatically will not see much of a problem.

gianm · 2016-12-07T18:23:57Z

@vogievetsky, yes, in my mind it is definitely an objective to make virtual columns work for all the things: selecting, grouping, filtering, and aggregating. With their introduction in #2511 they just work for selecting, this patch makes them work for aggregating, and future patches could make them work for grouping and filtering.

fjy · 2016-12-07T19:33:41Z

I think things are much cleaner now after doing a quick read through

After this patch, virtual columns work well for aggregating, but not for grouping (because the topN and groupBy engines don't deal well with DimensionSelectors that have no dictionaries) or filtering (because the CursorFactories don't do anything useful with them). Virtual columns: - Move VirtualColumn and related classes to their own package. - Add additional methods to VirtualColumn to support more kinds of selectors and to support cycle detection. - Add ExpressionVirtualColumn in core, offering math expressions as a virtual column. Queries: - Add VirtualColumns to timeseries, topN, and groupBy. Aggregators: - Remove the "expression" parameter (replaced by "expression" virtual columns). ColumnSelectorFactory: - Change getColumnCapabilities to getNativeType, since that's all anyone was using it for, and it's simpler. - Remove "makeMathExpressionSelector" (replaced by "expression" virtual columns). Storage adapters: - Add virtual column hooks before checking base columns. I think this ordering makes more sense than checking base columns first, since it prevents surprises when a new base column is added that happens to have the same name as a virtual column.

gianm · 2016-12-07T22:01:01Z

Pushed another commit containing a DimensionSelector for expressions. It doesn't actually work with topN and groupBy, though, since its value cardinality is CARDINALITY_UNKNOWN which they can't handle right now.

navis · 2016-12-08T00:44:29Z

@gianm I have some following PRs waiting on expression subject.

expression-metric-filter (Support metric filters #3755)
math-virtual-column (in local, not PRed yet)
virtual-column-as-dimension (in local, not PRed yet)
etc. (small things)

inline

Virtual columns:

Move VirtualColumn and related classes to their own package.

Could you postpone this till things settled down?

Add additional methods to VirtualColumn to support more kinds of selectors and
to support cycle detection.

Add ExpressionVirtualColumn in core, offering math expressions as a virtual column.

This is done in math-virtual-column. Could you wait on it?

Put cache IDs in VirtualColumnCacheHelper.

good idea.

Queries:

Add VirtualColumns to timeseries, topN, and groupBy.

I think this is done in virtual-column-as-dimension

Aggregators:

Remove the "expression" parameter (replaced by "expression" virtual columns).

I remember this is commented in previous PR. been thinking of replacing field with expression but this code looks removing expression, but ok (we should keep it as was in our version because it's already legacy, for year, for us).

ColumnSelectorFactory:

Change getColumnCapabilities to getNativeType, since that's all anyone was using
it for, and it's simpler.

good

Remove "makeMathExpressionSelector" (replaced by "expression" virtual columns).

This is for making in-line VC which is very useful. but ok, we will keep that in ours.

Storage adapters:

Add virtual column hooks before checking base columns. I think this ordering makes
more sense than checking base columns first, since it prevents surprises when a
new base column is added that happens to have the same name as a virtual column.

good

navis · 2016-12-08T00:47:56Z

And actually, it's possible to group-by on vc with some limitations(memory pressure), something like things done in SingleScanTimeDimSelector because group-by does not rely on cardinality of dimension.

gianm · 2016-12-08T01:22:29Z

@navis thanks for the feedback. On your points:

Move VirtualColumn and related classes to their own package.

Could you postpone this till things settled down?

yes, sure, I'll move the existing classes back. I'll still create a "virtual" package for the new classes though since there won't be conflicts there.

Add additional methods to VirtualColumn to support more kinds of selectors and
to support cycle detection.

Add ExpressionVirtualColumn in core, offering math expressions as a virtual column.

This is done in math-virtual-column. Could you wait on it?

Is your impl substantially different from the one in here? If so, what's different about it? Or, if it's not really different, would the the one in here work for you?

I'm asking because the impl in here is necessary to keep some of the tests working properly, due to removal of makeMathExpressionSelector from ColumnSelectorFactory. If yours is different & better than my impl, could I merge it into this PR and replace my impl?

Remove "makeMathExpressionSelector" (replaced by "expression" virtual columns).

This is for making in-line VC which is very useful. but ok, we will keep that in ours.

I guessed you were using it for this. I see the feature is useful, but I still prefer to remove it and use explicit virtualColumns since that simplifies the code for column selectors, aggregators, and filters.

Finally, that's a good point about grouping. But I'd like to eventually modify the query engines so that we can control the memory usage. I was thinking of using a bounded memory dictionary with spilling for that (at least for groupBy, which can spill). That's a separate issue though.

gianm · 2016-12-08T01:26:46Z

Moved VirtualColumn and VirtualColumns back to original locations.

jon-wei · 2016-12-07T22:06:13Z

-        default:
-          return new BooleanValueMatcher(predicateFactory.makeStringPredicate().apply(null));
+      final ValueType type = cursor.getNativeType(dimension);
+      if (type == ValueType.LONG) {


Why change from a switch to if statements?

getNativeType can return null, and switch doesn't like nulls.

gianm · 2016-12-08T15:56:54Z

Tagged Discuss due to open questions about how to move forward with the subsystem (see also #3755 and https://groups.google.com/d/topic/druid-development/_Sd78s7yU5U/discussion).

gianm · 2016-12-08T16:39:04Z

@navis would you mind PRing "virtual-column-as-dimension" so we can see what direction you are proposing there? In this PR, not much progress is being made on treating them as dimensions… there's a dimension selector maker on VirtualColumn but it's not actually usable by the query engines or for filtering as-is. I had some vague ideas there but nothing really concrete.

In the end I'm happy throwing away most of my PR although it has some things I would like to keep:

the additional validations on VirtualColumns, including cycle detection, this helps fail fast
using VirtualColumns as the serde type instead of List, this helps fail fast too
checking VirtualColumns before base columns, this makes behavior less surprising
removing expressions from ColumnSelectorFactory and ValueMatcherFactory, this simplifies implementations (pending further discussion?)

To me only the last one seems really controversial.

gianm · 2016-12-14T19:00:47Z

Going to close this, pending resolution of the druid-development thread, since that appears to be going in a different direction. May reopen it later, scoped down a bit.

gianm added the Improvement label Dec 7, 2016

gianm assigned navis and jon-wei Dec 7, 2016

gianm force-pushed the vc-nonsense branch from 9cd8c67 to 04a91e3 Compare December 7, 2016 16:04

gianm force-pushed the vc-nonsense branch from 04a91e3 to fb53786 Compare December 7, 2016 18:09

gianm force-pushed the vc-nonsense branch from fb53786 to ad9a77d Compare December 7, 2016 18:47

gianm force-pushed the vc-nonsense branch from ad9a77d to 7402a28 Compare December 7, 2016 21:14

Add dimension selector for ExpressionVirtualColumn.

fda5588

Move VirtualColumn, VirtualColumns back to io.druid.segment package.

5899da8

Fix some imports.

c0d6472

jon-wei reviewed Dec 8, 2016

View reviewed changes

gianm mentioned this pull request Dec 8, 2016

Support metric filters #3755

Closed

gianm added the Discuss label Dec 8, 2016

gianm closed this Dec 14, 2016

gianm deleted the vc-nonsense branch December 14, 2016 19:00

gianm mentioned this pull request Jan 6, 2017

Add virtual column types, holder serde, and safety features. #3823

Merged

clambertus unassigned navis and jon-wei Jul 6, 2018

Conversation

gianm commented Dec 7, 2016

Uh oh!

gianm commented Dec 7, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gianm commented Dec 7, 2016

Uh oh!

vogievetsky commented Dec 7, 2016

Uh oh!

gianm commented Dec 7, 2016

Uh oh!

fjy commented Dec 7, 2016

Uh oh!

gianm commented Dec 7, 2016

Uh oh!

navis commented Dec 8, 2016

Uh oh!

navis commented Dec 8, 2016

Uh oh!

gianm commented Dec 8, 2016

Uh oh!

gianm commented Dec 8, 2016

Uh oh!

jon-wei Dec 7, 2016

Choose a reason for hiding this comment

Uh oh!

gianm Dec 8, 2016

Choose a reason for hiding this comment

Uh oh!

gianm commented Dec 8, 2016

Uh oh!

gianm commented Dec 8, 2016

Uh oh!

gianm commented Dec 14, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

gianm commented Dec 7, 2016 •

edited

Loading