Skip to content

Expanding the VirtualColumn concept.#3758

Closed
gianm wants to merge 4 commits intoapache:masterfrom
gianm:vc-nonsense
Closed

Expanding the VirtualColumn concept.#3758
gianm wants to merge 4 commits intoapache:masterfrom
gianm:vc-nonsense

Conversation

@gianm
Copy link
Copy Markdown
Contributor

@gianm gianm commented Dec 7, 2016

After this patch, virtual columns work well for aggregating, but not for grouping
(because the topN and groupBy engines don't deal well with DimensionSelectors that have
no dictionaries) or filtering (because the CursorFactories don't do anything useful
with them).

Virtual columns:

  • Move VirtualColumn and related classes to their own package.
  • Add additional methods to VirtualColumn to support more kinds of selectors and
    to support cycle detection.
  • Add ExpressionVirtualColumn in core, offering math expressions as a virtual column.
  • Put cache IDs in VirtualColumnCacheHelper.

Queries:

  • Add VirtualColumns to timeseries, topN, and groupBy.

Aggregators:

  • Remove the "expression" parameter (replaced by "expression" virtual columns).

ColumnSelectorFactory:

  • Change getColumnCapabilities to getNativeType, since that's all anyone was using
    it for, and it's simpler.
  • Remove "makeMathExpressionSelector" (replaced by "expression" virtual columns).

Storage adapters:

  • Add virtual column hooks before checking base columns. I think this ordering makes
    more sense than checking base columns first, since it prevents surprises when a
    new base column is added that happens to have the same name as a virtual column.

@gianm
Copy link
Copy Markdown
Contributor Author

gianm commented Dec 7, 2016

@navis, @jon-wei could you take a look please and let me know your opinion?

@navis, judging by comments on #3755, it sounds like you have different plans for how expression support should work. Part of the patch here involves expressions becoming only virtual columns, not being available at the ColumnSelectorFactory, Aggregator, Filter, etc levels. So that would suggest that instead of doing:

"filter" : {
  "type": "expression",
  "expression": "index > 100 && index < 200"
}

You would instead do something like:

"virtualColumns" : [{
  "type" : "expression",
  "name" : "expr",
  "expression" : "index > 100 && index < 200"
}],
"filter" : {
  "type" : "selector",
  "dimension" : "expr",
  "value" : "true"
}

And something similar for aggregators. Rather than knowing about expressions themselves, they would refer to an expression virtual column. This would extend to any other concepts we gain over time, not just math expressions. The idea is moving towards using virtual columns as a sort of "projection" layer before we apply filters, groupings, and aggregations.

The JSON is a bit more complex, but the code inside is simpler (no need for aggregators, filters, and storage adapters to know about math expressions or any other sort of transformations) and more composable. I think that's a good tradeoff since it's common for Druid users to use query generators like plywood or SQL anyway, rather than writing the queries themselves, so it shouldn't be too much of a burden.

Maybe you oversold me on the virtual column concept since now I want to use it for more things :)

@gianm
Copy link
Copy Markdown
Contributor Author

gianm commented Dec 7, 2016

Also, feedback from others would be great as well; I just called out @navis and @jon-wei since they've been doing work in this area recently.

@vogievetsky
Copy link
Copy Markdown
Contributor

I support having virtualColumns being their own layer that is defined by JSON in the top level of the query. FWIW Plywood has always been built on top of that concept (it is called derivedAttributes).

I think the most important objective for virtualColumns bar none is that they work everywhere, on all query types and in all query constructs. As for the complexity of the API: I think people who make ad-hock queries will opt to use the SQL capabilities and people that are writing software that makes queries programmatically will not see much of a problem.

@gianm
Copy link
Copy Markdown
Contributor Author

gianm commented Dec 7, 2016

@vogievetsky, yes, in my mind it is definitely an objective to make virtual columns work for all the things: selecting, grouping, filtering, and aggregating. With their introduction in #2511 they just work for selecting, this patch makes them work for aggregating, and future patches could make them work for grouping and filtering.

@fjy
Copy link
Copy Markdown
Contributor

fjy commented Dec 7, 2016

I think things are much cleaner now after doing a quick read through

After this patch, virtual columns work well for aggregating, but not for grouping
(because the topN and groupBy engines don't deal well with DimensionSelectors that have
no dictionaries) or filtering (because the CursorFactories don't do anything useful
with them).

Virtual columns:
- Move VirtualColumn and related classes to their own package.
- Add additional methods to VirtualColumn to support more kinds of selectors and
  to support cycle detection.
- Add ExpressionVirtualColumn in core, offering math expressions as a virtual column.

Queries:
- Add VirtualColumns to timeseries, topN, and groupBy.

Aggregators:
- Remove the "expression" parameter (replaced by "expression" virtual columns).

ColumnSelectorFactory:
- Change getColumnCapabilities to getNativeType, since that's all anyone was using
  it for, and it's simpler.
- Remove "makeMathExpressionSelector" (replaced by "expression" virtual columns).

Storage adapters:
- Add virtual column hooks before checking base columns. I think this ordering makes
  more sense than checking base columns first, since it prevents surprises when a
  new base column is added that happens to have the same name as a virtual column.
@gianm
Copy link
Copy Markdown
Contributor Author

gianm commented Dec 7, 2016

Pushed another commit containing a DimensionSelector for expressions. It doesn't actually work with topN and groupBy, though, since its value cardinality is CARDINALITY_UNKNOWN which they can't handle right now.

@navis
Copy link
Copy Markdown
Contributor

navis commented Dec 8, 2016

@gianm I have some following PRs waiting on expression subject.

  • expression-metric-filter (Support metric filters #3755)
  • math-virtual-column (in local, not PRed yet)
  • virtual-column-as-dimension (in local, not PRed yet)
  • etc. (small things)

inline

Virtual columns:

  • Move VirtualColumn and related classes to their own package.

Could you postpone this till things settled down?

  • Add additional methods to VirtualColumn to support more kinds of selectors and
    to support cycle detection.
  • Add ExpressionVirtualColumn in core, offering math expressions as a virtual column.

This is done in math-virtual-column. Could you wait on it?

  • Put cache IDs in VirtualColumnCacheHelper.

good idea.

Queries:

  • Add VirtualColumns to timeseries, topN, and groupBy.

I think this is done in virtual-column-as-dimension

Aggregators:

  • Remove the "expression" parameter (replaced by "expression" virtual columns).

I remember this is commented in previous PR. been thinking of replacing field with expression but this code looks removing expression, but ok (we should keep it as was in our version because it's already legacy, for year, for us).

ColumnSelectorFactory:

  • Change getColumnCapabilities to getNativeType, since that's all anyone was using
    it for, and it's simpler.

good

  • Remove "makeMathExpressionSelector" (replaced by "expression" virtual columns).

This is for making in-line VC which is very useful. but ok, we will keep that in ours.

Storage adapters:

  • Add virtual column hooks before checking base columns. I think this ordering makes
    more sense than checking base columns first, since it prevents surprises when a
    new base column is added that happens to have the same name as a virtual column.

good

@navis
Copy link
Copy Markdown
Contributor

navis commented Dec 8, 2016

And actually, it's possible to group-by on vc with some limitations(memory pressure), something like things done in SingleScanTimeDimSelector because group-by does not rely on cardinality of dimension.

@gianm
Copy link
Copy Markdown
Contributor Author

gianm commented Dec 8, 2016

@navis thanks for the feedback. On your points:

  • Move VirtualColumn and related classes to their own package.

Could you postpone this till things settled down?

yes, sure, I'll move the existing classes back. I'll still create a "virtual" package for the new classes though since there won't be conflicts there.

  • Add additional methods to VirtualColumn to support more kinds of selectors and
    to support cycle detection.
  • Add ExpressionVirtualColumn in core, offering math expressions as a virtual column.

This is done in math-virtual-column. Could you wait on it?

Is your impl substantially different from the one in here? If so, what's different about it? Or, if it's not really different, would the the one in here work for you?

I'm asking because the impl in here is necessary to keep some of the tests working properly, due to removal of makeMathExpressionSelector from ColumnSelectorFactory. If yours is different & better than my impl, could I merge it into this PR and replace my impl?

  • Remove "makeMathExpressionSelector" (replaced by "expression" virtual columns).

This is for making in-line VC which is very useful. but ok, we will keep that in ours.

I guessed you were using it for this. I see the feature is useful, but I still prefer to remove it and use explicit virtualColumns since that simplifies the code for column selectors, aggregators, and filters.

Finally, that's a good point about grouping. But I'd like to eventually modify the query engines so that we can control the memory usage. I was thinking of using a bounded memory dictionary with spilling for that (at least for groupBy, which can spill). That's a separate issue though.

@gianm
Copy link
Copy Markdown
Contributor Author

gianm commented Dec 8, 2016

Moved VirtualColumn and VirtualColumns back to original locations.

default:
return new BooleanValueMatcher(predicateFactory.makeStringPredicate().apply(null));
final ValueType type = cursor.getNativeType(dimension);
if (type == ValueType.LONG) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why change from a switch to if statements?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getNativeType can return null, and switch doesn't like nulls.

@gianm gianm mentioned this pull request Dec 8, 2016
@gianm gianm added the Discuss label Dec 8, 2016
@gianm
Copy link
Copy Markdown
Contributor Author

gianm commented Dec 8, 2016

Tagged Discuss due to open questions about how to move forward with the subsystem (see also #3755 and https://groups.google.com/d/topic/druid-development/_Sd78s7yU5U/discussion).

@gianm
Copy link
Copy Markdown
Contributor Author

gianm commented Dec 8, 2016

@navis would you mind PRing "virtual-column-as-dimension" so we can see what direction you are proposing there? In this PR, not much progress is being made on treating them as dimensions… there's a dimension selector maker on VirtualColumn but it's not actually usable by the query engines or for filtering as-is. I had some vague ideas there but nothing really concrete.

In the end I'm happy throwing away most of my PR although it has some things I would like to keep:

  • the additional validations on VirtualColumns, including cycle detection, this helps fail fast
  • using VirtualColumns as the serde type instead of List, this helps fail fast too
  • checking VirtualColumns before base columns, this makes behavior less surprising
  • removing expressions from ColumnSelectorFactory and ValueMatcherFactory, this simplifies implementations (pending further discussion?)

To me only the last one seems really controversial.

@gianm
Copy link
Copy Markdown
Contributor Author

gianm commented Dec 14, 2016

Going to close this, pending resolution of the druid-development thread, since that appears to be going in a different direction. May reopen it later, scoped down a bit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants