Fix row schema of nested queries with ORDER BY clause to get proper row type during planning validation#13173
Fix row schema of nested queries with ORDER BY clause to get proper row type during planning validation#13173somu-imply wants to merge 10 commits intoapache:masterfrom
Conversation
| + ")"; | ||
| final String legacyExplanation = | ||
| "DruidOuterQueryRel(query=[{\"queryType\":\"groupBy\",\"dataSource\":{\"type\":\"query\",\"query\":{\"queryType\":\"scan\",\"dataSource\":{\"type\":\"table\",\"name\":\"__subquery__\"},\"intervals\":{\"type\":\"intervals\",\"intervals\":[\"-146136543-09-08T08:23:32.096Z/146140482-04-24T15:36:27.903Z\"]},\"resultFormat\":\"list\",\"granularity\":{\"type\":\"all\"}}},\"intervals\":{\"type\":\"intervals\",\"intervals\":[\"-146136543-09-08T08:23:32.096Z/146140482-04-24T15:36:27.903Z\"]},\"granularity\":{\"type\":\"all\"},\"dimensions\":[],\"aggregations\":[{\"type\":\"count\",\"name\":\"a0\"}],\"limitSpec\":{\"type\":\"NoopLimitSpec\"},\"context\":{\"defaultTimeout\":300000,\"maxScatterGatherBytes\":9223372036854775807,\"sqlCurrentTimestamp\":\"2000-01-01T00:00:00Z\",\"sqlQueryId\":\"dummy\",\"vectorize\":\"false\",\"vectorizeVirtualColumns\":\"false\"}}], signature=[{a0:LONG}])\n" | ||
| "DruidOuterQueryRel(query=[{\"queryType\":\"timeseries\",\"dataSource\":{\"type\":\"table\",\"name\":\"__subquery__\"},\"intervals\":{\"type\":\"intervals\",\"intervals\":[\"-146136543-09-08T08:23:32.096Z/146140482-04-24T15:36:27.903Z\"]},\"granularity\":{\"type\":\"all\"},\"aggregations\":[{\"type\":\"count\",\"name\":\"a0\"}],\"context\":{\"defaultTimeout\":300000,\"maxScatterGatherBytes\":9223372036854775807,\"sqlCurrentTimestamp\":\"2000-01-01T00:00:00Z\",\"sqlQueryId\":\"dummy\",\"vectorize\":\"false\",\"vectorizeVirtualColumns\":\"false\"}}], signature=[{a0:LONG}])\n" |
There was a problem hiding this comment.
This comes from the change in the subquery to move to a table data source from a query data source and calcite planning it differently
There was a problem hiding this comment.
This is certainly simpler. But, is it correct? Do we have a test that runs this same query, but we validate the results? The query in question is:
SELECT COUNT(*)
FROM (
SELECT DISTINCT dim2
FROM druid.foo
WHERE SUBSTRING(dim2, 1, 1) IN (
SELECT SUBSTRING(dim1, 1, 1) FROM druid.foo WHERE dim1 IS NOT NULL
)
)The DISTINCT requires grouping. For the new plan to be valid, we'd have to know that the Timeseries query can do the grouping. yet, Timeseries, I believe, only knows to group on time. So, the new plan may not actually produce the correct results. (That said, this stuff is complex, and perhaps there is something I'm missing.)
There was a problem hiding this comment.
I validated the result for both the plans it returned the same value of count. Also the outer query now is a timeseries but one of the inner queries still does a group by in Line 113. So the grouping and counting should be intact imo
paul-rogers
left a comment
There was a problem hiding this comment.
@somu-imply, Calcite is hard, thanks for digging in to track down the issue. I wonder, however, if we've actually found the root cause as some of the changes seem a bit off.
It would be super-helpful if you could add a bit more text to the description. First, what result do you see today? Why is that wrong? What is the result we expect? Why are we not getting that result? That then sets up up to explain why your fix corrects the issue and now does deliver the correct results.
| ); | ||
| return null; | ||
| } | ||
| if (!dataSource.isConcrete()) { |
There was a problem hiding this comment.
This is an unfortunate limitation. There is another PR to fix this limitation. I wonder, why was the check removed in this PR?
There was a problem hiding this comment.
This was introduced during #12418 as we moved to using a QueryDataSource. A TableDataSource always return true for isConcrete. So that part of the code can be avoided
| // fallback for cases other than group by inner query with order by with alias on __time column | ||
| sourceRowSignature = RowSignatures.fromRelDataType( | ||
| sourceRel.getRowType().getFieldNames(), | ||
| sourceRel.getRowType() |
There was a problem hiding this comment.
Why does the sourceRel return the wrong row type? A naive reading of this says that we ask the source rel for its row type. Each rel can have only one "output" row type. Why do we get the wrong one if the source rel is a DruidRel? Is there a bug earlier in the food chain? Said another way, is the getRowType() method wrong for DruidRel?
There was a problem hiding this comment.
This was a case encountered during aliasing. If a query was specified with an alias say for example select __time as t, dim1 as d from foo for the inner query the field name was changing to t from __time for that reason the fallback was used. #12418 should have some more details
There was a problem hiding this comment.
If aliasing is used, Calcite should ensure that the signature of the inner SELECT provides the alias names. If not, we broke something since this is pretty basic SQL functionality.
If, on the other hand, we want the column name __time, even if the user said __time AS t, then we're using non-standard SQL behavior. We can unpack the alias to work back to the base column if we need to know that this is the __time column. However, we should not need to: the outer query should not, in general, special-case __time. Our implementation might, given that native queries are focused on the event use case.
Suppose we have to know that t is really __time. The proper thing to do is to check if the column is an alias. If so, get the base column as above. It may also be possible to attach traits to the column. Columns an have traits such as their cardinality, etc. We might be able to add a trait saying the column kind (__time, dimension, measure or whatever.)
There was a problem hiding this comment.
Another concern. If we return as the row signature the signature of the source, we may miss computed columns:
SELECT * FROM
SELECT __time AS t, myCol as c, a + b AS d FROM fooHere, the signature of the inner query is (t TIMESTAMP, c VARCHAR, d BIGINT), say. If we get the source signature (based on foo), we would omit the computed column d, which is likely to cause cascading issues.
Let's be sure to test this case to validate the workaround.
| call.transformTo(outerQueryRel); | ||
| try { | ||
| call.transformTo(outerQueryRel); | ||
| } |
There was a problem hiding this comment.
Would be super-helpful to include comments on code such as this. Calcite is hard enough to understand already: every hint helps.
This occurs in the sort rule: Sort <-- DruidRel --> DruidRel(with sort). Before, if the replacement failed, we'd just throw the exception. Here, we're catching an exception and ignoring the rewrite.
This seems a bit fishy. First, That sort has to be pushed into the query, right? We have no way to express a sort as an "operator" without it being part of a native query.
Second, waiting for it to fail, then ignoring the error, seems...odd. Should we not check whatever the condition is we want to ignore, and avoid doing the rewrite in that case? But, that raises the above question: if we don't rewrite away that sort here, when will we do it?
Point is: Calcite is really complex. Sometimes the fix is not where you first find the problem, it can be multiple steps back where we introduced a fault.
There was a problem hiding this comment.
You're right to be suspicious. This try/catch is papering over some other bug. We should never be doing a call.transformTo that isn't going to work: our rules and our calls to isValidDruidQuery() are meant to be validating what is OK before we attempt to do the transformation. So, instead of doing the try/catch here, it's better to fix the bug that caused us to get to this line in the first place.
There was a problem hiding this comment.
Thanks for your comments. I'll investigate further
There was a problem hiding this comment.
Removed the try/catch thing as this is getting fixed upstream
| + ")"; | ||
| final String legacyExplanation = | ||
| "DruidOuterQueryRel(query=[{\"queryType\":\"groupBy\",\"dataSource\":{\"type\":\"query\",\"query\":{\"queryType\":\"scan\",\"dataSource\":{\"type\":\"table\",\"name\":\"__subquery__\"},\"intervals\":{\"type\":\"intervals\",\"intervals\":[\"-146136543-09-08T08:23:32.096Z/146140482-04-24T15:36:27.903Z\"]},\"resultFormat\":\"list\",\"granularity\":{\"type\":\"all\"}}},\"intervals\":{\"type\":\"intervals\",\"intervals\":[\"-146136543-09-08T08:23:32.096Z/146140482-04-24T15:36:27.903Z\"]},\"granularity\":{\"type\":\"all\"},\"dimensions\":[],\"aggregations\":[{\"type\":\"count\",\"name\":\"a0\"}],\"limitSpec\":{\"type\":\"NoopLimitSpec\"},\"context\":{\"defaultTimeout\":300000,\"maxScatterGatherBytes\":9223372036854775807,\"sqlCurrentTimestamp\":\"2000-01-01T00:00:00Z\",\"sqlQueryId\":\"dummy\",\"vectorize\":\"false\",\"vectorizeVirtualColumns\":\"false\"}}], signature=[{a0:LONG}])\n" | ||
| "DruidOuterQueryRel(query=[{\"queryType\":\"timeseries\",\"dataSource\":{\"type\":\"table\",\"name\":\"__subquery__\"},\"intervals\":{\"type\":\"intervals\",\"intervals\":[\"-146136543-09-08T08:23:32.096Z/146140482-04-24T15:36:27.903Z\"]},\"granularity\":{\"type\":\"all\"},\"aggregations\":[{\"type\":\"count\",\"name\":\"a0\"}],\"context\":{\"defaultTimeout\":300000,\"maxScatterGatherBytes\":9223372036854775807,\"sqlCurrentTimestamp\":\"2000-01-01T00:00:00Z\",\"sqlQueryId\":\"dummy\",\"vectorize\":\"false\",\"vectorizeVirtualColumns\":\"false\"}}], signature=[{a0:LONG}])\n" |
There was a problem hiding this comment.
This is certainly simpler. But, is it correct? Do we have a test that runs this same query, but we validate the results? The query in question is:
SELECT COUNT(*)
FROM (
SELECT DISTINCT dim2
FROM druid.foo
WHERE SUBSTRING(dim2, 1, 1) IN (
SELECT SUBSTRING(dim1, 1, 1) FROM druid.foo WHERE dim1 IS NOT NULL
)
)The DISTINCT requires grouping. For the new plan to be valid, we'd have to know that the Timeseries query can do the grouping. yet, Timeseries, I believe, only knows to group on time. So, the new plan may not actually produce the correct results. (That said, this stuff is complex, and perhaps there is something I'm missing.)
| public void testOrderByAlongWithInternalScanQuery() | ||
| { | ||
| testQuery( | ||
| "select __time as t, m1 from druid.foo where (m1 in (select distinct m1 from druid.foo)) order by 1 limit 1", |
There was a problem hiding this comment.
Thanks for the new test. The PR description provides a different query. However, I see no test for that specific query. Does such a test already exist? If so, then this change should have required a change to the expected results. Else, should we add a test for the actual case that this PR claims to fix?
There was a problem hiding this comment.
I'll add a test mimicking the actual query
|
One other thought: the title of this PR is a bit generic. This PR handles a very specific use case. Part of the challenge of reviewing is the need to reverse-engineer the use case from the code. Would be super-helpful for the title to say, "Fix row schema for nested queries with ..." or whatever the specific issue is. |
|
It will likely fix #13163 |
|
@somu-imply - I looked at that benchmark. None of those queries seem to have any complex sub-query. It would be better to hand-craft a couple of nested group-by queries that are going to hit the code path you changed. The nested group-by can have a large in-filter (1000). This is when we know how much time is getting added to planning. |
|
Will do so @abhishekagarwal87 . Based on comments from Paul and Gian I'll look into fixing the DruidRel so that the try catch can be avoided |
|
@somu-imply, thanks much for the detailed explanation! It strikes me that the problem is in how we validate queries. We really should not have to do an artificial
As a result, every step should have a correct row type (schema). If not, all heck breaks loose same as if we randomly changed some numbers in an algorithm and expected correct answers. If we have to jimmy the code to get the right answers, it means something is wrong somewhere in that list above. Unfortunately, Calcite is horribly complex. (Not Calcite's fault, all SQL planners are horribly complex.) So, it is a real PITA to find where things go wrong. As a result, we may have to do a hack just to keep things moving. But, the need for the hack suggests there are deeper issues, and we should track those down at some point, else they'll just pop up again elsewhere. |
|
@abhishekagarwal87 I have updated the I tested the working build (before 2022.04) and the current with the changes. Before After There seems to be not much increase in planning time. I added another nested query with group by as well which also had a similar planning time |
| //23: Group by and Order by with alias with large in filter nested query | ||
| "SELECT __time as t, dimSequential from foo " | ||
| + " where dimSequential in (select dimSequential from foo where " | ||
| + " dimSequential in (select dimSequential from foo)) " |
There was a problem hiding this comment.
A large foo table here does not affect planning time. But if it was instead a filter with inline data like dimSequential in ('1', '2', '3' .. '100') that we know does increase the planning time.
There was a problem hiding this comment.
Thanks @abhishekagarwal87 I have updated the benchmark tests by adding 2 queries with ~1000 values of IN filters.
Before (using 0.22.0 as baseline)
Benchmark (query) (rowsPerSegment) (vectorize) Mode Cnt Score Error Units
SqlBenchmark.planSql 24 5000000 false avgt 5 189.154 ± 36.029 ms/op
SqlBenchmark.planSql 25 5000000 false avgt 5 135.821 ± 22.867 ms/op
After (this PR build )
Benchmark (query) (rowsPerSegment) (storageType) (vectorize) Mode Cnt Score Error Units
SqlBenchmark.planSql 24 5000000 mmap false avgt 5 229.036 ± 49.337 ms/op
SqlBenchmark.planSql 25 5000000 mmap false avgt 5 145.159 ± 28.671 ms/op
There has been an increase of ~40ms for a query with 1000 IN filters where in the other nested query the increase has been ~10ms. Please suggest if this much change is acceptable.
|
@paul-rogers I have removed the dubious |
|
Closing this in favor of #13965 which fixes the planning errors on the most updated code. All the tests added here are also a part of that PR as well |
Queries like
was failing in phase of partial query generation. The change in this PR just makes DruidOuterQueryRel eagerly 'validate' the subquery by explaining it, so that the exception happens early and the 'is valid' check catches and returns false, which allows the planner to proceed to other plans.
#12418 tried to address an issue of using alias with outer query with an order by clause. The reason was that during the query explaining part the source relation type was being mismatched. In the first run it was set to the name of the column while in the next run it was set to the alias name. To mitigate the difference we set the Data source of the subquery to use a QueryDataSource from a TableDataSource which introduced the defect shown here.
To solve this we move back to the old implementation and added the check for the rowtype with an additional
toDriodQueryForExplaincall that would ensure setting the correct row types. This additional call does not impact performance as the planning time remained almost similar. A run of our benchmarks for planning indicate that they remain in the same ball park.Before this change
After the change
Separate test cases has been updated for this failed test case and also for #13163
This PR has: