-
Notifications
You must be signed in to change notification settings - Fork 3.8k
Fixing a data correctness issue in unnest when first row of an MVD is null #13764
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
bbd673d
8091d5c
d0b7f6c
c7dd18f
75b3670
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -200,11 +200,36 @@ public DruidQuery toDruidQuery(boolean finalizeAggregations) | |
| // This is necessary to handle the virtual columns on the unnestProject | ||
| // Also create the unnest datasource to be used by the partial query | ||
| PartialDruidQuery partialDruidQuery = unnestProjectNeeded ? partialQuery.withUnnest(unnestProject) : partialQuery; | ||
| String outputColName = unnestDatasourceRel.getUnnestProject().getRowType().getFieldNames().get(0); | ||
|
|
||
| // In case of nested queries for e.g. | ||
| // with t AS (select * from druid.numfoo, unnest(MV_TO_ARRAY(dim3)) as unnested (d3)) | ||
| // select d2,d3 from t, UNNEST(MV_TO_ARRAY(dim2)) as foo(d2) | ||
| // which plans to | ||
| // 186:LogicalCorrelate | ||
| // 178:LogicalProject | ||
| // 176:LogicalCorrelate | ||
| // 13:LogicalTableScan(subset=[rel#168:Subset#0.NONE.[]], table=[[druid, numfoo]]) | ||
| // 172:Uncollect(subset=[rel#173:Subset#3.NONE.[]]) | ||
| // 170:LogicalProject(subset=[rel#171:Subset#2.NONE.[]], EXPR$0=[MV_TO_ARRAY($cor0.dim3)]) | ||
| // 14:LogicalValues(subset=[rel#169:Subset#1.NONE.[0]], tuples=[[{ 0 }]]) | ||
| // 182:Uncollect(subset=[rel#183:Subset#8.NONE.[]]) | ||
| // 180:LogicalProject(subset=[rel#181:Subset#7.NONE.[]], EXPR$0=[MV_TO_ARRAY($cor1.dim2)]) | ||
| // 14:LogicalValues(subset=[rel#169:Subset#1.NONE.[0]], tuples=[[{ 0 }]]) | ||
| // | ||
| // the column name cannot be EXPR$0 for both inner and outer. The inner one which gets executed first gets the name | ||
| // EXPR$0 and as we move up the tree we add a 0 at the end to make the top level EXPR$00. | ||
| // Ideally these names should be replaced by the alias names specified in the query. Any future developer if | ||
| // able to find these alias names should replace EXPR$0 by dim3 and EXPR$00 by dim2, i.e use the correct name from Calcite | ||
|
|
||
| if (druidQueryRel instanceof DruidCorrelateUnnestRel) { | ||
| outputColName = outputColName + "0"; | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. im skeptical that this is always correct, is it really cool?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a hacky way as of now, I have kept a pointer to this to be corrected by fetching the actual names. Will do this in a followup PR |
||
| } | ||
| return partialDruidQuery.build( | ||
| UnnestDataSource.create( | ||
| leftDataSource, | ||
| dimOrExpToUnnest, | ||
| unnestDatasourceRel.getUnnestProject().getRowType().getFieldNames().get(0), | ||
| outputColName, | ||
| null | ||
| ), | ||
| rowSignature, | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -97,6 +97,7 @@ public class CalciteTests | |
| public static final String DATASOURCE3 = "numfoo"; | ||
| public static final String DATASOURCE4 = "foo4"; | ||
| public static final String DATASOURCE5 = "lotsocolumns"; | ||
| public static final String DATASOURCE6 = "unnestnumfoo"; | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Better name? |
||
| public static final String BROADCAST_DATASOURCE = "broadcast"; | ||
| public static final String FORBIDDEN_DATASOURCE = "forbiddenDatasource"; | ||
| public static final String SOME_DATASOURCE = "some_datasource"; | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -351,6 +351,100 @@ public Optional<Joinable> build( | |
| public static final List<InputRow> ROWS1 = | ||
| RAW_ROWS1.stream().map(TestDataBuilder::createRow).collect(Collectors.toList()); | ||
|
|
||
| public static final List<ImmutableMap<String, Object>> RAW_ROWS_FOR_UNNEST = ImmutableList.of( | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does this have all the interesting corner cases? Empty arrays or objects? Null values? Fields that appear in one nested object but not another (in both orders: (a,b), (a), (a,c))? And so on. To help future readers, might be handy to add a comment above each
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good idea will do |
||
| ImmutableMap.<String, Object>builder() | ||
| .put("t", "2000-01-01") | ||
| .put("m1", "1.0") | ||
| .put("m2", "1.0") | ||
| .put("d1", 1.0) | ||
| .put("f1", 1.0f) | ||
| .put("l1", 7L) | ||
| .put("dim1", "") | ||
| .put("dim3", ImmutableList.of("a", ImmutableList.of("b", "c"))) | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. the string dimension indexer can't really handle nested arrays like this, i think you'll end up with something like "a" and then the 'toString' of ["b","c"], or maybe something even weirder... I think you should stick to having either flat lists or single layer strings for these tests |
||
| .put("dim5", ImmutableList.of("a5", "b5")) | ||
| .put("dim6", "1") | ||
| .build(), | ||
| ImmutableMap.<String, Object>builder() | ||
| .put("t", "2000-01-01") | ||
| .put("m1", "1.0") | ||
| .put("m2", "1.0") | ||
| .put("d1", 1.0) | ||
| .put("f1", 1.0f) | ||
| .put("l1", 7L) | ||
| .put("dim1", "") | ||
| .put("dim2", ImmutableList.of("a", "b")) | ||
| .put("dim3", ImmutableList.of("a", ImmutableList.of("b", "c"))) | ||
| .put("dim4", ImmutableList.of("x", "y")) | ||
| .put("dim5", ImmutableList.of("c5", "d5")) | ||
| .put("dim6", "1") | ||
| .build(), | ||
| ImmutableMap.<String, Object>builder() | ||
| .put("t", "2000-01-02") | ||
| .put("m1", "2.0") | ||
| .put("m2", "2.0") | ||
| .put("d1", 1.7) | ||
| .put("d2", 1.7) | ||
| .put("f1", 0.1f) | ||
| .put("f2", 0.1f) | ||
| .put("l1", 325323L) | ||
| .put("l2", 325323L) | ||
| .put("dim1", "10.1") | ||
| .put("dim2", ImmutableList.of()) | ||
| .put("dim3", ImmutableList.of("b", "c")) | ||
| .put("dim4", ImmutableList.of("p", "q")) | ||
| .put("dim5", ImmutableList.of("e5", "f5")) | ||
| .put("dim6", "2") | ||
| .build(), | ||
| ImmutableMap.<String, Object>builder() | ||
| .put("t", "2000-01-03") | ||
| .put("m1", "3.0") | ||
| .put("m2", "3.0") | ||
| .put("d1", 0.0) | ||
| .put("d2", 0.0) | ||
| .put("f1", 0.0) | ||
| .put("f2", 0.0) | ||
| .put("l1", 0) | ||
| .put("l2", 0) | ||
| .put("dim1", "2") | ||
| .put("dim2", ImmutableList.of("")) | ||
| .put("dim3", ImmutableList.of("d")) | ||
| .put("dim4", ImmutableList.of("", "null", "")) | ||
| .put("dim5", ImmutableList.of("a5", "b5")) | ||
| .put("dim6", "3") | ||
| .build(), | ||
| ImmutableMap.<String, Object>builder() | ||
| .put("t", "2001-01-01") | ||
| .put("m1", "4.0") | ||
| .put("m2", "4.0") | ||
| .put("dim1", "1") | ||
| .put("dim2", ImmutableList.of("a")) | ||
| .put("dim3", ImmutableList.of("")) | ||
| .put("dim4", "b") | ||
| .put("dim5", "a5") | ||
| .put("dim6", "4") | ||
| .build(), | ||
| ImmutableMap.<String, Object>builder() | ||
| .put("t", "2001-01-02") | ||
| .put("m1", "5.0") | ||
| .put("m2", "5.0") | ||
| .put("dim1", "def") | ||
| .put("dim2", ImmutableList.of("abc")) | ||
| .put("dim3", ImmutableList.of()) | ||
| .put("dim4", "b") | ||
| .put("dim5", "b5") | ||
| .put("dim6", "5") | ||
| .build(), | ||
| ImmutableMap.<String, Object>builder() | ||
| .put("t", "2001-01-03") | ||
| .put("m1", "6.0") | ||
| .put("m2", "6.0") | ||
| .put("dim1", "abc") | ||
| .put("dim4", "b") | ||
| .put("dim5", "ab5") | ||
| .put("dim6", "6") | ||
| .build() | ||
| ); | ||
|
|
||
| public static final List<ImmutableMap<String, Object>> RAW_ROWS1_WITH_NUMERIC_DIMS = ImmutableList.of( | ||
| ImmutableMap.<String, Object>builder() | ||
| .put("t", "2000-01-01") | ||
|
|
@@ -435,6 +529,9 @@ public Optional<Joinable> build( | |
| public static final List<InputRow> ROWS1_WITH_NUMERIC_DIMS = | ||
| RAW_ROWS1_WITH_NUMERIC_DIMS.stream().map(raw -> createRow(raw, NUMFOO_SCHEMA)).collect(Collectors.toList()); | ||
|
|
||
| public static final List<InputRow> ROWS_WITH_NUMERIC_DIMS_FOR_UNNEST = | ||
| RAW_ROWS_FOR_UNNEST.stream().map(raw -> createRow(raw, NUMFOO_SCHEMA)).collect(Collectors.toList()); | ||
|
|
||
| public static final List<ImmutableMap<String, Object>> RAW_ROWS2 = ImmutableList.of( | ||
| ImmutableMap.<String, Object>builder() | ||
| .put("t", "2000-01-01") | ||
|
|
@@ -811,6 +908,14 @@ public static SpecificSegmentsQuerySegmentWalker createMockWalker( | |
| .rows(USER_VISIT_ROWS) | ||
| .buildMMappedIndex(); | ||
|
|
||
| final QueryableIndex unnestIndex = IndexBuilder | ||
| .create() | ||
| .tmpDir(new File(tmpDir, "9")) | ||
| .segmentWriteOutMediumFactory(OffHeapMemorySegmentWriteOutMediumFactory.instance()) | ||
| .schema(INDEX_SCHEMA_NUMERIC_DIMS) | ||
| .rows(ROWS_WITH_NUMERIC_DIMS_FOR_UNNEST) | ||
| .buildMMappedIndex(); | ||
|
|
||
| return new SpecificSegmentsQuerySegmentWalker( | ||
| conglomerate, | ||
| injector.getInstance(LookupExtractorFactoryContainerProvider.class), | ||
|
|
@@ -906,6 +1011,15 @@ public static SpecificSegmentsQuerySegmentWalker createMockWalker( | |
| .size(0) | ||
| .build(), | ||
| userVisitIndex | ||
| ).add( | ||
| DataSegment.builder() | ||
| .dataSource(CalciteTests.DATASOURCE6) | ||
| .interval(indexNumericDims.getDataInterval()) | ||
| .version("1") | ||
| .shardSpec(new LinearShardSpec(0)) | ||
| .size(0) | ||
| .build(), | ||
| unnestIndex | ||
| ).add( | ||
| DataSegment.builder() | ||
| .dataSource("wikipedia") | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks much for the detailed explanation!