-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[fix](orc) Should not pass selection vector when decode child column of List or Map #50136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[fix](orc) Should not pass selection vector when decode child column of List or Map #50136
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
TPC-H: Total hot run time: 33965 ms |
TPC-DS: Total hot run time: 191565 ms |
ClickBench: Total hot run time: 30.25 s |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
900ee35 to
c69fee9
Compare
|
run buildall |
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
TPC-H: Total hot run time: 34071 ms |
TPC-DS: Total hot run time: 191562 ms |
ClickBench: Total hot run time: 29.14 s |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
kaka11chen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…of List or Map (apache#50136) Related PR: apache#18615 Problem Summary: The problem is like apache/doris-thirdparty#256 When performing late materialization for LIST or MAP types, filters should not be applied directly to their child fields. These complex types rely on offsets to correctly map parent-child relationships within the columnar storage layout (e.g., in ORC or Parquet files). If filters are applied to the children of a LIST or MAP field, it may cause inconsistencies in the offset alignment, leading to incorrect data being read—such as mismatched elements, missing values, or even runtime errors. This breaks the structural integrity of the nested data and can produce incorrect query results. ```text mysql> select * from complex_data_orc; +------+--------------------------+-----------------+ | id | m | l | +------+--------------------------+-----------------+ | 1 | {"a":1, "b":2} | ["a", "b"] | | 2 | {"b":3, "c":4} | ["b"] | | 3 | {"c":5, "a":6, "b":7} | ["c", "a"] | | 4 | {"a":8, "c":9} | ["b", "c"] | | 5 | {"b":10, "a":11} | ["a"] | | 6 | {"c":12, "b":13} | ["c"] | | 7 | {"a":15} | ["a", "a"] | | 8 | {"b":17} | ["b", "b"] | | 9 | {"c":19} | ["c", "c"] | | 10 | {"a":20, "b":21, "c":22} | ["a", "b", "c"] | +------+--------------------------+-----------------+ 10 rows in set (0.02 sec) !!!WRONG RESULT: mysql> select * from complex_data_orc where id > 2; +------+--------------------------+----------------+ | id | m | l | +------+--------------------------+----------------+ | 3 | {"c":5, "a":6, "b":7} | ["c", "a"] | | 4 | {"a":8, "c":9} | ["b", "c"] | | 5 | {"b":10, "":11} | ["a"] | | 6 | {"":12, "":13} | ["c"] | | 7 | {"":15} | ["a", ""] | | 8 | {"":17} | ["", ""] | | 9 | {"":19} | ["", ""] | | 10 | {"a":20, "b":21, "c":22} | ["", "b", "c"] | +------+--------------------------+----------------+ 8 rows in set (0.02 sec) ``` To ensure correctness, filters should only be applied at the top level of the LIST or MAP, and their children should be read in full when late materialization occurs. After this pr: ```text mysql> select * from complex_data_orc where id > 2; +------+--------------------------+-----------------+ | id | m | l | +------+--------------------------+-----------------+ | 3 | {"c":5, "a":6, "b":7} | ["c", "a"] | | 4 | {"a":8, "c":9} | ["b", "c"] | | 5 | {"b":10, "a":11} | ["a"] | | 6 | {"c":12, "b":13} | ["c"] | | 7 | {"a":15} | ["a", "a"] | | 8 | {"b":17} | ["b", "b"] | | 9 | {"c":19} | ["c", "c"] | | 10 | {"a":20, "b":21, "c":22} | ["a", "b", "c"] | +------+--------------------------+-----------------+ 8 rows in set (1.41 sec) ```
…of List or Map (apache#50136) Related PR: apache#18615 Problem Summary: The problem is like apache/doris-thirdparty#256 When performing late materialization for LIST or MAP types, filters should not be applied directly to their child fields. These complex types rely on offsets to correctly map parent-child relationships within the columnar storage layout (e.g., in ORC or Parquet files). If filters are applied to the children of a LIST or MAP field, it may cause inconsistencies in the offset alignment, leading to incorrect data being read—such as mismatched elements, missing values, or even runtime errors. This breaks the structural integrity of the nested data and can produce incorrect query results. ```text mysql> select * from complex_data_orc; +------+--------------------------+-----------------+ | id | m | l | +------+--------------------------+-----------------+ | 1 | {"a":1, "b":2} | ["a", "b"] | | 2 | {"b":3, "c":4} | ["b"] | | 3 | {"c":5, "a":6, "b":7} | ["c", "a"] | | 4 | {"a":8, "c":9} | ["b", "c"] | | 5 | {"b":10, "a":11} | ["a"] | | 6 | {"c":12, "b":13} | ["c"] | | 7 | {"a":15} | ["a", "a"] | | 8 | {"b":17} | ["b", "b"] | | 9 | {"c":19} | ["c", "c"] | | 10 | {"a":20, "b":21, "c":22} | ["a", "b", "c"] | +------+--------------------------+-----------------+ 10 rows in set (0.02 sec) !!!WRONG RESULT: mysql> select * from complex_data_orc where id > 2; +------+--------------------------+----------------+ | id | m | l | +------+--------------------------+----------------+ | 3 | {"c":5, "a":6, "b":7} | ["c", "a"] | | 4 | {"a":8, "c":9} | ["b", "c"] | | 5 | {"b":10, "":11} | ["a"] | | 6 | {"":12, "":13} | ["c"] | | 7 | {"":15} | ["a", ""] | | 8 | {"":17} | ["", ""] | | 9 | {"":19} | ["", ""] | | 10 | {"a":20, "b":21, "c":22} | ["", "b", "c"] | +------+--------------------------+----------------+ 8 rows in set (0.02 sec) ``` To ensure correctness, filters should only be applied at the top level of the LIST or MAP, and their children should be read in full when late materialization occurs. After this pr: ```text mysql> select * from complex_data_orc where id > 2; +------+--------------------------+-----------------+ | id | m | l | +------+--------------------------+-----------------+ | 3 | {"c":5, "a":6, "b":7} | ["c", "a"] | | 4 | {"a":8, "c":9} | ["b", "c"] | | 5 | {"b":10, "a":11} | ["a"] | | 6 | {"c":12, "b":13} | ["c"] | | 7 | {"a":15} | ["a", "a"] | | 8 | {"b":17} | ["b", "b"] | | 9 | {"c":19} | ["c", "c"] | | 10 | {"a":20, "b":21, "c":22} | ["a", "b", "c"] | +------+--------------------------+-----------------+ 8 rows in set (1.41 sec) ```
…of List or Map (apache#50136) Related PR: apache#18615 Problem Summary: The problem is like apache/doris-thirdparty#256 When performing late materialization for LIST or MAP types, filters should not be applied directly to their child fields. These complex types rely on offsets to correctly map parent-child relationships within the columnar storage layout (e.g., in ORC or Parquet files). If filters are applied to the children of a LIST or MAP field, it may cause inconsistencies in the offset alignment, leading to incorrect data being read—such as mismatched elements, missing values, or even runtime errors. This breaks the structural integrity of the nested data and can produce incorrect query results. ```text mysql> select * from complex_data_orc; +------+--------------------------+-----------------+ | id | m | l | +------+--------------------------+-----------------+ | 1 | {"a":1, "b":2} | ["a", "b"] | | 2 | {"b":3, "c":4} | ["b"] | | 3 | {"c":5, "a":6, "b":7} | ["c", "a"] | | 4 | {"a":8, "c":9} | ["b", "c"] | | 5 | {"b":10, "a":11} | ["a"] | | 6 | {"c":12, "b":13} | ["c"] | | 7 | {"a":15} | ["a", "a"] | | 8 | {"b":17} | ["b", "b"] | | 9 | {"c":19} | ["c", "c"] | | 10 | {"a":20, "b":21, "c":22} | ["a", "b", "c"] | +------+--------------------------+-----------------+ 10 rows in set (0.02 sec) !!!WRONG RESULT: mysql> select * from complex_data_orc where id > 2; +------+--------------------------+----------------+ | id | m | l | +------+--------------------------+----------------+ | 3 | {"c":5, "a":6, "b":7} | ["c", "a"] | | 4 | {"a":8, "c":9} | ["b", "c"] | | 5 | {"b":10, "":11} | ["a"] | | 6 | {"":12, "":13} | ["c"] | | 7 | {"":15} | ["a", ""] | | 8 | {"":17} | ["", ""] | | 9 | {"":19} | ["", ""] | | 10 | {"a":20, "b":21, "c":22} | ["", "b", "c"] | +------+--------------------------+----------------+ 8 rows in set (0.02 sec) ``` To ensure correctness, filters should only be applied at the top level of the LIST or MAP, and their children should be read in full when late materialization occurs. After this pr: ```text mysql> select * from complex_data_orc where id > 2; +------+--------------------------+-----------------+ | id | m | l | +------+--------------------------+-----------------+ | 3 | {"c":5, "a":6, "b":7} | ["c", "a"] | | 4 | {"a":8, "c":9} | ["b", "c"] | | 5 | {"b":10, "a":11} | ["a"] | | 6 | {"c":12, "b":13} | ["c"] | | 7 | {"a":15} | ["a", "a"] | | 8 | {"b":17} | ["b", "b"] | | 9 | {"c":19} | ["c", "c"] | | 10 | {"a":20, "b":21, "c":22} | ["a", "b", "c"] | +------+--------------------------+-----------------+ 8 rows in set (1.41 sec) ```
…of List or Map (apache#50136) Related PR: apache#18615 Problem Summary: The problem is like apache/doris-thirdparty#256 When performing late materialization for LIST or MAP types, filters should not be applied directly to their child fields. These complex types rely on offsets to correctly map parent-child relationships within the columnar storage layout (e.g., in ORC or Parquet files). If filters are applied to the children of a LIST or MAP field, it may cause inconsistencies in the offset alignment, leading to incorrect data being read—such as mismatched elements, missing values, or even runtime errors. This breaks the structural integrity of the nested data and can produce incorrect query results. ```text mysql> select * from complex_data_orc; +------+--------------------------+-----------------+ | id | m | l | +------+--------------------------+-----------------+ | 1 | {"a":1, "b":2} | ["a", "b"] | | 2 | {"b":3, "c":4} | ["b"] | | 3 | {"c":5, "a":6, "b":7} | ["c", "a"] | | 4 | {"a":8, "c":9} | ["b", "c"] | | 5 | {"b":10, "a":11} | ["a"] | | 6 | {"c":12, "b":13} | ["c"] | | 7 | {"a":15} | ["a", "a"] | | 8 | {"b":17} | ["b", "b"] | | 9 | {"c":19} | ["c", "c"] | | 10 | {"a":20, "b":21, "c":22} | ["a", "b", "c"] | +------+--------------------------+-----------------+ 10 rows in set (0.02 sec) !!!WRONG RESULT: mysql> select * from complex_data_orc where id > 2; +------+--------------------------+----------------+ | id | m | l | +------+--------------------------+----------------+ | 3 | {"c":5, "a":6, "b":7} | ["c", "a"] | | 4 | {"a":8, "c":9} | ["b", "c"] | | 5 | {"b":10, "":11} | ["a"] | | 6 | {"":12, "":13} | ["c"] | | 7 | {"":15} | ["a", ""] | | 8 | {"":17} | ["", ""] | | 9 | {"":19} | ["", ""] | | 10 | {"a":20, "b":21, "c":22} | ["", "b", "c"] | +------+--------------------------+----------------+ 8 rows in set (0.02 sec) ``` To ensure correctness, filters should only be applied at the top level of the LIST or MAP, and their children should be read in full when late materialization occurs. After this pr: ```text mysql> select * from complex_data_orc where id > 2; +------+--------------------------+-----------------+ | id | m | l | +------+--------------------------+-----------------+ | 3 | {"c":5, "a":6, "b":7} | ["c", "a"] | | 4 | {"a":8, "c":9} | ["b", "c"] | | 5 | {"b":10, "a":11} | ["a"] | | 6 | {"c":12, "b":13} | ["c"] | | 7 | {"a":15} | ["a", "a"] | | 8 | {"b":17} | ["b", "b"] | | 9 | {"c":19} | ["c", "c"] | | 10 | {"a":20, "b":21, "c":22} | ["a", "b", "c"] | +------+--------------------------+-----------------+ 8 rows in set (1.41 sec) ```
…of List or Map (apache#50136) Related PR: apache#18615 Problem Summary: The problem is like apache/doris-thirdparty#256 When performing late materialization for LIST or MAP types, filters should not be applied directly to their child fields. These complex types rely on offsets to correctly map parent-child relationships within the columnar storage layout (e.g., in ORC or Parquet files). If filters are applied to the children of a LIST or MAP field, it may cause inconsistencies in the offset alignment, leading to incorrect data being read—such as mismatched elements, missing values, or even runtime errors. This breaks the structural integrity of the nested data and can produce incorrect query results. ```text mysql> select * from complex_data_orc; +------+--------------------------+-----------------+ | id | m | l | +------+--------------------------+-----------------+ | 1 | {"a":1, "b":2} | ["a", "b"] | | 2 | {"b":3, "c":4} | ["b"] | | 3 | {"c":5, "a":6, "b":7} | ["c", "a"] | | 4 | {"a":8, "c":9} | ["b", "c"] | | 5 | {"b":10, "a":11} | ["a"] | | 6 | {"c":12, "b":13} | ["c"] | | 7 | {"a":15} | ["a", "a"] | | 8 | {"b":17} | ["b", "b"] | | 9 | {"c":19} | ["c", "c"] | | 10 | {"a":20, "b":21, "c":22} | ["a", "b", "c"] | +------+--------------------------+-----------------+ 10 rows in set (0.02 sec) !!!WRONG RESULT: mysql> select * from complex_data_orc where id > 2; +------+--------------------------+----------------+ | id | m | l | +------+--------------------------+----------------+ | 3 | {"c":5, "a":6, "b":7} | ["c", "a"] | | 4 | {"a":8, "c":9} | ["b", "c"] | | 5 | {"b":10, "":11} | ["a"] | | 6 | {"":12, "":13} | ["c"] | | 7 | {"":15} | ["a", ""] | | 8 | {"":17} | ["", ""] | | 9 | {"":19} | ["", ""] | | 10 | {"a":20, "b":21, "c":22} | ["", "b", "c"] | +------+--------------------------+----------------+ 8 rows in set (0.02 sec) ``` To ensure correctness, filters should only be applied at the top level of the LIST or MAP, and their children should be read in full when late materialization occurs. After this pr: ```text mysql> select * from complex_data_orc where id > 2; +------+--------------------------+-----------------+ | id | m | l | +------+--------------------------+-----------------+ | 3 | {"c":5, "a":6, "b":7} | ["c", "a"] | | 4 | {"a":8, "c":9} | ["b", "c"] | | 5 | {"b":10, "a":11} | ["a"] | | 6 | {"c":12, "b":13} | ["c"] | | 7 | {"a":15} | ["a", "a"] | | 8 | {"b":17} | ["b", "b"] | | 9 | {"c":19} | ["c", "c"] | | 10 | {"a":20, "b":21, "c":22} | ["a", "b", "c"] | +------+--------------------------+-----------------+ 8 rows in set (1.41 sec) ```
…of List or Map (apache#50136) ### What problem does this PR solve? Related PR: apache#18615 Problem Summary: The problem is like apache/doris-thirdparty#256 When performing late materialization for LIST or MAP types, filters should not be applied directly to their child fields. These complex types rely on offsets to correctly map parent-child relationships within the columnar storage layout (e.g., in ORC or Parquet files). If filters are applied to the children of a LIST or MAP field, it may cause inconsistencies in the offset alignment, leading to incorrect data being read—such as mismatched elements, missing values, or even runtime errors. This breaks the structural integrity of the nested data and can produce incorrect query results. ```text mysql> select * from complex_data_orc; +------+--------------------------+-----------------+ | id | m | l | +------+--------------------------+-----------------+ | 1 | {"a":1, "b":2} | ["a", "b"] | | 2 | {"b":3, "c":4} | ["b"] | | 3 | {"c":5, "a":6, "b":7} | ["c", "a"] | | 4 | {"a":8, "c":9} | ["b", "c"] | | 5 | {"b":10, "a":11} | ["a"] | | 6 | {"c":12, "b":13} | ["c"] | | 7 | {"a":15} | ["a", "a"] | | 8 | {"b":17} | ["b", "b"] | | 9 | {"c":19} | ["c", "c"] | | 10 | {"a":20, "b":21, "c":22} | ["a", "b", "c"] | +------+--------------------------+-----------------+ 10 rows in set (0.02 sec) !!!WRONG RESULT: mysql> select * from complex_data_orc where id > 2; +------+--------------------------+----------------+ | id | m | l | +------+--------------------------+----------------+ | 3 | {"c":5, "a":6, "b":7} | ["c", "a"] | | 4 | {"a":8, "c":9} | ["b", "c"] | | 5 | {"b":10, "":11} | ["a"] | | 6 | {"":12, "":13} | ["c"] | | 7 | {"":15} | ["a", ""] | | 8 | {"":17} | ["", ""] | | 9 | {"":19} | ["", ""] | | 10 | {"a":20, "b":21, "c":22} | ["", "b", "c"] | +------+--------------------------+----------------+ 8 rows in set (0.02 sec) ``` To ensure correctness, filters should only be applied at the top level of the LIST or MAP, and their children should be read in full when late materialization occurs. After this pr: ```text mysql> select * from complex_data_orc where id > 2; +------+--------------------------+-----------------+ | id | m | l | +------+--------------------------+-----------------+ | 3 | {"c":5, "a":6, "b":7} | ["c", "a"] | | 4 | {"a":8, "c":9} | ["b", "c"] | | 5 | {"b":10, "a":11} | ["a"] | | 6 | {"c":12, "b":13} | ["c"] | | 7 | {"a":15} | ["a", "a"] | | 8 | {"b":17} | ["b", "b"] | | 9 | {"c":19} | ["c", "c"] | | 10 | {"a":20, "b":21, "c":22} | ["a", "b", "c"] | +------+--------------------------+-----------------+ 8 rows in set (1.41 sec) ```
What problem does this PR solve?
Related PR: #18615
Problem Summary:
The problem is like apache/doris-thirdparty#256
When performing late materialization for LIST or MAP types, filters should not be applied directly to their child fields. These complex types rely on offsets to correctly map parent-child relationships within the columnar storage layout (e.g., in ORC or Parquet files).
If filters are applied to the children of a LIST or MAP field, it may cause inconsistencies in the offset alignment, leading to incorrect data being read—such as mismatched elements, missing values, or even runtime errors. This breaks the structural integrity of the nested data and can produce incorrect query results.
To ensure correctness, filters should only be applied at the top level of the LIST or MAP, and their children should be read in full when late materialization occurs.
After this pr:
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)