Skip to content

Conversation

@kaka11chen
Copy link

[Fix] Fix late late materialization of list type by reading all of children.

Copy link
Contributor

@morningman morningman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@morningman morningman merged commit b895900 into apache:orc Nov 25, 2024
morningman pushed a commit to apache/doris that referenced this pull request Apr 18, 2025
…of List or Map (#50136)

### What problem does this PR solve?
Related PR: #18615 

Problem Summary:
The problem is like apache/doris-thirdparty#256
When performing late materialization for LIST or MAP types, filters
should not be applied directly to their child fields. These complex
types rely on offsets to correctly map parent-child relationships within
the columnar storage layout (e.g., in ORC or Parquet files).

If filters are applied to the children of a LIST or MAP field, it may
cause inconsistencies in the offset alignment, leading to incorrect data
being read—such as mismatched elements, missing values, or even runtime
errors. This breaks the structural integrity of the nested data and can
produce incorrect query results.

```text
mysql> select * from complex_data_orc;
+------+--------------------------+-----------------+
| id   | m                        | l               |
+------+--------------------------+-----------------+
|    1 | {"a":1, "b":2}           | ["a", "b"]      |
|    2 | {"b":3, "c":4}           | ["b"]           |
|    3 | {"c":5, "a":6, "b":7}    | ["c", "a"]      |
|    4 | {"a":8, "c":9}           | ["b", "c"]      |
|    5 | {"b":10, "a":11}         | ["a"]           |
|    6 | {"c":12, "b":13}         | ["c"]           |
|    7 | {"a":15}                 | ["a", "a"]      |
|    8 | {"b":17}                 | ["b", "b"]      |
|    9 | {"c":19}                 | ["c", "c"]      |
|   10 | {"a":20, "b":21, "c":22} | ["a", "b", "c"] |
+------+--------------------------+-----------------+
10 rows in set (0.02 sec)

!!!WRONG RESULT:
mysql> select * from complex_data_orc where id > 2;
+------+--------------------------+----------------+
| id   | m                        | l              |
+------+--------------------------+----------------+
|    3 | {"c":5, "a":6, "b":7}    | ["c", "a"]     |
|    4 | {"a":8, "c":9}           | ["b", "c"]     |
|    5 | {"b":10, "":11}          | ["a"]          |
|    6 | {"":12, "":13}           | ["c"]          |
|    7 | {"":15}                  | ["a", ""]      |
|    8 | {"":17}                  | ["", ""]       |
|    9 | {"":19}                  | ["", ""]       |
|   10 | {"a":20, "b":21, "c":22} | ["", "b", "c"] |
+------+--------------------------+----------------+
8 rows in set (0.02 sec)
```

To ensure correctness, filters should only be applied at the top level
of the LIST or MAP, and their children should be read in full when late
materialization occurs.

After this pr:
```text
mysql> select * from complex_data_orc where id > 2;
+------+--------------------------+-----------------+
| id   | m                        | l               |
+------+--------------------------+-----------------+
|    3 | {"c":5, "a":6, "b":7}    | ["c", "a"]      |
|    4 | {"a":8, "c":9}           | ["b", "c"]      |
|    5 | {"b":10, "a":11}         | ["a"]           |
|    6 | {"c":12, "b":13}         | ["c"]           |
|    7 | {"a":15}                 | ["a", "a"]      |
|    8 | {"b":17}                 | ["b", "b"]      |
|    9 | {"c":19}                 | ["c", "c"]      |
|   10 | {"a":20, "b":21, "c":22} | ["a", "b", "c"] |
+------+--------------------------+-----------------+
8 rows in set (1.41 sec)
```
suxiaogang223 added a commit to suxiaogang223/doris that referenced this pull request Apr 23, 2025
…of List or Map (apache#50136)

Related PR: apache#18615

Problem Summary:
The problem is like apache/doris-thirdparty#256
When performing late materialization for LIST or MAP types, filters
should not be applied directly to their child fields. These complex
types rely on offsets to correctly map parent-child relationships within
the columnar storage layout (e.g., in ORC or Parquet files).

If filters are applied to the children of a LIST or MAP field, it may
cause inconsistencies in the offset alignment, leading to incorrect data
being read—such as mismatched elements, missing values, or even runtime
errors. This breaks the structural integrity of the nested data and can
produce incorrect query results.

```text
mysql> select * from complex_data_orc;
+------+--------------------------+-----------------+
| id   | m                        | l               |
+------+--------------------------+-----------------+
|    1 | {"a":1, "b":2}           | ["a", "b"]      |
|    2 | {"b":3, "c":4}           | ["b"]           |
|    3 | {"c":5, "a":6, "b":7}    | ["c", "a"]      |
|    4 | {"a":8, "c":9}           | ["b", "c"]      |
|    5 | {"b":10, "a":11}         | ["a"]           |
|    6 | {"c":12, "b":13}         | ["c"]           |
|    7 | {"a":15}                 | ["a", "a"]      |
|    8 | {"b":17}                 | ["b", "b"]      |
|    9 | {"c":19}                 | ["c", "c"]      |
|   10 | {"a":20, "b":21, "c":22} | ["a", "b", "c"] |
+------+--------------------------+-----------------+
10 rows in set (0.02 sec)

!!!WRONG RESULT:
mysql> select * from complex_data_orc where id > 2;
+------+--------------------------+----------------+
| id   | m                        | l              |
+------+--------------------------+----------------+
|    3 | {"c":5, "a":6, "b":7}    | ["c", "a"]     |
|    4 | {"a":8, "c":9}           | ["b", "c"]     |
|    5 | {"b":10, "":11}          | ["a"]          |
|    6 | {"":12, "":13}           | ["c"]          |
|    7 | {"":15}                  | ["a", ""]      |
|    8 | {"":17}                  | ["", ""]       |
|    9 | {"":19}                  | ["", ""]       |
|   10 | {"a":20, "b":21, "c":22} | ["", "b", "c"] |
+------+--------------------------+----------------+
8 rows in set (0.02 sec)
```

To ensure correctness, filters should only be applied at the top level
of the LIST or MAP, and their children should be read in full when late
materialization occurs.

After this pr:
```text
mysql> select * from complex_data_orc where id > 2;
+------+--------------------------+-----------------+
| id   | m                        | l               |
+------+--------------------------+-----------------+
|    3 | {"c":5, "a":6, "b":7}    | ["c", "a"]      |
|    4 | {"a":8, "c":9}           | ["b", "c"]      |
|    5 | {"b":10, "a":11}         | ["a"]           |
|    6 | {"c":12, "b":13}         | ["c"]           |
|    7 | {"a":15}                 | ["a", "a"]      |
|    8 | {"b":17}                 | ["b", "b"]      |
|    9 | {"c":19}                 | ["c", "c"]      |
|   10 | {"a":20, "b":21, "c":22} | ["a", "b", "c"] |
+------+--------------------------+-----------------+
8 rows in set (1.41 sec)
```
suxiaogang223 added a commit to suxiaogang223/doris that referenced this pull request Apr 23, 2025
…of List or Map (apache#50136)

Related PR: apache#18615

Problem Summary:
The problem is like apache/doris-thirdparty#256
When performing late materialization for LIST or MAP types, filters
should not be applied directly to their child fields. These complex
types rely on offsets to correctly map parent-child relationships within
the columnar storage layout (e.g., in ORC or Parquet files).

If filters are applied to the children of a LIST or MAP field, it may
cause inconsistencies in the offset alignment, leading to incorrect data
being read—such as mismatched elements, missing values, or even runtime
errors. This breaks the structural integrity of the nested data and can
produce incorrect query results.

```text
mysql> select * from complex_data_orc;
+------+--------------------------+-----------------+
| id   | m                        | l               |
+------+--------------------------+-----------------+
|    1 | {"a":1, "b":2}           | ["a", "b"]      |
|    2 | {"b":3, "c":4}           | ["b"]           |
|    3 | {"c":5, "a":6, "b":7}    | ["c", "a"]      |
|    4 | {"a":8, "c":9}           | ["b", "c"]      |
|    5 | {"b":10, "a":11}         | ["a"]           |
|    6 | {"c":12, "b":13}         | ["c"]           |
|    7 | {"a":15}                 | ["a", "a"]      |
|    8 | {"b":17}                 | ["b", "b"]      |
|    9 | {"c":19}                 | ["c", "c"]      |
|   10 | {"a":20, "b":21, "c":22} | ["a", "b", "c"] |
+------+--------------------------+-----------------+
10 rows in set (0.02 sec)

!!!WRONG RESULT:
mysql> select * from complex_data_orc where id > 2;
+------+--------------------------+----------------+
| id   | m                        | l              |
+------+--------------------------+----------------+
|    3 | {"c":5, "a":6, "b":7}    | ["c", "a"]     |
|    4 | {"a":8, "c":9}           | ["b", "c"]     |
|    5 | {"b":10, "":11}          | ["a"]          |
|    6 | {"":12, "":13}           | ["c"]          |
|    7 | {"":15}                  | ["a", ""]      |
|    8 | {"":17}                  | ["", ""]       |
|    9 | {"":19}                  | ["", ""]       |
|   10 | {"a":20, "b":21, "c":22} | ["", "b", "c"] |
+------+--------------------------+----------------+
8 rows in set (0.02 sec)
```

To ensure correctness, filters should only be applied at the top level
of the LIST or MAP, and their children should be read in full when late
materialization occurs.

After this pr:
```text
mysql> select * from complex_data_orc where id > 2;
+------+--------------------------+-----------------+
| id   | m                        | l               |
+------+--------------------------+-----------------+
|    3 | {"c":5, "a":6, "b":7}    | ["c", "a"]      |
|    4 | {"a":8, "c":9}           | ["b", "c"]      |
|    5 | {"b":10, "a":11}         | ["a"]           |
|    6 | {"c":12, "b":13}         | ["c"]           |
|    7 | {"a":15}                 | ["a", "a"]      |
|    8 | {"b":17}                 | ["b", "b"]      |
|    9 | {"c":19}                 | ["c", "c"]      |
|   10 | {"a":20, "b":21, "c":22} | ["a", "b", "c"] |
+------+--------------------------+-----------------+
8 rows in set (1.41 sec)
```
suxiaogang223 added a commit to suxiaogang223/doris that referenced this pull request Apr 24, 2025
…of List or Map (apache#50136)

Related PR: apache#18615

Problem Summary:
The problem is like apache/doris-thirdparty#256
When performing late materialization for LIST or MAP types, filters
should not be applied directly to their child fields. These complex
types rely on offsets to correctly map parent-child relationships within
the columnar storage layout (e.g., in ORC or Parquet files).

If filters are applied to the children of a LIST or MAP field, it may
cause inconsistencies in the offset alignment, leading to incorrect data
being read—such as mismatched elements, missing values, or even runtime
errors. This breaks the structural integrity of the nested data and can
produce incorrect query results.

```text
mysql> select * from complex_data_orc;
+------+--------------------------+-----------------+
| id   | m                        | l               |
+------+--------------------------+-----------------+
|    1 | {"a":1, "b":2}           | ["a", "b"]      |
|    2 | {"b":3, "c":4}           | ["b"]           |
|    3 | {"c":5, "a":6, "b":7}    | ["c", "a"]      |
|    4 | {"a":8, "c":9}           | ["b", "c"]      |
|    5 | {"b":10, "a":11}         | ["a"]           |
|    6 | {"c":12, "b":13}         | ["c"]           |
|    7 | {"a":15}                 | ["a", "a"]      |
|    8 | {"b":17}                 | ["b", "b"]      |
|    9 | {"c":19}                 | ["c", "c"]      |
|   10 | {"a":20, "b":21, "c":22} | ["a", "b", "c"] |
+------+--------------------------+-----------------+
10 rows in set (0.02 sec)

!!!WRONG RESULT:
mysql> select * from complex_data_orc where id > 2;
+------+--------------------------+----------------+
| id   | m                        | l              |
+------+--------------------------+----------------+
|    3 | {"c":5, "a":6, "b":7}    | ["c", "a"]     |
|    4 | {"a":8, "c":9}           | ["b", "c"]     |
|    5 | {"b":10, "":11}          | ["a"]          |
|    6 | {"":12, "":13}           | ["c"]          |
|    7 | {"":15}                  | ["a", ""]      |
|    8 | {"":17}                  | ["", ""]       |
|    9 | {"":19}                  | ["", ""]       |
|   10 | {"a":20, "b":21, "c":22} | ["", "b", "c"] |
+------+--------------------------+----------------+
8 rows in set (0.02 sec)
```

To ensure correctness, filters should only be applied at the top level
of the LIST or MAP, and their children should be read in full when late
materialization occurs.

After this pr:
```text
mysql> select * from complex_data_orc where id > 2;
+------+--------------------------+-----------------+
| id   | m                        | l               |
+------+--------------------------+-----------------+
|    3 | {"c":5, "a":6, "b":7}    | ["c", "a"]      |
|    4 | {"a":8, "c":9}           | ["b", "c"]      |
|    5 | {"b":10, "a":11}         | ["a"]           |
|    6 | {"c":12, "b":13}         | ["c"]           |
|    7 | {"a":15}                 | ["a", "a"]      |
|    8 | {"b":17}                 | ["b", "b"]      |
|    9 | {"c":19}                 | ["c", "c"]      |
|   10 | {"a":20, "b":21, "c":22} | ["a", "b", "c"] |
+------+--------------------------+-----------------+
8 rows in set (1.41 sec)
```
suxiaogang223 added a commit to suxiaogang223/doris that referenced this pull request Apr 24, 2025
…of List or Map (apache#50136)

Related PR: apache#18615

Problem Summary:
The problem is like apache/doris-thirdparty#256
When performing late materialization for LIST or MAP types, filters
should not be applied directly to their child fields. These complex
types rely on offsets to correctly map parent-child relationships within
the columnar storage layout (e.g., in ORC or Parquet files).

If filters are applied to the children of a LIST or MAP field, it may
cause inconsistencies in the offset alignment, leading to incorrect data
being read—such as mismatched elements, missing values, or even runtime
errors. This breaks the structural integrity of the nested data and can
produce incorrect query results.

```text
mysql> select * from complex_data_orc;
+------+--------------------------+-----------------+
| id   | m                        | l               |
+------+--------------------------+-----------------+
|    1 | {"a":1, "b":2}           | ["a", "b"]      |
|    2 | {"b":3, "c":4}           | ["b"]           |
|    3 | {"c":5, "a":6, "b":7}    | ["c", "a"]      |
|    4 | {"a":8, "c":9}           | ["b", "c"]      |
|    5 | {"b":10, "a":11}         | ["a"]           |
|    6 | {"c":12, "b":13}         | ["c"]           |
|    7 | {"a":15}                 | ["a", "a"]      |
|    8 | {"b":17}                 | ["b", "b"]      |
|    9 | {"c":19}                 | ["c", "c"]      |
|   10 | {"a":20, "b":21, "c":22} | ["a", "b", "c"] |
+------+--------------------------+-----------------+
10 rows in set (0.02 sec)

!!!WRONG RESULT:
mysql> select * from complex_data_orc where id > 2;
+------+--------------------------+----------------+
| id   | m                        | l              |
+------+--------------------------+----------------+
|    3 | {"c":5, "a":6, "b":7}    | ["c", "a"]     |
|    4 | {"a":8, "c":9}           | ["b", "c"]     |
|    5 | {"b":10, "":11}          | ["a"]          |
|    6 | {"":12, "":13}           | ["c"]          |
|    7 | {"":15}                  | ["a", ""]      |
|    8 | {"":17}                  | ["", ""]       |
|    9 | {"":19}                  | ["", ""]       |
|   10 | {"a":20, "b":21, "c":22} | ["", "b", "c"] |
+------+--------------------------+----------------+
8 rows in set (0.02 sec)
```

To ensure correctness, filters should only be applied at the top level
of the LIST or MAP, and their children should be read in full when late
materialization occurs.

After this pr:
```text
mysql> select * from complex_data_orc where id > 2;
+------+--------------------------+-----------------+
| id   | m                        | l               |
+------+--------------------------+-----------------+
|    3 | {"c":5, "a":6, "b":7}    | ["c", "a"]      |
|    4 | {"a":8, "c":9}           | ["b", "c"]      |
|    5 | {"b":10, "a":11}         | ["a"]           |
|    6 | {"c":12, "b":13}         | ["c"]           |
|    7 | {"a":15}                 | ["a", "a"]      |
|    8 | {"b":17}                 | ["b", "b"]      |
|    9 | {"c":19}                 | ["c", "c"]      |
|   10 | {"a":20, "b":21, "c":22} | ["a", "b", "c"] |
+------+--------------------------+-----------------+
8 rows in set (1.41 sec)
```
suxiaogang223 added a commit to suxiaogang223/doris that referenced this pull request Apr 24, 2025
…of List or Map (apache#50136)

Related PR: apache#18615

Problem Summary:
The problem is like apache/doris-thirdparty#256
When performing late materialization for LIST or MAP types, filters
should not be applied directly to their child fields. These complex
types rely on offsets to correctly map parent-child relationships within
the columnar storage layout (e.g., in ORC or Parquet files).

If filters are applied to the children of a LIST or MAP field, it may
cause inconsistencies in the offset alignment, leading to incorrect data
being read—such as mismatched elements, missing values, or even runtime
errors. This breaks the structural integrity of the nested data and can
produce incorrect query results.

```text
mysql> select * from complex_data_orc;
+------+--------------------------+-----------------+
| id   | m                        | l               |
+------+--------------------------+-----------------+
|    1 | {"a":1, "b":2}           | ["a", "b"]      |
|    2 | {"b":3, "c":4}           | ["b"]           |
|    3 | {"c":5, "a":6, "b":7}    | ["c", "a"]      |
|    4 | {"a":8, "c":9}           | ["b", "c"]      |
|    5 | {"b":10, "a":11}         | ["a"]           |
|    6 | {"c":12, "b":13}         | ["c"]           |
|    7 | {"a":15}                 | ["a", "a"]      |
|    8 | {"b":17}                 | ["b", "b"]      |
|    9 | {"c":19}                 | ["c", "c"]      |
|   10 | {"a":20, "b":21, "c":22} | ["a", "b", "c"] |
+------+--------------------------+-----------------+
10 rows in set (0.02 sec)

!!!WRONG RESULT:
mysql> select * from complex_data_orc where id > 2;
+------+--------------------------+----------------+
| id   | m                        | l              |
+------+--------------------------+----------------+
|    3 | {"c":5, "a":6, "b":7}    | ["c", "a"]     |
|    4 | {"a":8, "c":9}           | ["b", "c"]     |
|    5 | {"b":10, "":11}          | ["a"]          |
|    6 | {"":12, "":13}           | ["c"]          |
|    7 | {"":15}                  | ["a", ""]      |
|    8 | {"":17}                  | ["", ""]       |
|    9 | {"":19}                  | ["", ""]       |
|   10 | {"a":20, "b":21, "c":22} | ["", "b", "c"] |
+------+--------------------------+----------------+
8 rows in set (0.02 sec)
```

To ensure correctness, filters should only be applied at the top level
of the LIST or MAP, and their children should be read in full when late
materialization occurs.

After this pr:
```text
mysql> select * from complex_data_orc where id > 2;
+------+--------------------------+-----------------+
| id   | m                        | l               |
+------+--------------------------+-----------------+
|    3 | {"c":5, "a":6, "b":7}    | ["c", "a"]      |
|    4 | {"a":8, "c":9}           | ["b", "c"]      |
|    5 | {"b":10, "a":11}         | ["a"]           |
|    6 | {"c":12, "b":13}         | ["c"]           |
|    7 | {"a":15}                 | ["a", "a"]      |
|    8 | {"b":17}                 | ["b", "b"]      |
|    9 | {"c":19}                 | ["c", "c"]      |
|   10 | {"a":20, "b":21, "c":22} | ["a", "b", "c"] |
+------+--------------------------+-----------------+
8 rows in set (1.41 sec)
```
koarz pushed a commit to koarz/doris that referenced this pull request Jun 4, 2025
…of List or Map (apache#50136)

### What problem does this PR solve?
Related PR: apache#18615 

Problem Summary:
The problem is like apache/doris-thirdparty#256
When performing late materialization for LIST or MAP types, filters
should not be applied directly to their child fields. These complex
types rely on offsets to correctly map parent-child relationships within
the columnar storage layout (e.g., in ORC or Parquet files).

If filters are applied to the children of a LIST or MAP field, it may
cause inconsistencies in the offset alignment, leading to incorrect data
being read—such as mismatched elements, missing values, or even runtime
errors. This breaks the structural integrity of the nested data and can
produce incorrect query results.

```text
mysql> select * from complex_data_orc;
+------+--------------------------+-----------------+
| id   | m                        | l               |
+------+--------------------------+-----------------+
|    1 | {"a":1, "b":2}           | ["a", "b"]      |
|    2 | {"b":3, "c":4}           | ["b"]           |
|    3 | {"c":5, "a":6, "b":7}    | ["c", "a"]      |
|    4 | {"a":8, "c":9}           | ["b", "c"]      |
|    5 | {"b":10, "a":11}         | ["a"]           |
|    6 | {"c":12, "b":13}         | ["c"]           |
|    7 | {"a":15}                 | ["a", "a"]      |
|    8 | {"b":17}                 | ["b", "b"]      |
|    9 | {"c":19}                 | ["c", "c"]      |
|   10 | {"a":20, "b":21, "c":22} | ["a", "b", "c"] |
+------+--------------------------+-----------------+
10 rows in set (0.02 sec)

!!!WRONG RESULT:
mysql> select * from complex_data_orc where id > 2;
+------+--------------------------+----------------+
| id   | m                        | l              |
+------+--------------------------+----------------+
|    3 | {"c":5, "a":6, "b":7}    | ["c", "a"]     |
|    4 | {"a":8, "c":9}           | ["b", "c"]     |
|    5 | {"b":10, "":11}          | ["a"]          |
|    6 | {"":12, "":13}           | ["c"]          |
|    7 | {"":15}                  | ["a", ""]      |
|    8 | {"":17}                  | ["", ""]       |
|    9 | {"":19}                  | ["", ""]       |
|   10 | {"a":20, "b":21, "c":22} | ["", "b", "c"] |
+------+--------------------------+----------------+
8 rows in set (0.02 sec)
```

To ensure correctness, filters should only be applied at the top level
of the LIST or MAP, and their children should be read in full when late
materialization occurs.

After this pr:
```text
mysql> select * from complex_data_orc where id > 2;
+------+--------------------------+-----------------+
| id   | m                        | l               |
+------+--------------------------+-----------------+
|    3 | {"c":5, "a":6, "b":7}    | ["c", "a"]      |
|    4 | {"a":8, "c":9}           | ["b", "c"]      |
|    5 | {"b":10, "a":11}         | ["a"]           |
|    6 | {"c":12, "b":13}         | ["c"]           |
|    7 | {"a":15}                 | ["a", "a"]      |
|    8 | {"b":17}                 | ["b", "b"]      |
|    9 | {"c":19}                 | ["c", "c"]      |
|   10 | {"a":20, "b":21, "c":22} | ["a", "b", "c"] |
+------+--------------------------+-----------------+
8 rows in set (1.41 sec)
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants