-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[Feature](Variant) Implement inner nested data type for variant type #39022
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thank you for your contribution to Apache Doris. Since 2024-03-18, the Document has been moved to doris-website. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
7b9885c to
5820cda
Compare
|
run buildall |
TPC-H: Total hot run time: 41920 ms |
TPC-DS: Total hot run time: 169921 ms |
ClickBench: Total hot run time: 30.18 s |
|
run buildall |
TPC-H: Total hot run time: 39346 ms |
|
run buildall |
2 similar comments
|
run buildall |
|
run buildall |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
| } | ||
|
|
||
| void ColumnObject::finalize(bool ignore_sparse) { | ||
| void ColumnObject::finalize(FinalizeMode mode) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: method 'finalize' can be made const [readability-make-member-function-const]
be/src/vec/columns/column_object.h:370:
- void finalize(FinalizeMode mode);
+ void finalize(FinalizeMode mode) const;| void ColumnObject::finalize(FinalizeMode mode) { | |
| void ColumnObject::finalize(FinalizeMode mode) const { |
| // and modified by Doris | ||
|
|
||
| #pragma once | ||
| #include <butil/compiler_specific.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: 'butil/compiler_specific.h' file not found [clang-diagnostic-error]
#include <butil/compiler_specific.h>
^|
run buildall |
TPC-H: Total hot run time: 39935 ms |
TPC-DS: Total hot run time: 202325 ms |
ClickBench: Total hot run time: 30.39 s |
ClickBench: Total hot run time: 30.5 s |
|
run buildall |
TPC-H: Total hot run time: 38272 ms |
TPC-DS: Total hot run time: 192313 ms |
ClickBench: Total hot run time: 31.2 s |
xiaokang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
morningman
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PROTO and THRIFR LGTM
…pache#39022) Currently, importing nested data formats, such as: ``` json { "a": [{"nested1": 1}, {"nested2": "123"}] } ``` This results in the a column type becoming JSON, which has worse compression and query performance compared to native arrays, mainly due to the inability to leverage low cardinality optimizations and the overhead of parsing JSON during queries. A common example: ``` json { "eventId": 1, "firstName": "Name1", "lastName": "Surname1", "body": { "phoneNumbers": [ { "number": "5550219210", "type": "GSM", "callLimit": 5 }, { "number": "02124713252", "type": "HOME", "callLimit": 3 }, { "number": "05550219211", "type": "WORK", "callLimit": 2 } ] } } ``` Consider storing the expanded nested structure so that the schema merge logic can be utilized directly, and querying becomes easier, for example: ``` json { "n": [{"a": 1, "b": 2}, {"a": 10, "b": 11, "c": 12}, {"a": 1001, "d": "12"}] }, { "n": [{"x": 1, "y": 2}] } ``` Data would be stored as follows, with following storage format Column | Row 0 | Row 1 -- | -- | -- n.a (array<int>) | [1, 10, 1001] | [null] n.b (int) | [2, 11, null] | [null] n.c (int) | [null, 12, null] | [null] n.d (text) | [null, null, "12"] | [null] n.x | [null, null, null] | [1] n.y | [null, null, null] | [1] Data offsets are aligned (equal size). To maintain the relationship between nested nodes, such as n.a, n.b, n.c, and n.d, during compaction, if any of these columns are missing, their offsets are filled using any sibling column's offset. ```sql SELECT v['n']['a'] FROM tbl; --- This outputs [1, 10, 1001]. ``` ``` sql SELECT v['n'] FROM tbl; --- This outputs [{"a" : 1, "b" : 2}, {"a" : 10, "b" : 11, "c" : 12}, {"a":1001, "d" : "12"}]. ``` During queries, the path's nested information is not perceived because this information is ignored during path evaluation (not stored in the subcolumn tree).
Background
Currently, importing nested data formats, such as:
{ "a": [{"nested1": 1}, {"nested2": "123"}] }This results in the a column type becoming JSON, which has worse compression and query performance compared to native arrays, mainly due to the inability to leverage low cardinality optimizations and the overhead of parsing JSON during queries.
A common example:
{ "eventId": 1, "firstName": "Name1", "lastName": "Surname1", "body": { "phoneNumbers": [ { "number": "5550219210", "type": "GSM", "callLimit": 5 }, { "number": "02124713252", "type": "HOME", "callLimit": 3 }, { "number": "05550219211", "type": "WORK", "callLimit": 2 } ] } }Design
Consider storing the expanded nested structure so that the schema merge logic can be utilized directly, and querying becomes easier, for example:
{ "n": [{"a": 1, "b": 2}, {"a": 10, "b": 11, "c": 12}, {"a": 1001, "d": "12"}] }, { "n": [{"x": 1, "y": 2}] }Data would be stored as follows, with following storage format
Data offsets are aligned (equal size).
Compaction
To maintain the relationship between nested nodes, such as n.a, n.b, n.c, and n.d, during compaction, if any of these columns are missing, their offsets are filled using any sibling column's offset.
Queries
During queries, the path's nested information is not perceived because this information is ignored during path evaluation (not stored in the subcolumn tree).