-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[Fix](parquet-reader) Fix and optimize parquet min-max filtering. #38277
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Fix](parquet-reader) Fix and optimize parquet min-max filtering. #38277
Conversation
|
Thank you for your contribution to Apache Doris. Since 2024-03-18, the Document has been moved to doris-website. |
|
run buildall |
96e0ee0 to
bd5d879
Compare
|
run buildall |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
|
|
||
| #pragma once | ||
|
|
||
| #include <gen_cpp/parquet_types.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: 'gen_cpp/parquet_types.h' file not found [clang-diagnostic-error]
#include <gen_cpp/parquet_types.h>
^| ", semver=" + (version ? *version : "null") + | ||
| ", appBuildHash=" + (appBuildHash ? *appBuildHash : "null") + ")"; | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: redundant access specifier has the same accessibility as the previous access specifier [readability-redundant-access-specifiers]
| public: |
Additional context
be/src/vec/exec/format/parquet/parquet_common.h:166: previously declared here
public:
^| return Status::OK(); | ||
| } | ||
|
|
||
| int compareTo(const SemanticVersion& other) const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: statement should be inside braces [readability-braces-around-statements]
| if (int cmp = compareIntegers(_major, other._major); cmp != 0) return cmp; | |
| if (int cmp = compareIntegers(_major, other._major); cmp != 0) { return cmp; | |
| } |
| } | ||
|
|
||
| int compareTo(const SemanticVersion& other) const { | ||
| if (int cmp = compareIntegers(_major, other._major); cmp != 0) return cmp; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: statement should be inside braces [readability-braces-around-statements]
| if (int cmp = compareIntegers(_minor, other._minor); cmp != 0) return cmp; | |
| if (int cmp = compareIntegers(_minor, other._minor); cmp != 0) { return cmp; | |
| } |
|
|
||
| int compareTo(const SemanticVersion& other) const { | ||
| if (int cmp = compareIntegers(_major, other._major); cmp != 0) return cmp; | ||
| if (int cmp = compareIntegers(_minor, other._minor); cmp != 0) return cmp; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: statement should be inside braces [readability-braces-around-statements]
| if (int cmp = compareIntegers(_patch, other._patch); cmp != 0) return cmp; | |
| if (int cmp = compareIntegers(_patch, other._patch); cmp != 0) { return cmp; | |
| } |
| ParquetStatisticsTest() {} | ||
| }; | ||
|
|
||
| TEST_F(ParquetStatisticsTest, test_try_read_old_utf8_stats) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: function 'TEST_F' exceeds recommended size/complexity thresholds [readability-function-size]
TEST_F(ParquetStatisticsTest, test_try_read_old_utf8_stats) {
^Additional context
be/test/vec/exec/parquet/parquet_statistics_test.cpp:30: 121 lines including whitespace and comments (threshold 80)
TEST_F(ParquetStatisticsTest, test_try_read_old_utf8_stats) {
^| // specific language governing permissions and limitations | ||
| // under the License. | ||
|
|
||
| #include <gtest/gtest.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: 'gtest/gtest.h' file not found [clang-diagnostic-error]
#include <gtest/gtest.h>
^| namespace doris { | ||
| namespace vectorized { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: nested namespaces can be concatenated [modernize-concat-nested-namespaces]
| namespace doris { | |
| namespace vectorized { | |
| namespace doris::vectorized { |
be/test/vec/exec/parquet/parquet_version_test.cpp:219:
- } // namespace vectorized
- } // namespace doris
+ } // namespace doris| namespace vectorized { | ||
| class ParquetVersionTest : public testing::Test { | ||
| public: | ||
| ParquetVersionTest() {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: use '= default' to define a trivial default constructor [modernize-use-equals-default]
| ParquetVersionTest() {} | |
| ParquetVersionTest() = default; |
| ParquetVersionTest() {} | ||
| }; | ||
|
|
||
| TEST_F(ParquetVersionTest, test_version_parser) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: function 'TEST_F' exceeds recommended size/complexity thresholds [readability-function-size]
TEST_F(ParquetVersionTest, test_version_parser) {
^Additional context
be/test/vec/exec/parquet/parquet_version_test.cpp:30: 91 lines including whitespace and comments (threshold 80)
TEST_F(ParquetVersionTest, test_version_parser) {
^
TPC-H: Total hot run time: 39895 ms |
TPC-DS: Total hot run time: 174183 ms |
ClickBench: Total hot run time: 31.61 s |
bd5d879 to
e1d5ba2
Compare
|
run buildall |
e1d5ba2 to
1ce8ced
Compare
|
run buildall |
1ce8ced to
b42b73a
Compare
|
run buildall |
TPC-H: Total hot run time: 39838 ms |
TPC-DS: Total hot run time: 173837 ms |
ClickBench: Total hot run time: 30.21 s |
|
run buildall |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
| } | ||
|
|
||
| int SemanticVersion::compare_to(const SemanticVersion& other) const { | ||
| if (int cmp = _compare_integers(_major, other._major); cmp != 0) return cmp; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: statement should be inside braces [readability-braces-around-statements]
| if (int cmp = _compare_integers(_major, other._major); cmp != 0) return cmp; | |
| if (int cmp = _compare_integers(_major, other._major); cmp != 0) { return cmp; | |
| } |
|
|
||
| int SemanticVersion::compare_to(const SemanticVersion& other) const { | ||
| if (int cmp = _compare_integers(_major, other._major); cmp != 0) return cmp; | ||
| if (int cmp = _compare_integers(_minor, other._minor); cmp != 0) return cmp; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: statement should be inside braces [readability-braces-around-statements]
| if (int cmp = _compare_integers(_minor, other._minor); cmp != 0) return cmp; | |
| if (int cmp = _compare_integers(_minor, other._minor); cmp != 0) { return cmp; | |
| } |
| int SemanticVersion::compare_to(const SemanticVersion& other) const { | ||
| if (int cmp = _compare_integers(_major, other._major); cmp != 0) return cmp; | ||
| if (int cmp = _compare_integers(_minor, other._minor); cmp != 0) return cmp; | ||
| if (int cmp = _compare_integers(_patch, other._patch); cmp != 0) return cmp; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: statement should be inside braces [readability-braces-around-statements]
| if (int cmp = _compare_integers(_patch, other._patch); cmp != 0) return cmp; | |
| if (int cmp = _compare_integers(_patch, other._patch); cmp != 0) { return cmp; | |
| } |
| if (int cmp = _compare_integers(_major, other._major); cmp != 0) return cmp; | ||
| if (int cmp = _compare_integers(_minor, other._minor); cmp != 0) return cmp; | ||
| if (int cmp = _compare_integers(_patch, other._patch); cmp != 0) return cmp; | ||
| if (int cmp = _compare_booleans(other._prerelease, _prerelease); cmp != 0) return cmp; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: statement should be inside braces [readability-braces-around-statements]
| if (int cmp = _compare_booleans(other._prerelease, _prerelease); cmp != 0) return cmp; | |
| if (int cmp = _compare_booleans(other._prerelease, _prerelease); cmp != 0) { return cmp; | |
| } |
| std::vector<NumberOrString> _identifiers; | ||
| }; | ||
|
|
||
| private: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: redundant access specifier has the same accessibility as the previous access specifier [readability-redundant-access-specifiers]
| private: |
Additional context
be/src/vec/exec/format/parquet/parquet_common.h:218: previously declared here
private:
^|
run buildall |
TPC-H: Total hot run time: 41555 ms |
TPC-DS: Total hot run time: 168434 ms |
ClickBench: Total hot run time: 29.91 s |
7cc53a3 to
e8f82a9
Compare
|
run buildall |
TPC-H: Total hot run time: 42085 ms |
TPC-DS: Total hot run time: 169444 ms |
ClickBench: Total hot run time: 29.73 s |
af7382c to
7ebb634
Compare
|
run buildall |
TPC-H: Total hot run time: 41876 ms |
TPC-DS: Total hot run time: 168872 ms |
ClickBench: Total hot run time: 30.16 s |
morningman
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
wuwenchi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…ache#38277) ## Proposed changes Refer to trino's implementation - Some bugs in the historical version paquet-mr. Use `CorruptStatistics::should_ignore_statistics()` to handle. - The old version of parquet uses `min` and `max` stats, and later implements `min_value` and `max_value`. `Min`/`max` stats cannot be used for some types and in some cases. This is related to the comparison and sorting method of values. - If it is double or float, special cases such as NaN, -0, and 0 must be handled. - If the string type only has min and max stats, but no min_value or max_value, use `ParquetPredicate::_try_read_old_utf8_stats()` to expand the range reading optimization method for optimization.
…ache#38277) ## Proposed changes Refer to trino's implementation - Some bugs in the historical version paquet-mr. Use `CorruptStatistics::should_ignore_statistics()` to handle. - The old version of parquet uses `min` and `max` stats, and later implements `min_value` and `max_value`. `Min`/`max` stats cannot be used for some types and in some cases. This is related to the comparison and sorting method of values. - If it is double or float, special cases such as NaN, -0, and 0 must be handled. - If the string type only has min and max stats, but no min_value or max_value, use `ParquetPredicate::_try_read_old_utf8_stats()` to expand the range reading optimization method for optimization.
…ache#38277) Refer to trino's implementation - Some bugs in the historical version paquet-mr. Use `CorruptStatistics::should_ignore_statistics()` to handle. - The old version of parquet uses `min` and `max` stats, and later implements `min_value` and `max_value`. `Min`/`max` stats cannot be used for some types and in some cases. This is related to the comparison and sorting method of values. - If it is double or float, special cases such as NaN, -0, and 0 must be handled. - If the string type only has min and max stats, but no min_value or max_value, use `ParquetPredicate::_try_read_old_utf8_stats()` to expand the range reading optimization method for optimization.
Proposed changes
Refer to trino's implementation
Some bugs in the historical version paquet-mr. Use
CorruptStatistics::should_ignore_statistics()to handle.The old version of parquet uses
minandmaxstats, and later implementsmin_valueandmax_value.Min/maxstats cannot be used for some types and in some cases. This is related to the comparison and sorting method of values.If it is double or float, special cases such as NaN, -0, and 0 must be handled.
If the string type only has min and max stats, but no min_value or max_value, use
ParquetPredicate::_try_read_old_utf8_stats()to expand the range reading optimization method for optimization.