Skip to content

Conversation

@kaka11chen
Copy link
Contributor

Backport #38277.

…ache#38277)

## Proposed changes

Refer to trino's implementation

- Some bugs in the historical version paquet-mr. Use
`CorruptStatistics::should_ignore_statistics()` to handle.

- The old version of parquet uses `min` and `max` stats, and later
implements `min_value` and `max_value`. `Min`/`max` stats cannot be used
for some types and in some cases. This is related to the comparison and
sorting method of values.

- If it is double or float, special cases such as NaN, -0, and 0 must be
handled.

- If the string type only has min and max stats, but no min_value or
max_value, use `ParquetPredicate::_try_read_old_utf8_stats()` to expand
the range reading optimization method for optimization.
@doris-robot
Copy link

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR

Since 2024-03-18, the Document has been moved to doris-website.
See Doris Document.

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions


#pragma once

#include <gen_cpp/parquet_types.h>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: 'gen_cpp/parquet_types.h' file not found [clang-diagnostic-error]

#include <gen_cpp/parquet_types.h>
         ^

// specific language governing permissions and limitations
// under the License.

#include <gtest/gtest.h>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: 'gtest/gtest.h' file not found [clang-diagnostic-error]

#include <gtest/gtest.h>
         ^

Comment on lines +24 to +25
namespace doris {
namespace vectorized {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: nested namespaces can be concatenated [modernize-concat-nested-namespaces]

Suggested change
namespace doris {
namespace vectorized {
namespace doris::vectorized {

be/test/vec/exec/parquet/parquet_corrupt_statistics_test.cpp:132:

- } // namespace vectorized
- } // namespace doris
+ } // namespace doris

// specific language governing permissions and limitations
// under the License.

#include <gtest/gtest.h>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: 'gtest/gtest.h' file not found [clang-diagnostic-error]

#include <gtest/gtest.h>
         ^

Comment on lines +24 to +25
namespace doris {
namespace vectorized {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: nested namespaces can be concatenated [modernize-concat-nested-namespaces]

Suggested change
namespace doris {
namespace vectorized {
namespace doris::vectorized {

be/test/vec/exec/parquet/parquet_statistics_test.cpp:-1:

+ }

ParquetStatisticsTest() = default;
};

TEST_F(ParquetStatisticsTest, test_try_read_old_utf8_stats) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: function 'TEST_F' exceeds recommended size/complexity thresholds [readability-function-size]

TEST_F(ParquetStatisticsTest, test_try_read_old_utf8_stats) {
^
Additional context

be/test/vec/exec/parquet/parquet_statistics_test.cpp:30: 121 lines including whitespace and comments (threshold 80)

TEST_F(ParquetStatisticsTest, test_try_read_old_utf8_stats) {
^

// specific language governing permissions and limitations
// under the License.

#include <gtest/gtest.h>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: 'gtest/gtest.h' file not found [clang-diagnostic-error]

#include <gtest/gtest.h>
         ^

Comment on lines +24 to +25
namespace doris {
namespace vectorized {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: nested namespaces can be concatenated [modernize-concat-nested-namespaces]

Suggested change
namespace doris {
namespace vectorized {
namespace doris::vectorized {

be/test/vec/exec/parquet/parquet_version_test.cpp:219:

- } // namespace vectorized
- } // namespace doris
+ } // namespace doris

ParquetVersionTest() = default;
};

TEST_F(ParquetVersionTest, test_version_parser) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: function 'TEST_F' exceeds recommended size/complexity thresholds [readability-function-size]

TEST_F(ParquetVersionTest, test_version_parser) {
^
Additional context

be/test/vec/exec/parquet/parquet_version_test.cpp:30: 91 lines including whitespace and comments (threshold 80)

TEST_F(ParquetVersionTest, test_version_parser) {
^

@kaka11chen
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 36.54% (9341/25562)
Line Coverage: 28.04% (76537/272968)
Region Coverage: 26.86% (39372/146556)
Branch Coverage: 23.61% (20010/84762)
Coverage Report: http://coverage.selectdb-in.cc/coverage/4f994ec0e9aa1f9c6a4df86054046d4cfcbbb4e4_4f994ec0e9aa1f9c6a4df86054046d4cfcbbb4e4/report/index.html

@yiguolei yiguolei merged commit a44a274 into apache:branch-2.1 Aug 15, 2024
@yiguolei yiguolei mentioned this pull request Sep 5, 2024
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants