Skip to content

Conversation

@suxiaogang223
Copy link
Contributor

@suxiaogang223 suxiaogang223 commented Mar 25, 2024

Proposed changes

Make skip_page() in ColumnChunkReader more efficient. No more reading page headers if there are pagelocations in chunk.

Further comments

If this is a relatively large or complex change, kick off the discussion at dev@doris.apache.org by explaining why you chose the solution you did and what alternatives you considered, etc...

@doris-robot
Copy link

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR

Since 2024-03-18, the Document has been moved to doris-website.
See Doris Document.

@suxiaogang223 suxiaogang223 marked this pull request as draft March 25, 2024 10:01
@suxiaogang223 suxiaogang223 changed the title [opt](parquet) efficient page skipping by page location [enhancement](parquet) efficient page skipping by page location Mar 25, 2024
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

@suxiaogang223 suxiaogang223 force-pushed the skip_page_v2 branch 3 times, most recently from 4dcc089 to 2c68efc Compare March 25, 2024 14:50
@suxiaogang223 suxiaogang223 marked this pull request as ready for review March 25, 2024 15:29
@github-actions
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

page_data.size -= dl + rl;
}

Status ColumnChunkReader::load_page_data() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: function 'load_page_data' has cognitive complexity of 80 (threshold 50) [readability-function-cognitive-complexity]

Status ColumnChunkReader::load_page_data() {
                          ^
Additional context

be/src/vec/exec/format/parquet/vparquet_column_chunk_reader.cpp:129: +1, including nesting penalty of 0, nesting level increased to 1

    if (_offset_index) {
    ^

be/src/vec/exec/format/parquet/vparquet_column_chunk_reader.cpp:131: +2, including nesting penalty of 1, nesting level increased to 2

        RETURN_IF_ERROR(_page_reader->parse_page_header());
        ^

be/src/common/status.h:541: expanded from macro 'RETURN_IF_ERROR'

    do {                                \
    ^

be/src/vec/exec/format/parquet/vparquet_column_chunk_reader.cpp:131: +3, including nesting penalty of 2, nesting level increased to 3

        RETURN_IF_ERROR(_page_reader->parse_page_header());
        ^

be/src/common/status.h:543: expanded from macro 'RETURN_IF_ERROR'

        if (UNLIKELY(!_status_.ok())) { \
        ^

be/src/vec/exec/format/parquet/vparquet_column_chunk_reader.cpp:136: +1, including nesting penalty of 0, nesting level increased to 1

    if (UNLIKELY(_state != HEADER_PARSED)) {
    ^

be/src/vec/exec/format/parquet/vparquet_column_chunk_reader.cpp:143: +1, including nesting penalty of 0, nesting level increased to 1

    if (_block_compress_codec != nullptr) {
    ^

be/src/vec/exec/format/parquet/vparquet_column_chunk_reader.cpp:145: +2, including nesting penalty of 1, nesting level increased to 2

        RETURN_IF_ERROR(_page_reader->get_page_data(compressed_data));
        ^

be/src/common/status.h:541: expanded from macro 'RETURN_IF_ERROR'

    do {                                \
    ^

be/src/vec/exec/format/parquet/vparquet_column_chunk_reader.cpp:145: +3, including nesting penalty of 2, nesting level increased to 3

        RETURN_IF_ERROR(_page_reader->get_page_data(compressed_data));
        ^

be/src/common/status.h:543: expanded from macro 'RETURN_IF_ERROR'

        if (UNLIKELY(!_status_.ok())) { \
        ^

be/src/vec/exec/format/parquet/vparquet_column_chunk_reader.cpp:146: +2, including nesting penalty of 1, nesting level increased to 2

        if (header.__isset.data_page_header_v2) {
        ^

be/src/vec/exec/format/parquet/vparquet_column_chunk_reader.cpp:155: +1

                header.__isset.data_page_header_v2 && header.data_page_header_v2.is_compressed;
                                                   ^

be/src/vec/exec/format/parquet/vparquet_column_chunk_reader.cpp:156: +2, including nesting penalty of 1, nesting level increased to 2

        if (header.__isset.data_page_header || is_v2_compressed) {
        ^

be/src/vec/exec/format/parquet/vparquet_column_chunk_reader.cpp:156: +1

        if (header.__isset.data_page_header || is_v2_compressed) {
                                            ^

be/src/vec/exec/format/parquet/vparquet_column_chunk_reader.cpp:162: +3, including nesting penalty of 2, nesting level increased to 3

            RETURN_IF_ERROR(_block_compress_codec->decompress(compressed_data, &_page_data));
            ^

be/src/common/status.h:541: expanded from macro 'RETURN_IF_ERROR'

    do {                                \
    ^

be/src/vec/exec/format/parquet/vparquet_column_chunk_reader.cpp:162: +4, including nesting penalty of 3, nesting level increased to 4

            RETURN_IF_ERROR(_block_compress_codec->decompress(compressed_data, &_page_data));
            ^

be/src/common/status.h:543: expanded from macro 'RETURN_IF_ERROR'

        if (UNLIKELY(!_status_.ok())) { \
        ^

be/src/vec/exec/format/parquet/vparquet_column_chunk_reader.cpp:163: +1, nesting level increased to 2

        } else {
          ^

be/src/vec/exec/format/parquet/vparquet_column_chunk_reader.cpp:167: +1, nesting level increased to 1

    } else {
      ^

be/src/vec/exec/format/parquet/vparquet_column_chunk_reader.cpp:168: +2, including nesting penalty of 1, nesting level increased to 2

        RETURN_IF_ERROR(_page_reader->get_page_data(_page_data));
        ^

be/src/common/status.h:541: expanded from macro 'RETURN_IF_ERROR'

    do {                                \
    ^

be/src/vec/exec/format/parquet/vparquet_column_chunk_reader.cpp:168: +3, including nesting penalty of 2, nesting level increased to 3

        RETURN_IF_ERROR(_page_reader->get_page_data(_page_data));
        ^

be/src/common/status.h:543: expanded from macro 'RETURN_IF_ERROR'

        if (UNLIKELY(!_status_.ok())) { \
        ^

be/src/vec/exec/format/parquet/vparquet_column_chunk_reader.cpp:169: +2, including nesting penalty of 1, nesting level increased to 2

        if (header.__isset.data_page_header_v2) {
        ^

be/src/vec/exec/format/parquet/vparquet_column_chunk_reader.cpp:175: +1, including nesting penalty of 0, nesting level increased to 1

    if (_max_rep_level > 0) {
    ^

be/src/vec/exec/format/parquet/vparquet_column_chunk_reader.cpp:177: +2, including nesting penalty of 1, nesting level increased to 2

        if (header.__isset.data_page_header_v2) {
        ^

be/src/vec/exec/format/parquet/vparquet_column_chunk_reader.cpp:178: +3, including nesting penalty of 2, nesting level increased to 3

            RETURN_IF_ERROR(_rep_level_decoder.init_v2(_v2_rep_levels, _max_rep_level,
            ^

be/src/common/status.h:541: expanded from macro 'RETURN_IF_ERROR'

    do {                                \
    ^

be/src/vec/exec/format/parquet/vparquet_column_chunk_reader.cpp:178: +4, including nesting penalty of 3, nesting level increased to 4

            RETURN_IF_ERROR(_rep_level_decoder.init_v2(_v2_rep_levels, _max_rep_level,
            ^

be/src/common/status.h:543: expanded from macro 'RETURN_IF_ERROR'

        if (UNLIKELY(!_status_.ok())) { \
        ^

be/src/vec/exec/format/parquet/vparquet_column_chunk_reader.cpp:180: +1, nesting level increased to 2

        } else {
          ^

be/src/vec/exec/format/parquet/vparquet_column_chunk_reader.cpp:181: +3, including nesting penalty of 2, nesting level increased to 3

            RETURN_IF_ERROR(_rep_level_decoder.init(
            ^

be/src/common/status.h:541: expanded from macro 'RETURN_IF_ERROR'

    do {                                \
    ^

be/src/vec/exec/format/parquet/vparquet_column_chunk_reader.cpp:181: +4, including nesting penalty of 3, nesting level increased to 4

            RETURN_IF_ERROR(_rep_level_decoder.init(
            ^

be/src/common/status.h:543: expanded from macro 'RETURN_IF_ERROR'

        if (UNLIKELY(!_status_.ok())) { \
        ^

be/src/vec/exec/format/parquet/vparquet_column_chunk_reader.cpp:186: +1, including nesting penalty of 0, nesting level increased to 1

    if (_max_def_level > 0) {
    ^

be/src/vec/exec/format/parquet/vparquet_column_chunk_reader.cpp:188: +2, including nesting penalty of 1, nesting level increased to 2

        if (header.__isset.data_page_header_v2) {
        ^

be/src/vec/exec/format/parquet/vparquet_column_chunk_reader.cpp:189: +3, including nesting penalty of 2, nesting level increased to 3

            RETURN_IF_ERROR(_def_level_decoder.init_v2(_v2_def_levels, _max_def_level,
            ^

be/src/common/status.h:541: expanded from macro 'RETURN_IF_ERROR'

    do {                                \
    ^

be/src/vec/exec/format/parquet/vparquet_column_chunk_reader.cpp:189: +4, including nesting penalty of 3, nesting level increased to 4

            RETURN_IF_ERROR(_def_level_decoder.init_v2(_v2_def_levels, _max_def_level,
            ^

be/src/common/status.h:543: expanded from macro 'RETURN_IF_ERROR'

        if (UNLIKELY(!_status_.ok())) { \
        ^

be/src/vec/exec/format/parquet/vparquet_column_chunk_reader.cpp:191: +1, nesting level increased to 2

        } else {
          ^

be/src/vec/exec/format/parquet/vparquet_column_chunk_reader.cpp:192: +3, including nesting penalty of 2, nesting level increased to 3

            RETURN_IF_ERROR(_def_level_decoder.init(
            ^

be/src/common/status.h:541: expanded from macro 'RETURN_IF_ERROR'

    do {                                \
    ^

be/src/vec/exec/format/parquet/vparquet_column_chunk_reader.cpp:192: +4, including nesting penalty of 3, nesting level increased to 4

            RETURN_IF_ERROR(_def_level_decoder.init(
            ^

be/src/common/status.h:543: expanded from macro 'RETURN_IF_ERROR'

        if (UNLIKELY(!_status_.ok())) { \
        ^

be/src/vec/exec/format/parquet/vparquet_column_chunk_reader.cpp:197: +1, including nesting penalty of 0, nesting level increased to 1

    auto encoding = header.__isset.data_page_header_v2 ? header.data_page_header_v2.encoding
                                                       ^

be/src/vec/exec/format/parquet/vparquet_column_chunk_reader.cpp:200: +1, including nesting penalty of 0, nesting level increased to 1

    if (encoding == tparquet::Encoding::PLAIN_DICTIONARY) {
    ^

be/src/vec/exec/format/parquet/vparquet_column_chunk_reader.cpp:205: +1, including nesting penalty of 0, nesting level increased to 1

    if (_decoders.find(static_cast<int>(encoding)) != _decoders.end()) {
    ^

be/src/vec/exec/format/parquet/vparquet_column_chunk_reader.cpp:207: +1, nesting level increased to 1

    } else {
      ^

be/src/vec/exec/format/parquet/vparquet_column_chunk_reader.cpp:209: +2, including nesting penalty of 1, nesting level increased to 2

        RETURN_IF_ERROR(Decoder::get_decoder(_metadata.type, encoding, page_decoder));
        ^

be/src/common/status.h:541: expanded from macro 'RETURN_IF_ERROR'

    do {                                \
    ^

be/src/vec/exec/format/parquet/vparquet_column_chunk_reader.cpp:209: +3, including nesting penalty of 2, nesting level increased to 3

        RETURN_IF_ERROR(Decoder::get_decoder(_metadata.type, encoding, page_decoder));
        ^

be/src/common/status.h:543: expanded from macro 'RETURN_IF_ERROR'

        if (UNLIKELY(!_status_.ok())) { \
        ^

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

page_data.size -= dl + rl;
}

Status ColumnChunkReader::load_page_data() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: function 'load_page_data' has cognitive complexity of 84 (threshold 50) [readability-function-cognitive-complexity]

Status ColumnChunkReader::load_page_data() {
                          ^
Additional context

be/src/vec/exec/format/parquet/vparquet_column_chunk_reader.cpp:127: +1, including nesting penalty of 0, nesting level increased to 1

    if (_offset_index) {
    ^

be/src/vec/exec/format/parquet/vparquet_column_chunk_reader.cpp:129: +2, including nesting penalty of 1, nesting level increased to 2

        if (UNLIKELY(!_page_reader->has_header_parsed())) {
        ^

be/src/vec/exec/format/parquet/vparquet_column_chunk_reader.cpp:130: +3, including nesting penalty of 2, nesting level increased to 3

            RETURN_IF_ERROR(_page_reader->parse_page_header());
            ^

be/src/common/status.h:541: expanded from macro 'RETURN_IF_ERROR'

    do {                                \
    ^

be/src/vec/exec/format/parquet/vparquet_column_chunk_reader.cpp:130: +4, including nesting penalty of 3, nesting level increased to 4

            RETURN_IF_ERROR(_page_reader->parse_page_header());
            ^

be/src/common/status.h:543: expanded from macro 'RETURN_IF_ERROR'

        if (UNLIKELY(!_status_.ok())) { \
        ^

be/src/vec/exec/format/parquet/vparquet_column_chunk_reader.cpp:135: +1, including nesting penalty of 0, nesting level increased to 1

    if (UNLIKELY(!_page_reader->has_header_parsed())) {
    ^

be/src/vec/exec/format/parquet/vparquet_column_chunk_reader.cpp:142: +1, including nesting penalty of 0, nesting level increased to 1

    if (_block_compress_codec != nullptr) {
    ^

be/src/vec/exec/format/parquet/vparquet_column_chunk_reader.cpp:144: +2, including nesting penalty of 1, nesting level increased to 2

        RETURN_IF_ERROR(_page_reader->get_page_data(compressed_data));
        ^

be/src/common/status.h:541: expanded from macro 'RETURN_IF_ERROR'

    do {                                \
    ^

be/src/vec/exec/format/parquet/vparquet_column_chunk_reader.cpp:144: +3, including nesting penalty of 2, nesting level increased to 3

        RETURN_IF_ERROR(_page_reader->get_page_data(compressed_data));
        ^

be/src/common/status.h:543: expanded from macro 'RETURN_IF_ERROR'

        if (UNLIKELY(!_status_.ok())) { \
        ^

be/src/vec/exec/format/parquet/vparquet_column_chunk_reader.cpp:145: +2, including nesting penalty of 1, nesting level increased to 2

        if (header.__isset.data_page_header_v2) {
        ^

be/src/vec/exec/format/parquet/vparquet_column_chunk_reader.cpp:154: +1

                header.__isset.data_page_header_v2 && header.data_page_header_v2.is_compressed;
                                                   ^

be/src/vec/exec/format/parquet/vparquet_column_chunk_reader.cpp:155: +2, including nesting penalty of 1, nesting level increased to 2

        if (header.__isset.data_page_header || is_v2_compressed) {
        ^

be/src/vec/exec/format/parquet/vparquet_column_chunk_reader.cpp:155: +1

        if (header.__isset.data_page_header || is_v2_compressed) {
                                            ^

be/src/vec/exec/format/parquet/vparquet_column_chunk_reader.cpp:161: +3, including nesting penalty of 2, nesting level increased to 3

            RETURN_IF_ERROR(_block_compress_codec->decompress(compressed_data, &_page_data));
            ^

be/src/common/status.h:541: expanded from macro 'RETURN_IF_ERROR'

    do {                                \
    ^

be/src/vec/exec/format/parquet/vparquet_column_chunk_reader.cpp:161: +4, including nesting penalty of 3, nesting level increased to 4

            RETURN_IF_ERROR(_block_compress_codec->decompress(compressed_data, &_page_data));
            ^

be/src/common/status.h:543: expanded from macro 'RETURN_IF_ERROR'

        if (UNLIKELY(!_status_.ok())) { \
        ^

be/src/vec/exec/format/parquet/vparquet_column_chunk_reader.cpp:162: +1, nesting level increased to 2

        } else {
          ^

be/src/vec/exec/format/parquet/vparquet_column_chunk_reader.cpp:166: +1, nesting level increased to 1

    } else {
      ^

be/src/vec/exec/format/parquet/vparquet_column_chunk_reader.cpp:167: +2, including nesting penalty of 1, nesting level increased to 2

        RETURN_IF_ERROR(_page_reader->get_page_data(_page_data));
        ^

be/src/common/status.h:541: expanded from macro 'RETURN_IF_ERROR'

    do {                                \
    ^

be/src/vec/exec/format/parquet/vparquet_column_chunk_reader.cpp:167: +3, including nesting penalty of 2, nesting level increased to 3

        RETURN_IF_ERROR(_page_reader->get_page_data(_page_data));
        ^

be/src/common/status.h:543: expanded from macro 'RETURN_IF_ERROR'

        if (UNLIKELY(!_status_.ok())) { \
        ^

be/src/vec/exec/format/parquet/vparquet_column_chunk_reader.cpp:168: +2, including nesting penalty of 1, nesting level increased to 2

        if (header.__isset.data_page_header_v2) {
        ^

be/src/vec/exec/format/parquet/vparquet_column_chunk_reader.cpp:174: +1, including nesting penalty of 0, nesting level increased to 1

    if (_max_rep_level > 0) {
    ^

be/src/vec/exec/format/parquet/vparquet_column_chunk_reader.cpp:176: +2, including nesting penalty of 1, nesting level increased to 2

        if (header.__isset.data_page_header_v2) {
        ^

be/src/vec/exec/format/parquet/vparquet_column_chunk_reader.cpp:177: +3, including nesting penalty of 2, nesting level increased to 3

            RETURN_IF_ERROR(_rep_level_decoder.init_v2(_v2_rep_levels, _max_rep_level,
            ^

be/src/common/status.h:541: expanded from macro 'RETURN_IF_ERROR'

    do {                                \
    ^

be/src/vec/exec/format/parquet/vparquet_column_chunk_reader.cpp:177: +4, including nesting penalty of 3, nesting level increased to 4

            RETURN_IF_ERROR(_rep_level_decoder.init_v2(_v2_rep_levels, _max_rep_level,
            ^

be/src/common/status.h:543: expanded from macro 'RETURN_IF_ERROR'

        if (UNLIKELY(!_status_.ok())) { \
        ^

be/src/vec/exec/format/parquet/vparquet_column_chunk_reader.cpp:179: +1, nesting level increased to 2

        } else {
          ^

be/src/vec/exec/format/parquet/vparquet_column_chunk_reader.cpp:180: +3, including nesting penalty of 2, nesting level increased to 3

            RETURN_IF_ERROR(_rep_level_decoder.init(
            ^

be/src/common/status.h:541: expanded from macro 'RETURN_IF_ERROR'

    do {                                \
    ^

be/src/vec/exec/format/parquet/vparquet_column_chunk_reader.cpp:180: +4, including nesting penalty of 3, nesting level increased to 4

            RETURN_IF_ERROR(_rep_level_decoder.init(
            ^

be/src/common/status.h:543: expanded from macro 'RETURN_IF_ERROR'

        if (UNLIKELY(!_status_.ok())) { \
        ^

be/src/vec/exec/format/parquet/vparquet_column_chunk_reader.cpp:185: +1, including nesting penalty of 0, nesting level increased to 1

    if (_max_def_level > 0) {
    ^

be/src/vec/exec/format/parquet/vparquet_column_chunk_reader.cpp:187: +2, including nesting penalty of 1, nesting level increased to 2

        if (header.__isset.data_page_header_v2) {
        ^

be/src/vec/exec/format/parquet/vparquet_column_chunk_reader.cpp:188: +3, including nesting penalty of 2, nesting level increased to 3

            RETURN_IF_ERROR(_def_level_decoder.init_v2(_v2_def_levels, _max_def_level,
            ^

be/src/common/status.h:541: expanded from macro 'RETURN_IF_ERROR'

    do {                                \
    ^

be/src/vec/exec/format/parquet/vparquet_column_chunk_reader.cpp:188: +4, including nesting penalty of 3, nesting level increased to 4

            RETURN_IF_ERROR(_def_level_decoder.init_v2(_v2_def_levels, _max_def_level,
            ^

be/src/common/status.h:543: expanded from macro 'RETURN_IF_ERROR'

        if (UNLIKELY(!_status_.ok())) { \
        ^

be/src/vec/exec/format/parquet/vparquet_column_chunk_reader.cpp:190: +1, nesting level increased to 2

        } else {
          ^

be/src/vec/exec/format/parquet/vparquet_column_chunk_reader.cpp:191: +3, including nesting penalty of 2, nesting level increased to 3

            RETURN_IF_ERROR(_def_level_decoder.init(
            ^

be/src/common/status.h:541: expanded from macro 'RETURN_IF_ERROR'

    do {                                \
    ^

be/src/vec/exec/format/parquet/vparquet_column_chunk_reader.cpp:191: +4, including nesting penalty of 3, nesting level increased to 4

            RETURN_IF_ERROR(_def_level_decoder.init(
            ^

be/src/common/status.h:543: expanded from macro 'RETURN_IF_ERROR'

        if (UNLIKELY(!_status_.ok())) { \
        ^

be/src/vec/exec/format/parquet/vparquet_column_chunk_reader.cpp:196: +1, including nesting penalty of 0, nesting level increased to 1

    auto encoding = header.__isset.data_page_header_v2 ? header.data_page_header_v2.encoding
                                                       ^

be/src/vec/exec/format/parquet/vparquet_column_chunk_reader.cpp:199: +1, including nesting penalty of 0, nesting level increased to 1

    if (encoding == tparquet::Encoding::PLAIN_DICTIONARY) {
    ^

be/src/vec/exec/format/parquet/vparquet_column_chunk_reader.cpp:204: +1, including nesting penalty of 0, nesting level increased to 1

    if (_decoders.find(static_cast<int>(encoding)) != _decoders.end()) {
    ^

be/src/vec/exec/format/parquet/vparquet_column_chunk_reader.cpp:206: +1, nesting level increased to 1

    } else {
      ^

be/src/vec/exec/format/parquet/vparquet_column_chunk_reader.cpp:208: +2, including nesting penalty of 1, nesting level increased to 2

        RETURN_IF_ERROR(Decoder::get_decoder(_metadata.type, encoding, page_decoder));
        ^

be/src/common/status.h:541: expanded from macro 'RETURN_IF_ERROR'

    do {                                \
    ^

be/src/vec/exec/format/parquet/vparquet_column_chunk_reader.cpp:208: +3, including nesting penalty of 2, nesting level increased to 3

        RETURN_IF_ERROR(Decoder::get_decoder(_metadata.type, encoding, page_decoder));
        ^

be/src/common/status.h:543: expanded from macro 'RETURN_IF_ERROR'

        if (UNLIKELY(!_status_.ok())) { \
        ^

// Parse dictionary data when reading
// RETURN_IF_ERROR(_page_reader->next_page_header());
// RETURN_IF_ERROR(_decode_dict_page());
_page_reader->seek_page(_metadata.dictionary_page_offset);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why change seek_to_page to seek_page , and remove the above comments.

return next_page();
} else if (_page_reader->get_page_header()->type == tparquet::PageType::DATA_PAGE_V2) {

if (_offset_index) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't add is-else logic, the underlying PageReader supply the unified interface.

// and skip_page() to skip pages one by one.
// todo: change this interface to seek_to_page(int64_t page_header_offset, size_t num_parsed_values)
// and set _chunk_parsed_values = num_parsed_values
// [[deprecated]]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's best to keep this interface, only the page with OffsetIndex can call this function.

void _get_uncompressed_levels(const tparquet::DataPageHeaderV2& page_v2, Slice& page_data);

//Returns the number of values in the current page.
int64_t _get_page_num_values() const {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move to PageReader

const tparquet::RowGroup& row_group,
const std::vector<RowRange>& row_ranges, cctz::time_zone* ctz,
io::IOContext* io_ctx,
io::IOContext* io_ctx, const tparquet::OffsetIndex* offset_index,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The complex type column will get wrong OffsetIndex

io::IOContext* _io_ctx = nullptr;
tparquet::PageHeader _cur_page_header;
Statistics _statistics;
tparquet::PageHeader _cur_page_header {};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why add {}

const FieldSchema* col_schema = schema_desc.get_column(read_col);
static_cast<void>(page_index.collect_skipped_page_range(
&column_index, conjuncts, col_schema, skipped_page_range, *_ctz));
if (skipped_page_range.empty()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never change the main logic in ParquetReader

_page_reader->seek_page(_metadata.data_page_offset);
}
_state = INITIALIZED;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dictionary page can be parsed lazily.

if (_offset_index) {
_page_index++;
if (_page_index == _offset_index->page_locations.size()) {
return Status::EndOfFile("End of file");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Return other errors, don't return EndOfFile, because is has special meaning in VfileScanner.

@suxiaogang223 suxiaogang223 deleted the skip_page_v2 branch August 15, 2024 08:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants