Skip to content

Conversation

@hubgeter
Copy link
Contributor

@hubgeter hubgeter commented Oct 8, 2023

Proposed changes

Issue Number: close #xxx
1.Reconstruct the logic of decode to read parquet. The parquet reader first reads the data according to the parquet physical type, and then performs a type conversion.

2.Support hive alter table.

to\from int smallint tinyint bigint float double boolean string char varchar date timestamp decimal
int
smallint
tinyint
float
bigint
double
boolean
string
char
varchar
date
timestamp
decimal

Further comments

If this is a relatively large or complex change, kick off the discussion at dev@doris.apache.org by explaining why you chose the solution you did and what alternatives you considered, etc...

@hubgeter
Copy link
Contributor Author

hubgeter commented Oct 8, 2023

run buildall

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

// specific language governing permissions and limitations
// under the License.

#include <gen_cpp/Metrics_types.h>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: 'gen_cpp/Metrics_types.h' file not found [clang-diagnostic-error]

#include <gen_cpp/Metrics_types.h>
         ^


template <typename src_type, typename dst_type, bool is_nullable>
struct NumberColumnConvert : public ColumnConvert {
virtual Status convert(const IColumn* src_col, IColumn* dst_col) override;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: 'virtual' is redundant since the function is already declared 'override' [modernize-use-override]

Suggested change
virtual Status convert(const IColumn* src_col, IColumn* dst_col) override;
Status convert(const IColumn* src_col, IColumn* dst_col) override;

}
template <typename src_type, bool is_nullable>
struct NumberColumnToStringConvert : public ColumnConvert {
virtual Status convert(const IColumn* src_col, IColumn* dst_col) override;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: 'virtual' is redundant since the function is already declared 'override' [modernize-use-override]

Suggested change
virtual Status convert(const IColumn* src_col, IColumn* dst_col) override;
Status convert(const IColumn* src_col, IColumn* dst_col) override;

struct int128totimestamp : public ColumnConvert {
int128totimestamp(DocTime* pTime) { doc = pTime; }

inline uint64_t to_timestamp_micros(uint32_t hi, uint64_t lo) const {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: function 'to_timestamp_micros' should be marked [[nodiscard]] [modernize-use-nodiscard]

Suggested change
inline uint64_t to_timestamp_micros(uint32_t hi, uint64_t lo) const {
[[nodiscard]] inline uint64_t to_timestamp_micros(uint32_t hi, uint64_t lo) const {

return (hi - ParquetInt96::JULIAN_EPOCH_OFFSET_DAYS) * ParquetInt96::MICROS_IN_DAY +
lo / ParquetInt96::NANOS_PER_MICROSECOND;
}
Status convert(const IColumn* src_col, IColumn* dst_col) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: annotate this function with 'override' or (rarely) 'final' [modernize-use-override]

Suggested change
Status convert(const IColumn* src_col, IColumn* dst_col) {
Status convert(const IColumn* src_col, IColumn* dst_col) override {

DocTime* doc;

public:
Status convert(const IColumn* src_col, IColumn* dst_col) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: annotate this function with 'override' or (rarely) 'final' [modernize-use-override]

Suggested change
Status convert(const IColumn* src_col, IColumn* dst_col) {
Status convert(const IColumn* src_col, IColumn* dst_col) override {

auto* src_data = static_cast<const ColumnVector<NumberType>*>(src_col)->get_data().data();
dst_col->resize(rows);
DecimalScaleParams& scale_params = doc->_decode_params->decimal_scale;
auto* data = static_cast<ColumnDecimal<Decimal<Int64>>*>(dst_col)->get_data().data();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: variable 'data' is not initialized [cppcoreguidelines-init-variables]

Suggested change
auto* data = static_cast<ColumnDecimal<Decimal<Int64>>*>(dst_col)->get_data().data();
auto* data = nullptr = static_cast<ColumnDecimal<Decimal<Int64>>*>(dst_col)->get_data().data();

return Status::OK();
}

public:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: redundant access specifier has the same accessibility as the previous access specifier [readability-redundant-access-specifiers]

Suggested change
public:
Additional context

be/src/vec/exec/format/convert.h:365: previously declared here

public:
^

inline uint64_t to_timestamp_micros() const {
return (hi - JULIAN_EPOCH_OFFSET_DAYS) * MICROS_IN_DAY + lo / NANOS_PER_MICROSECOND;
}
inline __int128 to_int128() const {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: function 'to_int128' should be marked [[nodiscard]] [modernize-use-nodiscard]

Suggested change
inline __int128 to_int128() const {
[[nodiscard]] inline __int128 to_int128() const {

}

BlockUPtr block = ctx->get_free_block();
BlockUPtr block = ctx->get_free_block(); //create block <- _output_tuple_desc / 想要的结果
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: variable 'block' is not initialized [cppcoreguidelines-init-variables]

Suggested change
BlockUPtr block = ctx->get_free_block(); //create block <- _output_tuple_desc / 想要的结果
BlockUPtr block = 0 = ctx->get_free_block(); //create block <- _output_tuple_desc / 想要的结果

@wm1581066
Copy link
Collaborator

schema change

@hubgeter
Copy link
Contributor Author

run buildall

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

// specific language governing permissions and limitations
// under the License.

#include <gen_cpp/PlanNodes_types.h>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: 'gen_cpp/PlanNodes_types.h' file not found [clang-diagnostic-error]

#include <gen_cpp/PlanNodes_types.h>
         ^

@hubgeter hubgeter changed the title Parquet shecma change [feature](hive)Support hive tables after alter type. Oct 10, 2023
@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 36.31% (8157/22467)
Line Coverage: 28.45% (65312/229590)
Region Coverage: 27.34% (33837/123759)
Branch Coverage: 24.11% (17330/71884)
Coverage Report: http://coverage.selectdb-in.cc/coverage/c7ef2f38a4bd63638791d6c403b4b6cef603ec20_c7ef2f38a4bd63638791d6c403b4b6cef603ec20/report/index.html

@doris-robot
Copy link

(From new machine)TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 44.9 seconds
stream load tsv: 576 seconds loaded 74807831229 Bytes, about 123 MB/s
stream load json: 21 seconds loaded 2358488459 Bytes, about 107 MB/s
stream load orc: 64 seconds loaded 1101869774 Bytes, about 16 MB/s
stream load parquet: 32 seconds loaded 861443392 Bytes, about 25 MB/s
insert into select: 28.9 seconds inserted 10000000 Rows, about 346K ops/s
storage size: 17162442616 Bytes

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

DataTypePtr src_type;
ParquetConvert::convert_data_type_from_parquet(physical_type, src_type,type,&need_convert);

ColumnPtr src_column = doris_column;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: variable 'src_column' is not initialized [cppcoreguidelines-init-variables]

Suggested change
ColumnPtr src_column = doris_column;
ColumnPtr src_column = 0 = doris_column;

@hubgeter
Copy link
Contributor Author

run buildall

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

// do nothing
break;
}
ColumnSelectVector::DataReadType read_type;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: variable 'read_type' is not initialized [cppcoreguidelines-init-variables]

    ColumnSelectVector::DataReadType read_type;
                                     ^

while (size_t run_length = select_vector.get_next_run<has_filter>(&read_type)) {
switch (read_type) {
case ColumnSelectVector::CONTENT: {
std::vector<StringRef> string_values;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: variable 'string_values' is not initialized [cppcoreguidelines-init-variables]

Suggested change
std::vector<StringRef> string_values;
std::vector<StringRef> string_values = 0;

}
string_values.emplace_back(_data->data + _offset, length);
_offset += length;
ColumnSelectVector::DataReadType read_type;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: variable 'read_type' is not initialized [cppcoreguidelines-init-variables]

    ColumnSelectVector::DataReadType read_type;
                                     ^

while (size_t run_length = select_vector.get_next_run<has_filter>(&read_type)) {
switch (read_type) {
case ColumnSelectVector::CONTENT: {
std::vector<StringRef> string_values;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: variable 'string_values' is not initialized [cppcoreguidelines-init-variables]

Suggested change
std::vector<StringRef> string_values;
std::vector<StringRef> string_values = 0;

}

// read the bitwidth of each miniblock
uint8_t* bit_width_data = _delta_bit_widths.data();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: variable 'bit_width_data' is not initialized [cppcoreguidelines-init-variables]

Suggested change
uint8_t* bit_width_data = _delta_bit_widths.data();
uint8_t* bit_width_data = nullptr = _delta_bit_widths.data();

while (size_t run_length = select_vector.get_next_run<has_filter>(&read_type)) {
switch (read_type) {
case ColumnSelectVector::CONTENT: {
std::vector<StringRef> string_values;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: variable 'string_values' is not initialized [cppcoreguidelines-init-variables]

Suggested change
std::vector<StringRef> string_values;
std::vector<StringRef> string_values = 0;

auto& column_data = static_cast<ColumnType&>(*doris_column).get_data();
size_t data_index = column_data.size();
column_data.resize(data_index + select_vector.num_values() - select_vector.num_filtered());
ColumnSelectVector::DataReadType read_type;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: variable 'read_type' is not initialized [cppcoreguidelines-init-variables]

        ColumnSelectVector::DataReadType read_type;
                                         ^

// specific language governing permissions and limitations
// under the License.

#include <gen_cpp/PlanNodes_types.h>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: 'gen_cpp/PlanNodes_types.h' file not found [clang-diagnostic-error]

#include <gen_cpp/PlanNodes_types.h>
         ^

auto* src_data = static_cast<const ColumnVector<NumberType>*>(src_col)->get_data().data();
dst_col->resize(rows);
DecimalScaleParams& scale_params = _convert_params->decimal_scale;
auto* data = static_cast<ColumnDecimal<Decimal<Int64>>*>(dst_col)->get_data().data();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: variable 'data' is not initialized [cppcoreguidelines-init-variables]

Suggested change
auto* data = static_cast<ColumnDecimal<Decimal<Int64>>*>(dst_col)->get_data().data();
auto* data = nullptr = static_cast<ColumnDecimal<Decimal<Int64>>*>(dst_col)->get_data().data();

DataTypePtr src_type;
ParquetConvert::convert_data_type_from_parquet(physical_type, src_type, type, &need_convert);

ColumnPtr src_column = doris_column;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: variable 'src_column' is not initialized [cppcoreguidelines-init-variables]

Suggested change
ColumnPtr src_column = doris_column;
ColumnPtr src_column = 0 = doris_column;

@hubgeter
Copy link
Contributor Author

run buildall

1 similar comment
@hubgeter
Copy link
Contributor Author

run buildall

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

DataTypePtr src_type;
RETURN_IF_ERROR(

ParquetConvert::convert_data_type_from_parquet(physical_type, src_type, type,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: variable 'src_column' is not initialized [cppcoreguidelines-init-variables]

Suggested change
ParquetConvert::convert_data_type_from_parquet(physical_type, src_type, type,
ColumnPtr src_column = 0 = doris_column;

@hubgeter hubgeter force-pushed the parquet_shecma_change branch from 28a1b8c to 7417819 Compare October 13, 2023 05:39
@hubgeter
Copy link
Contributor Author

run buildall

2 similar comments
@hubgeter
Copy link
Contributor Author

run buildall

@hubgeter
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

(From new machine)TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 46.73 seconds
stream load tsv: 556 seconds loaded 74807831229 Bytes, about 128 MB/s
stream load json: 21 seconds loaded 2358488459 Bytes, about 107 MB/s
stream load orc: 65 seconds loaded 1101869774 Bytes, about 16 MB/s
stream load parquet: 32 seconds loaded 861443392 Bytes, about 25 MB/s
insert into select: 28.7 seconds inserted 10000000 Rows, about 348K ops/s
storage size: 17162176004 Bytes

@hubgeter
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

(From new machine)TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 45.94 seconds
stream load tsv: 570 seconds loaded 74807831229 Bytes, about 125 MB/s
stream load json: 20 seconds loaded 2358488459 Bytes, about 112 MB/s
stream load orc: 64 seconds loaded 1101869774 Bytes, about 16 MB/s
stream load parquet: 31 seconds loaded 861443392 Bytes, about 26 MB/s
insert into select: 28.7 seconds inserted 10000000 Rows, about 348K ops/s
storage size: 17162342520 Bytes

@hubgeter
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 36.70% (8204/22354)
Line Coverage: 28.79% (65562/227761)
Region Coverage: 27.67% (34029/122976)
Branch Coverage: 24.20% (17248/71266)
Coverage Report: http://coverage.selectdb-in.cc/coverage/48168ae419a1c40d0b8b41aeec5c12b1f5c170ce_48168ae419a1c40d0b8b41aeec5c12b1f5c170ce/report/index.html

@doris-robot
Copy link

(From new machine)TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 45.8 seconds
stream load tsv: 573 seconds loaded 74807831229 Bytes, about 124 MB/s
stream load json: 20 seconds loaded 2358488459 Bytes, about 112 MB/s
stream load orc: 65 seconds loaded 1101869774 Bytes, about 16 MB/s
stream load parquet: 32 seconds loaded 861443392 Bytes, about 25 MB/s
insert into select: 29.0 seconds inserted 10000000 Rows, about 344K ops/s
storage size: 17162245495 Bytes

@hubgeter
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 36.96% (8269/22374)
Line Coverage: 28.98% (66097/228110)
Region Coverage: 27.84% (34297/123201)
Branch Coverage: 24.31% (17360/71410)
Coverage Report: http://coverage.selectdb-in.cc/coverage/b9e02ba841d4f9cb19adb44fb6756bb4c5317a51_b9e02ba841d4f9cb19adb44fb6756bb4c5317a51/report/index.html

@doris-robot
Copy link

(From new machine)TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 46.52 seconds
stream load tsv: 557 seconds loaded 74807831229 Bytes, about 128 MB/s
stream load json: 21 seconds loaded 2358488459 Bytes, about 107 MB/s
stream load orc: 65 seconds loaded 1101869774 Bytes, about 16 MB/s
stream load parquet: 32 seconds loaded 861443392 Bytes, about 25 MB/s
insert into select: 29.2 seconds inserted 10000000 Rows, about 342K ops/s
storage size: 17162001771 Bytes

@hubgeter
Copy link
Contributor Author

run p0

1 similar comment
@hubgeter
Copy link
Contributor Author

run p0

}

template <bool is_nullable>
struct int128totimestamp : public ColumnConvert {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Int128ToTimestamp, use upper case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there int128 column in parquet, and why convert to timestamp?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

int128 -> decimal128 / largeint

@morningman morningman force-pushed the parquet_shecma_change branch from bb8b178 to 0adc21f Compare October 31, 2023 14:47
@morningman
Copy link
Contributor

run buildall

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 37.27% (8453/22679)
Line Coverage: 29.57% (68179/230544)
Region Coverage: 28.33% (35393/124925)
Branch Coverage: 25.06% (18054/72040)
Coverage Report: http://coverage.selectdb-in.cc/coverage/0adc21fd1ae5626447021b2e01a451b904b55b1e_0adc21fd1ae5626447021b2e01a451b904b55b1e/report/index.html

@doris-robot
Copy link

(From new machine)TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 46.08 seconds
stream load tsv: 552 seconds loaded 74807831229 Bytes, about 129 MB/s
stream load json: 21 seconds loaded 2358488459 Bytes, about 107 MB/s
stream load orc: 65 seconds loaded 1101869774 Bytes, about 16 MB/s
stream load parquet: 32 seconds loaded 861443392 Bytes, about 25 MB/s
insert into select: 28.8 seconds inserted 10000000 Rows, about 347K ops/s
storage size: 17162109331 Bytes

@AshinGau
Copy link
Member

AshinGau commented Nov 1, 2023

LGTM

@morningman morningman merged commit a4e415a into apache:master Nov 1, 2023
dutyu pushed a commit to dutyu/doris that referenced this pull request Nov 4, 2023
1.Reconstruct the logic of decode to read parquet. The parquet  reader first reads the data according to the parquet physical type, and then performs a type conversion.

2.Support hive alter table.
seawinde pushed a commit to seawinde/doris that referenced this pull request Nov 13, 2023
1.Reconstruct the logic of decode to read parquet. The parquet  reader first reads the data according to the parquet physical type, and then performs a type conversion.

2.Support hive alter table.
XuJianxu pushed a commit to XuJianxu/doris that referenced this pull request Dec 14, 2023
1.Reconstruct the logic of decode to read parquet. The parquet  reader first reads the data according to the parquet physical type, and then performs a type conversion.

2.Support hive alter table.
kaka11chen added a commit to kaka11chen/doris that referenced this pull request Dec 15, 2023
kaka11chen added a commit to kaka11chen/doris that referenced this pull request Dec 15, 2023
kaka11chen added a commit to kaka11chen/doris that referenced this pull request Dec 18, 2023
morningman pushed a commit that referenced this pull request Dec 18, 2023
kaka11chen added a commit to kaka11chen/doris that referenced this pull request Dec 26, 2023
AshinGau added a commit that referenced this pull request Apr 10, 2024
…ader (#32873)

Following #25138, unified schema change interface for parquet and orc reader, and can be applied to other format readers as well.
Unified schema change interface for all format readers:
- First, read the data according to the column type of the file into source column;
- Second, convert source column to the destination column with type planned by FE.
yiguolei pushed a commit that referenced this pull request Apr 12, 2024
…ader (#32873)

Following #25138, unified schema change interface for parquet and orc reader, and can be applied to other format readers as well.
Unified schema change interface for all format readers:
- First, read the data according to the column type of the file into source column;
- Second, convert source column to the destination column with type planned by FE.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants