Skip to content

Conversation

@yinzhijian
Copy link
Contributor

Proposed changes

Issue Number: close #xxx

Problem Summary:

Optimization Todo list:

  1. FE generates the corresponding src slot desc and Expr through the parquet schema
  2. BE supports direct conversion of arrow type into dest primitive type of similar type. For example, arrow type is INT32, and dest type is TYPE_BIGINT (int64), INT32=>TYPE_BIGINT. Instead of the current way: INT32=> TYPE_INT => TYPE_BIGINT

Performance Testing:

load parquet file in vec version almost 1x faster than rowset version.
rows num:300k
test table schema:
CREATE TABLE parquet (
id int(11) NOT NULL COMMENT "",
email varchar(26) NOT NULL COMMENT "",
c_date32 DATE NOT NULL COMMENT "",
c_date64 DATETIME NOT NULL COMMENT "",
c_timestamp DATETIME NOT NULL COMMENT "",
c_decimal128 DECIMAL(27, 9) NULL COMMENT "",
c_bool BOOLEAN NULL COMMENT "",
c_float FLOAT NULL COMMENT "",
c_double DOUBLE NULL COMMENT "",
c_fixed_size_binary CHAR(20) NULL COMMENT "",
c_binary VARCHAR(32) NULL COMMENT "",
c_uint64 BIGINT NULL COMMENT ""
)
DISTRIBUTED BY HASH(id) BUCKETS 1
PROPERTIES (
"replication_num" = "1"
);

Checklist(Required)

  1. Does it affect the original behavior: (No)
  2. Has unit tests been added: (No)
  3. Has document been added or modified: (No Need)
  4. Does it need to update dependencies: (No)
  5. Are there any changes that cannot be rolled back: (No)

Further comments

If this is a relatively large or complex change, kick off the discussion at dev@doris.apache.org by explaining why you chose the solution you did and what alternatives you considered, etc...

@github-actions github-actions bot added area/sql/function Issues or PRs related to the SQL functions area/vectorization labels Apr 26, 2022
@github-actions github-actions bot added area/load Issues or PRs related to all kinds of load area/planner Issues or PRs related to the query planner labels Apr 29, 2022
runtime/vdata_stream_mgr.cpp
runtime/vpartition_info.cpp
runtime/vsorted_run_merger.cpp
runtime/vload_channel.cpp
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the code not add in this PR?

#define FOR_ARROW_TYPES(M) \
M(::arrow::Type::BOOL, TYPE_BOOLEAN) \
M(::arrow::Type::INT8, TYPE_TINYINT) \
M(::arrow::Type::UINT8, TYPE_TINYINT) \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UINT8 bigger than tinyint ?may overflow

PaddedPODArray<UInt8> & column_chars_t = assert_cast<ColumnString &>(*data_column).get_chars();
PaddedPODArray<UInt32> & column_offsets = assert_cast<ColumnString &>(*data_column).get_offsets();

const auto & concrete_array = dynamic_cast<const arrow::BinaryArray &>(*array);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code format: auto&. same to other.

PaddedPODArray<UInt8> & column_chars_t = assert_cast<ColumnString &>(*data_column).get_chars();
PaddedPODArray<UInt32> & column_offsets = assert_cast<ColumnString &>(*data_column).get_offsets();

const auto & concrete_array = dynamic_cast<const arrow::BinaryArray &>(*array);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rethink here really need dynamic_cast ?

  1. dynamic_cast means bad design of class or code, replace with virtual function
  2. this code may throw exception if cast faild. it is danger here.

return Status::OK();
}

Status arrow_column_to_doris_column(const arrow::Array* arrow_column,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add UT for this class

// init arrow batch
{
Status st = init_arrow_batch_if_necessary();
if (!st.ok()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the code is weird? rethink the logic?


// eval conjuncts, for example: t1 > 1
Status VParquetScanner::eval_conjunts(Block* block) {
for (auto& vctx : _pre_filter_vctxs) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may should change _pre_filter_vctxs to a tree to avoid the each time of copy operation

@yinzhijian
Copy link
Contributor Author

close this pr, use this #9433

@yinzhijian yinzhijian closed this May 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/load Issues or PRs related to all kinds of load area/planner Issues or PRs related to the query planner area/sql/function Issues or PRs related to the SQL functions area/vectorization

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants