Skip to content

Conversation

@hubgeter
Copy link
Contributor

@hubgeter hubgeter commented Jul 28, 2023

Proposed changes

Issue Number: close #xxx

after pr #21514

  1. append support for struct and map column type on textfile format of hive table.
  2. optimizer code that array column type.
+------+------------------------------------+
| id   | perf                               |
+------+------------------------------------+
| 1    | {"key1":"value1", "key2":"value2"} |
| 1    | {"key1":"value1", "key2":"value2"} |
| 2    | {"name":"John", "age":"30"}        |
+------+------------------------------------+
+---------+------------------+
| column1 | column2          |
+---------+------------------+
|       1 | {10, "data1", 1} |
|       2 | {20, "data2", 0} |
|       3 | {30, "data3", 1} |
+---------+------------------+

Summarizes support for complex types(support assign delimiter) :

  1. array< primitive_type > and array< array< ... > >
  2. map< primitive_type , primitive_type >
  3. Struct< primitive_type , primitive_type ... >

Further comments

If this is a relatively large or complex change, kick off the discussion at dev@doris.apache.org by explaining why you chose the solution you did and what alternatives you considered, etc...

@hubgeter hubgeter changed the title Hive map struct [feature](hive)append support for struct and map column type on textfile format of hive table Jul 28, 2023
@github-actions
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

1 similar comment
@github-actions
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@morningman
Copy link
Contributor

run buildall

@hello-stephen
Copy link
Contributor

(From new machine)TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 48.07 seconds
stream load tsv: 507 seconds loaded 74807831229 Bytes, about 140 MB/s
stream load json: 20 seconds loaded 2358488459 Bytes, about 112 MB/s
stream load orc: 65 seconds loaded 1101869774 Bytes, about 16 MB/s
stream load parquet: 31 seconds loaded 861443392 Bytes, about 26 MB/s
insert into select: 29.3 seconds inserted 10000000 Rows, about 341K ops/s
storage size: 17163567337 Bytes

}

//support map<primitive_type,primitive_type>
if (slot.getType().isMapType()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (slot.getType().isMapType()) {
else if (slot.getType().isMapType()) {

}

//support Struct< primitive_type,primitive_type ... >
if (slot.getType().isStructType()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (slot.getType().isStructType()) {
else if (slot.getType().isStructType()) {

std::string _value_separator;
std::string _line_delimiter;
std::string _array_delimiter;
// std::string _array_delimiter;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete it

bool TextConverter::write_vec_column(const SlotDescriptor* slot_desc,
vectorized::IColumn* nullable_col_ptr, const char* data,
size_t len, bool copy_string, bool need_escape, size_t rows) {
bool TextConverter::write_date(const TypeDescriptor& type_desc,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
bool TextConverter::write_date(const TypeDescriptor& type_desc,
bool TextConverter::write_data(const TypeDescriptor& type_desc,

// \N means it's NULL
if (slot_desc->is_nullable()) {
std::string col_type_name = col_ptr->get_name();
if (col_type_name.substr(0, 8) == "Nullable") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why change to this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unable to obtain sub SlotDescriptor from SlotDescriptor . So use vectorized::IColumn * -> get_name to determine whether the column type can be null.

@github-actions
Copy link
Contributor

github-actions bot commented Aug 3, 2023

clang-tidy review says "All clean, LGTM! 👍"

@hubgeter
Copy link
Contributor Author

hubgeter commented Aug 3, 2023

run buildall

1 similar comment
@hubgeter
Copy link
Contributor Author

hubgeter commented Aug 3, 2023

run buildall

@github-actions
Copy link
Contributor

github-actions bot commented Aug 3, 2023

clang-tidy review says "All clean, LGTM! 👍"

@hello-stephen
Copy link
Contributor

(From new machine)TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 47.35 seconds
stream load tsv: 518 seconds loaded 74807831229 Bytes, about 137 MB/s
stream load json: 20 seconds loaded 2358488459 Bytes, about 112 MB/s
stream load orc: 65 seconds loaded 1101869774 Bytes, about 16 MB/s
stream load parquet: 30 seconds loaded 861443392 Bytes, about 27 MB/s
insert into select: 29.9 seconds inserted 10000000 Rows, about 334K ops/s
storage size: 17162494781 Bytes

@hubgeter
Copy link
Contributor Author

hubgeter commented Aug 4, 2023

run buildall

@github-actions
Copy link
Contributor

github-actions bot commented Aug 4, 2023

clang-tidy review says "All clean, LGTM! 👍"

1 similar comment
@github-actions
Copy link
Contributor

github-actions bot commented Aug 4, 2023

clang-tidy review says "All clean, LGTM! 👍"

@hello-stephen
Copy link
Contributor

(From new machine)TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 46.54 seconds
stream load tsv: 514 seconds loaded 74807831229 Bytes, about 138 MB/s
stream load json: 20 seconds loaded 2358488459 Bytes, about 112 MB/s
stream load orc: 64 seconds loaded 1101869774 Bytes, about 16 MB/s
stream load parquet: 31 seconds loaded 861443392 Bytes, about 26 MB/s
insert into select: 29.6 seconds inserted 10000000 Rows, about 337K ops/s
storage size: 17161995543 Bytes

@zhangguoqiang666
Copy link
Contributor

run external

@github-actions
Copy link
Contributor

github-actions bot commented Aug 8, 2023

clang-tidy review says "All clean, LGTM! 👍"

Copy link
Contributor

@kaka11chen kaka11chen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions
Copy link
Contributor

github-actions bot commented Aug 8, 2023

PR approved by anyone and no changes requested.

@morningman
Copy link
Contributor

run buildall

@hello-stephen
Copy link
Contributor

(From new machine)TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 44.84 seconds
stream load tsv: 513 seconds loaded 74807831229 Bytes, about 139 MB/s
stream load json: 20 seconds loaded 2358488459 Bytes, about 112 MB/s
stream load orc: 65 seconds loaded 1101869774 Bytes, about 16 MB/s
stream load parquet: 32 seconds loaded 861443392 Bytes, about 25 MB/s
insert into select: 29.1 seconds inserted 10000000 Rows, about 343K ops/s
storage size: 17162055537 Bytes

Copy link
Contributor

@morningman morningman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Aug 9, 2023
@github-actions
Copy link
Contributor

github-actions bot commented Aug 9, 2023

PR approved by at least one committer and no changes requested.

@morningman morningman merged commit f1db6bd into apache:master Aug 10, 2023
xiaokang pushed a commit that referenced this pull request Aug 11, 2023
…ile format of hive table (#22347)

1. append support for struct and map column type on textfile format  of hive table.
2. optimizer code that array column type.

```mysql
+------+------------------------------------+
| id   | perf                               |
+------+------------------------------------+
| 1    | {"key1":"value1", "key2":"value2"} |
| 1    | {"key1":"value1", "key2":"value2"} |
| 2    | {"name":"John", "age":"30"}        |
+------+------------------------------------+
```

```mysql
+---------+------------------+
| column1 | column2          |
+---------+------------------+
|       1 | {10, "data1", 1} |
|       2 | {20, "data2", 0} |
|       3 | {30, "data3", 1} |
+---------+------------------+
```
Summarizes support for complex types(support assign delimiter) :

1. array< primitive_type > and array< array< ... > >
2. map< primitive_type , primitive_type >
3. Struct< primitive_type , primitive_type ... >
@xiaokang xiaokang mentioned this pull request Aug 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/2.0.1-merged reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants