Skip to content

Conversation

@hubgeter
Copy link
Contributor

@hubgeter hubgeter commented Jul 9, 2024

bp #37377

Proposed changes

Since the value of the partition column is fixed when querying the partition table, we can deserialize the value only once and then repeatedly insert the value into the block.

in Hive: 
CREATE TABLE parquet_partition_tb (
    col1 STRING,
    col2 INT,
    col3 DOUBLE
) PARTITIONED BY (
    partition_col1 STRING,
    partition_col2 INT
)
STORED AS PARQUET;

insert into  parquet_partition_tb partition (partition_col1="hello",partition_col2=1) values("word",2,2.3);

insert into parquet_partition_tb partition(partition_col1="hello",partition_col2=1 )  
select col1,col2,col3 from  parquet_partition_tb where partition_col1="hello" and partition_col2=1;
Repeat the `insert into xxx select  xxx`operation several times.


Doris :
before:
mysql>  select count(partition_col1) from parquet_partition_tb;
+-----------------------+
| count(partition_col1) |
+-----------------------+
|              33554432 |
+-----------------------+
1 row in set (3.24 sec)

mysql>  select count(partition_col2) from parquet_partition_tb;
+-----------------------+
| count(partition_col2) |
+-----------------------+
|              33554432 |
+-----------------------+
1 row in set (3.34 sec)


after:
mysql>  select count(partition_col1) from parquet_partition_tb ;
+-----------------------+
| count(partition_col1) |
+-----------------------+
|              33554432 |
+-----------------------+
1 row in set (0.79 sec)

mysql> select count(partition_col2) from parquet_partition_tb;
+-----------------------+
| count(partition_col2) |
+-----------------------+
|              33554432 |
+-----------------------+
1 row in set (0.51 sec)

Summary:

test sql select count(partition_col) from tbl;
Number of lines : 33554432

before after
boolean 3.96 0.47
tinyint 3.39 0.47
smallint 3.14 0.50
int 3.34 0.51
bigint 3.61 0.51
float 4.59 0.51
double 4.60 0.55
decimal(5,2) 3.96 0.61
date 5.80 0.52
timestamp 7.68 0.52
string 3.24 0.79

Issue Number: close #xxx

Proposed changes

Issue Number: close #xxx

@doris-robot
Copy link

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR

Since 2024-03-18, the Document has been moved to doris-website.
See Doris Document.

@hubgeter
Copy link
Contributor Author

hubgeter commented Jul 9, 2024

run buildall

@github-actions
Copy link
Contributor

github-actions bot commented Jul 9, 2024

clang-tidy review says "All clean, LGTM! 👍"

@hubgeter hubgeter force-pushed the pick_21_opt_fill_partition branch from cffae82 to 6a4ae32 Compare July 10, 2024 01:48
@hubgeter
Copy link
Contributor Author

run buildall

@github-actions
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@hubgeter hubgeter force-pushed the pick_21_opt_fill_partition branch from 6a4ae32 to a5a6ffe Compare July 11, 2024 02:00
@hubgeter
Copy link
Contributor Author

run buildall

@github-actions
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@hubgeter hubgeter force-pushed the pick_21_opt_fill_partition branch from a5a6ffe to 0787c02 Compare July 15, 2024 01:57
@hubgeter
Copy link
Contributor Author

run buildall

@github-actions
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

… without repeated deserialization. (apache#37377)

## Proposed changes

Since the value of the partition column is fixed when querying the
partition table, we can deserialize the value only once and then
repeatedly insert the value into the block.
```sql
in Hive: 
CREATE TABLE parquet_partition_tb (
    col1 STRING,
    col2 INT,
    col3 DOUBLE
) PARTITIONED BY (
    partition_col1 STRING,
    partition_col2 INT
)
STORED AS PARQUET;

insert into  parquet_partition_tb partition (partition_col1="hello",partition_col2=1) values("word",2,2.3);

insert into parquet_partition_tb partition(partition_col1="hello",partition_col2=1 )  
select col1,col2,col3 from  parquet_partition_tb where partition_col1="hello" and partition_col2=1;
Repeat the `insert into xxx select  xxx`operation several times.


Doris :
before:
mysql>  select count(partition_col1) from parquet_partition_tb;
+-----------------------+
| count(partition_col1) |
+-----------------------+
|              33554432 |
+-----------------------+
1 row in set (3.24 sec)

mysql>  select count(partition_col2) from parquet_partition_tb;
+-----------------------+
| count(partition_col2) |
+-----------------------+
|              33554432 |
+-----------------------+
1 row in set (3.34 sec)


after:
mysql>  select count(partition_col1) from parquet_partition_tb ;
+-----------------------+
| count(partition_col1) |
+-----------------------+
|              33554432 |
+-----------------------+
1 row in set (0.79 sec)

mysql> select count(partition_col2) from parquet_partition_tb;
+-----------------------+
| count(partition_col2) |
+-----------------------+
|              33554432 |
+-----------------------+
1 row in set (0.51 sec)

```
## Summary:
test sql `select count(partition_col) from tbl;`
Number of lines : 33554432
| |before | after|
|---|---|--|
|boolean |  3.96|0.47  | 
|tinyint  |  3.39|0.47  |  
|smallint |  3.14|0.50   |
|int    |3.34|0.51   | 
|bigint  |   3.61|0.51  |
|float   | 4.59 |0.51  | 
|double   |4.60| 0.55  | 
|decimal(5,2)|  3.96  |0.61 | 
|date   | 5.80|0.52    | 
|timestamp |  7.68 | 0.52 | 
|string  |  3.24 |0.79   | 

Issue Number: close #xxx

<!--Describe your changes.-->
@hubgeter hubgeter force-pushed the pick_21_opt_fill_partition branch from 0787c02 to b15f6df Compare July 15, 2024 16:10
@hubgeter
Copy link
Contributor Author

run buildall

@github-actions
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 36.50% (9261/25370)
Line Coverage: 28.02% (75657/269968)
Region Coverage: 26.86% (38904/144833)
Branch Coverage: 23.59% (19752/83714)
Coverage Report: http://coverage.selectdb-in.cc/coverage/b15f6df91902706bc03d15e6d352118e77cee7c1_b15f6df91902706bc03d15e6d352118e77cee7c1/report/index.html

@morningman morningman merged commit 6932eef into apache:branch-2.1 Jul 16, 2024
morningman added a commit that referenced this pull request Jul 17, 2024
… columns without repeated deserialization. (#37377)" (#38007)

Reverts #37530
Need more test, revert it temporarily
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants