Skip to content

Conversation

@hubgeter
Copy link
Contributor

@hubgeter hubgeter commented Aug 2, 2024

Proposed changes

pick pr: #38575 and fix this pr bug : #38245

… without repeated deserialization. (apache#37377)

## Proposed changes

Since the value of the partition column is fixed when querying the
partition table, we can deserialize the value only once and then
repeatedly insert the value into the block.
```sql
in Hive: 
CREATE TABLE parquet_partition_tb (
    col1 STRING,
    col2 INT,
    col3 DOUBLE
) PARTITIONED BY (
    partition_col1 STRING,
    partition_col2 INT
)
STORED AS PARQUET;

insert into  parquet_partition_tb partition (partition_col1="hello",partition_col2=1) values("word",2,2.3);

insert into parquet_partition_tb partition(partition_col1="hello",partition_col2=1 )  
select col1,col2,col3 from  parquet_partition_tb where partition_col1="hello" and partition_col2=1;
Repeat the `insert into xxx select  xxx`operation several times.


Doris :
before:
mysql>  select count(partition_col1) from parquet_partition_tb;
+-----------------------+
| count(partition_col1) |
+-----------------------+
|              33554432 |
+-----------------------+
1 row in set (3.24 sec)

mysql>  select count(partition_col2) from parquet_partition_tb;
+-----------------------+
| count(partition_col2) |
+-----------------------+
|              33554432 |
+-----------------------+
1 row in set (3.34 sec)


after:
mysql>  select count(partition_col1) from parquet_partition_tb ;
+-----------------------+
| count(partition_col1) |
+-----------------------+
|              33554432 |
+-----------------------+
1 row in set (0.79 sec)

mysql> select count(partition_col2) from parquet_partition_tb;
+-----------------------+
| count(partition_col2) |
+-----------------------+
|              33554432 |
+-----------------------+
1 row in set (0.51 sec)

```
## Summary:
test sql `select count(partition_col) from tbl;`
Number of lines : 33554432
| |before | after|
|---|---|--|
|boolean |  3.96|0.47  | 
|tinyint  |  3.39|0.47  |  
|smallint |  3.14|0.50   |
|int    |3.34|0.51   | 
|bigint  |   3.61|0.51  |
|float   | 4.59 |0.51  | 
|double   |4.60| 0.55  | 
|decimal(5,2)|  3.96  |0.61 | 
|date   | 5.80|0.52    | 
|timestamp |  7.68 | 0.52 | 
|string  |  3.24 |0.79   | 

Issue Number: close #xxx

<!--Describe your changes.-->
…rom_fixed_json (apache#38245)

## Proposed changes
fix a bug in DataTypeNullableSerDe.deserialize_column_from_fixed_json.

The expected behavior of the `deserialize_column_from_fixed_json`
function is to `insert` n values ​​into the column.

However, when the `DataTypeNullableSerDe` class implements this
function, the null_map column is `resize` to n, which does not insert n
values ​​into it. Since this function is only used by the
`_fill_partition_columns` of the `parquet/orc reader` and is not called
repeatedly for a `get_next_block`, this bug is covered up.
before pr : apache#37377
@doris-robot
Copy link

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR

Since 2024-03-18, the Document has been moved to doris-website.
See Doris Document.

@hubgeter
Copy link
Contributor Author

hubgeter commented Aug 2, 2024

run buildall

@github-actions
Copy link
Contributor

github-actions bot commented Aug 2, 2024

clang-tidy review says "All clean, LGTM! 👍"

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 36.39% (9256/25438)
Line Coverage: 27.92% (75671/271012)
Region Coverage: 26.76% (38898/145385)
Branch Coverage: 23.47% (19730/84050)
Coverage Report: http://coverage.selectdb-in.cc/coverage/2b7b903ff2aa7d61e2b6b43f01e0e852ea98bdc4_2b7b903ff2aa7d61e2b6b43f01e0e852ea98bdc4/report/index.html

@yiguolei yiguolei merged commit 607c0b8 into apache:branch-2.1 Aug 5, 2024
@yiguolei yiguolei mentioned this pull request Sep 5, 2024
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants