Skip to content

Conversation

@Jibing-Li
Copy link
Contributor

@Jibing-Li Jibing-Li commented Oct 9, 2023

While doing sample analyze, the result of row count, null number and datasize need to multiply a coefficient based on the sample percent/rows. This pr is mainly to calculate the coefficient according to the sampled file size over total size.

Further comments

If this is a relatively large or complex change, kick off the discussion at dev@doris.apache.org by explaining why you chose the solution you did and what alternatives you considered, etc...

@Jibing-Li Jibing-Li force-pushed the sample branch 6 times, most recently from 77f7de7 to 3a0d336 Compare October 9, 2023 12:20
@Jibing-Li Jibing-Li marked this pull request as ready for review October 10, 2023 01:59
@Jibing-Li Jibing-Li force-pushed the sample branch 5 times, most recently from 25fe617 to 4656dd9 Compare October 10, 2023 06:10
@Jibing-Li
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

(From new machine)TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 45.59 seconds
stream load tsv: 579 seconds loaded 74807831229 Bytes, about 123 MB/s
stream load json: 21 seconds loaded 2358488459 Bytes, about 107 MB/s
stream load orc: 64 seconds loaded 1101869774 Bytes, about 16 MB/s
stream load parquet: 30 seconds loaded 861443392 Bytes, about 27 MB/s
insert into select: 28.7 seconds inserted 10000000 Rows, about 348K ops/s
storage size: 17162404506 Bytes

@Jibing-Li Jibing-Li force-pushed the sample branch 3 times, most recently from bd6a208 to ad75f41 Compare October 11, 2023 03:39
@Jibing-Li
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

(From new machine)TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 45.11 seconds
stream load tsv: 563 seconds loaded 74807831229 Bytes, about 126 MB/s
stream load json: 21 seconds loaded 2358488459 Bytes, about 107 MB/s
stream load orc: 64 seconds loaded 1101869774 Bytes, about 16 MB/s
stream load parquet: 33 seconds loaded 861443392 Bytes, about 24 MB/s
insert into select: 29.1 seconds inserted 10000000 Rows, about 343K ops/s
storage size: 17162290863 Bytes

@Jibing-Li
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

(From new machine)TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 44.82 seconds
stream load tsv: 560 seconds loaded 74807831229 Bytes, about 127 MB/s
stream load json: 24 seconds loaded 2358488459 Bytes, about 93 MB/s
stream load orc: 64 seconds loaded 1101869774 Bytes, about 16 MB/s
stream load parquet: 32 seconds loaded 861443392 Bytes, about 25 MB/s
insert into select: 28.8 seconds inserted 10000000 Rows, about 347K ops/s
storage size: 17162288363 Bytes

Copy link
Contributor

@morningman morningman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Oct 11, 2023
@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Contributor

PR approved by anyone and no changes requested.

@morningman morningman merged commit c63bf24 into apache:master Oct 12, 2023
Jibing-Li added a commit to Jibing-Li/incubator-doris that referenced this pull request Oct 13, 2023
While doing sample analyze, the result of row count, null number and datasize need to multiply a coefficient based on
the sample percent/rows. This pr is mainly to calculate the coefficient according to the sampled file size over total size.
Jibing-Li added a commit to Jibing-Li/incubator-doris that referenced this pull request Oct 15, 2023
While doing sample analyze, the result of row count, null number and datasize need to multiply a coefficient based on
the sample percent/rows. This pr is mainly to calculate the coefficient according to the sampled file size over total size.
Jibing-Li added a commit to Jibing-Li/incubator-doris that referenced this pull request Oct 15, 2023
While doing sample analyze, the result of row count, null number and datasize need to multiply a coefficient based on
the sample percent/rows. This pr is mainly to calculate the coefficient according to the sampled file size over total size.
Jibing-Li added a commit to Jibing-Li/incubator-doris that referenced this pull request Oct 15, 2023
While doing sample analyze, the result of row count, null number and datasize need to multiply a coefficient based on
the sample percent/rows. This pr is mainly to calculate the coefficient according to the sampled file size over total size.
@Jibing-Li Jibing-Li deleted the sample branch October 17, 2023 04:05
dutyu pushed a commit to dutyu/doris that referenced this pull request Oct 28, 2023
While doing sample analyze, the result of row count, null number and datasize need to multiply a coefficient based on 
the sample percent/rows. This pr is mainly to calculate the coefficient according to the sampled file size over total size.
@xiaokang xiaokang mentioned this pull request Dec 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/2.0.3-merged merge_conflict reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants