Skip to content

Conversation

@yixiutt
Copy link
Contributor

@yixiutt yixiutt commented Oct 26, 2022

This feature mainly handle compaction for ordered data, adding a min_max key for segment and check if rowset are non-overlapping so we can do compaction just move files and modify rowset meta instead of traverse all rows.

The strategy list below:

  1. more than half of rowsets are non overlapping.
  2. all segments are more than 10M
  3. if base compaction, no delete version contains in input_rowsets.

By the way, my test shows that calc min max key does not effect load performance.

Proposed changes

Issue Number: close #xxx

Problem summary

Describe your changes.

Checklist(Required)

  1. Does it affect the original behavior:
    • Yes
    • No
    • I don't know
  2. Has unit tests been added:
    • Yes
    • No
    • No Need
  3. Has document been added or modified:
    • Yes
    • No
    • No Need
  4. Does it need to update dependencies:
    • Yes
    • No
  5. Are there any changes that cannot be rolled back:
    • Yes (If Yes, please explain WHY)
    • No

Further comments

If this is a relatively large or complex change, kick off the discussion at dev@doris.apache.org by explaining why you chose the solution you did and what alternatives you considered, etc...

@hello-stephen
Copy link
Contributor

TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 38.53 seconds
load time: 574 seconds
storage size: 17154827484 Bytes
https://doris-community-test-1308700295.cos.ap-hongkong.myqcloud.com/tmp/20221026191453_clickbench_pr_34299.html

@yixiutt yixiutt force-pushed the ordered_data_compaction branch from 078322e to 6d35086 Compare October 27, 2022 02:50
@hello-stephen
Copy link
Contributor

TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 38.71 seconds
load time: 598 seconds
storage size: 17154827763 Bytes
https://doris-community-test-1308700295.cos.ap-hongkong.myqcloud.com/tmp/20221027040554_clickbench_pr_34536.html

@yixiutt yixiutt force-pushed the ordered_data_compaction branch 3 times, most recently from 499ab1b to 9b4e49d Compare October 27, 2022 11:09
@hello-stephen
Copy link
Contributor

TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 38.63 seconds
load time: 565 seconds
storage size: 17154711916 Bytes
https://doris-community-test-1308700295.cos.ap-hongkong.myqcloud.com/tmp/20221027192031_clickbench_pr_34776.html

This feature mainly handle compaction for ordered data, adding a min_max
key for segment and check if rowset are non-overlapping so we can do compaction
just move files and modify rowset meta instead of traverse all rows.

The strategy list below:
1. more than half of rowsets are non overlapping.
2. all segments are more than 10M
3. if base compaction, no delete version contains in input_rowsets.

By the way, my test shows that calc min max key does not effect load performance.
@yixiutt yixiutt force-pushed the ordered_data_compaction branch from 9b4e49d to c39e0b4 Compare October 27, 2022 11:47
Copy link
Contributor

@dataroaring dataroaring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great. We should add a ut and regression test to ensure it works and check results.

@morningman
Copy link
Contributor

Hi @yixiutt , I think this is a breaking change to Doris core feature, so I created a new branch:
https://github.com/apache/doris/tree/compaction_opt for this feature dev.

And I have pushed the PR: opt compaction task producer and quick compaction (#13495) to it.
I will close this PR, and please push this PR to branch compaction_opt for testing

@morningman morningman closed this Nov 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants