Skip to content

Conversation

@weizuo93
Copy link
Contributor

@weizuo93 weizuo93 commented Nov 3, 2020

Proposed changes

A large number of small segment files will lead to low efficiency for scan operations. Multiple small files can be merged into a large file by compaction operation. So we could take the tablet scan frequency into consideration when selecting an tablet for compaction and preferentially do compaction for those tablets which are scanned frequently during a latest period of time at the present.

Using the compaction strategy of Kudufor reference, scan frequency can be calculated for tablet during a latest period of time and be taken into consideration when calculating compaction score.

Types of changes

What types of changes does your code introduce to Doris?
Put an x in the boxes that apply

  • [] Bugfix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • [] Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • [] Documentation Update (if none of the other choices apply)
  • [] Code refactor (Modify the code structure, format the code, etc...)

Checklist

Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your code.

Copy link
Contributor

@morningman morningman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plz update document:
administrator-guide/config/be_config.md

std::unique_ptr<CumulativeCompactionPolicy> _cumulative_compaction_policy;
std::string _cumulative_compaction_type;

int64_t _last_update_scan_count;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add comment for the new fields

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Comment on lines +335 to +336
CONF_mInt32(compaction_tablet_scan_frequency_factor, "0");
CONF_mInt32(compaction_tablet_compaction_score_factor, "1");
Copy link
Member

@acelyc111 acelyc111 Nov 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do they need to be normalized? If needed, you should define them as double.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do they need to be normalized? If needed, you should define them as double.

Normalization is not required.


### `column_dictionary_key_size_threshold`

### `compaction_tablet_compaction_score_factor`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why no document for these 2 configs?
Better give best practice for them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why no document for these 2 configs?
Better give best practice for them.

done.

}
if (table_score > highest_score) {
highest_score = table_score;
double scan_frequency = tablet_ptr->calculate_scan_frequency();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if compaction_tablet_scan_frequency_factor is zero, we can skip calling calculate_scan_frequency() to save some CPU.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if compaction_tablet_scan_frequency_factor is zero, we can skip calling calculate_scan_frequency() to save some CPU.

It's reasonable.

time_t now = time(nullptr);
int64_t current_count = query_scan_count->value();
double interval = difftime(now, _last_record_scan_count_timestamp);
double scan_frequency = (current_count - _last_record_scan_count) * 60 / interval;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why multi 60?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why multi 60?

It means the average count of tablet scans for each minute, Otherwise it will be the average count of tablet scans for each second .

@weizuo93 weizuo93 force-pushed the compaction_consider_tablet_scan_frequency branch from 6806c9d to 64ed98d Compare November 16, 2020 06:12
Copy link
Contributor

@morningman morningman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@morningman morningman added approved Indicates a PR has been approved by one committer. area/compact Issues or PRs related to the compact kind/feature Categorizes issue or PR as related to a new feature. labels Nov 17, 2020
@morningman morningman merged commit 6247408 into apache:master Nov 18, 2020
@yangzhg yangzhg mentioned this pull request Feb 9, 2021
@weizuo93 weizuo93 deleted the compaction_consider_tablet_scan_frequency branch March 9, 2022 07:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. area/compact Issues or PRs related to the compact kind/feature Categorizes issue or PR as related to a new feature.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Proposal] Take 'tablet scan frequency' into consideration when selecting a tablet for compaction

3 participants