Skip to content

Conversation

@chenlinzhong
Copy link
Contributor

@chenlinzhong chenlinzhong commented May 26, 2022

Proposed changes

Issue Number: close #9791

Problem Summary:

A table frequently imports a small amount of data. Doris generates a version for each imported data. If these versions are not merged in time, two problems will occur

  • -235: too many versions, exceeding the maximum number of tablet versions, import failed
  • query performance: doris adopts the merge on read mechanism, all versions of data will be merged when reading data. Too many versions may affect query performance

For this reason, we hope to merge the imported versions faster. The simplest way is to increase the number of CC pool thread numbers . However, this may lead to too high io of the entire node, because the CC task may select rowsets with a large amount of data

So We hope to merge some rowsets as soon as possible without adding disk IO

Therefore, we choose rowset with small data rowset to merge first. The definition of small data rowset is that the imported rows are smaller than config:: quick compaction_ max_ rows

To achieve this goal
we add a new thread pool for small rowset compaction. This thread pool only selects rowsets with small data rowsets for merging. The time to trigger merging is as follows:

  • Import completed: when be receives the push version request, try to select a small version of rowset. when the number of rowsets of selected exceeds config:: quick_compaction_ min_ rowsets do a compaction
  • -235: try to trigger when -235 appears

config parameter

key desc default
config::enable_quick_compaction whether to turn on this feature false
config::quick_compaction_max_rows less then rows rowset will be selected 1000
config::quick_compaction_batch_size triggered every publish batch size 10
config::quick_compaction_min_rowsets do quick compaction min selected rowsets 10

Test

1.we create a table, try to see the quick compaction cost of different rows with centern rowsets

CREATE TABLE table1 (
event_day date NULL COMMENT "",
siteid int(11) NULL DEFAULT "10" COMMENT "",
citycode smallint(6) NULL COMMENT "",
username varchar(32) NULL DEFAULT "" COMMENT "",
pv bigint(20) SUM NULL DEFAULT "0" COMMENT ""
) ENGINE=OLAP
AGGREGATE KEY(event_day, siteid, citycode, username)
COMMENT "OLAP"
DISTRIBUTED BY HASH(username) BUCKETS 10
PROPERTIES (
"replication_allocation" = "tag.location.default: 1",
"in_memory" = "false",
"storage_format" = "V2"
)

here is the result
image

2. test import small data performance with high frequency

start 30 threads ,every thread import 100 batch, every batch included 500-1000 rows, the result:

image

image

image

image

image

we can concluded that

  • cc compaction can not compact rowset in time and will stop import process when reach the max versions limits
  • quick compaction can effectively handle small data import scene and compact rowset in time

3.test one be import max qps with small data and compare with quick compaction

[without quick compaction]
start 2 threads ,every thread import 100 batch, every batch included 500-1000 rows
image
image
we can concluded that

  • max_qps is about 22/s,
  • average qps is about 10-15/s
  • import will usually break by -235

[with quick compaction]
start 30 threads ,every thread import 1000 batch, every batch included 500-1000 rows
image
image
image

we can concluded that

  • max_qps is about 32/s
  • average qps is 30/s
  • import will never break by -235 and very stable
  • disk IO remains low level

conclusion

  • quick compaction can effectively solve -235 problems
  • quick compaction can imporve import qps more than three times
  • quick compaction will not cost a lot of extra disk io
  • quick compaction is suitable for scenarios with small amount of data import

Checklist(Required)

  1. Does it affect the original behavior: (Yes/No/I Don't know)
  2. Has unit tests been added: (Yes/No/No Need)
  3. Has document been added or modified: (Yes/No/No Need)
  4. Does it need to update dependencies: (Yes/No)
  5. Are there any changes that cannot be rolled back: (Yes/No)

Further comments

If this is a relatively large or complex change, kick off the discussion at dev@doris.apache.org by explaining why you chose the solution you did and what alternatives you considered, etc...

@morningman
Copy link
Contributor

Could you provide some test result for this feature?

@morningman morningman added this to the v1.2 milestone May 27, 2022
@chenlinzhong
Copy link
Contributor Author

Could you provide some test result for this feature?

ok

1.merge small versions of rowset as soon as possible to increase the import frequency of small version data
2.small version means that the number of rows is less than config::small_compaction_rowset_rows  default 1000
1.merge small versions of rowset as soon as possible to increase the import frequency of small version data
2.small version means that the number of rows is less than config::small_compaction_rowset_rows  default 1000
1.merge small versions of rowset as soon as possible to increase the import frequency of small version data
2.small version means that the number of rows is less than config::small_compaction_rowset_rows  default 1000
1.merge small versions of rowset as soon as possible to increase the import frequency of small version data
2.small version means that the number of rows is less than config::small_compaction_rowset_rows  default 1000
1.merge small versions of rowset as soon as possible to increase the import frequency of small version data
2.small version means that the number of rows is less than config::small_compaction_rowset_rows  default 1000
1.merge small versions of rowset as soon as possible to increase the import frequency of small version data
2.small version means that the number of rows is less than config::small_compaction_max_rows  default 10000
1.merge small versions of rowset as soon as possible to increase the import frequency of small version data
2.small version means that the number of rows is less than config::small_compaction_max_rows  default 10000
1.merge small versions of rowset as soon as possible to increase the import frequency of small version data
2.small version means that the number of rows is less than config::small_compaction_max_rows  default 10000
@yangzhg
Copy link
Member

yangzhg commented Jun 9, 2022

Can you provide some test data about how often this feature can be imported with and without it

@caiconghui
Copy link
Contributor

will there be any side effect for normal compaction performance, if there are many stream load and with not small data?

@chenlinzhong
Copy link
Contributor Author

will there be any side effect for normal compaction performance, if there are many stream load and with not small data?

maybe, because small compaction use the same lock with cc compaction, so small compaction may stop cc compaction for one tablet at this round , but small compaction finished very quickly usually less then 1s , cc compaction maybe excute next round

@caiconghui
Copy link
Contributor

will there be any side effect for normal compaction performance, if there are many stream load and with not small data?

maybe, because small compaction use the same lock with cc compaction, so small compaction may stop cc compaction for one tablet at this round , but small compaction finished very quickly usually less then 1s , cc compaction maybe excute next round

so,it should be optional choice for user to use this optimization, may be better?

1.merge small versions of rowset as soon as possible to increase the import frequency of small version data
2.small version means that the number of rows is less than config::small_compaction_max_rows  default 10000
@chenlinzhong
Copy link
Contributor Author

will there be any side effect for normal compaction performance, if there are many stream load and with not small data?

maybe, because small compaction use the same lock with cc compaction, so small compaction may stop cc compaction for one tablet at this round , but small compaction finished very quickly usually less then 1s , cc compaction maybe excute next round

so,it should be optional choice for user to use this optimization, may be better?

ok

@morningman morningman added the dev/1.0.1-deprecated should be merged into dev-1.0.1 branch label Jun 10, 2022
@morningman morningman modified the milestones: v1.2, v1.1 Jun 10, 2022
Copy link
Member

@yangzhg yangzhg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Jun 13, 2022
@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Contributor

PR approved by anyone and no changes requested.

@yangzhg yangzhg merged commit 4dfebb9 into apache:master Jun 15, 2022
morningman pushed a commit that referenced this pull request Jun 16, 2022
* compaction quickly for small data import #9791
1.merge small versions of rowset as soon as possible to increase the import frequency of small version data
2.small version means that the number of rows is less than config::small_compaction_rowset_rows  default 1000
@morningman morningman added dev/merged-1.0.1-deprecated PR has been merged into dev-1.0.1 and removed dev/1.0.1-deprecated should be merged into dev-1.0.1 branch labels Jun 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. area/vectorization dev/merged-1.0.1-deprecated PR has been merged into dev-1.0.1 reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] compaction quickly for small data import

7 participants