Skip to content

PageStorage: Fix peak memory usage when running GC on PageDirectory#6168

Merged
ti-chi-bot merged 11 commits into
pingcap:masterfrom
JaySon-Huang:fix_directory_snapshot
Oct 26, 2022
Merged

PageStorage: Fix peak memory usage when running GC on PageDirectory#6168
ti-chi-bot merged 11 commits into
pingcap:masterfrom
JaySon-Huang:fix_directory_snapshot

Conversation

@JaySon-Huang
Copy link
Copy Markdown
Contributor

@JaySon-Huang JaySon-Huang commented Oct 19, 2022

What problem does this PR solve?

Issue Number: close #6163

Problem Summary:

The snapshot file getting bigger and bigger, causing peak memory usage after TiFlash running for a long time.
Introduced by #5357

#5357 want to resolve a bug that cause by concurrency between deleting a page and Full GC. But that fixes will make the entries in PageDirectory snapshot file getting bigger and bigger. Cause it can not ensure whether there are more edit on the pages with a FileSnapshot

What is changed and how it works?

When running into compact log-files, we freeze the current writing log-file. Because compact log-files and full gc won't run concurrently, once we freeze the current writing log-file, we can get all upsert that may happen after del for the same page id. So the new compacted log files can correctly remove the page id.

Another approach that dose not work Another approach is to avoid persisting a "upsert" after "del". However, because PageStorage support RefPages, it is hard to avoid this behavior.

For example, if we want to avoid persisting the wrong "upsert", we can check whether the page is deleted when the upsert entry edit is being committed to PageDirectory. If the page is deleted, then rollback the entries edit for this page id.
However, PageStorage supports RefPages. If one page itself is marked as deleted, but still being ref by another page id. We still need to commit the upsert entry, or "full gc" can not reduce space amplification as expected.

image

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Run the HeavySkewWriteRead testing pageworkload (based on #6178), we can see that before this PR, the wal snapshot file getting bigger and bigger as it does not remove the deleted page from snapshot file. After this PR, the wal snapshot file only keep the available page.

Before this PR

... ["Dumped directory snapshot ... [num_records=1085673] ... [size=76022852]."] [source=stress_test] [thread_id=23]
... ["Dumped directory snapshot ... [num_records=1125073] ... [size=78780852]."] [source=stress_test] [thread_id=23]
... ["Dumped directory snapshot ... [num_records=1168805] ... [size=81842092]."] [source=stress_test] [thread_id=23]
... ["Dumped directory snapshot ... [num_records=1213833] ... [size=84994052]."] [source=stress_test] [thread_id=23]
... ["Dumped directory snapshot ... [num_records=1258217] ... [size=88100932]."] [source=stress_test] [thread_id=23]
... ["Dumped directory snapshot ... [num_records=1307151] ... [size=91526312]."] [source=stress_test] [thread_id=23]

After this PR

... ["Dumped directory snapshot ... [num_records=754] ... [size=78420]."] [source=stress_test] [thread_id=24]
... ["Dumped directory snapshot ... [num_records=754] ... [size=78420]."] [source=stress_test] [thread_id=24]
... ["Dumped directory snapshot ... [num_records=755] ... [size=78524]."] [source=stress_test] [thread_id=24]
... ["Dumped directory snapshot ... [num_records=754] ... [size=78420]."] [source=stress_test] [thread_id=24]
... ["Dumped directory snapshot ... [num_records=754] ... [size=78420]."] [source=stress_test] [thread_id=24]

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Fix the bug that leads to peak memory usage when running GC

@ti-chi-bot
Copy link
Copy Markdown
Member

ti-chi-bot commented Oct 19, 2022

[REVIEW NOTIFICATION]

This pull request has been approved by:

  • hehechen
  • lidezhu

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Details

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

@ti-chi-bot ti-chi-bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. needs-cherry-pick-release-6.3 size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Oct 19, 2022
@JaySon-Huang
Copy link
Copy Markdown
Contributor Author

/run-all-tests

@JaySon-Huang
Copy link
Copy Markdown
Contributor Author

Unit-tests are added, but still need some fullstack tests and polish the metrics

@JaySon-Huang JaySon-Huang force-pushed the fix_directory_snapshot branch from 89c6da3 to 1d0fcfd Compare October 19, 2022 15:52
Comment thread dbms/src/Storages/Page/V3/WALStore.cpp Outdated
@JaySon-Huang
Copy link
Copy Markdown
Contributor Author

/run-all-tests

Comment thread dbms/src/Storages/Page/V3/WALStore.cpp Outdated
Comment thread dbms/src/Storages/Page/V3/WALStore.cpp
Copy link
Copy Markdown
Contributor

@lidezhu lidezhu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ti-chi-bot ti-chi-bot added the status/LGT1 Indicates that a PR has LGTM 1. label Oct 25, 2022
@JaySon-Huang
Copy link
Copy Markdown
Contributor Author

JaySon-Huang commented Oct 26, 2022

Metrics (grafana panels with this PR: #6175):
image
Before this fix
image
After this fix
image

@ti-chi-bot ti-chi-bot added status/LGT2 Indicates that a PR has LGTM 2. and removed status/LGT1 Indicates that a PR has LGTM 1. labels Oct 26, 2022
@JaySon-Huang
Copy link
Copy Markdown
Contributor Author

/merge

@ti-chi-bot
Copy link
Copy Markdown
Member

@JaySon-Huang: It seems you want to merge this PR, I will help you trigger all the tests:

/run-all-tests

You only need to trigger /merge once, and if the CI test fails, you just re-trigger the test that failed and the bot will merge the PR for you after the CI passes.

If you have any questions about the PR merge process, please refer to pr process.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@ti-chi-bot
Copy link
Copy Markdown
Member

This pull request has been accepted and is ready to merge.

DetailsCommit hash: a8d677d

@ti-chi-bot ti-chi-bot added the status/can-merge Indicates a PR has been approved by a committer. label Oct 26, 2022
@ti-chi-bot
Copy link
Copy Markdown
Member

@JaySon-Huang: Your PR was out of date, I have automatically updated it for you.

At the same time I will also trigger all tests for you:

/run-all-tests

If the CI test fails, you just re-trigger the test that failed and the bot will merge the PR for you after the CI passes.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@JaySon-Huang
Copy link
Copy Markdown
Contributor Author

/run-unit-test

@ti-chi-bot
Copy link
Copy Markdown
Member

In response to a cherrypick label: new pull request created: #6184.

ti-chi-bot pushed a commit to ti-chi-bot/tiflash that referenced this pull request Oct 26, 2022
Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
@JaySon-Huang JaySon-Huang deleted the fix_directory_snapshot branch October 26, 2022 07:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-cherry-pick-release-6.3 release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. status/can-merge Indicates a PR has been approved by a committer. status/LGT2 Indicates that a PR has LGTM 2.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Compaction on PageStorage WAL causes OOM

4 participants