-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[Fix](cloud-mow) Check partition's version to avoid wrongly update visible versions' delete bitmaps #49710
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Fix](cloud-mow) Check partition's version to avoid wrongly update visible versions' delete bitmaps #49710
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
63e9286 to
7b7c1f6
Compare
72706d6 to
e748596
Compare
|
run buildall |
1 similar comment
|
run buildall |
|
TeamCity cloud ut coverage result: |
TPC-H: Total hot run time: 35348 ms |
|
run buildall |
|
TeamCity cloud ut coverage result: |
|
run cloud_p0 |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
0fd617e to
c196ca2
Compare
|
run buildall |
|
TeamCity cloud ut coverage result: |
TPC-H: Total hot run time: 34223 ms |
TPC-DS: Total hot run time: 194016 ms |
ClickBench: Total hot run time: 31.8 s |
|
run buildall |
1 similar comment
|
run buildall |
|
TeamCity cloud ut coverage result: |
|
TeamCity cloud ut coverage result: |
TPC-H: Total hot run time: 34201 ms |
TPC-DS: Total hot run time: 193840 ms |
ClickBench: Total hot run time: 31.64 s |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
|
run feut |
zhannngchen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
…sible versions' delete bitmaps (apache#49710) considering the following problem: 1. Transaction X acquires the lock and attempts to publish with version a. This task is sent to the BE. At this point, the tablet's maximum version is a-1, and task (1) starts computation. 2. Transaction X fails on FE due to timeout and releases the lock. 3. Transaction Y acquires the lock, attempts to publish with version a, and succeeds. 4. Transaction X retries and acquires the lock again, and attempts to publish with version b. 5. Meanwhile, task (1) from Transaction X completes its computation on BE and writes the generated delete bitmap to the MS with version a. **Since Transaction X currently holds the lock, this write operation succeeds, overwriting the delete bitmaps written of actual version a by Transaction Y.** 6. Subsequent transactions on the tablet will use the pending delete bitmap to delete the version a delete bitmap written by task (1) in the MS. The root cause is that when a load txn retries in publish phase, the locks it gains are different, but they are the same in the current implementation because they have the same lock_id and initiator. This PR checks target partition's version when update delete bitmaps to avoid this problem.
…sible versions' delete bitmaps (apache#49710) considering the following problem: 1. Transaction X acquires the lock and attempts to publish with version a. This task is sent to the BE. At this point, the tablet's maximum version is a-1, and task (1) starts computation. 2. Transaction X fails on FE due to timeout and releases the lock. 3. Transaction Y acquires the lock, attempts to publish with version a, and succeeds. 4. Transaction X retries and acquires the lock again, and attempts to publish with version b. 5. Meanwhile, task (1) from Transaction X completes its computation on BE and writes the generated delete bitmap to the MS with version a. **Since Transaction X currently holds the lock, this write operation succeeds, overwriting the delete bitmaps written of actual version a by Transaction Y.** 6. Subsequent transactions on the tablet will use the pending delete bitmap to delete the version a delete bitmap written by task (1) in the MS. The root cause is that when a load txn retries in publish phase, the locks it gains are different, but they are the same in the current implementation because they have the same lock_id and initiator. This PR checks target partition's version when update delete bitmaps to avoid this problem.
…sible versions' delete bitmaps (apache#49710) considering the following problem: 1. Transaction X acquires the lock and attempts to publish with version a. This task is sent to the BE. At this point, the tablet's maximum version is a-1, and task (1) starts computation. 2. Transaction X fails on FE due to timeout and releases the lock. 3. Transaction Y acquires the lock, attempts to publish with version a, and succeeds. 4. Transaction X retries and acquires the lock again, and attempts to publish with version b. 5. Meanwhile, task (1) from Transaction X completes its computation on BE and writes the generated delete bitmap to the MS with version a. **Since Transaction X currently holds the lock, this write operation succeeds, overwriting the delete bitmaps written of actual version a by Transaction Y.** 6. Subsequent transactions on the tablet will use the pending delete bitmap to delete the version a delete bitmap written by task (1) in the MS. The root cause is that when a load txn retries in publish phase, the locks it gains are different, but they are the same in the current implementation because they have the same lock_id and initiator. This PR checks target partition's version when update delete bitmaps to avoid this problem.
…sible versions' delete bitmaps (apache#49710) considering the following problem: 1. Transaction X acquires the lock and attempts to publish with version a. This task is sent to the BE. At this point, the tablet's maximum version is a-1, and task (1) starts computation. 2. Transaction X fails on FE due to timeout and releases the lock. 3. Transaction Y acquires the lock, attempts to publish with version a, and succeeds. 4. Transaction X retries and acquires the lock again, and attempts to publish with version b. 5. Meanwhile, task (1) from Transaction X completes its computation on BE and writes the generated delete bitmap to the MS with version a. **Since Transaction X currently holds the lock, this write operation succeeds, overwriting the delete bitmaps written of actual version a by Transaction Y.** 6. Subsequent transactions on the tablet will use the pending delete bitmap to delete the version a delete bitmap written by task (1) in the MS. The root cause is that when a load txn retries in publish phase, the locks it gains are different, but they are the same in the current implementation because they have the same lock_id and initiator. This PR checks target partition's version when update delete bitmaps to avoid this problem.
…sible versions' delete bitmaps (apache#49710) ### What problem does this PR solve? considering the following problem: 1. Transaction X acquires the lock and attempts to publish with version a. This task is sent to the BE. At this point, the tablet's maximum version is a-1, and task (1) starts computation. 2. Transaction X fails on FE due to timeout and releases the lock. 3. Transaction Y acquires the lock, attempts to publish with version a, and succeeds. 4. Transaction X retries and acquires the lock again, and attempts to publish with version b. 5. Meanwhile, task (1) from Transaction X completes its computation on BE and writes the generated delete bitmap to the MS with version a. **Since Transaction X currently holds the lock, this write operation succeeds, overwriting the delete bitmaps written of actual version a by Transaction Y.** 6. Subsequent transactions on the tablet will use the pending delete bitmap to delete the version a delete bitmap written by task (1) in the MS. The root cause is that when a load txn retries in publish phase, the locks it gains are different, but they are the same in the current implementation because they have the same lock_id and initiator. This PR checks target partition's version when update delete bitmaps to avoid this problem.
… bitmap response regardless of status code (#52547) ### What problem does this PR solve? #49710 add a check in MS to forbid stale calc delete bitmap task to wrongly update delete bitmaps in MS. But this may lead to load fail due to the check on FE. This PR let FE retry to commit the txn when encounter stale calc delete bitmap response regardless of task's status code to avoid the problem.
… bitmap response regardless of status code (apache#52547) apache#49710 add a check in MS to forbid stale calc delete bitmap task to wrongly update delete bitmaps in MS. But this may lead to load fail due to the check on FE. This PR let FE retry to commit the txn when encounter stale calc delete bitmap response regardless of task's status code to avoid the problem.
… bitmap response regardless of status code (apache#52547) apache#49710 add a check in MS to forbid stale calc delete bitmap task to wrongly update delete bitmaps in MS. But this may lead to load fail due to the check on FE. This PR let FE retry to commit the txn when encounter stale calc delete bitmap response regardless of task's status code to avoid the problem.
… bitmap response regardless of status code (apache#52547) apache#49710 add a check in MS to forbid stale calc delete bitmap task to wrongly update delete bitmaps in MS. But this may lead to load fail due to the check on FE. This PR let FE retry to commit the txn when encounter stale calc delete bitmap response regardless of task's status code to avoid the problem.
What problem does this PR solve?
considering the following problem:
The root cause is that when a load txn retries in publish phase, the locks it gains are different, but they are the same in the current implementation because they have the same lock_id and initiator.
This PR checks target partition's version when update delete bitmaps to avoid this problem.
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)