Skip to content

Conversation

@yujun777
Copy link
Contributor

@yujun777 yujun777 commented Oct 10, 2023

Proposed changes

fe should sync replica's version with be's report:

I. In some cases, replica version of fe may greater than be's report version due to program bugs. If there's no write later, fe replica's last failed version will always be -1. And this replica couldn't be repair.

To fix these, if fe's version > be's report version, after 5min, fe will mark this replica as missing versions. Later this replica can be repair;

II. In some other cases, it will cause case replica's version = partition's visible version = partition's committed version, and replica's last failed version = partition's + 1. case as follow:
a. suppose partition visible version = 10, committed version = 11, the committed txn version T is not published yet;
b. clone a replica A to B, after cloning, B's version = 10.
c. publish txn T, A version = 11. Also due to txn publish bug, B's version = 11 (PR #23706 has fix this). (Another case is: even if this PR is fix and be publish txns ok, but if it powerf off and restart, the be may miss the version 11 because it's not ready to sync disk, and the writes will lost)
d. publish another txn, A and B's version become 12;
e. now B's backend version is: [1, 10], [12, 12], it will report it missing version = 11; then fe will mark this replica's last failed version = version + 1 = 12 + 1 = 13;
f. fe will let B clone version 11 from A, after cloning, B will contains all the version. But Replica.updateVersion has bug,
if it found its version not change (12 -> 12),it will not set replica's last failed version to -1.
Then later, B will always be: version = 12, last failed version = 13.

To Fix this problem, after cloning or when be report, if replica's version >= partition's commit version(more precise, replica's version = partition's commit version), and replica's last failed version > partition's commit (more precise, replica's last failed version = partition's commit version + 1), then we should set replica's last failed version to -1.

Further comments

If this is a relatively large or complex change, kick off the discussion at dev@doris.apache.org by explaining why you chose the solution you did and what alternatives you considered, etc...

@yujun777
Copy link
Contributor Author

run buildall

1 similar comment
@yujun777
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

(From new machine)TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 45.62 seconds
stream load tsv: 561 seconds loaded 74807831229 Bytes, about 127 MB/s
stream load json: 21 seconds loaded 2358488459 Bytes, about 107 MB/s
stream load orc: 64 seconds loaded 1101869774 Bytes, about 16 MB/s
stream load parquet: 32 seconds loaded 861443392 Bytes, about 25 MB/s
insert into select: 28.9 seconds inserted 10000000 Rows, about 346K ops/s
storage size: 17161924956 Bytes

dataroaring
dataroaring previously approved these changes Oct 11, 2023
Copy link
Contributor

@dataroaring dataroaring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Oct 11, 2023
@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Contributor

PR approved by anyone and no changes requested.

@yujun777
Copy link
Contributor Author

run buildall

@github-actions github-actions bot removed the approved Indicates a PR has been approved by one committer. label Oct 11, 2023
@doris-robot
Copy link

(From new machine)TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 45.38 seconds
stream load tsv: 561 seconds loaded 74807831229 Bytes, about 127 MB/s
stream load json: 20 seconds loaded 2358488459 Bytes, about 112 MB/s
stream load orc: 64 seconds loaded 1101869774 Bytes, about 16 MB/s
stream load parquet: 32 seconds loaded 861443392 Bytes, about 25 MB/s
insert into select: 28.7 seconds inserted 10000000 Rows, about 348K ops/s
storage size: 17162360855 Bytes

Copy link
Contributor

@dataroaring dataroaring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Oct 12, 2023
@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

xiaokang added a commit that referenced this pull request Oct 20, 2023
xiaokang added a commit that referenced this pull request Oct 20, 2023
dutyu pushed a commit to dutyu/doris that referenced this pull request Oct 28, 2023
@xiaokang xiaokang mentioned this pull request Dec 4, 2023
XuJianxu pushed a commit to XuJianxu/doris that referenced this pull request Dec 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/2.0.3-merged merge_conflict reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants