-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[fix](replica) handle replica version missing info to avoid -214 error #8209
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
be/src/olap/tablet.cpp
Outdated
| } | ||
|
|
||
| OLAPStatus Tablet::check_version_integrity(const Version& version) { | ||
| OLAPStatus Tablet::check_version_integrity(const Version& version, bool quite) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does quite mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if !quite, it will print log for missing versions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean quiet? slient may be better
yangzhg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
yangzhg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
yangzhg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
PR approved by at least one committer and no changes requested. |
apache#8209) In the original tablet reporting information, the version missing information is done by combining two pieces of information as follows: 1. the maximum consecutive version number 2. the `version_miss` field The logic of this approach is confusing and inconsistent with the logic of checking for missing versions when querying. After the change, we directly use the version checking logic used in the query, and set `version_miss` to true if a missing version is found and on the FE processing side. Originally, only the **bad replica** information was syncronized among FEs, but not the **version missing** information. As a result, the non-master FE is not aware of the missing version information. In the new design, we deprecate the original log persistence class `BackendTabletsInfo` and use the new `BackendReplicasInfo` to record replica reporting information and write both **bad** and **version missing** information to metadata so that other FEs can synchronize these information.
#8444) 1. This bug was introduced by #8209. Error in fe.warn.log: ``` java.lang.IllegalStateException: 560278 at com.google.common.base.Preconditions.checkState(Preconditions.java:508) ~[spark-dpp-0.15-SNAPSHOT.jar:0.15-SNAPSHOT] at org.apache.doris.catalog.TabletInvertedIndex.getReplica(TabletInvertedIndex.java:462) ~[palo-fe.jar:0.15-SNAPSHOT] at org.apache.doris.catalog.Catalog.replayBackendReplicasInfo(Catalog.java:6941) ~[palo-fe.jar:0.15-SNAPSHOT] at org.apache.doris.persist.EditLog.loadJournal(EditLog.java:626) [palo-fe.jar:0.15-SNAPSHOT] at org.apache.doris.catalog.Catalog.replayJournal(Catalog.java:2446) [palo-fe.jar:0.15-SNAPSHOT] at org.apache.doris.master.Checkpoint.doCheckpoint(Checkpoint.java:116) [palo-fe.jar:0.15-SNAPSHOT] at org.apache.doris.master.Checkpoint.runAfterCatalogReady(Checkpoint.java:74) [palo-fe.jar:0.15-SNAPSHOT] at org.apache.doris.common.util.MasterDaemon.runOneCycle(MasterDaemon.java:58) [palo-fe.jar:0.15-SNAPSHOT] at org.apache.doris.common.util.Daemon.run(Daemon.java:116) [palo-fe.jar:0.15-SNAPSHOT] ``` Since the reporting of a tablet and the deletion of a tablet are two independent events and are not mutually exclusive, it may happen that the tablet is deleted first and the reporting is done later. 2. Change the tablet report info. Now, the version of a tablet report from BE is the largest continuous version. Eg, versions: [1,2,3,5,7], the report version of this tablet will be 3.
#8444) 1. This bug was introduced by #8209. Error in fe.warn.log: ``` java.lang.IllegalStateException: 560278 at com.google.common.base.Preconditions.checkState(Preconditions.java:508) ~[spark-dpp-0.15-SNAPSHOT.jar:0.15-SNAPSHOT] at org.apache.doris.catalog.TabletInvertedIndex.getReplica(TabletInvertedIndex.java:462) ~[palo-fe.jar:0.15-SNAPSHOT] at org.apache.doris.catalog.Catalog.replayBackendReplicasInfo(Catalog.java:6941) ~[palo-fe.jar:0.15-SNAPSHOT] at org.apache.doris.persist.EditLog.loadJournal(EditLog.java:626) [palo-fe.jar:0.15-SNAPSHOT] at org.apache.doris.catalog.Catalog.replayJournal(Catalog.java:2446) [palo-fe.jar:0.15-SNAPSHOT] at org.apache.doris.master.Checkpoint.doCheckpoint(Checkpoint.java:116) [palo-fe.jar:0.15-SNAPSHOT] at org.apache.doris.master.Checkpoint.runAfterCatalogReady(Checkpoint.java:74) [palo-fe.jar:0.15-SNAPSHOT] at org.apache.doris.common.util.MasterDaemon.runOneCycle(MasterDaemon.java:58) [palo-fe.jar:0.15-SNAPSHOT] at org.apache.doris.common.util.Daemon.run(Daemon.java:116) [palo-fe.jar:0.15-SNAPSHOT] ``` Since the reporting of a tablet and the deletion of a tablet are two independent events and are not mutually exclusive, it may happen that the tablet is deleted first and the reporting is done later. 2. Change the tablet report info. Now, the version of a tablet report from BE is the largest continuous version. Eg, versions: [1,2,3,5,7], the report version of this tablet will be 3.
apache#8209) In the original tablet reporting information, the version missing information is done by combining two pieces of information as follows: 1. the maximum consecutive version number 2. the `version_miss` field The logic of this approach is confusing and inconsistent with the logic of checking for missing versions when querying. After the change, we directly use the version checking logic used in the query, and set `version_miss` to true if a missing version is found and on the FE processing side. Originally, only the **bad replica** information was syncronized among FEs, but not the **version missing** information. As a result, the non-master FE is not aware of the missing version information. In the new design, we deprecate the original log persistence class `BackendTabletsInfo` and use the new `BackendReplicasInfo` to record replica reporting information and write both **bad** and **version missing** information to metadata so that other FEs can synchronize these information.
this bug was introduced from apache#8209
this bug was introduced from #8209
this bug was introduced from #8209
#8209) In the original tablet reporting information, the version missing information is done by combining two pieces of information as follows: 1. the maximum consecutive version number 2. the `version_miss` field The logic of this approach is confusing and inconsistent with the logic of checking for missing versions when querying. After the change, we directly use the version checking logic used in the query, and set `version_miss` to true if a missing version is found and on the FE processing side. Originally, only the **bad replica** information was syncronized among FEs, but not the **version missing** information. As a result, the non-master FE is not aware of the missing version information. In the new design, we deprecate the original log persistence class `BackendTabletsInfo` and use the new `BackendReplicasInfo` to record replica reporting information and write both **bad** and **version missing** information to metadata so that other FEs can synchronize these information.
apache#8209) In the original tablet reporting information, the version missing information is done by combining two pieces of information as follows: 1. the maximum consecutive version number 2. the `version_miss` field The logic of this approach is confusing and inconsistent with the logic of checking for missing versions when querying. After the change, we directly use the version checking logic used in the query, and set `version_miss` to true if a missing version is found and on the FE processing side. Originally, only the **bad replica** information was syncronized among FEs, but not the **version missing** information. As a result, the non-master FE is not aware of the missing version information. In the new design, we deprecate the original log persistence class `BackendTabletsInfo` and use the new `BackendReplicasInfo` to record replica reporting information and write both **bad** and **version missing** information to metadata so that other FEs can synchronize these information.
apache#8444) 1. This bug was introduced by apache#8209. Error in fe.warn.log: ``` java.lang.IllegalStateException: 560278 at com.google.common.base.Preconditions.checkState(Preconditions.java:508) ~[spark-dpp-0.15-SNAPSHOT.jar:0.15-SNAPSHOT] at org.apache.doris.catalog.TabletInvertedIndex.getReplica(TabletInvertedIndex.java:462) ~[palo-fe.jar:0.15-SNAPSHOT] at org.apache.doris.catalog.Catalog.replayBackendReplicasInfo(Catalog.java:6941) ~[palo-fe.jar:0.15-SNAPSHOT] at org.apache.doris.persist.EditLog.loadJournal(EditLog.java:626) [palo-fe.jar:0.15-SNAPSHOT] at org.apache.doris.catalog.Catalog.replayJournal(Catalog.java:2446) [palo-fe.jar:0.15-SNAPSHOT] at org.apache.doris.master.Checkpoint.doCheckpoint(Checkpoint.java:116) [palo-fe.jar:0.15-SNAPSHOT] at org.apache.doris.master.Checkpoint.runAfterCatalogReady(Checkpoint.java:74) [palo-fe.jar:0.15-SNAPSHOT] at org.apache.doris.common.util.MasterDaemon.runOneCycle(MasterDaemon.java:58) [palo-fe.jar:0.15-SNAPSHOT] at org.apache.doris.common.util.Daemon.run(Daemon.java:116) [palo-fe.jar:0.15-SNAPSHOT] ``` Since the reporting of a tablet and the deletion of a tablet are two independent events and are not mutually exclusive, it may happen that the tablet is deleted first and the reporting is done later. 2. Change the tablet report info. Now, the version of a tablet report from BE is the largest continuous version. Eg, versions: [1,2,3,5,7], the report version of this tablet will be 3.
this bug was introduced from apache#8209
Proposed changes
Issue Number: close #8208
Problem Summary:
In the original tablet reporting information, the version missing information is done by combining
two pieces of information as follows:
version_missfieldThe logic of this approach is confusing and inconsistent with the logic of checking for missing versions when querying.
After the change, we directly use the version checking logic used in the query, and set
version_missto trueif a missing version is found
and on the FE processing side. Originally, only the bad replica information was syncronized among FEs,
but not the version missing information. As a result, the non-master FE is not aware of the missing version information.
In the new design, we deprecate the original log persistence class
BackendTabletsInfoand use the newBackendReplicasInfoto record replica reporting information and write both bad and version missinginformation to metadata so that other FEs can synchronize these information.
Checklist(Required)
Further comments
If this is a relatively large or complex change, kick off the discussion at dev@doris.apache.org by explaining why you chose the solution you did and what alternatives you considered, etc...