[Bug]Fix the bug data balance causes tablet loss#6063
[Bug]Fix the bug data balance causes tablet loss#6063qidaye wants to merge 7 commits intoapache:masterfrom
Conversation
be/src/agent/task_worker_pool.cpp
Outdated
| // get path hash of the created tablet | ||
| TabletSharedPtr tablet = StorageEngine::instance()->tablet_manager()->get_tablet( | ||
| create_tablet_req.tablet_id, create_tablet_req.tablet_schema.schema_hash); | ||
| create_tablet_req.tablet_id, create_tablet_req.replica_id, create_tablet_req.tablet_schema.schema_hash); |
There was a problem hiding this comment.
Check if create_tablet_req.replica_id is set before using it. We can not rely on the default value of it. It depends on thrift's implementation. Or you can set the default value in thrift file.
| TabletSharedPtr exist_tablet = StorageEngine::instance()->tablet_manager()->get_tablet( | ||
| clone_req.tablet_id, 0 /*replica_id*/, clone_req.schema_hash, &err); | ||
| if (exist_tablet != nullptr) { | ||
| exist_tablet->set_clone_mode(true); |
There was a problem hiding this comment.
Why not checking replica id here?
be/src/agent/task_worker_pool.cpp
Outdated
| // clone done, set clone mode false | ||
| // Retrieve once again to prevent tablet from being dropped | ||
| exist_tablet = StorageEngine::instance()->tablet_manager()->get_tablet( | ||
| clone_req.tablet_id, 0 /*replica_id*/, clone_req.schema_hash, &err); |
There was a problem hiding this comment.
Why not checking replica id here?
| StorageEngine::instance()->tablet_manager()->load_tablet_from_dir( | ||
| OLAPStatus load_header_status; | ||
| if (old_version_tablet != nullptr) { | ||
| // drop old version tablet first, then and new tablet |
There was a problem hiding this comment.
| // drop old version tablet first, then and new tablet | |
| // drop old version tablet first, then add new tablet |
| LOG(WARNING) << "errors while set tablet uid: '" << header_path; | ||
| _error_msgs->push_back("errors while set tablet uid."); | ||
| // reset_replica_id here. before load tablet to tablet_manager | ||
| OLAPStatus reset_replica_id_status = TabletMeta::reset_tablet_replica_id(header_path, _clone_req.replica_id); |
There was a problem hiding this comment.
merge reset_tablet_replica_id() into reset_tablet_uid()? To avoid save meta twice?
be/src/agent/task_worker_pool.cpp
Outdated
| if (dropped_tablet != nullptr) { | ||
| if (dropped_tablet->clone_mode()) { | ||
| LOG(WARNING) << "drop table cancelled as tablet is in clone mode! signature: " << agent_task_req.signature; | ||
| error_msgs.push_back("drop table cancelled!"); |
There was a problem hiding this comment.
add reason to error_msgs too.
Change-Id: Ibfb98ccf52966f6995f55f7ccb0e470abe347e29
| 13: optional TStorageFormat storage_format | ||
| 14: optional TTabletType tablet_type | ||
| 15: optional Types.TReplicaId replica_id | ||
| 16: optional Types.TReplicaId base_replica_id |
There was a problem hiding this comment.
Add comment to explain replica_id and base_replica_id
| drop_tablet_req.tablet_id, drop_tablet_req.schema_hash, false, &err); | ||
| drop_tablet_req.tablet_id, drop_tablet_req.schema_hash, replica_id, false, &err); | ||
| if (dropped_tablet != nullptr) { | ||
| if (dropped_tablet->clone_mode()) { |
There was a problem hiding this comment.
Why not just do this check in tablet_manager()->drop_tablet?
|
|
||
| TReplicaId replica_id = clone_req.__isset.replica_id ? clone_req.replica_id : 0; | ||
| // check tablet with the same tabletId existance, if exist, set tablet in clone mode | ||
| TabletSharedPtr exist_tablet = StorageEngine::instance()->tablet_manager()->get_tablet( |
There was a problem hiding this comment.
I think we can do this in EngineCloneTask.
| task_status.__set_error_msgs(error_msgs); | ||
| finish_task_request.__set_task_status(task_status); | ||
|
|
||
| // clone done, set clone mode false |
1. Provide a FE conf to test the reliability in single replica case when tablet scheduling are frequent. 2. According to #6063, almost apply this fix on current code.
|
fix at #9971 |
Proposed changes
fix #6061
Add version information in BE tablet meta, which is
replica_idin FE meta.Add
replica_idinformation inCloneTask/CreateReplicaTask/DropReplicaTaskof FE.There are 4 main changes in BE:
replica_idinformation when creating a tabletreplica_id, and addin_clone_modeflag in tablet_metareplica_id) andin_clone_modebefore drop a tabletreplica_idinformation when reporting tablet to FETwo other issues have also been considered.
Types of changes
What types of changes does your code introduce to Doris?
Put an
xin the boxes that applyChecklist
Put an
xin the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your code.Further comments
If this is a relatively large or complex change, kick off the discussion at dev@doris.apache.org by explaining why you chose the solution you did and what alternatives you considered, etc...