Skip to content

[Bug]Fix the bug data balance causes tablet loss#6063

Closed
qidaye wants to merge 7 commits intoapache:masterfrom
qidaye:tablet_lost
Closed

[Bug]Fix the bug data balance causes tablet loss#6063
qidaye wants to merge 7 commits intoapache:masterfrom
qidaye:tablet_lost

Conversation

@qidaye
Copy link
Contributor

@qidaye qidaye commented Jun 18, 2021

Proposed changes

fix #6061

Add version information in BE tablet meta, which is replica_id in FE meta.

Add replica_id information in CloneTask/CreateReplicaTask/DropReplicaTask of FE.

There are 4 main changes in BE:

  1. add replica_id information when creating a tablet
  2. clone tablet with checking replica_id, and add in_clone_mode flag in tablet_meta
  3. check version information(replica_id) and in_clone_mode before drop a tablet
  4. add replica_id information when reporting tablet to FE

Two other issues have also been considered.

  1. Do not affect other tasks.
  2. Compatibility of old and new data, with and without version information.

Types of changes

What types of changes does your code introduce to Doris?
Put an x in the boxes that apply

  • Bugfix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation Update (if none of the other choices apply)
  • Code refactor (Modify the code structure, format the code, etc...)

Checklist

Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your code.

  • I have created an issue on (Fix [Bug] Tablet is lost due to data balancing #6061) and described the bug/feature there in detail
  • Compiling and unit tests pass locally with my changes
  • I have added tests that prove my fix is effective or that my feature works
  • If these changes need document changes, I have updated the document
  • Any dependent changes have been merged

Further comments

If this is a relatively large or complex change, kick off the discussion at dev@doris.apache.org by explaining why you chose the solution you did and what alternatives you considered, etc...

@morningman morningman added area/balance Issues or PRs related to data balance kind/fix Categorizes issue or PR as related to a bug. labels Jun 28, 2021
// get path hash of the created tablet
TabletSharedPtr tablet = StorageEngine::instance()->tablet_manager()->get_tablet(
create_tablet_req.tablet_id, create_tablet_req.tablet_schema.schema_hash);
create_tablet_req.tablet_id, create_tablet_req.replica_id, create_tablet_req.tablet_schema.schema_hash);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check if create_tablet_req.replica_id is set before using it. We can not rely on the default value of it. It depends on thrift's implementation. Or you can set the default value in thrift file.

TabletSharedPtr exist_tablet = StorageEngine::instance()->tablet_manager()->get_tablet(
clone_req.tablet_id, 0 /*replica_id*/, clone_req.schema_hash, &err);
if (exist_tablet != nullptr) {
exist_tablet->set_clone_mode(true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not checking replica id here?

// clone done, set clone mode false
// Retrieve once again to prevent tablet from being dropped
exist_tablet = StorageEngine::instance()->tablet_manager()->get_tablet(
clone_req.tablet_id, 0 /*replica_id*/, clone_req.schema_hash, &err);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not checking replica id here?

StorageEngine::instance()->tablet_manager()->load_tablet_from_dir(
OLAPStatus load_header_status;
if (old_version_tablet != nullptr) {
// drop old version tablet first, then and new tablet
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// drop old version tablet first, then and new tablet
// drop old version tablet first, then add new tablet

LOG(WARNING) << "errors while set tablet uid: '" << header_path;
_error_msgs->push_back("errors while set tablet uid.");
// reset_replica_id here. before load tablet to tablet_manager
OLAPStatus reset_replica_id_status = TabletMeta::reset_tablet_replica_id(header_path, _clone_req.replica_id);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

merge reset_tablet_replica_id() into reset_tablet_uid()? To avoid save meta twice?

if (dropped_tablet != nullptr) {
if (dropped_tablet->clone_mode()) {
LOG(WARNING) << "drop table cancelled as tablet is in clone mode! signature: " << agent_task_req.signature;
error_msgs.push_back("drop table cancelled!");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add reason to error_msgs too.

13: optional TStorageFormat storage_format
14: optional TTabletType tablet_type
15: optional Types.TReplicaId replica_id
16: optional Types.TReplicaId base_replica_id
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add comment to explain replica_id and base_replica_id

drop_tablet_req.tablet_id, drop_tablet_req.schema_hash, false, &err);
drop_tablet_req.tablet_id, drop_tablet_req.schema_hash, replica_id, false, &err);
if (dropped_tablet != nullptr) {
if (dropped_tablet->clone_mode()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just do this check in tablet_manager()->drop_tablet?


TReplicaId replica_id = clone_req.__isset.replica_id ? clone_req.replica_id : 0;
// check tablet with the same tabletId existance, if exist, set tablet in clone mode
TabletSharedPtr exist_tablet = StorageEngine::instance()->tablet_manager()->get_tablet(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can do this in EngineCloneTask.

task_status.__set_error_msgs(error_msgs);
finish_task_request.__set_task_status(task_status);

// clone done, set clone mode false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do this in clone task.

@morningman morningman self-assigned this Jan 6, 2022
morningman pushed a commit that referenced this pull request Jun 15, 2022
1. Provide a FE conf to test the reliability in single replica case when tablet scheduling are frequent.
2. According to #6063, almost apply this fix on current code.
@qidaye
Copy link
Contributor Author

qidaye commented Jun 15, 2022

fix at #9971

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/balance Issues or PRs related to data balance kind/fix Categorizes issue or PR as related to a bug.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Tablet is lost due to data balancing

2 participants