Skip to content

[Bug] Failed to load data because replica's state is not rignt #9422

@morningman

Description

@morningman

Search before asking

  • I had searched in the issues and found no similar issues.

Version

1.0.0

What's Wrong?

Load data with error:

Failed to commit txn xxx, Tablet [xxxx] success replica num 0 is less than quorum replica ...

And the replica state is:
One replica is in state DECOMMISSION, but version is correct, eg, 21.
Another replica is in state NORMAL, but version is stale, eg, 20, and last failed version is 21.

This is table with replication_num = 1. 3 BEs.

What You Expected?

Replica can be recovered automatically and load can be succeed later.

How to Reproduce?

The following step may lead to the error:
0. Tablet 10000 with 1 replica 10001 on Backend A, version is 20.

  1. Begin transaction 100, which is about to write version 21.
  2. begin a balance clone task, to clone from Backend A to Backend B.
  3. clone task finished, now there are 2 replica(10001, and 10002) with version 20 on Backend A and B.
  4. Tablet 10000 being scheduled again, and set replica 10001's state to DECOMMISSION.
  5. Transaction 100 finished, and set replica 10001's version to 21. But 10002 failed to load, so its version remains 20.
  6. For now, there are 2 replicas, one is in state DECOMMISSION and version is 21, one is in state NORMAL with version 20.
  7. The following load job can not find a normal replica, so load will be failed.

In this situation, we can only restart FE to recover.

Anything Else?

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    usercaseImportant user case type label

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions