Skip to content

[Bug] Replica state is in DECOMMISSION forever #6561

@morningman

Description

@morningman

Search before asking

  • I had searched in the issues and found no similar issues.

Version

trunk 79fd117

What's Wrong?

At some time, we may notice that the state of a replica of a tablet has been in the DECOMMISSION state and cannot be restored.
And this tablet belongs to a colocation table. As a result, the colocation group where the colocation table is located is always in an unstable state, and the colocation plan cannot be performed.

What You Expected?

The replica state should become NORMAL after some time. And the colocation group should become STABLE.

How to Reproduce?

Hard to reproduce.
You need to have multi colocation table with high frequency load.

It may happen as follows:

  1. A tablet of colocation table is in COLOCATION_REDUNDANT state
  2. The tablet is being scheduled and set one of replica as DECOMMISSION in TabletScheduler.deleteReplicaInternal()
  3. The tablet will then be scheduled again
  4. But at that time, the BE node of the replica that was
    set to the DECOMMISSION state in step 2 is returned to the colocation group.
    So the tablet's health status becomes VERSION_INCOMPLETE. (because replica is DECOMMISSION state do not allow load)
  5. However, because the replica in the DECOMMISSION state will not receive the load task, the health status of this tablet will always be VERSION INCOMPLETE.

Anything Else?

Why:

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/fixCategorizes issue or PR as related to a bug.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions