KAFKA-12181; Loosen raft fetch offset validation of remote replicas by hachikuji · Pull Request #10309 · apache/kafka

hachikuji · 2021-03-12T22:14:01Z

Currently the Raft leader raises an exception if there is a non-monotonic update to the fetch offset of a replica. In a situation where the replica had lost it disk state, this would prevent the replica from being able to recover. In this patch, we relax the validation to address this problem. It is worth pointing out that this validation could not be relied on to protect from data loss after a voter has lost committed state.

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

jsancio

Thanks of the PR. Looks good in general.

jsancio

LGTM. Thanks for the improvement.

abbccdda

Thanks for the patch, left some comments.

abbccdda · 2021-03-20T16:59:03Z

        Set<Integer> voters,
-        Set<Integer> grantingVoters
+        Set<Integer> grantingVoters,
+        LogContext logContext


Just for my own education, when is it preferable to use upper class log context vs creating own log context?

The log context is useful because it carries with it a logging prefix which can be used to distinguish log messages. For example, in a streams application, the fact that we have multiple producers can make debugging difficult. Or in the context of integration/system/simulation testing, we often get logs from multiple nodes mixed together. With a common prefix, it is easy to grep messages for a particular instance so long as the LogContext is carried through to all the dependencies. Sometimes it is a little annoying to add the extra parameter, but it is worthwhile for improved debugging whenever the parent object already has a log context.

abbccdda · 2021-03-21T03:44:39Z

    public void testNoOpForNegativeRemoteNodeId() {
        int observerId = -1;
-        long endOffset = 10L;
+        long epochStartOffset = 10L;


So this offset was named wrong previously?

I'm not sure I'd call it wrong. The epoch start offset is initialized as the current log end offset. But I thought it was better to choose a more explicit name.

abbccdda · 2021-03-21T03:52:02Z

+                    throw new IllegalStateException("Detected non-monotonic update of local " +
+                        "end offset: " + currentEndOffset.offset + " -> " + endOffsetMetadata.offset);
+                } else {
+                    log.warn("Detected non-monotonic update of fetch offset from nodeId {}: {} -> {}",


I wonder whether the current approach is too loose. Maybe this is already done, but do we want to inform failed replica to cleanup or truncate in the FetchResponse?

The situation we are trying to handle is when a follower loses its disk. Basically the damage is already done by the time we receive the Fetch and the only thing we can do is let the follower try to catch back up. The problem with the old logic is that it prevented this even in situations which would not violate guarantees. I am planning to file a follow-up jira to think of some ways to handle disk loss situations more generally. We would like to at least detect the situation and see if we can prevent it from causing too much damage.

abbccdda

LGTM

…pache#10309) Currently the Raft leader raises an exception if there is a non-monotonic update to the fetch offset of a replica. In a situation where the replica had lost it disk state, this would prevent the replica from being able to recover. In this patch, we relax the validation to address this problem. It is worth pointing out that this validation could not be relied on to protect from data loss after a voter has lost committed state. Reviewers: José Armando García Sancio <jsancio@gmail.com>, Boyang Chen <boyang@confluent.io>

KAFKA-12181; Loosen end offset validation of remote replicas

c102b77

jsancio reviewed Mar 12, 2021

View reviewed changes

Comment thread raft/src/main/java/org/apache/kafka/raft/LeaderState.java Outdated

Comment thread raft/src/test/java/org/apache/kafka/raft/LeaderStateTest.java Outdated

hachikuji added 2 commits March 12, 2021 18:32

Review comments and add simulation test

7d87ce6

Add toString for ReplicaState

39ee5e7

jsancio reviewed Mar 15, 2021

View reviewed changes

Comment thread raft/src/test/java/org/apache/kafka/raft/RaftEventSimulationTest.java Outdated

Comment thread raft/src/test/java/org/apache/kafka/raft/RaftEventSimulationTest.java Outdated

hachikuji mentioned this pull request Mar 17, 2021

KAFKA-12459; Use property testing library for raft event simulation tests #10323

Merged

3 tasks

Create killAndDeletePersistentState

a6b1261

jsancio approved these changes Mar 17, 2021

View reviewed changes

hachikuji added 2 commits March 17, 2021 21:28

Merge remote-tracking branch 'upstream/trunk' into KAFKA-12181

778d76d

Refactor new simulation test as a property test

bc8cc9c

abbccdda reviewed Mar 21, 2021

View reviewed changes

Consolidate check in SingleLeader invariant

4fc7bf9

abbccdda approved these changes Mar 22, 2021

View reviewed changes

hachikuji merged commit f5f66b9 into apache:trunk Mar 22, 2021

ijuma added the kraft label Aug 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KAFKA-12181; Loosen raft fetch offset validation of remote replicas#10309

KAFKA-12181; Loosen raft fetch offset validation of remote replicas#10309
hachikuji merged 7 commits intoapache:trunkfrom
hachikuji:KAFKA-12181

hachikuji commented Mar 12, 2021

Uh oh!

jsancio left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jsancio left a comment

Uh oh!

abbccdda left a comment

Uh oh!

abbccdda Mar 20, 2021

Uh oh!

hachikuji Mar 22, 2021

Uh oh!

Uh oh!

abbccdda Mar 21, 2021

Uh oh!

hachikuji Mar 22, 2021

Uh oh!

abbccdda Mar 21, 2021

Uh oh!

hachikuji Mar 22, 2021

Uh oh!

abbccdda left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

hachikuji commented Mar 12, 2021

Committer Checklist (excluded from commit message)

Uh oh!

jsancio left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jsancio left a comment

Choose a reason for hiding this comment

Uh oh!

abbccdda left a comment

Choose a reason for hiding this comment

Uh oh!

abbccdda Mar 20, 2021

Choose a reason for hiding this comment

Uh oh!

hachikuji Mar 22, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

abbccdda Mar 21, 2021

Choose a reason for hiding this comment

Uh oh!

hachikuji Mar 22, 2021

Choose a reason for hiding this comment

Uh oh!

abbccdda Mar 21, 2021

Choose a reason for hiding this comment

Uh oh!

hachikuji Mar 22, 2021

Choose a reason for hiding this comment

Uh oh!

abbccdda left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants