De-flake snapshot test#17120
Conversation
dhiaayachi
left a comment
There was a problem hiding this comment.
Thank you for making this test more reliable @banks
I have a question inline, but it's more for my understanding 😅
| @@ -105,11 +108,27 @@ func testSnapShotRestoreForLogStore(t *testing.T, logStore libcluster.LogStore) | |||
| require.Equal(r, LeaderLogIndex, followerLogIndex) | |||
There was a problem hiding this comment.
@banks Isn't this supposed to guarantee that the snapshot is applied to the follower? What am I missing here ?
There was a problem hiding this comment.
Not quite. It almost does in that it won't return until the NoOp log after the restore on leader is committed, and for it to be committed it needs to have been replicated on a quorum of servers, and for it to have been replicated on a follower, that follower needs to have also applied the snapshot...
But I suspect what is happening here when this fails (and causes a nil pointer panic in the test) is that only one of the followers plus leader has committed it but we pick a different follower to query who hasn't yet and didn't participate in that quorum.
Honestly, I'm most confused how I saw this fail multiple times in a row on one specific PR when it doesn't seem to fail as often on others.
There was a problem hiding this comment.
Isn't fc.Agent().Self() return the local last_log_index? which is for the same follower we are querying after to read the entries. I'm wondering if this is related to some kind of race inside consul state where the stat is updated before the data is actually available 🤔
In all cases I think it make sense to add this as it make the test more robust
There was a problem hiding this comment.
I actually missed the context of your question totally and didn't see this loop... Yeah I think you're right.
In general last_log_index of a follower only tells us that that log was written to the log store not necessarily comitted globally or applied to the FSM yet. (We could use lastAppliedIndex of lastSnapshotIndex here instead to be sure, but probably doesn't matter because...)
But in this case, the leader must have written the NoOp log after the snapshot for that to have returned cleanly and so it's index must be at least that record. The follower should only get to that record once it has restored that new snapshot since it will hit the "gap" before it and so leader will call installSnapshot on it. InstallSnapshot waits for the FSM restore to complete synchronously before returning and the leader will not continue to replicate the NoOp log until it's done that. So yeah it would seem like this loop should already be waiting for snapshot restore to have completed at least on the specific follower that fc is pointing to.
I can't see any way that stat could be updated before the write to disk happened either 🤔 .
Which means the nil panic on fetch below could be a different issue and maybe an actual bug. If so at least it would still fail with this change although take a few seconds longer. I'll go back and look at the logs harder to see if there are any other clues.
There was a problem hiding this comment.
Check out the logs here: https://github.com/hashicorp/consul-enterprise/actions/runs/4788691965/jobs/8519472432
You can actually see a lot - the Get request goes to server-1 which has only just started installing the snapshot but not completed it yet. server 2 the other follower already had.
It also looks like the leader started shutting down during this time but I wonder if that's just races in log collection between the containers or something...
There was a problem hiding this comment.
Here's the actual line of the read. https://github.com/hashicorp/consul-enterprise/actions/runs/4788691965/jobs/8519472432#step:9:1325
It happens right before the same server completes it's snapshot restore (although it was already started). 🤔
There was a problem hiding this comment.
This is all consistent with my hypothesis, but I have no idea how it makes sense given the retry-until-caught-up you pointed out above Dhia!
|
Going to merge this since CI fails are all lint issues in unrelated code being fixed elsewhere or unrelated to this test and I want to get this test to be un-flaked for other PRs! |
* De-flake snapshot test (#17120) * Add missing import to test. * Swap retry t reference. --------- Co-authored-by: Paul Banks <pbanks@hashicorp.com>
Description
I found the snapshot container test was failing a lot with a nil pointer panic in the test runner itslelf.
It seems like there is some non-determinism in exactly when the follower has the snapshot installed - if we try to read from it before the snapshot install completes we get a nil key and panic.
This fixes that flake with a retry on the first Get from the follower so we will "wait" for a few seconds for the snapshot install to happen before running through and asserting all the keys are present.