De-flake snapshot test by banks · Pull Request #17120 · hashicorp/consul

banks · 2023-04-25T11:54:03Z

Description

I found the snapshot container test was failing a lot with a nil pointer panic in the test runner itslelf.

It seems like there is some non-determinism in exactly when the follower has the snapshot installed - if we try to read from it before the snapshot install completes we get a nil key and panic.

This fixes that flake with a retry on the first Get from the follower so we will "wait" for a few seconds for the snapshot install to happen before running through and asserting all the keys are present.

dhiaayachi

Thank you for making this test more reliable @banks
I have a question inline, but it's more for my understanding 😅

dhiaayachi · 2023-04-25T12:52:54Z

@@ -105,11 +108,27 @@ func testSnapShotRestoreForLogStore(t *testing.T, logStore libcluster.LogStore)
 		require.Equal(r, LeaderLogIndex, followerLogIndex)


@banks Isn't this supposed to guarantee that the snapshot is applied to the follower? What am I missing here ?

Not quite. It almost does in that it won't return until the NoOp log after the restore on leader is committed, and for it to be committed it needs to have been replicated on a quorum of servers, and for it to have been replicated on a follower, that follower needs to have also applied the snapshot...

But I suspect what is happening here when this fails (and causes a nil pointer panic in the test) is that only one of the followers plus leader has committed it but we pick a different follower to query who hasn't yet and didn't participate in that quorum.

Honestly, I'm most confused how I saw this fail multiple times in a row on one specific PR when it doesn't seem to fail as often on others.

Isn't fc.Agent().Self() return the local last_log_index? which is for the same follower we are querying after to read the entries. I'm wondering if this is related to some kind of race inside consul state where the stat is updated before the data is actually available 🤔

In all cases I think it make sense to add this as it make the test more robust

I actually missed the context of your question totally and didn't see this loop... Yeah I think you're right.

In general last_log_index of a follower only tells us that that log was written to the log store not necessarily comitted globally or applied to the FSM yet. (We could use lastAppliedIndex of lastSnapshotIndex here instead to be sure, but probably doesn't matter because...)

But in this case, the leader must have written the NoOp log after the snapshot for that to have returned cleanly and so it's index must be at least that record. The follower should only get to that record once it has restored that new snapshot since it will hit the "gap" before it and so leader will call installSnapshot on it. InstallSnapshot waits for the FSM restore to complete synchronously before returning and the leader will not continue to replicate the NoOp log until it's done that. So yeah it would seem like this loop should already be waiting for snapshot restore to have completed at least on the specific follower that fc is pointing to.

I can't see any way that stat could be updated before the write to disk happened either 🤔 .

Which means the nil panic on fetch below could be a different issue and maybe an actual bug. If so at least it would still fail with this change although take a few seconds longer. I'll go back and look at the logs harder to see if there are any other clues.

Check out the logs here: https://github.com/hashicorp/consul-enterprise/actions/runs/4788691965/jobs/8519472432

You can actually see a lot - the Get request goes to server-1 which has only just started installing the snapshot but not completed it yet. server 2 the other follower already had.

It also looks like the leader started shutting down during this time but I wonder if that's just races in log collection between the containers or something...

Here's the actual line of the read. https://github.com/hashicorp/consul-enterprise/actions/runs/4788691965/jobs/8519472432#step:9:1325

It happens right before the same server completes it's snapshot restore (although it was already started). 🤔

This is all consistent with my hypothesis, but I have no idea how it makes sense given the retry-until-caught-up you pointed out above Dhia!

banks · 2023-04-25T14:25:18Z

Going to merge this since CI fails are all lint issues in unrelated code being fixed elsewhere or unrelated to this test and I want to get this test to be un-flaked for other PRs!

* De-flake snapshot test (#17120) * Add missing import to test. * Swap retry t reference. --------- Co-authored-by: Paul Banks <pbanks@hashicorp.com>

De-flake snapshot test

c763e34

banks added pr/no-changelog PR does not need a corresponding .changelog entry pr/no-backport labels Apr 25, 2023

banks requested a review from dhiaayachi April 25, 2023 11:54

boxofrad approved these changes Apr 25, 2023

View reviewed changes

dhiaayachi approved these changes Apr 25, 2023

View reviewed changes

banks merged commit 9e35c47 into main Apr 25, 2023

banks deleted the deflake-snapshot-test branch April 25, 2023 14:25

hashi-derek pushed a commit that referenced this pull request May 11, 2023

De-flake snapshot test (#17120)

e25f1d2

hashi-derek mentioned this pull request May 11, 2023

Backport flaky-test fix to 1.15 #17309

Merged

hashi-derek added a commit that referenced this pull request May 11, 2023

Backport flaky-test fix to 1.15 (#17309)

4f99d24

* De-flake snapshot test (#17120) * Add missing import to test. * Swap retry t reference. --------- Co-authored-by: Paul Banks <pbanks@hashicorp.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

De-flake snapshot test#17120

De-flake snapshot test#17120
banks merged 1 commit into
mainfrom
deflake-snapshot-test

banks commented Apr 25, 2023

Uh oh!

dhiaayachi left a comment

Uh oh!

dhiaayachi Apr 25, 2023

Uh oh!

banks Apr 25, 2023

Uh oh!

dhiaayachi Apr 25, 2023

Uh oh!

banks Apr 26, 2023

Uh oh!

banks Apr 26, 2023

Uh oh!

banks Apr 26, 2023 •

edited

Loading

Uh oh!

banks Apr 26, 2023 •

edited

Loading

Uh oh!

banks commented Apr 25, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		@@ -105,11 +108,27 @@ func testSnapShotRestoreForLogStore(t *testing.T, logStore libcluster.LogStore)
		require.Equal(r, LeaderLogIndex, followerLogIndex)

Conversation

banks commented Apr 25, 2023

Description

Uh oh!

dhiaayachi left a comment

Choose a reason for hiding this comment

Uh oh!

dhiaayachi Apr 25, 2023

Choose a reason for hiding this comment

Uh oh!

banks Apr 25, 2023

Choose a reason for hiding this comment

Uh oh!

dhiaayachi Apr 25, 2023

Choose a reason for hiding this comment

Uh oh!

banks Apr 26, 2023

Choose a reason for hiding this comment

Uh oh!

banks Apr 26, 2023

Choose a reason for hiding this comment

Uh oh!

banks Apr 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

banks Apr 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

banks commented Apr 25, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

banks Apr 26, 2023 •

edited

Loading

banks Apr 26, 2023 •

edited

Loading