node config: add default_ineligible configuration option#27965
Conversation
Specifies if scheduling eligibility should be disabled by default for new client nodes.
| var out structs.SingleNodeResponse | ||
|
|
||
| // Register should succeed | ||
| testutil.WaitForResult(func() (bool, error) { |
There was a problem hiding this comment.
For future work, this testutil.WaitForResult helper is really dodgy and has weird local-vs-CI behaviors. We have must.Wait in shoenig/test that's a little easier to use correctly and gives you fine-grained control.
| return false, fmt.Errorf("missing reg") | ||
| } | ||
| must.Eq(t, structs.NodeSchedulingEligible, out3.Node.SchedulingEligibility) | ||
| return out3.Node.ID == req.NodeID, nil |
There was a problem hiding this comment.
This is a weird assertion. Why would we ever get a different node here than the one we requested? And even if we did, why does that belong in this test?
There was a problem hiding this comment.
Yeah, I see how this just smells weird. I re-used the original request since the query is identical. I can create a req3 so it doesn't make a person's eyes cross.
I'd done the same thing intentionally on ln 2399 to demonstrate that c1 and c2 are the same node, but I can rip that out also.
There was a problem hiding this comment.
That's not really the issue so much as we're checking that we got the node ID we asked for, which would maybe be meaningful in a test of the node RPCs, but this is a test for the client configuration on startup.
| // Update eligibility should succeed | ||
| must.Wait(t, wait.InitialSuccess( | ||
| wait.ErrorFunc(func() error { | ||
| err := s1.RPC("Node.UpdateEligibility", &req2, &out2) | ||
| if err != nil { | ||
| return err | ||
| } | ||
| must.NotEq(t, out2.NodeModifyIndex, out.Index) | ||
| return nil | ||
| }), | ||
| )) |
There was a problem hiding this comment.
Why are we waiting here? Isn't this a synchronous update?
| return false, fmt.Errorf("missing reg") | ||
| } | ||
| must.Eq(t, structs.NodeSchedulingIneligible, out.Node.SchedulingEligibility) | ||
| return out.Node.ID == req.NodeID, nil |
There was a problem hiding this comment.
| return out.Node.ID == req.NodeID, nil | |
| return true, nil |
| must.Wait(t, wait.InitialSuccess( | ||
| wait.ErrorFunc(func() error { | ||
| err := s1.RPC("Node.GetNode", &req3, &out3) | ||
| if err != nil { | ||
| return err | ||
| } | ||
| must.Eq(t, structs.NodeSchedulingEligible, out3.Node.SchedulingEligibility) | ||
| return nil | ||
| }), | ||
| wait.Timeout(time.Second*1), | ||
| wait.Gap(time.Millisecond*30), | ||
| )) |
There was a problem hiding this comment.
Same here. The state store write was synchronous, so why are we waiting? Should we be waiting to see if it changes with wait.ContinualSuccesss? How long would we need to wait for if we did?
| must.Wait(t, wait.InitialSuccess( | ||
| wait.ErrorFunc(func() error { | ||
| err := s1.RPC("Node.GetNode", &req4, &out4) | ||
| if err != nil { | ||
| return err | ||
| } | ||
| must.Eq(t, structs.NodeSchedulingEligible, out4.Node.SchedulingEligibility) | ||
| return nil | ||
| }), | ||
| wait.Timeout(time.Second*1), | ||
| wait.Gap(time.Millisecond*30), | ||
| )) |
There was a problem hiding this comment.
This will immediately succeed even if it flips, because the node just started and it takes a little bit to send a fingerprint. I think we need to wait until the initial fingerprint is sent as well, right? Can we detect that from this server RPC?
This PR picks up where @wjordan left off in #13082 as suggested in NMD-1461.
The original PR creates a
DefaultIneligibleconfig option on the client that allows a user to set nodes as ineligible at initialization.Restart behavior came up in the original review and @wjordan correctly identified that
upsertNodeTxnin state_store.go overwrites eligibility with the last known state of recognized nodes.I added some steps to their TestClient_DefaultIneligble test to capture that behavior. The test now marks the node as
Eligiblequeries to confirm desired state, restarts the node, and queries once again to confirm desired state.Links
Refs:
NMD-1461
Closes: GH #10329
Rebase & inclusion of @wjordan's PR
Contributor Checklist
changelog entry using the
make clcommand.ensure regressions will be caught.
and job configuration, please update the Nomad product documentation, which is stored in the
web-unified-docsrepo. Refer to theweb-unified-docscontributor guide for docs guidelines.Please also consider whether the change requires notes within the upgrade
guide. If you would like help with the docs, tag the
nomad-docsteam in this PR.Reviewer Checklist
backporting document.
in the majority of situations. The main exceptions are long-lived feature branches or merges where
history should be preserved.
within the public repository.
Changes to Security Controls
Are there any changes to security controls (access controls, encryption, logging) in this pull request? If so, explain.