-
Notifications
You must be signed in to change notification settings - Fork 656
orchestrator/restart: Reset restart history when task spec changes #2304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -30,6 +30,13 @@ type instanceRestartInfo struct { | |
| // Restart.MaxAttempts and Restart.Window are both | ||
| // nonzero. | ||
| restartedInstances *list.List | ||
| // Why is specVersion in this structure and not in the map key? While | ||
| // putting it in the key would be a very simple solution, it wouldn't | ||
| // be easy to clean up map entries corresponding to old specVersions. | ||
| // Making the key version-agnostic and clearing the value whenever the | ||
| // version changes avoids the issue of stale map entries for old | ||
| // versions. | ||
| specVersion api.Version | ||
| } | ||
|
|
||
| type delayedStart struct { | ||
|
|
@@ -54,8 +61,7 @@ type Supervisor struct { | |
| mu sync.Mutex | ||
| store *store.MemoryStore | ||
| delays map[string]*delayedStart | ||
| history map[instanceTuple]*instanceRestartInfo | ||
| historyByService map[string]map[instanceTuple]struct{} | ||
| historyByService map[string]map[instanceTuple]*instanceRestartInfo | ||
| TaskTimeout time.Duration | ||
| } | ||
|
|
||
|
|
@@ -64,8 +70,7 @@ func NewSupervisor(store *store.MemoryStore) *Supervisor { | |
| return &Supervisor{ | ||
| store: store, | ||
| delays: make(map[string]*delayedStart), | ||
| history: make(map[instanceTuple]*instanceRestartInfo), | ||
| historyByService: make(map[string]map[instanceTuple]struct{}), | ||
| historyByService: make(map[string]map[instanceTuple]*instanceRestartInfo), | ||
| TaskTimeout: defaultOldTaskTimeout, | ||
| } | ||
| } | ||
|
|
@@ -214,8 +219,8 @@ func (r *Supervisor) shouldRestart(ctx context.Context, t *api.Task, service *ap | |
| r.mu.Lock() | ||
| defer r.mu.Unlock() | ||
|
|
||
| restartInfo := r.history[instanceTuple] | ||
| if restartInfo == nil { | ||
| restartInfo := r.historyByService[t.ServiceID][instanceTuple] | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Perhaps I'm mis-interpreting the TODO at the top; But doesn't this make it even more dependent on
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I believe the TODO is saying that
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Was thinking if the TODO meant: instance id's are globally unique, thus we only need to track per instance
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No, we don't have a concept of an instance ID. We track instances as a combination of service ID, and either a slot number or node ID. |
||
| if restartInfo == nil || (t.SpecVersion != nil && *t.SpecVersion != restartInfo.specVersion) { | ||
| return true | ||
| } | ||
|
|
||
|
|
@@ -268,17 +273,26 @@ func (r *Supervisor) recordRestartHistory(restartTask *api.Task) { | |
| r.mu.Lock() | ||
| defer r.mu.Unlock() | ||
|
|
||
| if r.history[tuple] == nil { | ||
| r.history[tuple] = &instanceRestartInfo{} | ||
| if r.historyByService[restartTask.ServiceID] == nil { | ||
| r.historyByService[restartTask.ServiceID] = make(map[instanceTuple]*instanceRestartInfo) | ||
| } | ||
| if r.historyByService[restartTask.ServiceID][tuple] == nil { | ||
| r.historyByService[restartTask.ServiceID][tuple] = &instanceRestartInfo{} | ||
| } | ||
|
|
||
| restartInfo := r.history[tuple] | ||
| restartInfo.totalRestarts++ | ||
| restartInfo := r.historyByService[restartTask.ServiceID][tuple] | ||
|
|
||
| if r.historyByService[restartTask.ServiceID] == nil { | ||
| r.historyByService[restartTask.ServiceID] = make(map[instanceTuple]struct{}) | ||
| if restartTask.SpecVersion != nil && *restartTask.SpecVersion != restartInfo.specVersion { | ||
| // This task has a different SpecVersion from the one we're | ||
| // tracking. Most likely, the service was updated. Past failures | ||
| // shouldn't count against the new service definition, so clear | ||
| // the history for this instance. | ||
| *restartInfo = instanceRestartInfo{ | ||
| specVersion: *restartTask.SpecVersion, | ||
| } | ||
| } | ||
| r.historyByService[restartTask.ServiceID][tuple] = struct{}{} | ||
|
|
||
| restartInfo.totalRestarts++ | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So do we track max attempted restarts for a task, not for a service? Does this mean that if I have a service with two replicas, and one of those replica's keeps failing, that the replica is no longer restarted, but the other replica is kept running? (i.e., the service running in degraded mode)
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Guess I didn't realise that (it's confusing which options apply to the service as a whole, and which ones apply to individual tasks.
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's per "instance", so either per-replica or per-node depending whether it's a replicated or global service. I believe the reasoning for this was that if you have one node that's broken or misbehaving and has tasks restarting in a loop, you don't want that to impact the ability to restart other tasks when they encounter occasional problems. By tracking restarts per-instance, the instances don't share a global maximum number of restarts. It seems more useful overall. But I agree it can be confusing.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Perhaps a future addition; on failure; re-schedule task 🤔
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
We actually do have a mechanism that reschedules tasks after they have failed on the same node repeatedly: #1552 Of course, it only applies to replicated services.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. TIL @mstanleyjones not sure if we documented that? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I believe we have. It sounds familiar. |
||
|
|
||
| if restartTask.Spec.Restart.Window != nil && (restartTask.Spec.Restart.Window.Seconds != 0 || restartTask.Spec.Restart.Window.Nanos != 0) { | ||
| if restartInfo.restartedInstances == nil { | ||
|
|
@@ -432,16 +446,6 @@ func (r *Supervisor) CancelAll() { | |
| // ClearServiceHistory forgets restart history related to a given service ID. | ||
| func (r *Supervisor) ClearServiceHistory(serviceID string) { | ||
| r.mu.Lock() | ||
| defer r.mu.Unlock() | ||
|
|
||
| tuples := r.historyByService[serviceID] | ||
| if tuples == nil { | ||
| return | ||
| } | ||
|
|
||
| delete(r.historyByService, serviceID) | ||
|
|
||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: Wondering if this should also re-initialize the history
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is called when the service is deleted, so we want the map entry for it to be gone.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ah makes sense |
||
| for t := range tuples { | ||
| delete(r.history, t) | ||
| } | ||
| r.mu.Unlock() | ||
| } | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 ❤️