-
Notifications
You must be signed in to change notification settings - Fork 656
Fix task reaper batching #2669
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix task reaper batching #2669
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -40,6 +40,12 @@ type TaskReaper struct { | |
| cleanup []string | ||
| stopChan chan struct{} | ||
| doneChan chan struct{} | ||
|
|
||
| // tickSignal is a channel that, if non-nil and available, will be written | ||
| // to to signal that a tick has occurred. its sole purpose is for testing | ||
| // code, to verify that take cleanup attempts are happening when they | ||
| // should be. | ||
| tickSignal chan struct{} | ||
| } | ||
|
|
||
| // New creates a new TaskReaper. | ||
|
|
@@ -115,7 +121,34 @@ func (tr *TaskReaper) Run(ctx context.Context) { | |
|
|
||
| // Clean up when we hit TaskHistoryRetentionLimit or when the timer expires, | ||
| // whichever happens first. | ||
| // | ||
| // Specifically, the way this should work: | ||
| // - Create a timer and immediately stop it. We don't want to fire the | ||
| // cleanup routine yet, because we just did a cleanup as part of the | ||
| // initialization above. | ||
| // - Launch into an event loop | ||
| // - When we receive an event, handle the event as needed | ||
| // - After receiving the event: | ||
| // - If minimum batch size (maxDirty) is exceeded with dirty + cleanup, | ||
| // then immediately launch into the cleanup routine | ||
| // - Otherwise, if the timer is stopped, start it (reset). | ||
| // - If the timer expires and the timer channel is signaled, then Stop the | ||
| // timer (so that it will be ready to be started again as needed), and | ||
| // execute the cleanup routine (tick) | ||
| timer := time.NewTimer(reaperBatchingInterval) | ||
| timer.Stop() | ||
|
|
||
| // If stop is somehow called AFTER the timer has expired, there will be a | ||
| // value in the timer.C channel. If there is such a value, we should drain | ||
| // it out. This select statement allows us to drain that value if it's | ||
| // present, or continue straight through otherwise. | ||
| select { | ||
| case <-timer.C: | ||
| default: | ||
| } | ||
|
|
||
| // keep track with a boolean of whether the timer is currently stopped | ||
| isTimerStopped := true | ||
|
|
||
| // Watch for: | ||
| // 1. EventCreateTask for cleaning slots, which is the best time to cleanup that node/slot. | ||
|
|
@@ -153,23 +186,52 @@ func (tr *TaskReaper) Run(ctx context.Context) { | |
| } | ||
|
|
||
| if len(tr.dirty)+len(tr.cleanup) > maxDirty { | ||
| // stop the timer, so we don't fire it. if we get another event | ||
| // after we do this cleaning, we will reset the timer then | ||
| timer.Stop() | ||
| // if the timer had fired, drain out the value. | ||
| select { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is used in a few places. Can we move this to a function called drainTimer() ?
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think it's worthwhile to move this to a function. it's essentially just an if statement for a channel read. sort of like but you can't do such an if statement, so you do this select with an empty default.
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. alternatively you can do if !timer.Stop() {
<-timer.C
}which is how the docs drain a channel, but for some reason doing it that way was deadlocking, so i did it the easy way instead of trying to figure out what was weird.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see your point. The second option is more readable than the first one. My point was that its 4 lines of code thats repeated at least 3x, which can be replaced by a single function call. But I'll let you decide. |
||
| case <-timer.C: | ||
| default: | ||
| } | ||
| isTimerStopped = true | ||
| tr.tick() | ||
| } else { | ||
| timer.Reset(reaperBatchingInterval) | ||
| if isTimerStopped { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. does it matter here if you always call Reset() without checking if the timer was stopped ?
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. you've got the added and removed lines confused.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No :). I'm asking why not always call Reset() without checking isTimerStopped. At the same time, I'm also questioning if we need the bool isTimerStopped and whether we can do without it. @dperny
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Non-blocking - this code's logic seems fine, but looking at some of our vendored code like GRPC (see clientconn.go) it seems like the pattern is to create a timer at the beginning of the loop. If the timer fires, set the timer to I am fine with this as is, though.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @anshulpundir If I'm understanding the intent correctly, we want to clean up whenever |
||
| timer.Reset(reaperBatchingInterval) | ||
| isTimerStopped = false | ||
| } | ||
| } | ||
| case <-timer.C: | ||
| timer.Stop() | ||
| // we can safely ignore draining off of the timer channel, because | ||
| // we already know that the timer has expired. | ||
| isTimerStopped = true | ||
| tr.tick() | ||
| case <-tr.stopChan: | ||
| // even though this doesn't really matter in this context, it's | ||
| // good hygiene to drain the value. | ||
| timer.Stop() | ||
| select { | ||
| case <-timer.C: | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. move to a function. |
||
| default: | ||
| } | ||
| return | ||
| } | ||
| } | ||
| } | ||
|
|
||
| // tick performs task history cleanup. | ||
| func (tr *TaskReaper) tick() { | ||
| // this signals that a tick has occurred. it exists solely for testing. | ||
| if tr.tickSignal != nil { | ||
| // try writing to this channel, but if it's full, fall straight through | ||
| // and ignore it. | ||
| select { | ||
| case tr.tickSignal <- struct{}{}: | ||
| default: | ||
| } | ||
| } | ||
|
|
||
| if len(tr.dirty) == 0 && len(tr.cleanup) == 0 { | ||
| return | ||
| } | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we instead use the side-effects caused by tick() to test ?
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not cleanly, as far as i could tell, unless you have a better idea?
additionally, it would make the test fragile to changes in the implementation of
tick. right now, the test is only relevant to the implementation of the code under test.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe. Can we just check to see if the tasks were actually cleaned up to verify that tick() was called ?
Alternatively, we could just add a counter to count the number of times tick() was executed. The benefit of this is that we could additionally use that counter to expose stats.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bringing in a conversation @anshulpundir and i had in meatspace:
i've chosen to add this code for the purpose of determining affirmatively or negatively that
tickhas or has not been called. Relying on side effects can tell that tick has been called, but cannot prove definitively that it has not been called.