KAFKA-7790: Fix Bugs in Trogdor Task Expiration#6103
KAFKA-7790: Fix Bugs in Trogdor Task Expiration#6103cmccabe merged 5 commits intoapache:trunkfrom stanislavkozlovski:trogdor-timeout
Conversation
This commit changes a Trogdor agent/coordinator's behavior to not run tasks that have expired. We define an expired task as one whose sum of `startedMs` and `durationMs` is less than the current time in milliseconds.
| try { | ||
| client.createWorker(new CreateWorkerRequest(workerId, taskId, spec)); | ||
| if (startedMs == -1) | ||
| startedMs = time.milliseconds(); |
There was a problem hiding this comment.
Here the Coordinator keeps track of when it has first ran the task - useful in cases where spec.startMs==0
| if (!worker.hasExpired()) { | ||
| worker.tryCreate(); | ||
| } else { | ||
| log.info("{}: Will not create worker state {} as it has expired. ", node.name(), worker.state); |
There was a problem hiding this comment.
This is in the case where a Coordinator re-schedules a task when an Agent is detected via the heartbeats
| } catch (Throwable t) { | ||
| failure = "Failed to create TaskController: " + t.getMessage(); | ||
| } | ||
| if (spec.hasExpired(time, -1)) |
There was a problem hiding this comment.
This is where a Coordinator is given a brand new task that is expired
…nt NPathComplexity
|
Retest this please |
| public boolean hasExpired(Time time, long startedMs) { | ||
| long startMs = this.startMs > 0 ? this.startMs : startedMs; | ||
| if (startMs <= 0) // task doesn't have a start time yet | ||
| return false; |
There was a problem hiding this comment.
We shouldn't be special-casing 0 here. In general, I think the only thing we need here is an accessor function like endMs which returns startMs + durationMs.
| this.reference = shutdownManager.takeReference(); | ||
| } | ||
|
|
||
| boolean hasExpired() { |
There was a problem hiding this comment.
I don't know if it makes sense to add this function just for a single caller.
We could just check if (spec.endMs() >= time.milliseconds()) below...
Also reverts back changes to NodeManager
…t NPathComplexity
|
retest this please |
|
JDK11 failure seems unrelated - |
|
LGTM |
The Trogdor Coordinator now overwrites a task's startMs to the time it received it if startMs is in the past. The Trogdor Agent now correctly expires a task after the expiry time (startMs + durationMs) passes. Previously, it would ignore startMs and expire after durationMs milliseconds of local start of the task. Reviewed-by: Colin P. McCabe <cmccabe@apache.org>
https://issues.apache.org/jira/browse/KAFKA-7790
Changes:
startMsto the time it received it ifstartMsis in the past.startMs + durationMs) passes. Previously, it would ignorestartMsand expire afterdurationMsmilliseconds of local start of the task.