Skip to content

KAFKA-7790: Fix Bugs in Trogdor Task Expiration#6103

Merged
cmccabe merged 5 commits intoapache:trunkfrom
stanislavkozlovski:trogdor-timeout
Jan 11, 2019
Merged

KAFKA-7790: Fix Bugs in Trogdor Task Expiration#6103
cmccabe merged 5 commits intoapache:trunkfrom
stanislavkozlovski:trogdor-timeout

Conversation

@stanislavkozlovski
Copy link
Copy Markdown
Contributor

@stanislavkozlovski stanislavkozlovski commented Jan 8, 2019

https://issues.apache.org/jira/browse/KAFKA-7790

Changes:

  • The Trogdor Coordinator now overwrites a task's startMs to the time it received it if startMs is in the past.
  • The Trogdor Agent now correctly expires a task after the expiry time (startMs + durationMs) passes. Previously, it would ignore startMs and expire after durationMs milliseconds of local start of the task.

This commit changes a Trogdor agent/coordinator's behavior to not run tasks that have expired. We define an expired task as one whose sum of `startedMs` and `durationMs` is less than the current time in milliseconds.
try {
client.createWorker(new CreateWorkerRequest(workerId, taskId, spec));
if (startedMs == -1)
startedMs = time.milliseconds();
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here the Coordinator keeps track of when it has first ran the task - useful in cases where spec.startMs==0

if (!worker.hasExpired()) {
worker.tryCreate();
} else {
log.info("{}: Will not create worker state {} as it has expired. ", node.name(), worker.state);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is in the case where a Coordinator re-schedules a task when an Agent is detected via the heartbeats

} catch (Throwable t) {
failure = "Failed to create TaskController: " + t.getMessage();
}
if (spec.hasExpired(time, -1))
Copy link
Copy Markdown
Contributor Author

@stanislavkozlovski stanislavkozlovski Jan 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where a Coordinator is given a brand new task that is expired

@stanislavkozlovski
Copy link
Copy Markdown
Contributor Author

Retest this please

public boolean hasExpired(Time time, long startedMs) {
long startMs = this.startMs > 0 ? this.startMs : startedMs;
if (startMs <= 0) // task doesn't have a start time yet
return false;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't be special-casing 0 here. In general, I think the only thing we need here is an accessor function like endMs which returns startMs + durationMs.

this.reference = shutdownManager.takeReference();
}

boolean hasExpired() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if it makes sense to add this function just for a single caller.

We could just check if (spec.endMs() >= time.milliseconds()) below...

@cmccabe cmccabe changed the title KAFKA-7790: Expire Trogdor tasks KAFKA-7790" Fix Bugs in Trogdor Task Expiration Jan 10, 2019
@cmccabe cmccabe changed the title KAFKA-7790" Fix Bugs in Trogdor Task Expiration KAFKA-7790: Fix Bugs in Trogdor Task Expiration Jan 10, 2019
@cmccabe
Copy link
Copy Markdown
Contributor

cmccabe commented Jan 10, 2019

retest this please

@stanislavkozlovski
Copy link
Copy Markdown
Contributor Author

JDK11 failure seems unrelated - org.apache.kafka.streams.KafkaStreamsTest.shouldThrowOnCleanupWhileRunning

@cmccabe
Copy link
Copy Markdown
Contributor

cmccabe commented Jan 11, 2019

LGTM

@cmccabe cmccabe merged commit 625e0d8 into apache:trunk Jan 11, 2019
pengxiaolong pushed a commit to pengxiaolong/kafka that referenced this pull request Jun 14, 2019
The Trogdor Coordinator now overwrites a task's startMs to the time it received it if startMs is in the past.

The Trogdor Agent now correctly expires a task after the expiry time (startMs + durationMs) passes. Previously, it would ignore startMs and expire after durationMs milliseconds of local start of the task.

Reviewed-by: Colin P. McCabe <cmccabe@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants