Trogdor Improvements: execute Trogdor tasks directly and run the Trogdor worker in a separate process by yangxi · Pull Request #6087 · apache/kafka

yangxi · 2019-01-04T12:52:46Z

This PR adds two features: 1) directly execute Trogdor tasks without setting up coordinators and agents, 2) make agents be able to execute shell commands as the worker in another process, so we have the flexibility of running workloads, such as non-Java clients, docker images, etc.

Directly execute Trogdor tasks (feature 1).
- Add TrogdorLocalRunner that reads a task from the spec, executes the task, and
  reports results.
- Add two SPEC examples: consume-bench-spec.json and
  produce-bench-spec.json.
- Add/update helper scripts.
- Update TROGDOR.md
Run the Trogdor worker in a separate process (feature 2).
- Users pass the worker command in spec.workerCommand, such as
  "workerCommand": ["python", "./producer.py"]
- RuntimeProcessWorker starts the worker in a process by executing
  workercommand --spec TASK_SPEC.
- RuntimeProcessWorker monitors the worker's stderr and stdout line by
  line, if the stdout line is a JSON value, it updates its status with
  the JSON value.

More detailed description of your change,
if necessary. The PR title and PR message become
the squashed commit message, so use a separate
comment to ping reviewers.

Summary of testing strategy (including rationale)
for the feature or bug fix. Unit and/or integration
tests are expected for any behaviour change and
system tests should be considered for larger changes.

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

It is handy to run Trogdor tasks directly without setting up distributed Trogdor coordinator/agents. For example, we can quickly run produce and consume performance tests. + Add TrogdorLocalRunner that reads a task from the spec, executes the task, and reports results. + Add two SPEC examples: consume-bench-spec.json and produce-bench-spec.json + Add/update helper scripts. + Update TROGDOR.md

hachikuji · 2019-01-05T08:57:30Z

cc @cmccabe

cmccabe · 2019-01-05T22:19:21Z

@@ -0,0 +1,6 @@
+#!/usr/bin/env bash


It would be good if we could add the ability to read JSON from a file to the Java code, rather than hacking it up in the shell. The reason is because the shell has some issues with escaping special characters. Another reason is because this would be useful more generally in all the other trogdor command line commands that take JSON arguments.

So we could just write ./bin/trogdor.sh exec --spec @/my/spec/path

And then we could also write something like bin/trogdor.sh client -create-task @/my/task/path, etc.

I think argparse4j has built-in support for this:
https://argparse4j.github.io/apidocs/net/sourceforge/argparse4j/ArgumentParserBuilder.html#fromFilePrefix-java.lang.String-

fromFilePrefix does not help here because what it does is to read arguments from a file. So you have to put "--spce {....}" into a file, then call ./bin/trogdor.sh exec @/my/spec/path.

I think it is better to read the SPEC file directly. We can also ignore lines starting with #, so we can put comments in the SPEC files.

@cmccabe Updated, please have a look. It reads the SPEC file if the arguments starting with @.

This patch adds the Trogdor RuntimeProcessWorker that runs the worker in a separate process. With this, Trogdor can execute diverse workloads, such as non-Java clients. + Users pass the worker command in spec.workerCommand, such as "workerCommand": ["python", "./producer.py"] + RuntimeProcessWorker starts the worker in a process by executing `workercommand --spec TASK_SPEC`. + RuntimeProcessWorker monitors the worker's stderr and stdout line by line, if the stdout line is a JSON value, it updates its status with the JSON value.

stanislavkozlovski

Thanks a lot for the PR, Xi!
These changes (especially RuntimeProcessWorker ) are very useful and will unlock a lot of use cases with Trogdor. It's looking great to me.

I left some small comments and want to discuss the status of the RuntimeProcessWorker a bit more

stanislavkozlovski · 2019-01-11T10:04:36Z

    @Override
    public TaskWorker newTaskWorker(String id) {
-        return new ConsumeBenchWorker(id, this);
+        if (this.workerCommand() == null) {


If we are to have workerCommand() be in TaskSpec, I think it would make sense to have this check and RuntimeProcessWorker instantiation be in TaskSpec#newTaskWorker. We could maybe then have a new abstract method (e.g TaskSpec#newSpecificTaskWorker or whatever) that inheriting classes would override

Well, the problem is that RuntimeProcessWorker is a special case. Starting RuntimeProcessWorker in TaskSpec makes it feel like that is a default worker.

stanislavkozlovski · 2019-01-11T10:19:51Z

+    public class ErrorUpdater implements Runnable {
+        BufferedReader br;
+        ErrorUpdater() {
+            br = new BufferedReader(new InputStreamReader(worker.getErrorStream(), StandardCharsets.UTF_8));


Does this get closed automatically by Java or are we leaking it? As far as I know this would keep the system resources until it's closed, but I guess garbage collection would eventually close it? Same question for StatusUpdated#br

Yeah, I think so since the stream is gone after either the process is gone.

Which process? Trogdor Agent processes are long-lived

The worker process. After it is gone, its stdout and stderr streams should be closed.

stanislavkozlovski · 2019-01-11T10:26:53Z

+                    log.info("Worker (stdout):{}", line);
+                    try {
+                        JsonNode resp = JsonUtil.JSON_SERDE.readTree(line);
+                        status.update(resp);


There's just one thing I'm not sure about here. In ConsumeBenchWorker and RoundTripWorker we update the status every 30 seconds. In ProduceBenchWorker we update the status every minute.
Here, we update it on every single line which means that it might be very easy to miss certain outputs when monitoring this task through the coordinator.

I'm wondering whether it would make sense to update this status once every 30s/1m with a concatenation of all the read json in that time frame (up to a certain limit I imagine).}
I'd like to hear your thoughts on this @yangxi @cmccabe

Both ConsumeBenchWorker and ProduceBenchWorker overwrites the status, and both expect the worker returns final results at the end of task. This follows the same idea. It does not update status for every line, but only updates the status if the result is a JSON string. Which means that we expect the worker know this and only return the final results in JSON string at the end of the task.

Let's keep it like this. If overwriting becomes a problem, we can do some improvements:

We can decide the response API. For example, we only update the status for lines like this {task: TASK_ID, response: { ..... } }

We can concatenate output lines.

We can send out status as a Kafka stream :)

They do overwrite the status but they do it once every 30 seconds, so as a user you know that if you query once every 30 seconds you'll get the latest status.
In this case you don't, you may query once every 30 seconds but you're not sure whether some message got written and then overwritten in that timespan.

@cmccabe is working on https://issues.apache.org/jira/browse/KAFKA-7793, which are usability improvements to the Trogdor CLI and I thought that having a way to follow a tasks' status might be useful. If we settled on a pre-defined status update time we could implement that easily.

Anyway that's just my thoughts on this. In the end, we do log every line vialog.info("Worker (stdout):{}", line); so it shoudn't hurt

cmccabe · 2019-01-12T00:43:55Z

Hi @yangxi,

Thanks for a great idea for a PR, and for taking the time to put this together.

As I was working on #6133, I thought it would be easier to just roll this change into that one. The reasoning is because they touch on a lot of the same files. For example, that patch removes the "start a task" bash scripts in favor of just being able to specify a file with JSON in it via trogdor.sh client startTask -i taskId -s /path/to/spec. I think this is a much cleaner approach which eliminates the need for a lot of bash glue code.

The approach in #6133 is a bit different because it just adds an --exec command-line flag to the Agent process, rather than creating a separate process. When the agent is given this --exec flag, it immediately starts the given task spec, and waits for it to complete.

I think that is the right way to go, since it means that we have fewer different ways of doing the same thing. In particular, the command-line tools for querying the agent over its REST API will continue to work with the change in #6133, whereas they wouldn't here if we were creating a separate process. This could be very useful if you want to get status updates over time of something that you are running.

yangxi · 2019-01-12T01:26:59Z

@cmccabe

Hi @yangxi,
As I was working on #6133, I thought it would be easier to just roll this change into that one. The reasoning is because they touch on a lot of the same files. For example, that patch removes the "start a task" bash scripts in favor of just being able to specify a file with JSON in it via trogdor.sh client startTask -i taskId -s /path/to/spec. I think this is a much cleaner approach which eliminates the need for a lot of bash glue code.

This PR reads from the JSON config file too, if the spec name starts with @, it reads the file, ignores lines starting with #.

Are you going to commit the #6133 soon, I can make the TrogdorLocalRunner use same arguments.

The approach in #6133 is a bit different because it just adds an --exec command-line flag to the Agent process, rather than creating a separate process. When the agent is given this --exec flag, it immediately starts the given task spec, and waits for it to complete.

Sorry about that this PR squeezes two not related features: 1) directly execute Trogdor tasks without setting up coordinators and agents, 2) make agents be able to execute shell commands as the worker in another process, so we have the flexibility of running workloads, such as non-Java clients, docker images, etc.

I guess you want to compare #6133 with the feature 1. For feature 1, it does not create a separate process to run the spec. It runs the task directly.

I think that is the right way to go, since it means that we have fewer different ways of doing the same thing. In particular, the command-line tools for querying the agent over its REST API will continue to work with the change in #6133, whereas they wouldn't here if we were creating a separate process. This could be very useful if you want to get status updates over time of something that you are running.

Sorry for the confusion of combing two un-related features in one PR.
Feature 1 directly executes the tasks. The goal is to run tasks without setting up agents, etc. So there is no need to query the results via REST API since you run it locally.

Feature 2 does not change how agents work. The agent starts the worker process, and the agent updates results same as other agents. So, you just query results as same as before from agents.

stanislavkozlovski · 2019-01-14T11:29:16Z

Sorry for the confusion of combing two un-related features in one PR.
Feature 1 directly executes the tasks. The goal is to run tasks without setting up agents, etc. So there is no need to query the results via REST API since you run it locally.

It might make sense to query for the results if your task were to run for 1 hour, for example

yangxi · 2019-01-14T11:40:11Z

Sorry for the confusion of combing two un-related features in one PR.
Feature 1 directly executes the tasks. The goal is to run tasks without setting up agents, etc. So there is no need to query the results via REST API since you run it locally.

It might make sense to query for the results if your task were to run for 1 hour, for example

You are right. Adding --exec is a better approach. Looks like we don't need the TrogdorLocalRunner then. I wish I knew #6133.

@cmccabe Please have a look the second feature (run the Trogdor worker in a separate process). If you think it is helpful, I can submit it as a PR to your #6133 branch.

cmccabe reviewed Jan 5, 2019

View reviewed changes

Comment thread bin/trogdor.sh Outdated

cmccabe reviewed Jan 5, 2019

View reviewed changes

Xi Yang added 2 commits January 7, 2019 18:07

Read Trogdor SPEC from a file.

0c366f0

Update README about how to execute Trogdor SPEC tasks directly.

ba300f5

stanislavkozlovski reviewed Jan 7, 2019

View reviewed changes

Comment thread tools/src/main/java/org/apache/kafka/trogdor/workload/ConsumeBenchWorker.java

Xi Yang added 5 commits January 7, 2019 22:22

Address some warnings.

0b90ceb

Merge github.com:apache/kafka into local-trogdor

1e7b234

Merge github.com:apache/kafka into trunk

9f2dd90

Merge branch 'trunk' into local-trogdor

461c91e

yangxi changed the title ~~Execute Trogdor tasks directly~~ Trogdor Improvements: execute Trogdor tasks directly and run Trogdor worker in a process Jan 11, 2019

yangxi changed the title ~~Trogdor Improvements: execute Trogdor tasks directly and run Trogdor worker in a process~~ Trogdor Improvements: execute Trogdor tasks directly and run the Trogdor worker in a separate process Jan 11, 2019

stanislavkozlovski reviewed Jan 11, 2019

View reviewed changes

Xi Yang added 2 commits January 11, 2019 23:08

Minor fixes

05d461c

Reuse the constructor.

da2de88

yangxi closed this Jan 16, 2019

Conversation

yangxi commented Jan 4, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Committer Checklist (excluded from commit message)

Uh oh!

hachikuji commented Jan 5, 2019

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yangxi Jan 7, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

stanislavkozlovski left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yangxi Jan 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yangxi Jan 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cmccabe commented Jan 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yangxi commented Jan 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stanislavkozlovski commented Jan 14, 2019

Uh oh!

yangxi commented Jan 14, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yangxi commented Jan 4, 2019 •

edited

Loading

yangxi Jan 7, 2019 •

edited

Loading

yangxi Jan 11, 2019 •

edited

Loading

yangxi Jan 11, 2019 •

edited

Loading

cmccabe commented Jan 12, 2019 •

edited

Loading

yangxi commented Jan 12, 2019 •

edited

Loading