Skip to content

Trogdor Improvements: execute Trogdor tasks directly and run the Trogdor worker in a separate process#6087

Closed
yangxi wants to merge 10 commits intoapache:trunkfrom
yangxi:local-trogdor
Closed

Trogdor Improvements: execute Trogdor tasks directly and run the Trogdor worker in a separate process#6087
yangxi wants to merge 10 commits intoapache:trunkfrom
yangxi:local-trogdor

Conversation

@yangxi
Copy link
Copy Markdown
Contributor

@yangxi yangxi commented Jan 4, 2019

This PR adds two features: 1) directly execute Trogdor tasks without setting up coordinators and agents, 2) make agents be able to execute shell commands as the worker in another process, so we have the flexibility of running workloads, such as non-Java clients, docker images, etc.

  • Directly execute Trogdor tasks (feature 1).

    • Add TrogdorLocalRunner that reads a task from the spec, executes the task, and
      reports results.
    • Add two SPEC examples: consume-bench-spec.json and
      produce-bench-spec.json.
    • Add/update helper scripts.
    • Update TROGDOR.md
  • Run the Trogdor worker in a separate process (feature 2).

    • Users pass the worker command in spec.workerCommand, such as
      "workerCommand": ["python", "./producer.py"]
    • RuntimeProcessWorker starts the worker in a process by executing
      workercommand --spec TASK_SPEC.
    • RuntimeProcessWorker monitors the worker's stderr and stdout line by
      line, if the stdout line is a JSON value, it updates its status with
      the JSON value.

More detailed description of your change,
if necessary. The PR title and PR message become
the squashed commit message, so use a separate
comment to ping reviewers.

Summary of testing strategy (including rationale)
for the feature or bug fix. Unit and/or integration
tests are expected for any behaviour change and
system tests should be considered for larger changes.

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

It is handy to run Trogdor tasks directly without setting up
distributed Trogdor coordinator/agents. For example, we can quickly
run produce and consume performance tests.

+ Add TrogdorLocalRunner that reads a task from the spec, executes the task, and
reports results.
+ Add two SPEC examples: consume-bench-spec.json and
produce-bench-spec.json
+ Add/update helper scripts.
+ Update TROGDOR.md
@hachikuji
Copy link
Copy Markdown
Contributor

cc @cmccabe

Comment thread bin/trogdor.sh Outdated
Comment thread tests/trogdor/bin/trogdor-local-run.sh Outdated
@@ -0,0 +1,6 @@
#!/usr/bin/env bash
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good if we could add the ability to read JSON from a file to the Java code, rather than hacking it up in the shell. The reason is because the shell has some issues with escaping special characters. Another reason is because this would be useful more generally in all the other trogdor command line commands that take JSON arguments.

So we could just write ./bin/trogdor.sh exec --spec @/my/spec/path

And then we could also write something like bin/trogdor.sh client -create-task @/my/task/path, etc.

I think argparse4j has built-in support for this:
https://argparse4j.github.io/apidocs/net/sourceforge/argparse4j/ArgumentParserBuilder.html#fromFilePrefix-java.lang.String-

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

Copy link
Copy Markdown
Contributor Author

@yangxi yangxi Jan 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fromFilePrefix does not help here because what it does is to read arguments from a file. So you have to put "--spce {....}" into a file, then call ./bin/trogdor.sh exec @/my/spec/path.

I think it is better to read the SPEC file directly. We can also ignore lines starting with #, so we can put comments in the SPEC files.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cmccabe Updated, please have a look. It reads the SPEC file if the arguments starting with @.

Xi Yang added 5 commits January 7, 2019 22:22
This patch adds the Trogdor RuntimeProcessWorker that runs the worker
in a separate process. With this, Trogdor can execute diverse
workloads, such as non-Java clients.

+ Users pass the worker command in spec.workerCommand, such as
"workerCommand": ["python", "./producer.py"]

+ RuntimeProcessWorker starts the worker in a process by executing
`workercommand --spec TASK_SPEC`.

+ RuntimeProcessWorker monitors the worker's stderr and stdout line by
line, if the stdout line is a JSON value, it updates its status with
the JSON value.
@yangxi yangxi changed the title Execute Trogdor tasks directly Trogdor Improvements: execute Trogdor tasks directly and run Trogdor worker in a process Jan 11, 2019
@yangxi yangxi changed the title Trogdor Improvements: execute Trogdor tasks directly and run Trogdor worker in a process Trogdor Improvements: execute Trogdor tasks directly and run the Trogdor worker in a separate process Jan 11, 2019
Copy link
Copy Markdown
Contributor

@stanislavkozlovski stanislavkozlovski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the PR, Xi!
These changes (especially RuntimeProcessWorker ) are very useful and will unlock a lot of use cases with Trogdor. It's looking great to me.

I left some small comments and want to discuss the status of the RuntimeProcessWorker a bit more

Comment thread tests/trogdor/spec/consume-bench-spec.json Outdated
Comment thread tools/src/main/java/org/apache/kafka/trogdor/common/SpecUtils.java Outdated
Comment thread tools/src/main/java/org/apache/kafka/trogdor/task/TaskSpec.java Outdated
@Override
public TaskWorker newTaskWorker(String id) {
return new ConsumeBenchWorker(id, this);
if (this.workerCommand() == null) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are to have workerCommand() be in TaskSpec, I think it would make sense to have this check and RuntimeProcessWorker instantiation be in TaskSpec#newTaskWorker. We could maybe then have a new abstract method (e.g TaskSpec#newSpecificTaskWorker or whatever) that inheriting classes would override

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the problem is that RuntimeProcessWorker is a special case. Starting RuntimeProcessWorker in TaskSpec makes it feel like that is a default worker.

Comment thread tools/src/main/java/org/apache/kafka/trogdor/workload/ProduceBenchSpec.java Outdated
Comment thread tools/src/main/java/org/apache/kafka/trogdor/workload/RuntimeProcessWorker.java Outdated
Comment thread tools/src/main/java/org/apache/kafka/trogdor/workload/RuntimeProcessWorker.java Outdated
public class ErrorUpdater implements Runnable {
BufferedReader br;
ErrorUpdater() {
br = new BufferedReader(new InputStreamReader(worker.getErrorStream(), StandardCharsets.UTF_8));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this get closed automatically by Java or are we leaking it? As far as I know this would keep the system resources until it's closed, but I guess garbage collection would eventually close it? Same question for StatusUpdated#br

Copy link
Copy Markdown
Contributor Author

@yangxi yangxi Jan 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think so since the stream is gone after either the process is gone.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which process? Trogdor Agent processes are long-lived

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The worker process. After it is gone, its stdout and stderr streams should be closed.

log.info("Worker (stdout):{}", line);
try {
JsonNode resp = JsonUtil.JSON_SERDE.readTree(line);
status.update(resp);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's just one thing I'm not sure about here. In ConsumeBenchWorker and RoundTripWorker we update the status every 30 seconds. In ProduceBenchWorker we update the status every minute.
Here, we update it on every single line which means that it might be very easy to miss certain outputs when monitoring this task through the coordinator.

I'm wondering whether it would make sense to update this status once every 30s/1m with a concatenation of all the read json in that time frame (up to a certain limit I imagine).}
I'd like to hear your thoughts on this @yangxi @cmccabe

Copy link
Copy Markdown
Contributor Author

@yangxi yangxi Jan 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both ConsumeBenchWorker and ProduceBenchWorker overwrites the status, and both expect the worker returns final results at the end of task. This follows the same idea. It does not update status for every line, but only updates the status if the result is a JSON string. Which means that we expect the worker know this and only return the final results in JSON string at the end of the task.

Let's keep it like this. If overwriting becomes a problem, we can do some improvements:

  1. We can decide the response API. For example, we only update the status for lines like this {task: TASK_ID, response: { ..... } }
  2. We can concatenate output lines.
  3. We can send out status as a Kafka stream :)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They do overwrite the status but they do it once every 30 seconds, so as a user you know that if you query once every 30 seconds you'll get the latest status.
In this case you don't, you may query once every 30 seconds but you're not sure whether some message got written and then overwritten in that timespan.

@cmccabe is working on https://issues.apache.org/jira/browse/KAFKA-7793, which are usability improvements to the Trogdor CLI and I thought that having a way to follow a tasks' status might be useful. If we settled on a pre-defined status update time we could implement that easily.

Anyway that's just my thoughts on this. In the end, we do log every line vialog.info("Worker (stdout):{}", line); so it shoudn't hurt

@cmccabe
Copy link
Copy Markdown
Contributor

cmccabe commented Jan 12, 2019

Hi @yangxi,

Thanks for a great idea for a PR, and for taking the time to put this together.

As I was working on #6133, I thought it would be easier to just roll this change into that one. The reasoning is because they touch on a lot of the same files. For example, that patch removes the "start a task" bash scripts in favor of just being able to specify a file with JSON in it via trogdor.sh client startTask -i taskId -s /path/to/spec. I think this is a much cleaner approach which eliminates the need for a lot of bash glue code.

The approach in #6133 is a bit different because it just adds an --exec command-line flag to the Agent process, rather than creating a separate process. When the agent is given this --exec flag, it immediately starts the given task spec, and waits for it to complete.

I think that is the right way to go, since it means that we have fewer different ways of doing the same thing. In particular, the command-line tools for querying the agent over its REST API will continue to work with the change in #6133, whereas they wouldn't here if we were creating a separate process. This could be very useful if you want to get status updates over time of something that you are running.

@yangxi
Copy link
Copy Markdown
Contributor Author

yangxi commented Jan 12, 2019

@cmccabe

Hi @yangxi,
As I was working on #6133, I thought it would be easier to just roll this change into that one. The reasoning is because they touch on a lot of the same files. For example, that patch removes the "start a task" bash scripts in favor of just being able to specify a file with JSON in it via trogdor.sh client startTask -i taskId -s /path/to/spec. I think this is a much cleaner approach which eliminates the need for a lot of bash glue code.

This PR reads from the JSON config file too, if the spec name starts with @, it reads the file, ignores lines starting with #.

Are you going to commit the #6133 soon, I can make the TrogdorLocalRunner use same arguments.

The approach in #6133 is a bit different because it just adds an --exec command-line flag to the Agent process, rather than creating a separate process. When the agent is given this --exec flag, it immediately starts the given task spec, and waits for it to complete.

Sorry about that this PR squeezes two not related features: 1) directly execute Trogdor tasks without setting up coordinators and agents, 2) make agents be able to execute shell commands as the worker in another process, so we have the flexibility of running workloads, such as non-Java clients, docker images, etc.

I guess you want to compare #6133 with the feature 1. For feature 1, it does not create a separate process to run the spec. It runs the task directly.

I think that is the right way to go, since it means that we have fewer different ways of doing the same thing. In particular, the command-line tools for querying the agent over its REST API will continue to work with the change in #6133, whereas they wouldn't here if we were creating a separate process. This could be very useful if you want to get status updates over time of something that you are running.

Sorry for the confusion of combing two un-related features in one PR.
Feature 1 directly executes the tasks. The goal is to run tasks without setting up agents, etc. So there is no need to query the results via REST API since you run it locally.

Feature 2 does not change how agents work. The agent starts the worker process, and the agent updates results same as other agents. So, you just query results as same as before from agents.

@stanislavkozlovski
Copy link
Copy Markdown
Contributor

Sorry for the confusion of combing two un-related features in one PR.
Feature 1 directly executes the tasks. The goal is to run tasks without setting up agents, etc. So there is no need to query the results via REST API since you run it locally.

It might make sense to query for the results if your task were to run for 1 hour, for example

@yangxi
Copy link
Copy Markdown
Contributor Author

yangxi commented Jan 14, 2019

Sorry for the confusion of combing two un-related features in one PR.
Feature 1 directly executes the tasks. The goal is to run tasks without setting up agents, etc. So there is no need to query the results via REST API since you run it locally.

It might make sense to query for the results if your task were to run for 1 hour, for example

You are right. Adding --exec is a better approach. Looks like we don't need the TrogdorLocalRunner then. I wish I knew #6133.

@cmccabe Please have a look the second feature (run the Trogdor worker in a separate process). If you think it is helpful, I can submit it as a PR to your #6133 branch.

@yangxi yangxi closed this Jan 16, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants