Trogdor Improvements: execute Trogdor tasks directly and run the Trogdor worker in a separate process#6087
Trogdor Improvements: execute Trogdor tasks directly and run the Trogdor worker in a separate process#6087yangxi wants to merge 10 commits intoapache:trunkfrom yangxi:local-trogdor
Conversation
It is handy to run Trogdor tasks directly without setting up distributed Trogdor coordinator/agents. For example, we can quickly run produce and consume performance tests. + Add TrogdorLocalRunner that reads a task from the spec, executes the task, and reports results. + Add two SPEC examples: consume-bench-spec.json and produce-bench-spec.json + Add/update helper scripts. + Update TROGDOR.md
|
cc @cmccabe |
| @@ -0,0 +1,6 @@ | |||
| #!/usr/bin/env bash | |||
There was a problem hiding this comment.
It would be good if we could add the ability to read JSON from a file to the Java code, rather than hacking it up in the shell. The reason is because the shell has some issues with escaping special characters. Another reason is because this would be useful more generally in all the other trogdor command line commands that take JSON arguments.
So we could just write ./bin/trogdor.sh exec --spec @/my/spec/path
And then we could also write something like bin/trogdor.sh client -create-task @/my/task/path, etc.
I think argparse4j has built-in support for this:
https://argparse4j.github.io/apidocs/net/sourceforge/argparse4j/ArgumentParserBuilder.html#fromFilePrefix-java.lang.String-
There was a problem hiding this comment.
fromFilePrefix does not help here because what it does is to read arguments from a file. So you have to put "--spce {....}" into a file, then call ./bin/trogdor.sh exec @/my/spec/path.
I think it is better to read the SPEC file directly. We can also ignore lines starting with #, so we can put comments in the SPEC files.
There was a problem hiding this comment.
@cmccabe Updated, please have a look. It reads the SPEC file if the arguments starting with @.
This patch adds the Trogdor RuntimeProcessWorker that runs the worker in a separate process. With this, Trogdor can execute diverse workloads, such as non-Java clients. + Users pass the worker command in spec.workerCommand, such as "workerCommand": ["python", "./producer.py"] + RuntimeProcessWorker starts the worker in a process by executing `workercommand --spec TASK_SPEC`. + RuntimeProcessWorker monitors the worker's stderr and stdout line by line, if the stdout line is a JSON value, it updates its status with the JSON value.
stanislavkozlovski
left a comment
There was a problem hiding this comment.
Thanks a lot for the PR, Xi!
These changes (especially RuntimeProcessWorker ) are very useful and will unlock a lot of use cases with Trogdor. It's looking great to me.
I left some small comments and want to discuss the status of the RuntimeProcessWorker a bit more
| @Override | ||
| public TaskWorker newTaskWorker(String id) { | ||
| return new ConsumeBenchWorker(id, this); | ||
| if (this.workerCommand() == null) { |
There was a problem hiding this comment.
If we are to have workerCommand() be in TaskSpec, I think it would make sense to have this check and RuntimeProcessWorker instantiation be in TaskSpec#newTaskWorker. We could maybe then have a new abstract method (e.g TaskSpec#newSpecificTaskWorker or whatever) that inheriting classes would override
There was a problem hiding this comment.
Well, the problem is that RuntimeProcessWorker is a special case. Starting RuntimeProcessWorker in TaskSpec makes it feel like that is a default worker.
| public class ErrorUpdater implements Runnable { | ||
| BufferedReader br; | ||
| ErrorUpdater() { | ||
| br = new BufferedReader(new InputStreamReader(worker.getErrorStream(), StandardCharsets.UTF_8)); |
There was a problem hiding this comment.
Does this get closed automatically by Java or are we leaking it? As far as I know this would keep the system resources until it's closed, but I guess garbage collection would eventually close it? Same question for StatusUpdated#br
There was a problem hiding this comment.
Yeah, I think so since the stream is gone after either the process is gone.
There was a problem hiding this comment.
Which process? Trogdor Agent processes are long-lived
There was a problem hiding this comment.
The worker process. After it is gone, its stdout and stderr streams should be closed.
| log.info("Worker (stdout):{}", line); | ||
| try { | ||
| JsonNode resp = JsonUtil.JSON_SERDE.readTree(line); | ||
| status.update(resp); |
There was a problem hiding this comment.
There's just one thing I'm not sure about here. In ConsumeBenchWorker and RoundTripWorker we update the status every 30 seconds. In ProduceBenchWorker we update the status every minute.
Here, we update it on every single line which means that it might be very easy to miss certain outputs when monitoring this task through the coordinator.
I'm wondering whether it would make sense to update this status once every 30s/1m with a concatenation of all the read json in that time frame (up to a certain limit I imagine).}
I'd like to hear your thoughts on this @yangxi @cmccabe
There was a problem hiding this comment.
Both ConsumeBenchWorker and ProduceBenchWorker overwrites the status, and both expect the worker returns final results at the end of task. This follows the same idea. It does not update status for every line, but only updates the status if the result is a JSON string. Which means that we expect the worker know this and only return the final results in JSON string at the end of the task.
Let's keep it like this. If overwriting becomes a problem, we can do some improvements:
- We can decide the response API. For example, we only update the status for lines like this {task: TASK_ID, response: { ..... } }
- We can concatenate output lines.
- We can send out status as a Kafka stream :)
There was a problem hiding this comment.
They do overwrite the status but they do it once every 30 seconds, so as a user you know that if you query once every 30 seconds you'll get the latest status.
In this case you don't, you may query once every 30 seconds but you're not sure whether some message got written and then overwritten in that timespan.
@cmccabe is working on https://issues.apache.org/jira/browse/KAFKA-7793, which are usability improvements to the Trogdor CLI and I thought that having a way to follow a tasks' status might be useful. If we settled on a pre-defined status update time we could implement that easily.
Anyway that's just my thoughts on this. In the end, we do log every line vialog.info("Worker (stdout):{}", line); so it shoudn't hurt
|
Hi @yangxi, Thanks for a great idea for a PR, and for taking the time to put this together. As I was working on #6133, I thought it would be easier to just roll this change into that one. The reasoning is because they touch on a lot of the same files. For example, that patch removes the "start a task" bash scripts in favor of just being able to specify a file with JSON in it via The approach in #6133 is a bit different because it just adds an I think that is the right way to go, since it means that we have fewer different ways of doing the same thing. In particular, the command-line tools for querying the agent over its REST API will continue to work with the change in #6133, whereas they wouldn't here if we were creating a separate process. This could be very useful if you want to get status updates over time of something that you are running. |
This PR reads from the JSON config file too, if the spec name starts with Are you going to commit the #6133 soon, I can make the TrogdorLocalRunner use same arguments.
Sorry about that this PR squeezes two not related features: 1) directly execute Trogdor tasks without setting up coordinators and agents, 2) make agents be able to execute shell commands as the worker in another process, so we have the flexibility of running workloads, such as non-Java clients, docker images, etc. I guess you want to compare #6133 with the feature 1. For feature 1, it does not create a separate process to run the spec. It runs the task directly.
Sorry for the confusion of combing two un-related features in one PR. Feature 2 does not change how agents work. The agent starts the worker process, and the agent updates results same as other agents. So, you just query results as same as before from agents. |
It might make sense to query for the results if your task were to run for 1 hour, for example |
You are right. Adding @cmccabe Please have a look the second feature (run the Trogdor worker in a separate process). If you think it is helpful, I can submit it as a PR to your #6133 branch. |
This PR adds two features: 1) directly execute Trogdor tasks without setting up coordinators and agents, 2) make agents be able to execute shell commands as the worker in another process, so we have the flexibility of running workloads, such as non-Java clients, docker images, etc.
Directly execute Trogdor tasks (feature 1).
reports results.
produce-bench-spec.json.
Run the Trogdor worker in a separate process (feature 2).
"workerCommand": ["python", "./producer.py"]
workercommand --spec TASK_SPEC.line, if the stdout line is a JSON value, it updates its status with
the JSON value.
More detailed description of your change,
if necessary. The PR title and PR message become
the squashed commit message, so use a separate
comment to ping reviewers.
Summary of testing strategy (including rationale)
for the feature or bug fix. Unit and/or integration
tests are expected for any behaviour change and
system tests should be considered for larger changes.
Committer Checklist (excluded from commit message)