-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-15877: [C++] Add a C++ query testing tool #12586
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-15877: [C++] Add a C++ query testing tool #12586
Conversation
…oaded datasets in the future and is needed for the query tester to recognize the root directory.
|
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename pull request title in the following format? or See also: |
| namespace compute | ||
| { | ||
|
|
||
| std::shared_ptr<ExecPlan> Plan_Q1(AsyncGenerator<util::optional<ExecBatch>> &sink_gen, int scale_factor) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code is almost identical to tpch1() in builtin_queries.cc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I created tpch1() by leveraging this. Once the PR adding tpch_benchmark.cc merges I will update this PR to remove tpch_benchmark.cc in favor of this (or perhaps keep the benchmark but link to builtin queries).
This PR modifies the `SubmitTask` and `Finish` methods of MapNode in `ExecPlan` to avoid scheduling extra thread tasks. Performed the TPC-H Benchmark developed in PR #12537 with and without the changes. ``` TPC-H Benchmark (With Extra Thread Tasks) ------------------------------------------------------------------- Benchmark Time CPU Iterations ------------------------------------------------------------------- BM_Tpch_Q1/ScaleFactor:1 95035633 ns 178700 ns 100 TPC-H Benchmark (Without Extra Thread Tasks) ------------------------------------------------------------------- Benchmark Time CPU Iterations ------------------------------------------------------------------- BM_Tpch_Q1/ScaleFactor:1 91511754 ns 182060 ns 100 ``` Also, tested with the Query Tester as proposed in PR #12586 ``` With Thread Tasks (batch size = 4096) ./query_tester tpch-1 Average Duration: 0.106694s (+/- 0s) Average Output Rows/S: 37.4902rps Average Output Bytes/S: 4573.81bps Without Thread Tasks (batch size = 4096) ./query_tester tpch-1 Average Duration: 0.104658s (+/- 0s) Average Output Rows/S: 38.2198rps Average Output Bytes/S: 4662.82bps ``` Closes #12720 from sanjibansg/thread_tasks Authored-by: Sanjiban Sengupta <sanjiban.sg@gmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>
This PR is based on #12537 and will need to remain in draft until that is merged. In addition, I'd like to fit in a few more builtin queries in this initial PR.
This PR creates a standalone query testing executable query_tester.
The tool takes a number of command line options today:
The tool will first look for a Substrait query in the
queriesfolder (there is an example of TPC-H Q1 in JSON format). At the moment this isn't very useful as our Substrait support is very limited.The tool will then look for builtin queries, I'd like to add support for all of the TPC-H builtin queries.
There is a
datasetsfolder that is also created. In the future I'd like to add support for downloading remote datasets to this folder.The current output simply prints a few statistics:
Note that
Output Rows/SandOutput Bytes/Sis not very useful for TPC-H queries (which aggregate most of their data so they have very little output for the amount of work done). I've prototyped adding a much more exhaustive breakdown of time spent by intercepting OT events but I'd like to save that work for a future PR.