Skip to content

Conversation

@westonpace
Copy link
Member

This PR is based on #12537 and will need to remain in draft until that is merged. In addition, I'd like to fit in a few more builtin queries in this initial PR.

This PR creates a standalone query testing executable query_tester.

The tool takes a number of command line options today:

Usage: query_tester [options] query 

Positional arguments:
query            	name of the query to run [required]

Optional arguments:
-h --help        	shows help message and exits
-v --version     	prints version information and exits
--num-iterations 	[default: 1]
--cpu-threads    	size to use for the CPU thread pool, default controlled by Arrow
--io-threads     	size to use for the I/O thread pool, default controlled by Arrow
--validate       	if set the program will validate the query results [default: false] (not yet implemented)

The tool will first look for a Substrait query in the queries folder (there is an example of TPC-H Q1 in JSON format). At the moment this isn't very useful as our Substrait support is very limited.

The tool will then look for builtin queries, I'd like to add support for all of the TPC-H builtin queries.

There is a datasets folder that is also created. In the future I'd like to add support for downloading remote datasets to this folder.

The current output simply prints a few statistics:

(conbench3) pace@pace-desktop:~/dev/arrow/dev/qtester/debug-build$ ./query_tester tpch-1 --num-iterations 10
Average       Duration: 1.19292s (+/- 0.00569406s)
Average Output  Rows/S: 3.35311rps
Average Output Bytes/S: 409.08bps

Note that Output Rows/S and Output Bytes/S is not very useful for TPC-H queries (which aggregate most of their data so they have very little output for the amount of work done). I've prototyped adding a much more exhaustive breakdown of time spent by intercepting OT events but I'd like to save that work for a future PR.

@github-actions
Copy link

github-actions bot commented Mar 9, 2022

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW

Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

@westonpace westonpace changed the title Feature/arrow 15877 query testing tool ARROW-15877: [C++] Add a C++ query testing tool Mar 9, 2022
@github-actions
Copy link

github-actions bot commented Mar 9, 2022

namespace compute
{

std::shared_ptr<ExecPlan> Plan_Q1(AsyncGenerator<util::optional<ExecBatch>> &sink_gen, int scale_factor)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is almost identical to tpch1() in builtin_queries.cc

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I created tpch1() by leveraging this. Once the PR adding tpch_benchmark.cc merges I will update this PR to remove tpch_benchmark.cc in favor of this (or perhaps keep the benchmark but link to builtin queries).

westonpace pushed a commit that referenced this pull request Mar 30, 2022
This PR modifies the `SubmitTask` and `Finish` methods of MapNode in `ExecPlan` to avoid scheduling extra thread tasks.

Performed the TPC-H Benchmark developed in PR #12537 with and without the changes.
```
TPC-H Benchmark (With Extra Thread Tasks)
-------------------------------------------------------------------
Benchmark                         Time             CPU   Iterations
-------------------------------------------------------------------
BM_Tpch_Q1/ScaleFactor:1   95035633 ns       178700 ns          100

TPC-H Benchmark (Without Extra Thread Tasks)
-------------------------------------------------------------------
Benchmark                         Time             CPU   Iterations
-------------------------------------------------------------------
BM_Tpch_Q1/ScaleFactor:1   91511754 ns       182060 ns          100

```

Also, tested with the Query Tester as proposed in PR #12586
```
With Thread Tasks (batch size = 4096)
./query_tester tpch-1
Average       Duration: 0.106694s (+/- 0s)
Average Output  Rows/S: 37.4902rps
Average Output Bytes/S: 4573.81bps

Without Thread Tasks (batch size = 4096)
./query_tester tpch-1
Average       Duration: 0.104658s (+/- 0s)
Average Output  Rows/S: 38.2198rps
Average Output Bytes/S: 4662.82bps
```

Closes #12720 from sanjibansg/thread_tasks

Authored-by: Sanjiban Sengupta <sanjiban.sg@gmail.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
@westonpace westonpace closed this Jan 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants