ARROW-15877: [C++] Add a C++ query testing tool #12586

westonpace · 2022-03-09T01:54:47Z

This PR is based on #12537 and will need to remain in draft until that is merged. In addition, I'd like to fit in a few more builtin queries in this initial PR.

This PR creates a standalone query testing executable query_tester.

The tool takes a number of command line options today:

Usage: query_tester [options] query 

Positional arguments:
query            	name of the query to run [required]

Optional arguments:
-h --help        	shows help message and exits
-v --version     	prints version information and exits
--num-iterations 	[default: 1]
--cpu-threads    	size to use for the CPU thread pool, default controlled by Arrow
--io-threads     	size to use for the I/O thread pool, default controlled by Arrow
--validate       	if set the program will validate the query results [default: false] (not yet implemented)

The tool will first look for a Substrait query in the queries folder (there is an example of TPC-H Q1 in JSON format). At the moment this isn't very useful as our Substrait support is very limited.

The tool will then look for builtin queries, I'd like to add support for all of the TPC-H builtin queries.

There is a datasets folder that is also created. In the future I'd like to add support for downloading remote datasets to this folder.

The current output simply prints a few statistics:

(conbench3) pace@pace-desktop:~/dev/arrow/dev/qtester/debug-build$ ./query_tester tpch-1 --num-iterations 10
Average       Duration: 1.19292s (+/- 0.00569406s)
Average Output  Rows/S: 3.35311rps
Average Output Bytes/S: 409.08bps

Note that Output Rows/S and Output Bytes/S is not very useful for TPC-H queries (which aggregate most of their data so they have very little output for the amount of work done). I've prototyped adding a much more exhaustive breakdown of time spent by intercepting OT events but I'd like to save that work for a future PR.

…ester-base

…oaded datasets in the future and is needed for the query tester to recognize the root directory.

github-actions · 2022-03-09T01:55:06Z

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW

Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

…p directory

github-actions · 2022-03-09T03:59:33Z

https://issues.apache.org/jira/browse/ARROW-15877

joosthooz · 2022-03-10T14:15:20Z

cpp/src/arrow/compute/exec/tpch_benchmark.cc

+namespace compute
+{
+
+std::shared_ptr<ExecPlan> Plan_Q1(AsyncGenerator<util::optional<ExecBatch>> &sink_gen, int scale_factor)


This code is almost identical to tpch1() in builtin_queries.cc

Yes, I created tpch1() by leveraging this. Once the PR adding tpch_benchmark.cc merges I will update this PR to remove tpch_benchmark.cc in favor of this (or perhaps keep the benchmark but link to builtin queries).

This PR modifies the `SubmitTask` and `Finish` methods of MapNode in `ExecPlan` to avoid scheduling extra thread tasks. Performed the TPC-H Benchmark developed in PR #12537 with and without the changes. ``` TPC-H Benchmark (With Extra Thread Tasks) ------------------------------------------------------------------- Benchmark Time CPU Iterations ------------------------------------------------------------------- BM_Tpch_Q1/ScaleFactor:1 95035633 ns 178700 ns 100 TPC-H Benchmark (Without Extra Thread Tasks) ------------------------------------------------------------------- Benchmark Time CPU Iterations ------------------------------------------------------------------- BM_Tpch_Q1/ScaleFactor:1 91511754 ns 182060 ns 100 ``` Also, tested with the Query Tester as proposed in PR #12586 ``` With Thread Tasks (batch size = 4096) ./query_tester tpch-1 Average Duration: 0.106694s (+/- 0s) Average Output Rows/S: 37.4902rps Average Output Bytes/S: 4573.81bps Without Thread Tasks (batch size = 4096) ./query_tester tpch-1 Average Duration: 0.104658s (+/- 0s) Average Output Rows/S: 38.2198rps Average Output Bytes/S: 4662.82bps ``` Closes #12720 from sanjibansg/thread_tasks Authored-by: Sanjiban Sengupta <sanjiban.sg@gmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>

save-buffer and others added 10 commits March 2, 2022 12:02

Add TPC-H Generator

342c3c0

Draft of R bindings

c4495dc

Fix bugs, parallel text generation, rudimentary tests

d7c508c

Uncommenting R tests, and a first stab at the filewriter C++

289337e

Make it actually multithreaded

2c580ac

Fill new arrays with empty Datums explicitly

de2305a

Add some tests, fix some bugs

3eb99c6

Merge remote-tracking branch 'save-buffer/sasha_tpch' into feature/qt…

4fc7bef

…ester-base

First pass at a query testing tool.

7f3e6bc

Added in empty datasets directory. It will be a destination for downl…

6848a8b

…oaded datasets in the future and is needed for the query tester to recognize the root directory.

github-actions bot added Component: C++ Component: R labels Mar 9, 2022

ARROW-15877: Moved the standalone query-tester executable into the cp…

239b20f

…p directory

westonpace changed the title ~~Feature/arrow 15877 query testing tool~~ ARROW-15877: [C++] Add a C++ query testing tool Mar 9, 2022

ARROW-15877: ExecContext was not using the thread pool

a3b4362

joosthooz reviewed Mar 10, 2022

View reviewed changes

sanjibansg mentioned this pull request Mar 25, 2022

ARROW-15994: [C++] Back out taskify changes #12720

Closed

westonpace closed this Jan 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ARROW-15877: [C++] Add a C++ query testing tool #12586

ARROW-15877: [C++] Add a C++ query testing tool #12586

Uh oh!

westonpace commented Mar 9, 2022

Uh oh!

github-actions bot commented Mar 9, 2022

Uh oh!

github-actions bot commented Mar 9, 2022

Uh oh!

joosthooz Mar 10, 2022

Uh oh!

westonpace Mar 10, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ARROW-15877: [C++] Add a C++ query testing tool #12586

ARROW-15877: [C++] Add a C++ query testing tool #12586

Uh oh!

Conversation

westonpace commented Mar 9, 2022

Uh oh!

github-actions bot commented Mar 9, 2022

Uh oh!

github-actions bot commented Mar 9, 2022

Uh oh!

joosthooz Mar 10, 2022

Choose a reason for hiding this comment

Uh oh!

westonpace Mar 10, 2022

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants