ARROW-3998: [C++] Add TPC-H Generator #12537

save-buffer · 2022-03-02T00:00:36Z

This PR contains an implementation of a multithreaded TPC-H dbgen, as well as an implementation of Q1 as a google benchmark. The advantage of this dbgen approach is that it is a scan node: it generates data on the fly and streams it over. As a result, I was for instance able to run scale factor 1000 on Q1 on my desktop with only 32 GB of RAM.

I did verify results of Q1. They don't exactly match the reference results, but they are quite close and well within what I'd expect the variance to be between random number generators.

-------------------------------------------------------------
Benchmark                   Time             CPU   Iterations
-------------------------------------------------------------
BM_Tpch_Q1/SF:1     186609936 ns       268825 ns          100
BM_Tpch_Q1/SF:10   1858114140 ns       276741 ns           10
BM_Tpch_Q1/SF:100  18561088470 ns       273067 ns            1
BM_Tpch_Q1/SF:1000 186103719755 ns       289445 ns            1

github-actions · 2022-03-02T00:00:53Z

https://issues.apache.org/jira/browse/ARROW-3998

github-actions · 2022-03-02T00:00:55Z

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

jonkeane

This is fantastic! I'll pull this locally and try to wire it up in R tomorrow. One comment about licensing the code snippet.

Is it possible to add tests for these generators? Even if we don't want to necessarily have generated data that go with the tests, even just confirming that they ran + generated the right shape of data + other details would be fantastic.

jonkeane · 2022-03-02T00:24:19Z

cpp/src/arrow/compute/exec/tpch_node.cc

+              `"' ___.,'` j,-'
+              `-.__.,--'
+             */
+            // Source: https://stackoverflow.com/questions/1068849/how-do-i-determine-the-number-of-digits-of-an-integer-in-c


We should confirm that this is ok to copy/paste. https://www.apache.org/legal/resolved.html#stackoverflow says no, though I thought I remember mention of a date after which stack overflow content is ok.

It's actually not really a copy - I have the DCHECK, made it work with 64-bit ints, return -1 in the unreachable scenario, and the variable and function names are different. I was mainly using it as a source for justifying why this is the fastest way.

But yes, I'm not sure how different something like this has to be in order for it to count as "original" vs "copied and modified"...

nods modified is ok, and it might be changed enough, but I'll get a second opinion on it as well to ensure we're good here.

Modification doesn't really matter from a legal perspective. Derivative works are only allowed if you have rights to the source.

That being said, we are in a terrible gray area here that I fear will become a can of worms and a waste of everyone's time. Technically the only "proper" approach would be a clean room approach where @save-buffer describes what is needed to someone that has never seen the SO code and that person writes the code. However, legally documenting such a process is a headache.

In this case I think we are (at least ethically) in the clear. The answer author has this statement on their profile page:

All code I post on Stack Overflow is covered by the "Do whatever the heck you want with it" licence, the full text of which is:

Do whatever the heck you want with it.

There are a few edits by other authors but none of which touches the code included here. My opinion would be to proceed as is (but get rid of the ASCII art, I can't abide fun).

save-buffer · 2022-03-02T00:51:25Z

Good point about tests - I will add some.

cpp/src/arrow/compute/exec/tpch_benchmark.cc

jonkeane · 2022-03-02T20:08:08Z

cpp/src/arrow/compute/exec/tpch_node.h

+#include "arrow/result.h"
+#include "arrow/status.h"
+#include "arrow/type.h"
+#include "arrow/util/pcg_random.h"


I don't think these headers (or the ones in vendored) are actually pulled in by default, at least they weren't pulled in when I built using the instructions: https://arrow.apache.org/docs/r/articles/developers/setup.html#step-2---configure-the-libarrow-build

I coppied the vendored/pcg/*.hpp over manually — but I suspect we'll need to add those to cmake somewhere to do that as part of the build now that these are used in compute

Good point. The pcg subdirectory needs to be added to https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/CMakeLists.txt

TIL about that file!

jonkeane · 2022-03-02T21:21:35Z

And one more thing I though as I was doing this: For the tables that have keys shared across them (e.g. lineitems and orders), is it possible generate those at the same time so that the keys are the same across both? I can figure out if generating them separately is an issue by comparing duckdb or running the queries through arrow bench.

save-buffer · 2022-03-02T21:30:24Z

@jonkeane They get generated at the same time if you use the same TpchGen object.

jonkeane · 2022-03-02T22:33:51Z

@jonkeane They get generated at the same time if you use the same TpchGen object.

Cool, so then the next thing would be to figure out if there's a way to emit and write these record batch readers simultaneously. (and that might actually need to to be done all in C++, the R execution paradigm might make it hard to have one plan that is emitting two sets of record batches and writes them simultaneously)

save-buffer · 2022-03-02T22:57:35Z

Yeah, one thing to think about is what the R code would look like. You'd need some way of "linking" the two tables in R right?

jonkeane · 2022-03-02T23:01:41Z

From the R side, yeah. The hard part is that we want to take each of these Record Batch Readers and write them to parquet files, but we don't have an object in the R bindings that is designed to hold multiple record batch readers and kick off the writing for both.

jonkeane · 2022-03-02T23:05:34Z

Though we should be able to construct a plan that includes write nodes for both of those recordbatchreaders in C++ and then let C++ handle the rest... (no need to try and shoe horn that multithreading into R which will be reluctant to make that easy!) I'm digging into those now to see if I can wire that up myself...

westonpace · 2022-03-03T01:18:41Z

Yeah, I'm pretty sure a single plan could have one tpch gen node for each table and then each of those nodes would be attached to a dedicated write node. I have done very little testing with multiple sink nodes though so it would be an interesting experiment.

jonkeane · 2022-03-03T01:22:54Z

Yeah that was what I was thinking. I just did a bit of tinkering with one gen node and a write node, but don’t have it working quite yet. I’ll push what I have when I get home and maybe it’ll be obvious what I’m doing wrong.

Tangentially: ideally, I would like each table to go to a single parquet file (with multiple row groups, and written batch by batch). Do we have the node machinery for that already? Or just machinery for writing to datasets?

westonpace · 2022-03-03T01:28:33Z

Tangentially: ideally, I would like each table to go to a single parquet file (with multiple row groups, and written batch by batch). Do we have the node machinery for that already? Or just machinery for writing to datasets?

I'm not sure we need a dedicated write node for single-file writes.

What your describing is the "default partitioning" but looking at the code I'm not sure if we default to that or if we default to a segmentation fault when no partitioning is specified. 😬 arrow::dataset::Partitioning::Default() could be used in the meantime.

jonkeane · 2022-03-03T01:36:14Z

Well that might explain the seg fault I got! Thanks for that I’ll try piping that in.

I have some thoughts about if that actually works, but I need to dig some more. My issues with trying to use it that way preciously might have been from the R bindings and not C++ so I might be able to work around them. I know the single file write is not standard, and unlikely to be the right thing for most people, but in this case it helps a lot with compatibility.

jonkeane · 2022-03-03T15:01:55Z

r/src/compute-exec.cpp

+}
+
+// [[arrow::export]]
+void Tpch_Dbgen_Write(


This is not the ultimate shape that this function will take — but was my first attempt to using the write node. It is currently segfaulting and I've commented some of the silly things I've done in-line.

This function is a bit separate from the PR so if getting this working will delay merging the larger PR, I'm happy to pull it into a separate one.

jonkeane · 2022-03-03T15:04:52Z

r/src/compute-exec.cpp

+  auto base_path =  base_dir + "/parquet_dataset";
+  filesystem->CreateDir(base_path);
+
+  auto format = std::make_shared<ds::ParquetFileFormat>();
+
+  ds::FileSystemDatasetWriteOptions write_options;
+  write_options.file_write_options = format->DefaultWriteOptions();
+  write_options.existing_data_behavior = ds::ExistingDataBehavior::kDeleteMatchingPartitions;
+  write_options.filesystem = filesystem;
+  write_options.base_dir = base_path;
+  write_options.partitioning = arrow::dataset::Partitioning::Default();
+  write_options.basename_template = "part{i}.parquet";
+  write_options.max_partitions = 1024;


A lot of this is hard coded for now to get it working — one thing that surprise me a little bit is that we have slightly different write options for this than we do for datasets. I think I need to wire up FileSystemDatasetWriteOptions (or figure out if it gets translated in the right way to get to that point with our R wiring of FileWriteOptions

jonkeane · 2022-03-03T15:09:01Z

r/src/compute-exec.cpp

+  // TODO: this had a checked_cast in front of it in the code I adapted it from
+  // but I ran into namespace issues when doing it so I took it out to see if it
+  // worked, but maybe that's what's causing the sefault?
+  const ds::WriteNodeOptions options =
+    ds::WriteNodeOptions{write_options, table->output_schema()};


This might be exactly what's causing the seg fault, the code I saw that did this had checked_cast in front of this

cpp/src/arrow/compute/exec/tpch_node.cc

pitrou · 2022-03-03T16:27:16Z

cpp/src/arrow/compute/exec/tpch_node.cc

+        void AppendNumberPaddedToNineDigits(char *out, int64_t x)
+        {
+            // We do all of this to avoid calling snprintf, which does a lot of crazy
+            // locale stuff. On Windows and MacOS this can get suuuuper slow


We have formatting utilities in arrow/util/formatting.h, no need to reinvent them.

@save-buffer looks like this comment is still unaddressed. Also, if speed is a concern, I notice ~2x faster performance with arrow::internal::detail::FormatAllDigitsLeftPadded than with AppendNumberPaddedToNineDigits

nealrichardson · 2022-03-10T13:47:24Z

Presumably this involves vendoring some official TPC code? If so, we should confirm its license and add a note to LICENSE.txt

save-buffer · 2022-03-10T15:21:02Z

No, this is all original content. I followed the official spec, but no code was taken.

pitrou

+1. I pushed a couple minor changes.

pitrou · 2022-03-29T17:09:49Z

Hmm, there's an ASAN issue that needs to be looked at before merging this. I can do that tomorrow if you want.

save-buffer · 2022-03-29T17:43:51Z

cpp/src/arrow/compute/exec/tpch_node.h

-  static Result<TpchGen> Make(ExecPlan* plan, double scale_factor = 1.0,
-                              int64_t batch_size = 4096,
-                              util::optional<int64_t> seed = util::nullopt);
+  static Result<std::unique_ptr<TpchGen>> Make(


This does not need to be a unique_ptr. The whole point of using shared_ptr for the order/lineitem and part/partsupp generators is so that this TpchGen object can safely die.
Also none of the methods in this class need to be virtual.

As the commit title suggested, the point here is to hide implementation details from the .h, and minimize header inclusion.
It can be a unique_ptr, a shared_ptr or the pimpl pattern, but any of these patterns require some form of dynamic allocation.

This reverts commit 31e693e.

pitrou · 2022-03-29T18:10:08Z

@save-buffer I would have appreciated if you had asked for the changeset motivation instead of bluntly reverting it. The point here was to minimize the exposure of implementation details in a .h file, and also consequently minimize the set of transitive inclusions (as a common concern with C++ compilation times).

save-buffer · 2022-03-29T18:22:04Z

Apologies, I'll not revert like that in the future. We can always revert the revert.

As for the problem of increasing transitive inclusions, I agree keeping compilation time down is a worthy goal. I guess this header file now will transitively include pcg_random.h. Maybe we can find some other way to avoid including pcg_random without having to make everything virtual.

I think either way we don't have to make the factory return a unique_ptr.

pitrou · 2022-03-29T18:27:22Z

I think either way we don't have to make the factory return a unique_ptr.

Indeed, we don't. We can use the pimpl idiom instead (which might be slightly more verbose). The virtual methods seem harmless to me, given that the function call cost is not critical here, but either way is ok :-)

save-buffer · 2022-03-29T19:26:46Z

I guess my biggest gripe is with the unnecessary heap allocation. But I guess it's not a big deal (until we begin measuring ExecPlans-per-second 😛)
I'll just revert the revert.

This reverts commit df884c1.

This PR modifies the `SubmitTask` and `Finish` methods of MapNode in `ExecPlan` to avoid scheduling extra thread tasks. Performed the TPC-H Benchmark developed in PR #12537 with and without the changes. ``` TPC-H Benchmark (With Extra Thread Tasks) ------------------------------------------------------------------- Benchmark Time CPU Iterations ------------------------------------------------------------------- BM_Tpch_Q1/ScaleFactor:1 95035633 ns 178700 ns 100 TPC-H Benchmark (Without Extra Thread Tasks) ------------------------------------------------------------------- Benchmark Time CPU Iterations ------------------------------------------------------------------- BM_Tpch_Q1/ScaleFactor:1 91511754 ns 182060 ns 100 ``` Also, tested with the Query Tester as proposed in PR #12586 ``` With Thread Tasks (batch size = 4096) ./query_tester tpch-1 Average Duration: 0.106694s (+/- 0s) Average Output Rows/S: 37.4902rps Average Output Bytes/S: 4573.81bps Without Thread Tasks (batch size = 4096) ./query_tester tpch-1 Average Duration: 0.104658s (+/- 0s) Average Output Rows/S: 38.2198rps Average Output Bytes/S: 4662.82bps ``` Closes #12720 from sanjibansg/thread_tasks Authored-by: Sanjiban Sengupta <sanjiban.sg@gmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>

pitrou · 2022-03-30T09:15:40Z

Great! The s3fs failures on macOS are unrelated, so I'm gonna merge now. Thank you @save-buffer .

ursabot · 2022-03-30T09:21:11Z

Benchmark runs are scheduled for baseline = 0deefd1 and contender = 50fab73. 50fab73 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.38% ⬆️0.33%] test-mac-arm
[Finished ⬇️2.14% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.04% ⬆️0.26%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/415| 50fab73d ec2-t3-xlarge-us-east-2>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/401| 50fab73d test-mac-arm>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/401| 50fab73d ursa-i9-9960x>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/411| 50fab73d ursa-thinkcentre-m75q>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/414| 0deefd18 ec2-t3-xlarge-us-east-2>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/400| 0deefd18 test-mac-arm>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/400| 0deefd18 ursa-i9-9960x>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/410| 0deefd18 ursa-thinkcentre-m75q>
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

github-actions bot added the Component: C++ label Mar 2, 2022

jonkeane reviewed Mar 2, 2022

View reviewed changes

save-buffer force-pushed the sasha_tpch branch from 43c2076 to 5081761 Compare March 2, 2022 00:39

jonkeane reviewed Mar 2, 2022

View reviewed changes

cpp/src/arrow/compute/exec/tpch_benchmark.cc Outdated Show resolved Hide resolved

save-buffer force-pushed the sasha_tpch branch from 5081761 to 342c3c0 Compare March 2, 2022 20:02

jonkeane reviewed Mar 2, 2022

View reviewed changes

github-actions bot added the Component: R label Mar 2, 2022

jonkeane reviewed Mar 3, 2022

View reviewed changes

pitrou reviewed Mar 3, 2022

View reviewed changes

cpp/src/arrow/compute/exec/tpch_node.cc Outdated Show resolved Hide resolved

pitrou reviewed Mar 3, 2022

View reviewed changes

save-buffer force-pushed the sasha_tpch branch from 8cce701 to 3eb99c6 Compare March 8, 2022 20:12

westonpace mentioned this pull request Mar 9, 2022

ARROW-15877: [C++] Add a C++ query testing tool #12586

Closed

westonpace self-requested a review March 10, 2022 22:54

save-buffer added 10 commits March 28, 2022 14:57

Gate finished_callback_, don't init finished_

f3954ae

Make the code completely unreadable

c9befae

Remove R stuff (so we can put it in another PR)

4b16296

Respond to comments

7e9dce3

Switch to regex, switch to named args

8e18bf6

Respond to more comments, fix spurious crash (I think)

36967e9

Fix rebase error

d8016d6

Make constant tables have static storage

38958c6

clang-format

4caad35

Add comment

dd37803

save-buffer force-pushed the sasha_tpch branch from fb744a0 to dd37803 Compare March 28, 2022 21:57

pitrou added 2 commits March 29, 2022 15:15

Some nits

a7adc5c

Hide implementation details

31e693e

pitrou approved these changes Mar 29, 2022

View reviewed changes

save-buffer commented Mar 29, 2022

View reviewed changes

Revert "Hide implementation details"

df884c1

This reverts commit 31e693e.

Fix ASAN issue

3a8583b

Revert "Revert "Hide implementation details""

8bc6b8c

This reverts commit df884c1.

pitrou closed this in 50fab73 Mar 30, 2022

save-buffer deleted the sasha_tpch branch March 30, 2022 16:55

ARROW-3998: [C++] Add TPC-H Generator #12537

ARROW-3998: [C++] Add TPC-H Generator #12537

Uh oh!

Conversation

save-buffer commented Mar 2, 2022

Uh oh!

github-actions bot commented Mar 2, 2022

Uh oh!

github-actions bot commented Mar 2, 2022

Uh oh!

jonkeane left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

save-buffer commented Mar 2, 2022

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jonkeane commented Mar 2, 2022

Uh oh!

save-buffer commented Mar 2, 2022

Uh oh!

jonkeane commented Mar 2, 2022

Uh oh!

save-buffer commented Mar 2, 2022

Uh oh!

jonkeane commented Mar 2, 2022

Uh oh!

jonkeane commented Mar 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

westonpace commented Mar 3, 2022

Uh oh!

jonkeane commented Mar 3, 2022

Uh oh!

westonpace commented Mar 3, 2022

Uh oh!

jonkeane commented Mar 3, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nealrichardson commented Mar 10, 2022

Uh oh!

save-buffer commented Mar 10, 2022

Uh oh!

pitrou left a comment

Choose a reason for hiding this comment

Uh oh!

pitrou commented Mar 29, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pitrou commented Mar 29, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

jonkeane commented Mar 2, 2022 •

edited

Loading

pitrou commented Mar 29, 2022 •

edited

Loading

pitrou commented Mar 29, 2022 •

edited

Loading

ursabot commented Mar 30, 2022 •

edited

Loading