ARROW-11705: [R] Support scalar value recycling in RecordBatch/Table$create() #10269

thisisnic · 2021-05-07T19:04:57Z

This also adds missing spaces in some unrelated R files

github-actions · 2021-05-07T19:05:19Z

https://issues.apache.org/jira/browse/ARROW-11705

r/R/table.R

ianmcook · 2021-05-07T20:06:01Z

Ultimately I think we should try to do scalar value recycling in C++ code, but I think this a great temporary solution in the meantime.

What happens if you pass data frames instead of vectors and one of them has length one (i.e. only one row)? Maybe add a test to check the behavior in that case.

And what happens if you pass Arrow arrays instead of R vectors and one of them has length 1? E.g.

Table$create(a = Array$create(1:10), b = Array$create(5))

thisisnic · 2021-05-10T07:54:25Z

What happens if you pass data frames instead of vectors and one of them has length one (i.e. only one row)? Maybe add a test to check the behavior in that case.

Will do.

And what happens if you pass Arrow arrays instead of R vectors and one of them has length 1? E.g.
Table$create(a = Array$create(1:10), b = Array$create(5))

An error, will make changes to handle this to and add appropriate tests.

thisisnic · 2021-05-10T07:56:43Z

@ianmcook I think this is something for a separate ticket/PR, but when I was testing things you mentioned above, I found that it is possible to create Table and RecordBatch objects with duplicated column names, which then results in errors when I try to analyse them, e.g.

Table$create(iris, iris) %>% filter(Species == "versicolor")

Error in schm$GetFieldByName(name)$ToString() : attempt to apply non-function

Shall I create a new ticket to add code which checks column names are unique when combining objects?

ianmcook · 2021-05-10T13:13:17Z

Shall I create a new ticket to add code which checks column names are unique when combining objects?

Yes, please create a Jira for that, thanks! And if you happen to solve it in this PR, then you can resolve that Jira with a comment indicating that it was solved in this PR.

r/R/record-batch.R

r/R/table.R

r/R/record-batch.R

ianmcook · 2021-05-11T13:31:05Z

This looks pretty good to me! Just a few final things:

Could you please ensure there are spaces added: if( → if ( and ){→ ) {
Could you search all the tests for "ARROW-11705" and see if the two skipped tests work now that you've implemented this fix?

thisisnic · 2021-05-11T14:59:10Z

This looks pretty good to me! Just a few final things:

* Could you please ensure there are spaces added: `if(` → `if (` and `){`→ `) {`

* Could you search all the tests for "[ARROW-11705](https://issues.apache.org/jira/browse/ARROW-11705)" and see if the two skipped tests work now that you've implemented this fix?

Done both now, and fixed spacing in a load of other places as well.

ianmcook · 2021-05-11T15:42:37Z

@nealrichardson you want to take a look before I merge? Thanks

ianmcook · 2021-05-11T15:45:40Z

@thisisnic please also build the docs so the new .Rd file is included here

ianmcook · 2021-05-11T17:22:06Z

In theory, this would be a good opportunity to use Conbench to test whether the multiple length() calls added here might have a meaningful effect on the performance of Table/RecordBatch creation. In practice, I'm not sure whether Conbench would help us here. @jonkeane do you know?

jonkeane · 2021-05-11T18:44:12Z

It's worth checking — there might be too much noise right now to see a small difference, but large ones should jump out.

jonkeane · 2021-05-11T18:44:16Z

@ursabot please benchmark lang=R

ursabot · 2021-05-11T18:45:15Z

Benchmark runs are scheduled for baseline = 4e0f0cf and contender = dbe74918019e172f8bdd3a2085f1ec7481fa79f4. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Failed ⬇️0.0% ⬆️0.0%] ec2-t3-large-us-east-2 (mimalloc)
[Failed ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2 (mimalloc)
[Finished ⬇️5.8% ⬆️0.0%] ursa-i9-9960x (mimalloc)
[Failed ⬇️0.0% ⬆️0.0% Warning: Contender and baseline run contexts do not match] ursa-thinkcentre-m75q (mimalloc)
⚠️ ursa-i9-9960x agent is disconnected or machine is offline.

thisisnic · 2021-05-13T14:23:27Z

Benchmark runs are scheduled for baseline = 4e0f0cf and contender = dbe7491. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Failed arrow_down0.0% arrow_up0.0%] ec2-t3-large-us-east-2 (mimalloc)
[Failed arrow_down0.0% arrow_up0.0%] ec2-t3-xlarge-us-east-2 (mimalloc)
[Scheduled] ursa-i9-9960x (mimalloc)
[Failed arrow_down0.0% arrow_up0.0% Warning: Contender and baseline run contexts do not match] ursa-thinkcentre-m75q (mimalloc)
warning ursa-i9-9960x agent is disconnected or machine is offline.

@jonkeane It looks like Conbench has used different benchmarks for each of those PRs - do you know why that's happened?

ianmcook · 2021-05-13T16:57:15Z

I think Conbench needs some tweaking before it can help us here.

I'll go ahead and resolve the conflicts, wait for checks to pass, and merge this if there aren't any objections

ianmcook · 2021-05-13T17:56:12Z

@thisisnic ooh we finally have some benchmark results to look at!: ursa-i9-9960x (mimalloc)
The dataframe-to-table results are the pertinent ones here I think.
@jonkeane can you help us interpret these results?

thisisnic · 2021-05-14T07:07:03Z

@thisisnic ooh we finally have some benchmark results to look at!: ursa-i9-9960x (mimalloc)
The dataframe-to-table results are the pertinent ones here I think.
@jonkeane can you help us interpret these results?

I'm going to take a stab at interpreting them just to see how I do. Overall, the changes I've made have made things 5% slower. I'm not sure whether this is important - as I don't have a good idea of a cut-off, or idea whether any of this is just noise. Looking more closely at the results, the biggest differences are to file-read, where the update is at least 8% slower.

Does this feel like a significant slowdown? I'd say so. I think it provides support to the idea that this really should be implemented at the C++ level rather than the R level. Is this OK to be merged in? I think "perhaps", as long as we open a JIRA for this to be implemented at the C++ level (which I have done here: https://issues.apache.org/jira/browse/ARROW-12789)

Let me know your thoughts!

jonkeane · 2021-05-14T13:18:06Z

A couple of comments/additions, that I think you're generally right.

The R benchmarks tend to be stable (https://conbench.ursa.dev/compare/runs/8b6fef07829948998502a7677dec6e03...0cbd9dcbe2594e06ab95cf0e088cf25b/ is a run on the master branch and is between -3% and 1% change and that -3% is an outlier there, the next largest decrease is -0.8%). So we can have decent confidence that we're not observing noise alone here. We're working actively to improve this, but wanted to put it out there as part of the assumptions I'm using.

There are some file-read benchmarks that are >5% slower, interestingly it is all (and only) the fanniemae dataset that is slower (both reading from parquet and from feather) and only when it is being converted to a data.frame, not when it is being left as a table. This seems a little suspect to me since the only places that I'm seeing you've meaningfully changed the code is RecordBatch$create, Table$create, and MakeArrayFromScalar. Do any of those get called when reading parquet or feather files?

Note: I don't see csv reads run here, IIRC those were proactively disabled due to memory issues, but I will confirm that (and I thought this machine should have been able to handle these and there is https://issues.apache.org/jira/browse/ARROW-12519 to track).

There are also another number of benchmarks that are in the 5-1% slower range (the other file-read, as well as the df to R conversions, and a handful of the writing benchmarks). The df to R conversions seem more in line with the code that was changed, and those are in the 3-6% range (though most are closer to 3%, with one being an outlier at 6%)

The next 28/128 or ~20% of the benchmarks are 0-1% slower and then 19/138 or ~14% of the benchmarks are 0-1% faster. These are probably all just noise.

thisisnic · 2021-05-14T13:34:33Z

There are some file-read benchmarks that are >5% slower, interestingly it is all (and only) the fanniemae dataset that is slower (both reading from parquet and from feather) and only when it is being converted to a data.frame, not when it is being left as a table. This seems a little suspect to me since the only places that I'm seeing you've meaningfully changed the code is RecordBatch$create, Table$create, and MakeArrayFromScalar. Do any of those get called when reading parquet or feather files?

They do not, which does make it strange; completely overlooked the fact that those shouldn't be relevant here.

r/R/scalar.R

nealrichardson · 2021-05-14T14:31:34Z

r/R/record-batch.R

This code block seems to be repeated in both the Table and RecordBatch code, so it might be better factored out as a helper function.

I also wonder whether the logic would actually be simpler in C++ because you could do it at a later point where you know exactly that you have a vector of Arrays and don't have to worry about whether it is an R vector, a Scalar, an Array, etc. See the check_consistent_array_size function for example in r/src--you could drop in around there and instead of erroring if you don't have consistent lengths, handle the recycling case. (Also ok to (1) ignore this suggestion or (2) defer to a followup)

I opened up a ticket to do this in C++, so I figured probably no point duplicating that effort? Though if this seems like a special case that's better off implemented in the R package's C++ layer rather than the source C++, I can look into it. See discussion on this ticket, @nealrichardson : https://issues.apache.org/jira/browse/ARROW-12789

nealrichardson · 2021-05-14T14:35:05Z

r/R/record-batch.R

FWIW there is a base R function lengths() that does this (though I don't recall what version it was added in)

Apparently 3.2.0. I think in our test-r-versions job, we only got back to 3.3.0 and I can't think of anywhere else that we go back to 3.2.0 or before so will update.

Per DESCRIPTION we require R >= 3.3 (because we depend on packages that require R >= 3.3)

nealrichardson · 2021-05-14T14:38:45Z

r/R/record-batch.R

Call me 👴 but I personally find

arrays[arr_lens == 1] <- lapply(arrays[arr_lens == 1], repeat_value_as_array, max_array_len)

easier to read than modify2(...).

I don't disagree, will update

nealrichardson · 2021-05-14T14:41:11Z

r/R/util.R

If you're checking length again here (unnecessary IMO since you're only calling this in the case where you've validated length == 1 already), you could simplify your modify2() wrapper and just map over all arrays, and in here only do the recycling if length == 1.

I think I was trying to make this function a bit more generic in case it's useful elsewhere, but you make a good point; I'll remove the check.

r/tests/testthat/test-RecordBatch.R

ElenaHenderson · 2021-05-17T17:32:51Z

@ursabot please benchmark name=dataframe-to-table lang=R

Co-authored-by: Ian Cook <ianmcook@gmail.com>

Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>

…ame objects

Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>

nealrichardson

Some notes but generally LGTM, thanks

r/R/util.R

nealrichardson · 2021-06-08T23:02:09Z

r/R/util.R

+    if(all(map_lgl(arrays, ~inherits(.x, "data.frame")))){
+      abort(c(
+          "All input tibbles or data.frames must have the same number of rows",
+          x = paste("Number of rows in inputs:",oxford_paste(map_int(arrays, ~nrow(.x))))


right?

Suggested change

x = paste("Number of rows in inputs:",oxford_paste(map_int(arrays, ~nrow(.x))))

x = paste("Number of rows in inputs:", oxford_paste(arr_lens)

Also, are we worried that arr_lens could be large (and thus this error message would be huge)?

I wasn't before, but now I am now you mention it. I have updated my error message to just print the longest and shortest length items seeing as I think this is still sufficient to be useful. Let me know if it looks OK!

…eated expression into variable

…create() This also adds missing spaces in some unrelated R files Closes apache#10269 from thisisnic/ARROW-11705_scalar_recycling Lead-authored-by: Nic Crane <thisisnic@gmail.com> Co-authored-by: Nic <thisisnic@gmail.com> Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>

github-actions bot added the Component: R label May 7, 2021

ianmcook reviewed May 7, 2021

View reviewed changes

r/R/table.R Outdated Show resolved Hide resolved

ianmcook reviewed May 10, 2021

View reviewed changes

r/R/record-batch.R Outdated Show resolved Hide resolved

thisisnic requested a review from ianmcook May 10, 2021 17:31

ianmcook reviewed May 10, 2021

View reviewed changes

r/R/table.R Outdated Show resolved Hide resolved

ianmcook reviewed May 10, 2021

View reviewed changes

r/R/record-batch.R Outdated Show resolved Hide resolved

github-actions bot added the Component: Parquet label May 11, 2021

ianmcook approved these changes May 11, 2021

View reviewed changes

ianmcook requested a review from nealrichardson May 11, 2021 15:42

nealrichardson reviewed May 14, 2021

View reviewed changes

thisisnic and others added 20 commits June 8, 2021 15:12

rebase fix 1

f17c4ad

Call modify2 directly

9ad7264

Import modify2

ed66595

Add test for recycling not working if pass in tibble

0dc7666

Add handling of ArrowDatum objects to scalar recycling

830ff2b

Update r/R/record-batch.R

b7094cd

Co-authored-by: Ian Cook <ianmcook@gmail.com>

Update r/R/table.R

d57c33c

Co-authored-by: Ian Cook <ianmcook@gmail.com>

Remove call to as.vector

c478825

Fix spacing and don't skip test

0d298f7

Run devtools::document()

4905839

Update r/R/scalar.R

9ad9a49

Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>

Use lapply instead of modify2

65b18e5

Remove unnecessary check of length

789e850

Make behaviour consistent with tibble

f071f2e

Refactor scalar recycling as own function, and only do on non data.fr…

4aa1eeb

…ame objects

Add test for tibble with length 1

d148a8b

Update r/R/util.R

c791aeb

Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>

Update r/R/util.R

e7ff7d8

Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>

Remove unnecessary check to see if tibble

e3df47e

Reorder so fewer calls to Table__from_dots (again)

95463e3

thisisnic force-pushed the ARROW-11705_scalar_recycling branch from c01dfff to 95463e3 Compare June 8, 2021 14:34

thisisnic marked this pull request as draft June 8, 2021 14:39

thisisnic added 2 commits June 8, 2021 16:15

Give error message with tibbles

8b6de2c

Remove unnecessary whitespace

fff1377

thisisnic marked this pull request as ready for review June 8, 2021 15:42

nealrichardson approved these changes Jun 8, 2021

View reviewed changes

Add spacing, shorten error message, rearrange functions, and turn rep…

d044bfb

…eated expression into variable

nealrichardson closed this in dbcd0d9 Jun 16, 2021

asfimport mentioned this pull request Jun 16, 2021

[R] Support scalar value recycling in RecordBatch/Table$create() #27565

Closed

	x = paste("Number of rows in inputs:",oxford_paste(map_int(arrays, ~nrow(.x))))
	x = paste("Number of rows in inputs:", oxford_paste(arr_lens)

ARROW-11705: [R] Support scalar value recycling in RecordBatch/Table$create() #10269

ARROW-11705: [R] Support scalar value recycling in RecordBatch/Table$create() #10269

Uh oh!

Conversation

thisisnic commented May 7, 2021 • edited by ianmcook Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented May 7, 2021

Uh oh!

Uh oh!

ianmcook commented May 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thisisnic commented May 10, 2021

Uh oh!

thisisnic commented May 10, 2021

Uh oh!

ianmcook commented May 10, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ianmcook commented May 11, 2021

Uh oh!

thisisnic commented May 11, 2021

Uh oh!

ianmcook commented May 11, 2021

Uh oh!

ianmcook commented May 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ianmcook commented May 11, 2021

Uh oh!

jonkeane commented May 11, 2021

Uh oh!

jonkeane commented May 11, 2021

Uh oh!

ursabot commented May 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thisisnic commented May 13, 2021

Uh oh!

ianmcook commented May 13, 2021

Uh oh!

ianmcook commented May 13, 2021

Uh oh!

thisisnic commented May 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jonkeane commented May 14, 2021

Uh oh!

thisisnic commented May 14, 2021

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ElenaHenderson commented May 17, 2021

Uh oh!

nealrichardson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

thisisnic commented May 7, 2021 •

edited by ianmcook

Loading

ianmcook commented May 7, 2021 •

edited

Loading

ianmcook commented May 11, 2021 •

edited

Loading

ursabot commented May 11, 2021 •

edited

Loading

thisisnic commented May 14, 2021 •

edited

Loading