Skip to content

Conversation

@thisisnic
Copy link
Member

@thisisnic thisisnic commented May 7, 2021

This also adds missing spaces in some unrelated R files

@github-actions
Copy link

github-actions bot commented May 7, 2021

@ianmcook
Copy link
Member

ianmcook commented May 7, 2021

Ultimately I think we should try to do scalar value recycling in C++ code, but I think this a great temporary solution in the meantime.

What happens if you pass data frames instead of vectors and one of them has length one (i.e. only one row)? Maybe add a test to check the behavior in that case.

And what happens if you pass Arrow arrays instead of R vectors and one of them has length 1? E.g.

Table$create(a = Array$create(1:10), b = Array$create(5))

@thisisnic
Copy link
Member Author

What happens if you pass data frames instead of vectors and one of them has length one (i.e. only one row)? Maybe add a test to check the behavior in that case.

Will do.

And what happens if you pass Arrow arrays instead of R vectors and one of them has length 1? E.g.

Table$create(a = Array$create(1:10), b = Array$create(5))

An error, will make changes to handle this to and add appropriate tests.

@thisisnic
Copy link
Member Author

@ianmcook I think this is something for a separate ticket/PR, but when I was testing things you mentioned above, I found that it is possible to create Table and RecordBatch objects with duplicated column names, which then results in errors when I try to analyse them, e.g.

Table$create(iris, iris) %>% filter(Species == "versicolor")

Error in schm$GetFieldByName(name)$ToString() : attempt to apply non-function

Shall I create a new ticket to add code which checks column names are unique when combining objects?

@ianmcook
Copy link
Member

Shall I create a new ticket to add code which checks column names are unique when combining objects?

Yes, please create a Jira for that, thanks! And if you happen to solve it in this PR, then you can resolve that Jira with a comment indicating that it was solved in this PR.

@thisisnic thisisnic requested a review from ianmcook May 10, 2021 17:31
@ianmcook
Copy link
Member

This looks pretty good to me! Just a few final things:

  • Could you please ensure there are spaces added: if(if ( and ){) {
  • Could you search all the tests for "ARROW-11705" and see if the two skipped tests work now that you've implemented this fix?

@thisisnic
Copy link
Member Author

This looks pretty good to me! Just a few final things:

* Could you please ensure there are spaces added: `if(` → `if (` and `){`→ `) {`

* Could you search all the tests for "[ARROW-11705](https://issues.apache.org/jira/browse/ARROW-11705)" and see if the two skipped tests work now that you've implemented this fix?

Done both now, and fixed spacing in a load of other places as well.

@ianmcook ianmcook requested a review from nealrichardson May 11, 2021 15:42
@ianmcook
Copy link
Member

@nealrichardson you want to take a look before I merge? Thanks

@ianmcook
Copy link
Member

ianmcook commented May 11, 2021

@thisisnic please also build the docs so the new .Rd file is included here

@ianmcook
Copy link
Member

In theory, this would be a good opportunity to use Conbench to test whether the multiple length() calls added here might have a meaningful effect on the performance of Table/RecordBatch creation. In practice, I'm not sure whether Conbench would help us here. @jonkeane do you know?

@jonkeane
Copy link
Member

It's worth checking — there might be too much noise right now to see a small difference, but large ones should jump out.

@jonkeane
Copy link
Member

@ursabot please benchmark lang=R

@ursabot
Copy link

ursabot commented May 11, 2021

Benchmark runs are scheduled for baseline = 4e0f0cf and contender = dbe74918019e172f8bdd3a2085f1ec7481fa79f4. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Failed ⬇️0.0% ⬆️0.0%] ec2-t3-large-us-east-2 (mimalloc)
[Failed ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2 (mimalloc)
[Finished ⬇️5.8% ⬆️0.0%] ursa-i9-9960x (mimalloc)
[Failed ⬇️0.0% ⬆️0.0% Warning: Contender and baseline run contexts do not match] ursa-thinkcentre-m75q (mimalloc)
⚠️ ursa-i9-9960x agent is disconnected or machine is offline.

@thisisnic
Copy link
Member Author

Benchmark runs are scheduled for baseline = 4e0f0cf and contender = dbe7491. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Failed arrow_down0.0% arrow_up0.0%] ec2-t3-large-us-east-2 (mimalloc)
[Failed arrow_down0.0% arrow_up0.0%] ec2-t3-xlarge-us-east-2 (mimalloc)
[Scheduled] ursa-i9-9960x (mimalloc)
[Failed arrow_down0.0% arrow_up0.0% Warning: Contender and baseline run contexts do not match] ursa-thinkcentre-m75q (mimalloc)
warning ursa-i9-9960x agent is disconnected or machine is offline.

@jonkeane It looks like Conbench has used different benchmarks for each of those PRs - do you know why that's happened?

@ianmcook
Copy link
Member

I think Conbench needs some tweaking before it can help us here.

I'll go ahead and resolve the conflicts, wait for checks to pass, and merge this if there aren't any objections

@ianmcook
Copy link
Member

@thisisnic ooh we finally have some benchmark results to look at!: ursa-i9-9960x (mimalloc)
The dataframe-to-table results are the pertinent ones here I think.
@jonkeane can you help us interpret these results?

@thisisnic
Copy link
Member Author

thisisnic commented May 14, 2021

@thisisnic ooh we finally have some benchmark results to look at!: ursa-i9-9960x (mimalloc)
The dataframe-to-table results are the pertinent ones here I think.
@jonkeane can you help us interpret these results?

I'm going to take a stab at interpreting them just to see how I do. Overall, the changes I've made have made things 5% slower. I'm not sure whether this is important - as I don't have a good idea of a cut-off, or idea whether any of this is just noise. Looking more closely at the results, the biggest differences are to file-read, where the update is at least 8% slower.

Does this feel like a significant slowdown? I'd say so. I think it provides support to the idea that this really should be implemented at the C++ level rather than the R level. Is this OK to be merged in? I think "perhaps", as long as we open a JIRA for this to be implemented at the C++ level (which I have done here: https://issues.apache.org/jira/browse/ARROW-12789)

Let me know your thoughts!

@jonkeane
Copy link
Member

A couple of comments/additions, that I think you're generally right.

The R benchmarks tend to be stable (https://conbench.ursa.dev/compare/runs/8b6fef07829948998502a7677dec6e03...0cbd9dcbe2594e06ab95cf0e088cf25b/ is a run on the master branch and is between -3% and 1% change and that -3% is an outlier there, the next largest decrease is -0.8%). So we can have decent confidence that we're not observing noise alone here. We're working actively to improve this, but wanted to put it out there as part of the assumptions I'm using.

There are some file-read benchmarks that are >5% slower, interestingly it is all (and only) the fanniemae dataset that is slower (both reading from parquet and from feather) and only when it is being converted to a data.frame, not when it is being left as a table. This seems a little suspect to me since the only places that I'm seeing you've meaningfully changed the code is RecordBatch$create, Table$create, and MakeArrayFromScalar. Do any of those get called when reading parquet or feather files?

Note: I don't see csv reads run here, IIRC those were proactively disabled due to memory issues, but I will confirm that (and I thought this machine should have been able to handle these and there is https://issues.apache.org/jira/browse/ARROW-12519 to track).

There are also another number of benchmarks that are in the 5-1% slower range (the other file-read, as well as the df to R conversions, and a handful of the writing benchmarks). The df to R conversions seem more in line with the code that was changed, and those are in the 3-6% range (though most are closer to 3%, with one being an outlier at 6%)

The next 28/128 or ~20% of the benchmarks are 0-1% slower and then 19/138 or ~14% of the benchmarks are 0-1% faster. These are probably all just noise.

@thisisnic
Copy link
Member Author

There are some file-read benchmarks that are >5% slower, interestingly it is all (and only) the fanniemae dataset that is slower (both reading from parquet and from feather) and only when it is being converted to a data.frame, not when it is being left as a table. This seems a little suspect to me since the only places that I'm seeing you've meaningfully changed the code is RecordBatch$create, Table$create, and MakeArrayFromScalar. Do any of those get called when reading parquet or feather files?

They do not, which does make it strange; completely overlooked the fact that those shouldn't be relevant here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code block seems to be repeated in both the Table and RecordBatch code, so it might be better factored out as a helper function.

I also wonder whether the logic would actually be simpler in C++ because you could do it at a later point where you know exactly that you have a vector of Arrays and don't have to worry about whether it is an R vector, a Scalar, an Array, etc. See the check_consistent_array_size function for example in r/src--you could drop in around there and instead of erroring if you don't have consistent lengths, handle the recycling case. (Also ok to (1) ignore this suggestion or (2) defer to a followup)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened up a ticket to do this in C++, so I figured probably no point duplicating that effort? Though if this seems like a special case that's better off implemented in the R package's C++ layer rather than the source C++, I can look into it. See discussion on this ticket, @nealrichardson : https://issues.apache.org/jira/browse/ARROW-12789

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW there is a base R function lengths() that does this (though I don't recall what version it was added in)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently 3.2.0. I think in our test-r-versions job, we only got back to 3.3.0 and I can't think of anywhere else that we go back to 3.2.0 or before so will update.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per DESCRIPTION we require R >= 3.3 (because we depend on packages that require R >= 3.3)

Comment on lines 169 to 173
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Call me 👴 but I personally find

arrays[arr_lens == 1] <- lapply(arrays[arr_lens == 1], repeat_value_as_array, max_array_len)

easier to read than modify2(...).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't disagree, will update

r/R/util.R Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're checking length again here (unnecessary IMO since you're only calling this in the case where you've validated length == 1 already), you could simplify your modify2() wrapper and just map over all arrays, and in here only do the recycling if length == 1.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I was trying to make this function a bit more generic in case it's useful elsewhere, but you make a good point; I'll remove the check.

@ElenaHenderson
Copy link
Contributor

@ursabot please benchmark name=dataframe-to-table lang=R

@thisisnic thisisnic force-pushed the ARROW-11705_scalar_recycling branch from c01dfff to 95463e3 Compare June 8, 2021 14:34
@thisisnic thisisnic marked this pull request as draft June 8, 2021 14:39
@thisisnic thisisnic marked this pull request as ready for review June 8, 2021 15:42
Copy link
Member

@nealrichardson nealrichardson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some notes but generally LGTM, thanks

r/R/util.R Outdated
if(all(map_lgl(arrays, ~inherits(.x, "data.frame")))){
abort(c(
"All input tibbles or data.frames must have the same number of rows",
x = paste("Number of rows in inputs:",oxford_paste(map_int(arrays, ~nrow(.x))))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right?

Suggested change
x = paste("Number of rows in inputs:",oxford_paste(map_int(arrays, ~nrow(.x))))
x = paste("Number of rows in inputs:", oxford_paste(arr_lens)

Also, are we worried that arr_lens could be large (and thus this error message would be huge)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't before, but now I am now you mention it. I have updated my error message to just print the longest and shortest length items seeing as I think this is still sufficient to be useful. Let me know if it looks OK!

sjperkins pushed a commit to sjperkins/arrow that referenced this pull request Jun 23, 2021
…create()

This also adds missing spaces in some unrelated R files

Closes apache#10269 from thisisnic/ARROW-11705_scalar_recycling

Lead-authored-by: Nic Crane <thisisnic@gmail.com>
Co-authored-by: Nic <thisisnic@gmail.com>
Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants