-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-11705: [R] Support scalar value recycling in RecordBatch/Table$create() #10269
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Ultimately I think we should try to do scalar value recycling in C++ code, but I think this a great temporary solution in the meantime. What happens if you pass data frames instead of vectors and one of them has length one (i.e. only one row)? Maybe add a test to check the behavior in that case. And what happens if you pass Arrow arrays instead of R vectors and one of them has length 1? E.g. Table$create(a = Array$create(1:10), b = Array$create(5)) |
Will do.
An error, will make changes to handle this to and add appropriate tests. |
|
@ianmcook I think this is something for a separate ticket/PR, but when I was testing things you mentioned above, I found that it is possible to create Table$create(iris, iris) %>% filter(Species == "versicolor")
Shall I create a new ticket to add code which checks column names are unique when combining objects? |
Yes, please create a Jira for that, thanks! And if you happen to solve it in this PR, then you can resolve that Jira with a comment indicating that it was solved in this PR. |
|
This looks pretty good to me! Just a few final things:
|
Done both now, and fixed spacing in a load of other places as well. |
|
@nealrichardson you want to take a look before I merge? Thanks |
|
@thisisnic please also build the docs so the new |
|
In theory, this would be a good opportunity to use Conbench to test whether the multiple |
|
It's worth checking — there might be too much noise right now to see a small difference, but large ones should jump out. |
|
@ursabot please benchmark lang=R |
|
Benchmark runs are scheduled for baseline = 4e0f0cf and contender = dbe74918019e172f8bdd3a2085f1ec7481fa79f4. Results will be available as each benchmark for each run completes. |
@jonkeane It looks like Conbench has used different benchmarks for each of those PRs - do you know why that's happened? |
|
I think Conbench needs some tweaking before it can help us here. I'll go ahead and resolve the conflicts, wait for checks to pass, and merge this if there aren't any objections |
|
@thisisnic ooh we finally have some benchmark results to look at!: ursa-i9-9960x (mimalloc) |
I'm going to take a stab at interpreting them just to see how I do. Overall, the changes I've made have made things 5% slower. I'm not sure whether this is important - as I don't have a good idea of a cut-off, or idea whether any of this is just noise. Looking more closely at the results, the biggest differences are to file-read, where the update is at least 8% slower. Does this feel like a significant slowdown? I'd say so. I think it provides support to the idea that this really should be implemented at the C++ level rather than the R level. Is this OK to be merged in? I think "perhaps", as long as we open a JIRA for this to be implemented at the C++ level (which I have done here: https://issues.apache.org/jira/browse/ARROW-12789) Let me know your thoughts! |
|
A couple of comments/additions, that I think you're generally right. The R benchmarks tend to be stable (https://conbench.ursa.dev/compare/runs/8b6fef07829948998502a7677dec6e03...0cbd9dcbe2594e06ab95cf0e088cf25b/ is a run on the master branch and is between -3% and 1% change and that -3% is an outlier there, the next largest decrease is -0.8%). So we can have decent confidence that we're not observing noise alone here. We're working actively to improve this, but wanted to put it out there as part of the assumptions I'm using. There are some file-read benchmarks that are >5% slower, interestingly it is all (and only) the fanniemae dataset that is slower (both reading from parquet and from feather) and only when it is being converted to a data.frame, not when it is being left as a table. This seems a little suspect to me since the only places that I'm seeing you've meaningfully changed the code is Note: I don't see csv reads run here, IIRC those were proactively disabled due to memory issues, but I will confirm that (and I thought this machine should have been able to handle these and there is https://issues.apache.org/jira/browse/ARROW-12519 to track). There are also another number of benchmarks that are in the 5-1% slower range (the other file-read, as well as the df to R conversions, and a handful of the writing benchmarks). The df to R conversions seem more in line with the code that was changed, and those are in the 3-6% range (though most are closer to 3%, with one being an outlier at 6%) The next 28/128 or ~20% of the benchmarks are 0-1% slower and then 19/138 or ~14% of the benchmarks are 0-1% faster. These are probably all just noise. |
They do not, which does make it strange; completely overlooked the fact that those shouldn't be relevant here. |
r/R/record-batch.R
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code block seems to be repeated in both the Table and RecordBatch code, so it might be better factored out as a helper function.
I also wonder whether the logic would actually be simpler in C++ because you could do it at a later point where you know exactly that you have a vector of Arrays and don't have to worry about whether it is an R vector, a Scalar, an Array, etc. See the check_consistent_array_size function for example in r/src--you could drop in around there and instead of erroring if you don't have consistent lengths, handle the recycling case. (Also ok to (1) ignore this suggestion or (2) defer to a followup)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opened up a ticket to do this in C++, so I figured probably no point duplicating that effort? Though if this seems like a special case that's better off implemented in the R package's C++ layer rather than the source C++, I can look into it. See discussion on this ticket, @nealrichardson : https://issues.apache.org/jira/browse/ARROW-12789
r/R/record-batch.R
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW there is a base R function lengths() that does this (though I don't recall what version it was added in)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apparently 3.2.0. I think in our test-r-versions job, we only got back to 3.3.0 and I can't think of anywhere else that we go back to 3.2.0 or before so will update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per DESCRIPTION we require R >= 3.3 (because we depend on packages that require R >= 3.3)
r/R/record-batch.R
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Call me 👴 but I personally find
arrays[arr_lens == 1] <- lapply(arrays[arr_lens == 1], repeat_value_as_array, max_array_len)
easier to read than modify2(...).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't disagree, will update
r/R/util.R
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're checking length again here (unnecessary IMO since you're only calling this in the case where you've validated length == 1 already), you could simplify your modify2() wrapper and just map over all arrays, and in here only do the recycling if length == 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I was trying to make this function a bit more generic in case it's useful elsewhere, but you make a good point; I'll remove the check.
|
@ursabot please benchmark name=dataframe-to-table lang=R |
Co-authored-by: Ian Cook <ianmcook@gmail.com>
Co-authored-by: Ian Cook <ianmcook@gmail.com>
Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>
c01dfff to
95463e3
Compare
nealrichardson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some notes but generally LGTM, thanks
r/R/util.R
Outdated
| if(all(map_lgl(arrays, ~inherits(.x, "data.frame")))){ | ||
| abort(c( | ||
| "All input tibbles or data.frames must have the same number of rows", | ||
| x = paste("Number of rows in inputs:",oxford_paste(map_int(arrays, ~nrow(.x)))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right?
| x = paste("Number of rows in inputs:",oxford_paste(map_int(arrays, ~nrow(.x)))) | |
| x = paste("Number of rows in inputs:", oxford_paste(arr_lens) |
Also, are we worried that arr_lens could be large (and thus this error message would be huge)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't before, but now I am now you mention it. I have updated my error message to just print the longest and shortest length items seeing as I think this is still sufficient to be useful. Let me know if it looks OK!
…eated expression into variable
…create() This also adds missing spaces in some unrelated R files Closes apache#10269 from thisisnic/ARROW-11705_scalar_recycling Lead-authored-by: Nic Crane <thisisnic@gmail.com> Co-authored-by: Nic <thisisnic@gmail.com> Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>
This also adds missing spaces in some unrelated R files