-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-9140: [R] Zero-copy Arrow to R where possible #10445
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-9140: [R] Zero-copy Arrow to R where possible #10445
Conversation
|
An interesting side benefit from this is that we don't get altrep R vectors from slices: library(arrow, warn.conflicts = FALSE)
#> See arrow_info() for available features
x <- 1:1e3+ 1L
v <- Array$create(x)
x1 <- v$as_vector()
.Internal(inspect(x1))
#> @7face1b4ba18 13 INTSXP g0c0 [REF(65535)] std::shared_ptr<arrow::Array, int32, NONULL> (len=1000, ptr=0x7facdc27e998)
x2 <- v$Slice(500)$as_vector()
.Internal(inspect(x2))
#> @7facde2bfd80 13 INTSXP g0c0 [REF(65535)] std::shared_ptr<arrow::Array, int32, NONULL> (len=500, ptr=0x7facdb0a2198)Created on 2021-06-08 by the reprex package (v2.0.0) |
f2db09c to
320b33f
Compare
|
@nealrichardson I've turned off making INT64 arrays with no nulls into altrep bit64::integer64 vectors (i.e. reusing the payload of the array as the data of the underlying double array in R ...) because of the auto downcasting to int vectors when it fits: library(arrow, warn.conflicts = FALSE)
i64 <- Array$create(bit64::as.integer64(1:10))
.Internal(inspect(as.vector(i64)))
#> @7ff0b2099c38 13 INTSXP g0c4 [REF(2)] (len=10, tl=0) 1,2,3,4,5,...Created on 2021-06-08 by the reprex package (v2.0.0) I can altogether remove the int64 altrep code if needed ... ? |
|
This is super cool. We should get some benchmarking on this. Tangentially related, should we have an |
|
Interesting, we could make that opt-in. For now it's all happening here so would be easy enough to make conditional: // [[arrow::export]]
SEXP Array__as_vector(const std::shared_ptr<arrow::Array>& array) {
auto type = array->type();
#if defined(HAS_ALTREP)
if (array->null_count() == 0) {
switch (type->id()) {
case arrow::Type::DOUBLE:
return arrow::r::Make_array_nonull_dbl_vector(array);
case arrow::Type::INT32:
return arrow::r::Make_array_nonull_int_vector(array);
// case arrow::Type::INT64:
// return arrow::r::Make_array_nonull_int64_vector(array);
default:
break;
}
}
#endif
return arrow::r::ArrayVector__as_vector(array->length(), type, {array});
} |
|
Cool, so that would just be |
|
I've used I've put
|
SGTM
That surprises me, I'd expect that checking an R boolean would be faster than counting nulls in a (potentially long) array.
I think defaulting to ALTREP on is good (particularly if/when we quantify its benefits) but leaving an escape hatch in case of error is nice. |
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not an R user, but here are some comments.
|
Ok, I've run some benchmarks on this branch and I'm seeing a huge speed up for floats + integers with It might be out of scope for this PR, but chunked arrays don't see a similar speed up (which makes sense given they call library(arrow, warn.conflicts = FALSE)
x <- 1:1e3+ 1L
v <- Array$create(x)
x1 <- v$as_vector()
.Internal(inspect(x1))
#> @7f9077f5a1a8 13 INTSXP g0c0 [REF(65535)] std::shared_ptr<arrow::Array, int32, NONULL> (len=1000, ptr=0x7f90975a9a08)
v_chunked <- ChunkedArray$create(x)
x2 <- v_chunked$as_vector()
.Internal(inspect(x2))
#> @7f908312c000 13 INTSXP g0c7 [REF(2)] (len=1000, tl=0) 2,3,4,5,6,...Created on 2021-06-10 by the reprex package (v2.0.0) arrowbench results (using the new benchmarks in voltrondata-labs/arrowbench#28): |
… no nulls into an altrep R numeric vector w/ no copy
…e `arrow.use_altrep` option (with default to TRUE: opt out)
50a88df to
8721259
Compare
- const std::shared_ptr<Array>& Get()
|
@jonkeane we might be able to deal with a fake I believe this can be a follow up pull request with some altrep vectors to wrap chunked arrays, and arrays with nulls. Implementation will be different, e.g. the |
|
That sounds good. The faked With csv reading, we frequently (with sufficiently large csvs) get multiple chunks, but I've tried reading parquet a few times to get a table with chunked arrays where there are more than one chunk and haven't so far seen it happen. I need to dig more into the parquet reading source code to see if it is ever possible to get more-than-1-chunks- |
This comment has been minimized.
This comment has been minimized.
library(arrow, warn.conflicts = FALSE)
#> See arrow_info() for available features
c_int <- ChunkedArray$create(1:1000)
c_dbl <- ChunkedArray$create(as.numeric(1:1000))
c_int$num_chunks
#> [1] 1
c_dbl$num_chunks
#> [1] 1
.Internal(inspect(as.vector(c_int)))
#> @7fc1314ce2a8 13 INTSXP g0c0 [REF(65535)] std::shared_ptr<arrow::Array, int32, NONULL> (len=1000, ptr=0x7fc1274a0408)
.Internal(inspect(as.vector(c_dbl)))
#> @7fc131528a90 14 REALSXP g0c0 [REF(65535)] std::shared_ptr<arrow::Array, double, NONULL> (len=1000, ptr=0x7fc12b2b5178)Created on 2021-06-18 by the reprex package (v2.0.0.9000) I think I'll deal with this https://issues.apache.org/jira/browse/ARROW-13114 here as well, and then do a follow up for using RTasks (i.e. https://issues.apache.org/jira/browse/ARROW-13113). |
|
Actually, I now think it is better to do the RTasks thing first https://issues.apache.org/jira/browse/ARROW-13113 and then rework |
|
I'm not sure why we get this: https://github.com/apache/arrow/pull/10445/checks?check_run_id=2858594249#step:8:2641 |
Looks like linking the arrow binding failed. There's a cpp11 symbol missing ... which is odd because it's a template, so it ought to be inlined everywhere |
bkietz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is neat, thanks for doing it. Some minor comments:
| } | ||
|
|
||
| SEXP MakeInt32ArrayNoNull(const std::shared_ptr<Array>& array) { | ||
| return Int32ArrayNoNull::Make(array); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Edge case worthy of mention in a comment somewhere: IIUC this doesn't check for instances of NA_integer_ in array and these will be considered null by R even though is_altrep_int_nonull(as.vector(array)) (and we report no_NA = 1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll think about where to mention this, but this isn't limited to these altrep variants. any int32 array that happens to have a non null NA_integer_ will be mismanaged by R with this sentinel approach.
Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>
|
@nealrichardson I believe this is ready to go |
This makes altrep R vectors of type
INTSXPorREALSXPfromarrow::Arrayof typeInt32Type/DoubleTypethat don't have any nulls:the altrep vector holds an external pointer so that the
Arraystays around, and its payload is shared. The R vector is marked as not mutable.create a “big” arrow Array with no nulls (just for testing purposes)
turn into R vector, using altrep, sharing the payload
verify it’s an altrep with the inspect method
it’s marked as not mutable so check that modify -> duplicate
timings for double vector
when a copy is needed, the data is copied entirely:
timings for integer vector
Created on 2021-06-08 by the reprex package (v2.0.0)