Skip to content

Conversation

@romainfrancois
Copy link
Contributor

@romainfrancois romainfrancois commented Jun 3, 2021

This makes altrep R vectors of type INTSXP or REALSXP from arrow::Array of type Int32Type / DoubleType that don't have any nulls:

the altrep vector holds an external pointer so that the Array stays around, and its payload is shared. The R vector is marked as not mutable.

library(arrow, warn.conflicts = FALSE)
#> See arrow_info() for available features

create a “big” arrow Array with no nulls (just for testing purposes)

a <- arrow:::Test_array_nonull_dbl_vector(1e7)

turn into R vector, using altrep, sharing the payload

v <- a$as_vector()

verify it’s an altrep with the inspect method

.Internal(inspect(v))
#> @7f9abf8ba470 14 REALSXP g0c0 [REF(65535)] std::shared_ptr<arrow::Array, double, NONULL> (len=10000000, ptr=0x7f9ab5c8cd18)

it’s marked as not mutable so check that modify -> duplicate

v[1] <- 0
#> Duplicate
.Internal(inspect(v))
#> @7f9ac0000000 14 REALSXP g1c7 [MARK,REF(1)] (len=10000000, tl=0) 0,42,42,42,42,...

timings for double vector

bench::workout({
  a <- arrow:::Test_array_nonull_dbl_vector(1e7)
  v <- a$as_vector()
  .Internal(inspect(v))
  v[1] <- 0
  .Internal(inspect(v))
})
#> @7f9abc122190 14 REALSXP g0c0 [REF(65535)] std::shared_ptr<arrow::Array, double, NONULL> (len=10000000, ptr=0x7f9aba2109c8)
#> Duplicate
#> @7f9aa5c00000 14 REALSXP g1c7 [MARK,REF(1)] (len=10000000, tl=0) 0,42,42,42,42,...
#> # A tibble: 5 x 3
#>   exprs                                             process     real
#>   <bch:expr>                                       <bch:tm> <bch:tm>
#> 1 a <- arrow:::Test_array_nonull_dbl_vector(1e+07)   70.3ms   70.6ms
#> 2 v <- a$as_vector()                                   13µs   14.3µs
#> 3 .Internal(inspect(v))                                12µs   11.9µs

when a copy is needed, the data is copied entirely:

#> 4 v[1] <- 0                                          53.1ms   53.2ms
#> 5 .Internal(inspect(v))                                20µs   22.6µs

timings for integer vector

bench::workout({
  a <- arrow:::Test_array_nonull_int_vector(1e7)
  v <- a$as_vector()
  .Internal(inspect(v))
  v[1] <- 0
  .Internal(inspect(v))
})
#> @7f9abc5bd780 13 INTSXP g0c0 [REF(65535)] std::shared_ptr<arrow::Array, int32, NONULL> (len=10000000, ptr=0x7f9ab8997378)
#> @7f9ac0000000 14 REALSXP g1c7 [MARK,REF(1)] (len=10000000, tl=0) 0,42,42,42,42,...
#> # A tibble: 5 x 3
#>   exprs                                             process     real
#>   <bch:expr>                                       <bch:tm> <bch:tm>
#> 1 a <- arrow:::Test_array_nonull_int_vector(1e+07)   54.5ms   54.7ms
#> 2 v <- a$as_vector()                                   12µs   13.2µs
#> 3 .Internal(inspect(v))                                11µs   11.3µs
#> 4 v[1] <- 0                                         851.4ms  854.6ms
#> 5 .Internal(inspect(v))                                17µs   18.8µs

Created on 2021-06-08 by the reprex package (v2.0.0)

@github-actions
Copy link

github-actions bot commented Jun 8, 2021

@romainfrancois
Copy link
Contributor Author

An interesting side benefit from this is that we don't get altrep R vectors from slices:

library(arrow, warn.conflicts = FALSE)
#> See arrow_info() for available features

x <- 1:1e3+ 1L
v <- Array$create(x)
x1 <- v$as_vector()  
.Internal(inspect(x1))
#> @7face1b4ba18 13 INTSXP g0c0 [REF(65535)] std::shared_ptr<arrow::Array, int32, NONULL> (len=1000, ptr=0x7facdc27e998)

x2 <- v$Slice(500)$as_vector()
.Internal(inspect(x2))
#> @7facde2bfd80 13 INTSXP g0c0 [REF(65535)] std::shared_ptr<arrow::Array, int32, NONULL> (len=500, ptr=0x7facdb0a2198)

Created on 2021-06-08 by the reprex package (v2.0.0)

@romainfrancois romainfrancois force-pushed the ARROW_9140_zero_copy branch from f2db09c to 320b33f Compare June 8, 2021 11:08
@romainfrancois
Copy link
Contributor Author

@nealrichardson I've turned off making INT64 arrays with no nulls into altrep bit64::integer64 vectors (i.e. reusing the payload of the array as the data of the underlying double array in R ...) because of the auto downcasting to int vectors when it fits:

library(arrow, warn.conflicts = FALSE)

i64 <- Array$create(bit64::as.integer64(1:10))
.Internal(inspect(as.vector(i64)))
#> @7ff0b2099c38 13 INTSXP g0c4 [REF(2)] (len=10, tl=0) 1,2,3,4,5,...

Created on 2021-06-08 by the reprex package (v2.0.0)

I can altogether remove the int64 altrep code if needed ... ?

@romainfrancois romainfrancois marked this pull request as ready for review June 8, 2021 11:52
@nealrichardson
Copy link
Member

This is super cool. We should get some benchmarking on this.

Tangentially related, should we have an options(arrow.altrep) or something to govern whether to use ALTREP? I know vroom has a function argument to enable or disable it (though I don't know why exactly). Might be useful to have in case there are unforeseen issues, or if nothing else it would make it easy for people to test side-by-side the performance.

@romainfrancois
Copy link
Contributor Author

Interesting, we could make that opt-in. For now it's all happening here so would be easy enough to make conditional:

// [[arrow::export]]
SEXP Array__as_vector(const std::shared_ptr<arrow::Array>& array) {
  auto type = array->type();

#if defined(HAS_ALTREP)
  if (array->null_count() == 0) {
    switch (type->id()) {
      case arrow::Type::DOUBLE:
        return arrow::r::Make_array_nonull_dbl_vector(array);
      case arrow::Type::INT32:
        return arrow::r::Make_array_nonull_int_vector(array);
        // case arrow::Type::INT64:
        //   return arrow::r::Make_array_nonull_int64_vector(array);

      default:
        break;
    }
  }
#endif

  return arrow::r::ArrayVector__as_vector(array->length(), type, {array});
}

@nealrichardson
Copy link
Member

Cool, so that would just be if(GetBoolOption("arrow.altrep", true) && array->null_count() == 0)

@nealrichardson nealrichardson requested a review from bkietz June 9, 2021 02:51
@romainfrancois
Copy link
Contributor Author

I've used arrow.use_altrep but I can use arrow.altrep all the same.

I've put GetBoolOption("arrow.altrep", true) after the array->null_count() == 0 because it's cheaper to call null_count().

GetBoolOption("arrow.altrep", true) means that this is opt-out, I'm fine with this but it might be risky.

@nealrichardson
Copy link
Member

I've used arrow.use_altrep but I can use arrow.altrep all the same.

SGTM

I've put GetBoolOption("arrow.altrep", true) after the array->null_count() == 0 because it's cheaper to call null_count().

That surprises me, I'd expect that checking an R boolean would be faster than counting nulls in a (potentially long) array.

GetBoolOption("arrow.altrep", true) means that this is opt-out, I'm fine with this but it might be risky.

I think defaulting to ALTREP on is good (particularly if/when we quantify its benefits) but leaving an escape hatch in case of error is nice.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not an R user, but here are some comments.

@jonkeane
Copy link
Member

Ok, I've run some benchmarks on this branch and I'm seeing a huge speed up for floats + integers with as.vector(array). 🎉

It might be out of scope for this PR, but chunked arrays don't see a similar speed up (which makes sense given they call ArrayVector__as_vector directly rather than routing through Array__as_vector, so they aren't being using alt rep). I can't quite tell from the cpp if Table__to_dataframe would just work with alt rep as well if it worked with ChunkedArrays or if we would need to more to facilitate that.

library(arrow, warn.conflicts = FALSE)

x <- 1:1e3+ 1L
v <- Array$create(x)
x1 <- v$as_vector()  
.Internal(inspect(x1))
#> @7f9077f5a1a8 13 INTSXP g0c0 [REF(65535)] std::shared_ptr<arrow::Array, int32, NONULL> (len=1000, ptr=0x7f90975a9a08)


v_chunked <- ChunkedArray$create(x)
x2 <- v_chunked$as_vector()  
.Internal(inspect(x2))
#> @7f908312c000 13 INTSXP g0c7 [REF(2)] (len=1000, tl=0) 2,3,4,5,6,...

Created on 2021-06-10 by the reprex package (v2.0.0)

arrowbench results (using the new benchmarks in voltrondata-labs/arrowbench#28):
zero-copy-data-conversion.html.zip

@romainfrancois
Copy link
Contributor Author

@jonkeane we might be able to deal with a fake ChunkedArrray that has only one array, but otherwise we can't use the same trick with general (more than 1 array) case, because by construction the payload of these arrays is not contiguous, which is what we leverage here.

I believe this can be a follow up pull request with some altrep vectors to wrap chunked arrays, and arrays with nulls. Implementation will be different, e.g. the DATAPTR() will cause materialization (and so be expensive) but for things that are altrep aware and know how to iterate by region, we might be able to only materialize regions as needed.

@jonkeane
Copy link
Member

That sounds good. The faked ChunkedArray with one chunk might be (surprisingly) beneficial, at least in (some? see below) parquet reading situations.

With csv reading, we frequently (with sufficiently large csvs) get multiple chunks, but I've tried reading parquet a few times to get a table with chunked arrays where there are more than one chunk and haven't so far seen it happen. I need to dig more into the parquet reading source code to see if it is ever possible to get more-than-1-chunks-ChunkedArrays from reading a parquet.

tab <- Table$create(x = 1:100)
tf <- tempfile(); on.exit(unlink(tf))
write_parquet(tab, tf, chunk_size = 10)
read_tab <- read_parquet(tf, as_data_frame = FALSE) 
read_tab[[1]]$num_chunks

@romainfrancois

This comment has been minimized.

@romainfrancois
Copy link
Contributor Author

romainfrancois commented Jun 18, 2021

library(arrow, warn.conflicts = FALSE)
#> See arrow_info() for available features

c_int <- ChunkedArray$create(1:1000)
c_dbl <- ChunkedArray$create(as.numeric(1:1000))
c_int$num_chunks
#> [1] 1
c_dbl$num_chunks
#> [1] 1
.Internal(inspect(as.vector(c_int)))
#> @7fc1314ce2a8 13 INTSXP g0c0 [REF(65535)] std::shared_ptr<arrow::Array, int32, NONULL> (len=1000, ptr=0x7fc1274a0408)
.Internal(inspect(as.vector(c_dbl)))
#> @7fc131528a90 14 REALSXP g0c0 [REF(65535)] std::shared_ptr<arrow::Array, double, NONULL> (len=1000, ptr=0x7fc12b2b5178)

Created on 2021-06-18 by the reprex package (v2.0.0.9000)

I think I'll deal with this https://issues.apache.org/jira/browse/ARROW-13114 here as well, and then do a follow up for using RTasks (i.e. https://issues.apache.org/jira/browse/ARROW-13113).

@romainfrancois
Copy link
Contributor Author

Actually, I now think it is better to do the RTasks thing first https://issues.apache.org/jira/browse/ARROW-13113 and then rework RecordBatch__to_dataframe() and Table__to_dataframe()

@romainfrancois
Copy link
Contributor Author

@bkietz
Copy link
Member

bkietz commented Jun 18, 2021

I'm not sure why we get this: https://github.com/apache/arrow/pull/10445/checks?check_run_id=2858594249#step:8:2641

Looks like linking the arrow binding failed. There's a cpp11 symbol missing
https://github.com/apache/arrow/pull/10445/checks?check_run_id=2858594151#step:9:5579

... which is odd because it's a template, so it ought to be inlined everywhere

Copy link
Member

@bkietz bkietz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is neat, thanks for doing it. Some minor comments:

}

SEXP MakeInt32ArrayNoNull(const std::shared_ptr<Array>& array) {
return Int32ArrayNoNull::Make(array);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edge case worthy of mention in a comment somewhere: IIUC this doesn't check for instances of NA_integer_ in array and these will be considered null by R even though is_altrep_int_nonull(as.vector(array)) (and we report no_NA = 1)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll think about where to mention this, but this isn't limited to these altrep variants. any int32 array that happens to have a non null NA_integer_ will be mismanaged by R with this sentinel approach.

@romainfrancois
Copy link
Contributor Author

@nealrichardson I believe this is ready to go

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants