ARROW-10570: [R] Use Converter API to convert SEXP to Array/ChunkedArray #8650

romainfrancois · 2020-11-12T15:26:20Z

No description provided.

github-actions · 2020-11-12T15:39:42Z

Thanks for opening a pull request!

Could you open an issue for this pull request on JIRA?
https://issues.apache.org/jira/browse/ARROW

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

See also:

github-actions · 2020-11-12T18:34:48Z

https://issues.apache.org/jira/browse/ARROW-10530

romainfrancois · 2020-11-13T13:50:30Z

Thanks @kszucs for the direct help. This is very far from done, but it's a start, and perhaps we can resume the conversation here.

AFAIK, There is no R equivalent to PyObject*, so I'm trying here to mimic that with this:

struct RObject {
  RVectorType rtype;
  void* data;
  bool null;
};

This will evolve (perhaps into some sort of hierarchy) I'm sure as other types join the implementation (for now this only handles going from an R integer vector (INTSXP) to an int32 Array.

romainfrancois · 2020-11-13T14:18:37Z

library(arrow, warn.conflicts = FALSE)

arrow:::vec_to_arrow(1:2, int32())
#> Array
#> <int32>
#> [
#>   1,
#>   2
#> ]
arrow:::vec_to_arrow(c(1,2), float64())
#> Array
#> <double>
#> [
#>   1,
#>   2
#> ]
arrow:::vec_to_arrow(as.raw(c(1,2)), uint8())
#> Array
#> <uint8>
#> [
#>   1,
#>   2
#> ]

^{Created on 2020-11-13 by the reprex package (v0.3.0.9001)}

kszucs · 2020-11-13T15:10:52Z

cc @bkietz since we co-authored the python-side refactor

romainfrancois · 2020-11-27T09:49:08Z

This is still quite wip, as I navigate how the various Converter work together.

It all feels a bit weird to operate in terms of single values, i.e. via the Append() methods, since R does not have such concept, this has to resort to many casting.

It feels particularly odd with the struct converter (converting from a data frame). As of right now, this is bypassing Append() and instead uses a Visit() method that appends several (column by column).

Would it make sense to have some sort of vectorised approach ?

Should this instead figure out a way to treat data frames that the struct array gets built row by row instead of column by column ?

romainfrancois · 2020-11-27T13:47:35Z

ping @kszucs @bkietz I don't necessarily need a full review at this point, as this is far from done and will need further changes, but perhaps you can have a look and let me know if this goes in the right direction ? cc @nealrichardson

nealrichardson · 2020-12-07T21:39:20Z

FYI Windows build failure is

r_to_arrow.cpp: In instantiation of 'arrow::Status arrow::r::VisitInt64Vector(SEXP, R_xlen_t, VisitorFunc&&) [with VisitorFunc = arrow::r::VisitVector(SEXP, R_xlen_t, T*) [with T = arrow::internal::Converter<arrow::r::RScalar*, arrow::r::RConversionOptions>; SEXP = SEXPREC*; R_xlen_t = int]::<lambda(arrow::r::RScalar*)>; SEXP = SEXPREC*; R_xlen_t = int]':
r_to_arrow.cpp:512:84:   required from 'arrow::Status arrow::r::VisitVector(SEXP, R_xlen_t, T*) [with T = arrow::internal::Converter<arrow::r::RScalar*, arrow::r::RConversionOptions>; SEXP = SEXPREC*; R_xlen_t = int]'
r_to_arrow.cpp:535:40:   required from 'arrow::Status arrow::r::Extend(T*, SEXP, R_xlen_t) [with T = arrow::internal::Converter<arrow::r::RScalar*, arrow::r::RConversionOptions>; SEXP = SEXPREC*; R_xlen_t = int]'
r_to_arrow.cpp:762:7:   required from here
r_to_arrow.cpp:451:66: error: dereferencing type-punned pointer will break strict-aliasing rules [-Werror=strict-aliasing]
     obj.null = is_NA<int64_t>(*reinterpret_cast<int64_t*>(&value));
                                                                  ^

kszucs · 2020-12-07T22:55:50Z

It all feels a bit weird to operate in terms of single values, i.e. via the Append() methods, since R does not have such concept, this has to resort to many casting.

This converter API was designed to operate on list-like collections of generic objects one-by-one. We have different code paths for more efficient conversion from numpy/pandas, see here.

Would it make sense to have some sort of vectorised approach ?

I think so, yes. We could extend the converter API to consume columns instead of individual values.

nealrichardson · 2020-12-07T23:07:23Z

It all feels a bit weird to operate in terms of single values, i.e. via the Append() methods, since R does not have such concept, this has to resort to many casting.

This converter API was designed to operate on list-like collections of generic objects one-by-one. We have different code paths for more efficient conversion from numpy/pandas, see here.

I'll observe that the PR that added the new converter API made only a cosmetic change to numpy_to_arrow.cc, so maybe the new API is not so relevant for R after all?

Row-based conversion might make sense when converting an R list-type object, but otherwise you'd expect something vectorized to be better. For better or worse, now that all of this row-based conversion has been implemented here, we can benchmark both ways and compare.

kszucs · 2020-12-07T23:13:39Z

Row-based conversion might make sense when converting an R list-type object, but otherwise you'd expect something vectorized to be better. For better or worse, now that all of this row-based conversion has been implemented here, we can benchmark both ways and compare.

Sorry, it's not about row- vs. column-wise. Currently the converters have an Append method to convert and append values one-by-one. We can provide an AppendMultiple or something similar to extend the underlying arrays in batches so we can optimize for R's representation.

I'm going to take a look at the PR tomorrow to properly answer Romain's questions.

romainfrancois · 2020-12-08T09:36:53Z

I believe AppendMultiple() is what I would be looking for. It would e.g. solve my dilemma about converting data frames to struct types ...

I had missed the more efficient python paths, having a look.

r/src/r_to_arrow.cpp

nealrichardson · 2021-01-05T23:05:20Z

Since the latest commits aren't compiling, I did some benchmarking on bcb1be7. Summary of findings:

Character to string conversion is usually (but not always) faster with the new code, around 20-30% better. Because string conversion is generally slower than other types, a small percentage improvement can be significant.
Integer and integer64 conversion is slower by an order of magnitude or more in the new code
bench::mark didn't report results for numeric vectors because the results were not equal.

Not sure where things will stand with the latest changes, but I think this suggests that the (numpy-like) special handling for vector types that can be just copied/moved to Arrow are important where appropriate. Otherwise, the string results suggest that there is some performance gain to be had with this work, and if the new approach will handle chunking and parallelization, we can do even better.

Code:

download.file("https://ursa-qa.s3.amazonaws.com/fanniemae_loanperf/2016Q4.csv.gz", "fanniemae.csv.gz")
df <- read_delim_arrow("fanniemae.csv.gz", delim="|", col_names=FALSE)
dim(df)
## [1] 22180168       31
for (n in names(df)) {
  print(n)
  print(class(df[[n]]))
  try(print(bench::mark(arrow:::Array__from_vector(df[[n]], NULL), arrow:::vec_to_arrow(df[[n]], NULL))))
}

There's also a NYC taxi CSV at https://ursa-qa.s3.amazonaws.com/nyctaxi/yellow_tripdata_2010-01.csv.gz you can test with (just read_csv_arrow(), it has colnames).

romainfrancois · 2021-01-06T15:02:21Z

At the point now where I dropped the artificial Rscalar class and all goes through AppendRange() which is only slightly different from @kszucs take on #8886.

Mostly because AppendRange() takes start and size when IIUC Extend() assumes start = 0.

I could "easily" align to what #8886 does, because at the moment I'm only ever using start = 0 anyway, but the logic behind having a start was to facilitate one R vector contributing to two different chunks in the chunker ...

nealrichardson · 2021-01-06T16:15:26Z

the logic behind having a start was to facilitate one R vector contributing to two different chunks in the chunker ...

Right, we want to be able to chunk the inputs--that's ARROW-9293

nealrichardson · 2021-01-07T17:57:33Z

That last commit gets us back to parity-ish on character vector performance, though it's a bit noisy.

nealrichardson · 2021-01-07T18:42:09Z

Running benchmarks on the character vectors, using 10 iterations the performance is basically the same as on master.

I haven't tested factors, but that file does test integer, numeric, integer64, and character, with variation in the number of NAs.

So, something with equivalent performance, but that allows chunking and parallelization by columns (which the old converter does not, IIRC), would in practice have significant benefits.

romainfrancois · 2021-01-08T15:58:24Z

Now pulled in the commit from #8886 so that this uses Converter::Extend() instead of the R specific previous AppendRange().

I believe that for the chunker class to be useful in this case, we would need to act on that comment:

// we could get bit smarter here since the whole batch of appendable values
// will be rejected if a capacity error is raised

romainfrancois · 2021-02-10T12:25:55Z

oops, I only see your message now @nealrichardson, after I essentially went ahead with migration/elimination of code. Can we envision to run the same code with this pull request vs master. Maybe by setting up continuous benchmarking, with https://github.com/r-lib/bench#continuous-benchmarking

I'm arriving at the end of this week's sprint soon, I'll mark this as ready for review and stay vigilant.

Otherwise I can roll back today's commits and add the option 😬.

romainfrancois · 2021-02-10T12:36:26Z

e.g. dplyr has this action: https://github.com/tidyverse/dplyr/blob/master/.github/workflows/continuous-benchmarks.yaml

This might need to be adapted to arrow 's offering of many actions, but perhaps we can set it up, and then rebase. So that we don't have to deal with options etc ...

nealrichardson · 2021-02-10T17:12:18Z

No worries, we've been working on some other continuous benchmarking tools and can test things out with them. Thanks for pushing this forward.

nealrichardson · 2021-02-10T17:18:36Z

r/src/r_to_arrow.cpp

+  // TODO: this probably should be disabled when x is an ALTREP object
+  //       because MakeSimpleArray below will force materialization


Isn't this unavoidable though? We have to get the data into Arrow format somehow.

I guess the one case you might not want to materialize is if you had an ALTREP object that were backed by an Arrow array (which of course we don't do yet)--then you could just grab the Arrow array.

Maybe that's a reach, but perhaps there are cases where this would mean only arrow materialize without R materialize. For example if the arrow data type is not representable in R, e.g. uint32

jonkeane · 2021-02-26T23:03:36Z

I've been helping out benchmark these changes against the 3.0 release and everything I'm seeing is in line: no performance regressions. I've used a handful of our real-world datasets along with some synthetic datasets made up of individual types and they are all in line with the 3.0 release performance wise, which is great. I've been adding the benchmarks to arrowbench and are currently in a PR there in case you're curious about them.

One thing I did notice is that simple feature columns are having issues that aren't there in the release.

Here's a test that exercises the issue (I dug a bit to see if I could find the bug but haven't yet). The structure of the list column is meant to be minimal but is based off of the failure I saw with a real sf tibble (see below)

test_that("sf-like list columns", {
  df <- tibble::tibble(col = list(structure(list(1), class = c("one"))))
  expect_array_roundtrip(df)
})

the error+traceback is:

<error/vctrs_error_scalar_type>
Input must be a vector, not a `one` object.
Backtrace:
    █
 1. ├─Table$create(df)
 2. │ └─arrow:::Table__from_dots(dots, schema)
 3. └─vctrs:::stop_scalar_type(...)
 4.   └─vctrs:::stop_vctrs(msg, "vctrs_error_scalar_type", actual = x)

A more naturalistic example of this is the following which works in 3.0, but not on this branch

df_simple <- sf::read_sf(system.file("shape/nc.shp", package = "sf"))
tab_simple <- Table$create(df_simple)

romainfrancois · 2021-03-01T09:09:06Z

slightly simpler reprex:

arrow::Array$create(list(structure(list(1), class = c("one"))))
#> Error: Input must be a vector, not a `one` object.

^{Created on 2021-03-01 by the reprex package (v0.3.0)}

romainfrancois · 2021-03-01T09:24:42Z

Which essentially is:

vctrs::vec_size(structure(list(1), class = c("one")))
#> Error: Input must be a vector, not a `one` object.

^{Created on 2021-03-01 by the reprex package (v0.3.0)}

as used in the ListConverter:

template <typename T>
class RListConverter : public ListConverter<T, RConverter, RConverterTrait> {
 public:
  Status Extend(SEXP x, int64_t size) override {
    RETURN_NOT_OK(this->Reserve(size));

    RVectorType rtype = GetVectorType(x);
    if (rtype != LIST) {
      return Status::Invalid("Cannot convert to list type");
    }

    auto append_value = [this](SEXP value) {
      auto n = vctrs::short_vec_size(value);    // <---------- HERE
      RETURN_NOT_OK(this->list_builder_->ValidateOverflow(n));
      RETURN_NOT_OK(this->list_builder_->Append());
      return this->value_converter_.get()->Extend(value, n);
    };
    auto append_null = [this]() { return this->list_builder_->AppendNull(); };
    return RVectorVisitor<SEXP>::Visit(x, size, append_null, append_value);
  }
};

romainfrancois · 2021-03-01T10:59:07Z

In terms of the sf example, I get:

df_simple <- sf::read_sf(system.file("shape/nc.shp", package = "sf"))
vctrs::vec_size(df_simple$geometry)
#> [1] 100
class(df_simple$geometry)
#> [1] "sfc_MULTIPOLYGON" "sfc"

vctrs::vec_size(df_simple$geometry[[1]])
#> Error: Input must be a vector, not a `XY/MULTIPOLYGON/sfg` object.
class(df_simple$geometry[[1]])
#> [1] "XY"           "MULTIPOLYGON" "sfg"

^{Created on 2021-03-01 by the reprex package (v0.3.0)}

I'm not sure this is a vctrs or an sf issue. cc @lionel- @edzer

…ast momentarily, to help with sf geometry columns.

romainfrancois · 2021-03-01T11:53:11Z

I've put a fix in place, at least momentarily

jonkeane · 2021-03-01T23:42:33Z

Ok, I've looked at this more, and my first reprex was a bit too minimal it looks like. Here is another example/test we could/should add to exercise this (I'm happy to push a commit including it to this branch if you would like):

test_that("sf-like list columns", {
  df <- structure(list(col = structure(list(structure(list(list(structure(1))), class = "inner")), class = "outer")), class = "data.frame")
  expect_array_roundtrip(df)
})

The fix that you made does fix an error on the inner "class", but I believe that

arrow/r/src/r_to_arrow.cpp

Line 1014 in fe1c774

options.size = vctrs::short_vec_size(x);

is tripping on the outer "class".

This was extra-fun to debug (and explains why the sf example above worked) because it looks like sf registers some vctrs methods which mean that these will work so long as those have been registered (i.e. whenever sf is used).

I suspect that it would be very infrequent to have someone want to round-trip a parquet file including sf data without also having sf loaded (so in practice the current state would be fine), attempting to debug what's going on is super complicated (and any other non-standard vctrs that use that outer level style classing would similarly fail).

nealrichardson

Thanks @romainfrancois! I'll merge now and make a followup issue for the edge case.

nealrichardson · 2021-03-02T00:06:28Z

Followup is ARROW-11832. Thanks @jonkeane for digging!

romainfrancois · 2021-04-06T12:39:26Z

df above is not a valid data frame, so that's ok, however, it indeed looks like there is a follow up problem:

library(arrow, warn.conflicts = FALSE)

df <- structure(list(col = structure(list(structure(list(list(structure(1))), class = "inner")), class = "outer")), class = "data.frame")
nrow(df)
#> [1] 0
vctrs::vec_size(df)
#> Error: Corrupt data frame: row.names are missing
Array$create(df)
#> Error: Corrupt data frame: row.names are missing

df <- structure(list(col = structure(list(structure(list(list(structure(1))), class = "inner")), class = "outer")), class = "data.frame", row.names = c(NA, -1L))
nrow(df)
#> [1] 1
vctrs::vec_size(df)
#> [1] 1
Array$create(df)
#> Error: Input must be a vector, not a <outer> object.

^{Created on 2021-04-06 by the reprex package (v0.3.0)}

romainfrancois added the Component: R label Nov 12, 2020

romainfrancois changed the title ~~Use Converter api to convert SEXP to Array~~ ARROW-10530: [R] Use Converter API to convert SEXP to Array/ChunkedArray Nov 12, 2020

romainfrancois force-pushed the RConverter branch from 9ba8029 to 6d84bd2 Compare November 23, 2020 11:08

bkietz self-requested a review December 1, 2020 21:36

romainfrancois force-pushed the RConverter branch from 5a7f11a to 3b1a3a4 Compare December 7, 2020 11:09

kszucs reviewed Dec 8, 2020

View reviewed changes

r/src/r_to_arrow.cpp Outdated Show resolved Hide resolved

kszucs reviewed Dec 8, 2020

View reviewed changes

r/src/r_to_arrow.cpp Outdated Show resolved Hide resolved

romainfrancois mentioned this pull request Jan 7, 2021

[C++] Add Extend and ExtendMasked to the converter interface #8886

Closed

nealrichardson mentioned this pull request Jan 8, 2021

ARROW-10803: Support R >= 3.3 and add CI #8833

Closed

romainfrancois force-pushed the RConverter branch from 4b2d827 to f55466c Compare January 20, 2021 09:54

github-actions bot added the Component: C++ label Jan 20, 2021

kszucs force-pushed the master branch from d238279 to 437c8c9 Compare January 25, 2021 21:53

romainfrancois added 6 commits February 10, 2021 10:12

rm obslete MakeFactorArray

1372f7b

rm MakeStructArray

84d1805

rm VectorToArrayConverter

ea0c381

- GetConverter

97f9f93

-Array__from_vector()

3bce51e

... end remove array_from_vector.cpp altgether

7c45774

romainfrancois marked this pull request as ready for review February 10, 2021 12:26

nealrichardson reviewed Feb 10, 2021

View reviewed changes

jorgecarleitao force-pushed the master branch from d4608a9 to 356c300 Compare February 14, 2021 12:09

revert to using Rf_length() for lists that are not data frames, at le…

fe1c774

…ast momentarily, to help with sf geometry columns.

nealrichardson approved these changes Mar 2, 2021

View reviewed changes

nealrichardson closed this in 7184c3f Mar 2, 2021

romainfrancois mentioned this pull request Apr 7, 2021

ARROW-11832: [R] Handle conversion of extra nested struct column #9919

Closed

kszucs mentioned this pull request Jun 17, 2021

ARROW-12983: [C++][Python][R] Properly overflow to chunked array in Python-to-Arrow conversion #10470

Closed

This was referenced Mar 10, 2021

[R] Optionally use distro package in linuxlibs.R #26499

Closed

[R] Use Converter API to convert SEXP to Array/ChunkedArray #26534

Closed

[R] Handle conversion of extra nested struct column #27680

Closed

		// TODO: this probably should be disabled when x is an ALTREP object
		// because MakeSimpleArray below will force materialization

ARROW-10570: [R] Use Converter API to convert SEXP to Array/ChunkedArray #8650

ARROW-10570: [R] Use Converter API to convert SEXP to Array/ChunkedArray #8650

Uh oh!

Conversation

romainfrancois commented Nov 12, 2020

Uh oh!

github-actions bot commented Nov 12, 2020

Uh oh!

github-actions bot commented Nov 12, 2020

Uh oh!

romainfrancois commented Nov 13, 2020

Uh oh!

romainfrancois commented Nov 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kszucs commented Nov 13, 2020

Uh oh!

romainfrancois commented Nov 27, 2020

Uh oh!

romainfrancois commented Nov 27, 2020

Uh oh!

nealrichardson commented Dec 7, 2020

Uh oh!

kszucs commented Dec 7, 2020

Uh oh!

nealrichardson commented Dec 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kszucs commented Dec 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

romainfrancois commented Dec 8, 2020

Uh oh!

Uh oh!

Uh oh!

nealrichardson commented Jan 5, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

romainfrancois commented Jan 6, 2021

Uh oh!

nealrichardson commented Jan 6, 2021

Uh oh!

nealrichardson commented Jan 7, 2021

Uh oh!

nealrichardson commented Jan 7, 2021

Uh oh!

romainfrancois commented Jan 8, 2021

Uh oh!

romainfrancois commented Feb 10, 2021

Uh oh!

romainfrancois commented Feb 10, 2021

Uh oh!

nealrichardson commented Feb 10, 2021

Uh oh!

nealrichardson Feb 10, 2021

Choose a reason for hiding this comment

Uh oh!

romainfrancois Feb 11, 2021

Choose a reason for hiding this comment

Uh oh!

jonkeane commented Feb 26, 2021

Uh oh!

romainfrancois commented Mar 1, 2021

Uh oh!

romainfrancois commented Mar 1, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

romainfrancois commented Mar 1, 2021

Uh oh!

romainfrancois commented Mar 1, 2021

Uh oh!

jonkeane commented Mar 1, 2021

Uh oh!

nealrichardson left a comment

Choose a reason for hiding this comment

Uh oh!

nealrichardson commented Mar 2, 2021

Uh oh!

romainfrancois commented Apr 6, 2021

Uh oh!

romainfrancois commented Nov 13, 2020 •

edited

Loading

nealrichardson commented Dec 7, 2020 •

edited

Loading

kszucs commented Dec 7, 2020 •

edited

Loading

nealrichardson commented Jan 5, 2021 •

edited

Loading

romainfrancois commented Mar 1, 2021 •

edited

Loading