Skip to content

Conversation

@romainfrancois
Copy link
Contributor

No description provided.

@github-actions
Copy link

Thanks for opening a pull request!

Could you open an issue for this pull request on JIRA?
https://issues.apache.org/jira/browse/ARROW

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

See also:

@romainfrancois romainfrancois changed the title Use Converter api to convert SEXP to Array ARROW-10530: [R] Use Converter API to convert SEXP to Array/ChunkedArray Nov 12, 2020
@github-actions
Copy link

@romainfrancois
Copy link
Contributor Author

Thanks @kszucs for the direct help. This is very far from done, but it's a start, and perhaps we can resume the conversation here.

AFAIK, There is no R equivalent to PyObject*, so I'm trying here to mimic that with this:

struct RObject {
  RVectorType rtype;
  void* data;
  bool null;
};

This will evolve (perhaps into some sort of hierarchy) I'm sure as other types join the implementation (for now this only handles going from an R integer vector (INTSXP) to an int32 Array.

@romainfrancois
Copy link
Contributor Author

romainfrancois commented Nov 13, 2020

library(arrow, warn.conflicts = FALSE)

arrow:::vec_to_arrow(1:2, int32())
#> Array
#> <int32>
#> [
#>   1,
#>   2
#> ]
arrow:::vec_to_arrow(c(1,2), float64())
#> Array
#> <double>
#> [
#>   1,
#>   2
#> ]
arrow:::vec_to_arrow(as.raw(c(1,2)), uint8())
#> Array
#> <uint8>
#> [
#>   1,
#>   2
#> ]

Created on 2020-11-13 by the reprex package (v0.3.0.9001)

@kszucs
Copy link
Member

kszucs commented Nov 13, 2020

cc @bkietz since we co-authored the python-side refactor

@romainfrancois
Copy link
Contributor Author

This is still quite wip, as I navigate how the various Converter work together.

It all feels a bit weird to operate in terms of single values, i.e. via the Append() methods, since R does not have such concept, this has to resort to many casting.

It feels particularly odd with the struct converter (converting from a data frame). As of right now, this is bypassing Append() and instead uses a Visit() method that appends several (column by column).

Would it make sense to have some sort of vectorised approach ?

Should this instead figure out a way to treat data frames that the struct array gets built row by row instead of column by column ?

@romainfrancois
Copy link
Contributor Author

ping @kszucs @bkietz I don't necessarily need a full review at this point, as this is far from done and will need further changes, but perhaps you can have a look and let me know if this goes in the right direction ? cc @nealrichardson

@nealrichardson
Copy link
Member

FYI Windows build failure is

r_to_arrow.cpp: In instantiation of 'arrow::Status arrow::r::VisitInt64Vector(SEXP, R_xlen_t, VisitorFunc&&) [with VisitorFunc = arrow::r::VisitVector(SEXP, R_xlen_t, T*) [with T = arrow::internal::Converter<arrow::r::RScalar*, arrow::r::RConversionOptions>; SEXP = SEXPREC*; R_xlen_t = int]::<lambda(arrow::r::RScalar*)>; SEXP = SEXPREC*; R_xlen_t = int]':
r_to_arrow.cpp:512:84:   required from 'arrow::Status arrow::r::VisitVector(SEXP, R_xlen_t, T*) [with T = arrow::internal::Converter<arrow::r::RScalar*, arrow::r::RConversionOptions>; SEXP = SEXPREC*; R_xlen_t = int]'
r_to_arrow.cpp:535:40:   required from 'arrow::Status arrow::r::Extend(T*, SEXP, R_xlen_t) [with T = arrow::internal::Converter<arrow::r::RScalar*, arrow::r::RConversionOptions>; SEXP = SEXPREC*; R_xlen_t = int]'
r_to_arrow.cpp:762:7:   required from here
r_to_arrow.cpp:451:66: error: dereferencing type-punned pointer will break strict-aliasing rules [-Werror=strict-aliasing]
     obj.null = is_NA<int64_t>(*reinterpret_cast<int64_t*>(&value));
                                                                  ^

@kszucs
Copy link
Member

kszucs commented Dec 7, 2020

It all feels a bit weird to operate in terms of single values, i.e. via the Append() methods, since R does not have such concept, this has to resort to many casting.

This converter API was designed to operate on list-like collections of generic objects one-by-one. We have different code paths for more efficient conversion from numpy/pandas, see here.

Would it make sense to have some sort of vectorised approach ?

I think so, yes. We could extend the converter API to consume columns instead of individual values.

@nealrichardson
Copy link
Member

nealrichardson commented Dec 7, 2020

It all feels a bit weird to operate in terms of single values, i.e. via the Append() methods, since R does not have such concept, this has to resort to many casting.

This converter API was designed to operate on list-like collections of generic objects one-by-one. We have different code paths for more efficient conversion from numpy/pandas, see here.

I'll observe that the PR that added the new converter API made only a cosmetic change to numpy_to_arrow.cc, so maybe the new API is not so relevant for R after all?

Row-based conversion might make sense when converting an R list-type object, but otherwise you'd expect something vectorized to be better. For better or worse, now that all of this row-based conversion has been implemented here, we can benchmark both ways and compare.

@kszucs
Copy link
Member

kszucs commented Dec 7, 2020

Row-based conversion might make sense when converting an R list-type object, but otherwise you'd expect something vectorized to be better. For better or worse, now that all of this row-based conversion has been implemented here, we can benchmark both ways and compare.

Sorry, it's not about row- vs. column-wise. Currently the converters have an Append method to convert and append values one-by-one. We can provide an AppendMultiple or something similar to extend the underlying arrays in batches so we can optimize for R's representation.

I'm going to take a look at the PR tomorrow to properly answer Romain's questions.

@romainfrancois
Copy link
Contributor Author

I believe AppendMultiple() is what I would be looking for. It would e.g. solve my dilemma about converting data frames to struct types ...

I had missed the more efficient python paths, having a look.

@nealrichardson
Copy link
Member

nealrichardson commented Jan 5, 2021

Since the latest commits aren't compiling, I did some benchmarking on bcb1be7. Summary of findings:

  • Character to string conversion is usually (but not always) faster with the new code, around 20-30% better. Because string conversion is generally slower than other types, a small percentage improvement can be significant.
  • Integer and integer64 conversion is slower by an order of magnitude or more in the new code
  • bench::mark didn't report results for numeric vectors because the results were not equal.

Not sure where things will stand with the latest changes, but I think this suggests that the (numpy-like) special handling for vector types that can be just copied/moved to Arrow are important where appropriate. Otherwise, the string results suggest that there is some performance gain to be had with this work, and if the new approach will handle chunking and parallelization, we can do even better.

Code:

download.file("https://ursa-qa.s3.amazonaws.com/fanniemae_loanperf/2016Q4.csv.gz", "fanniemae.csv.gz")
df <- read_delim_arrow("fanniemae.csv.gz", delim="|", col_names=FALSE)
dim(df)
## [1] 22180168       31
for (n in names(df)) {
  print(n)
  print(class(df[[n]]))
  try(print(bench::mark(arrow:::Array__from_vector(df[[n]], NULL), arrow:::vec_to_arrow(df[[n]], NULL))))
}

There's also a NYC taxi CSV at https://ursa-qa.s3.amazonaws.com/nyctaxi/yellow_tripdata_2010-01.csv.gz you can test with (just read_csv_arrow(), it has colnames).

@romainfrancois
Copy link
Contributor Author

At the point now where I dropped the artificial Rscalar class and all goes through AppendRange() which is only slightly different from @kszucs take on #8886.

Mostly because AppendRange() takes start and size when IIUC Extend() assumes start = 0.

I could "easily" align to what #8886 does, because at the moment I'm only ever using start = 0 anyway, but the logic behind having a start was to facilitate one R vector contributing to two different chunks in the chunker ...

@nealrichardson
Copy link
Member

the logic behind having a start was to facilitate one R vector contributing to two different chunks in the chunker ...

Right, we want to be able to chunk the inputs--that's ARROW-9293

@nealrichardson
Copy link
Member

That last commit gets us back to parity-ish on character vector performance, though it's a bit noisy.

@nealrichardson
Copy link
Member

Running benchmarks on the character vectors, using 10 iterations the performance is basically the same as on master.

I haven't tested factors, but that file does test integer, numeric, integer64, and character, with variation in the number of NAs.

So, something with equivalent performance, but that allows chunking and parallelization by columns (which the old converter does not, IIRC), would in practice have significant benefits.

@romainfrancois
Copy link
Contributor Author

Now pulled in the commit from #8886 so that this uses Converter::Extend() instead of the R specific previous AppendRange().

I believe that for the chunker class to be useful in this case, we would need to act on that comment:

// we could get bit smarter here since the whole batch of appendable values
// will be rejected if a capacity error is raised

@romainfrancois
Copy link
Contributor Author

oops, I only see your message now @nealrichardson, after I essentially went ahead with migration/elimination of code. Can we envision to run the same code with this pull request vs master. Maybe by setting up continuous benchmarking, with https://github.com/r-lib/bench#continuous-benchmarking

I'm arriving at the end of this week's sprint soon, I'll mark this as ready for review and stay vigilant.

Otherwise I can roll back today's commits and add the option 😬.

@romainfrancois romainfrancois marked this pull request as ready for review February 10, 2021 12:26
@romainfrancois
Copy link
Contributor Author

e.g. dplyr has this action: https://github.com/tidyverse/dplyr/blob/master/.github/workflows/continuous-benchmarks.yaml

This might need to be adapted to arrow 's offering of many actions, but perhaps we can set it up, and then rebase. So that we don't have to deal with options etc ...

@nealrichardson
Copy link
Member

No worries, we've been working on some other continuous benchmarking tools and can test things out with them. Thanks for pushing this forward.

Comment on lines +912 to +913
// TODO: this probably should be disabled when x is an ALTREP object
// because MakeSimpleArray below will force materialization
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this unavoidable though? We have to get the data into Arrow format somehow.

I guess the one case you might not want to materialize is if you had an ALTREP object that were backed by an Arrow array (which of course we don't do yet)--then you could just grab the Arrow array.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe that's a reach, but perhaps there are cases where this would mean only arrow materialize without R materialize. For example if the arrow data type is not representable in R, e.g. uint32

@jonkeane
Copy link
Member

I've been helping out benchmark these changes against the 3.0 release and everything I'm seeing is in line: no performance regressions. I've used a handful of our real-world datasets along with some synthetic datasets made up of individual types and they are all in line with the 3.0 release performance wise, which is great. I've been adding the benchmarks to arrowbench and are currently in a PR there in case you're curious about them.

One thing I did notice is that simple feature columns are having issues that aren't there in the release.

Here's a test that exercises the issue (I dug a bit to see if I could find the bug but haven't yet). The structure of the list column is meant to be minimal but is based off of the failure I saw with a real sf tibble (see below)

test_that("sf-like list columns", {
  df <- tibble::tibble(col = list(structure(list(1), class = c("one"))))
  expect_array_roundtrip(df)
})

the error+traceback is:

<error/vctrs_error_scalar_type>
Input must be a vector, not a `one` object.
Backtrace:
    █
 1. ├─Table$create(df)
 2. │ └─arrow:::Table__from_dots(dots, schema)
 3. └─vctrs:::stop_scalar_type(...)
 4.   └─vctrs:::stop_vctrs(msg, "vctrs_error_scalar_type", actual = x)

A more naturalistic example of this is the following which works in 3.0, but not on this branch

df_simple <- sf::read_sf(system.file("shape/nc.shp", package = "sf"))
tab_simple <- Table$create(df_simple)

@romainfrancois
Copy link
Contributor Author

slightly simpler reprex:

arrow::Array$create(list(structure(list(1), class = c("one"))))
#> Error: Input must be a vector, not a `one` object.

Created on 2021-03-01 by the reprex package (v0.3.0)

@romainfrancois
Copy link
Contributor Author

romainfrancois commented Mar 1, 2021

Which essentially is:

vctrs::vec_size(structure(list(1), class = c("one")))
#> Error: Input must be a vector, not a `one` object.

Created on 2021-03-01 by the reprex package (v0.3.0)

as used in the ListConverter:

template <typename T>
class RListConverter : public ListConverter<T, RConverter, RConverterTrait> {
 public:
  Status Extend(SEXP x, int64_t size) override {
    RETURN_NOT_OK(this->Reserve(size));

    RVectorType rtype = GetVectorType(x);
    if (rtype != LIST) {
      return Status::Invalid("Cannot convert to list type");
    }

    auto append_value = [this](SEXP value) {
      auto n = vctrs::short_vec_size(value);    // <---------- HERE
      RETURN_NOT_OK(this->list_builder_->ValidateOverflow(n));
      RETURN_NOT_OK(this->list_builder_->Append());
      return this->value_converter_.get()->Extend(value, n);
    };
    auto append_null = [this]() { return this->list_builder_->AppendNull(); };
    return RVectorVisitor<SEXP>::Visit(x, size, append_null, append_value);
  }
};

@romainfrancois
Copy link
Contributor Author

In terms of the sf example, I get:

df_simple <- sf::read_sf(system.file("shape/nc.shp", package = "sf"))
vctrs::vec_size(df_simple$geometry)
#> [1] 100
class(df_simple$geometry)
#> [1] "sfc_MULTIPOLYGON" "sfc"

vctrs::vec_size(df_simple$geometry[[1]])
#> Error: Input must be a vector, not a `XY/MULTIPOLYGON/sfg` object.
class(df_simple$geometry[[1]])
#> [1] "XY"           "MULTIPOLYGON" "sfg"

Created on 2021-03-01 by the reprex package (v0.3.0)

I'm not sure this is a vctrs or an sf issue. cc @lionel- @edzer

…ast momentarily, to help with sf geometry columns.
@romainfrancois
Copy link
Contributor Author

I've put a fix in place, at least momentarily

@jonkeane
Copy link
Member

jonkeane commented Mar 1, 2021

Ok, I've looked at this more, and my first reprex was a bit too minimal it looks like. Here is another example/test we could/should add to exercise this (I'm happy to push a commit including it to this branch if you would like):

test_that("sf-like list columns", {
  df <- structure(list(col = structure(list(structure(list(list(structure(1))), class = "inner")), class = "outer")), class = "data.frame")
  expect_array_roundtrip(df)
})

The fix that you made does fix an error on the inner "class", but I believe that

options.size = vctrs::short_vec_size(x);
is tripping on the outer "class".

This was extra-fun to debug (and explains why the sf example above worked) because it looks like sf registers some vctrs methods which mean that these will work so long as those have been registered (i.e. whenever sf is used).

I suspect that it would be very infrequent to have someone want to round-trip a parquet file including sf data without also having sf loaded (so in practice the current state would be fine), attempting to debug what's going on is super complicated (and any other non-standard vctrs that use that outer level style classing would similarly fail).

Copy link
Member

@nealrichardson nealrichardson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @romainfrancois! I'll merge now and make a followup issue for the edge case.

@nealrichardson
Copy link
Member

Followup is ARROW-11832. Thanks @jonkeane for digging!

@romainfrancois
Copy link
Contributor Author

df above is not a valid data frame, so that's ok, however, it indeed looks like there is a follow up problem:

library(arrow, warn.conflicts = FALSE)

df <- structure(list(col = structure(list(structure(list(list(structure(1))), class = "inner")), class = "outer")), class = "data.frame")
nrow(df)
#> [1] 0
vctrs::vec_size(df)
#> Error: Corrupt data frame: row.names are missing
Array$create(df)
#> Error: Corrupt data frame: row.names are missing

df <- structure(list(col = structure(list(structure(list(list(structure(1))), class = "inner")), class = "outer")), class = "data.frame", row.names = c(NA, -1L))
nrow(df)
#> [1] 1
vctrs::vec_size(df)
#> [1] 1
Array$create(df)
#> Error: Input must be a vector, not a <outer> object.

Created on 2021-04-06 by the reprex package (v0.3.0)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants