-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-10570: [R] Use Converter API to convert SEXP to Array/ChunkedArray #8650
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks for opening a pull request! Could you open an issue for this pull request on JIRA? Then could you also rename pull request title in the following format? See also: |
|
Thanks @kszucs for the direct help. This is very far from done, but it's a start, and perhaps we can resume the conversation here. AFAIK, There is no R equivalent to struct RObject {
RVectorType rtype;
void* data;
bool null;
};This will evolve (perhaps into some sort of hierarchy) I'm sure as other types join the implementation (for now this only handles going from an R integer vector (INTSXP) to an |
library(arrow, warn.conflicts = FALSE)
arrow:::vec_to_arrow(1:2, int32())
#> Array
#> <int32>
#> [
#> 1,
#> 2
#> ]
arrow:::vec_to_arrow(c(1,2), float64())
#> Array
#> <double>
#> [
#> 1,
#> 2
#> ]
arrow:::vec_to_arrow(as.raw(c(1,2)), uint8())
#> Array
#> <uint8>
#> [
#> 1,
#> 2
#> ]Created on 2020-11-13 by the reprex package (v0.3.0.9001) |
|
cc @bkietz since we co-authored the python-side refactor |
9ba8029 to
6d84bd2
Compare
|
This is still quite wip, as I navigate how the various It all feels a bit weird to operate in terms of single values, i.e. via the It feels particularly odd with the struct converter (converting from a data frame). As of right now, this is bypassing Would it make sense to have some sort of vectorised approach ? Should this instead figure out a way to treat data frames that the struct array gets built row by row instead of column by column ? |
|
ping @kszucs @bkietz I don't necessarily need a full review at this point, as this is far from done and will need further changes, but perhaps you can have a look and let me know if this goes in the right direction ? cc @nealrichardson |
5a7f11a to
3b1a3a4
Compare
|
FYI Windows build failure is |
This converter API was designed to operate on list-like collections of generic objects one-by-one. We have different code paths for more efficient conversion from numpy/pandas, see here.
I think so, yes. We could extend the converter API to consume columns instead of individual values. |
I'll observe that the PR that added the new converter API made only a cosmetic change to numpy_to_arrow.cc, so maybe the new API is not so relevant for R after all? Row-based conversion might make sense when converting an R list-type object, but otherwise you'd expect something vectorized to be better. For better or worse, now that all of this row-based conversion has been implemented here, we can benchmark both ways and compare. |
Sorry, it's not about row- vs. column-wise. Currently the converters have an I'm going to take a look at the PR tomorrow to properly answer Romain's questions. |
|
I believe I had missed the more efficient python paths, having a look. |
|
Since the latest commits aren't compiling, I did some benchmarking on bcb1be7. Summary of findings:
Not sure where things will stand with the latest changes, but I think this suggests that the (numpy-like) special handling for vector types that can be just copied/moved to Arrow are important where appropriate. Otherwise, the string results suggest that there is some performance gain to be had with this work, and if the new approach will handle chunking and parallelization, we can do even better. Code: download.file("https://ursa-qa.s3.amazonaws.com/fanniemae_loanperf/2016Q4.csv.gz", "fanniemae.csv.gz")
df <- read_delim_arrow("fanniemae.csv.gz", delim="|", col_names=FALSE)
dim(df)
## [1] 22180168 31
for (n in names(df)) {
print(n)
print(class(df[[n]]))
try(print(bench::mark(arrow:::Array__from_vector(df[[n]], NULL), arrow:::vec_to_arrow(df[[n]], NULL))))
}There's also a NYC taxi CSV at https://ursa-qa.s3.amazonaws.com/nyctaxi/yellow_tripdata_2010-01.csv.gz you can test with (just |
|
At the point now where I dropped the artificial Mostly because I could "easily" align to what #8886 does, because at the moment I'm only ever using |
Right, we want to be able to chunk the inputs--that's ARROW-9293 |
|
That last commit gets us back to parity-ish on character vector performance, though it's a bit noisy. |
|
Running benchmarks on the character vectors, using 10 iterations the performance is basically the same as on master. I haven't tested factors, but that file does test integer, numeric, integer64, and character, with variation in the number of NAs. So, something with equivalent performance, but that allows chunking and parallelization by columns (which the old converter does not, IIRC), would in practice have significant benefits. |
|
Now pulled in the commit from #8886 so that this uses I believe that for the chunker class to be useful in this case, we would need to act on that comment: |
4b2d827 to
f55466c
Compare
|
oops, I only see your message now @nealrichardson, after I essentially went ahead with migration/elimination of code. Can we envision to run the same code with this pull request vs master. Maybe by setting up continuous benchmarking, with https://github.com/r-lib/bench#continuous-benchmarking I'm arriving at the end of this week's sprint soon, I'll mark this as ready for review and stay vigilant. Otherwise I can roll back today's commits and add the option 😬. |
|
e.g. This might need to be adapted to |
|
No worries, we've been working on some other continuous benchmarking tools and can test things out with them. Thanks for pushing this forward. |
| // TODO: this probably should be disabled when x is an ALTREP object | ||
| // because MakeSimpleArray below will force materialization |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this unavoidable though? We have to get the data into Arrow format somehow.
I guess the one case you might not want to materialize is if you had an ALTREP object that were backed by an Arrow array (which of course we don't do yet)--then you could just grab the Arrow array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe that's a reach, but perhaps there are cases where this would mean only arrow materialize without R materialize. For example if the arrow data type is not representable in R, e.g. uint32
d4608a9 to
356c300
Compare
|
I've been helping out benchmark these changes against the 3.0 release and everything I'm seeing is in line: no performance regressions. I've used a handful of our real-world datasets along with some synthetic datasets made up of individual types and they are all in line with the 3.0 release performance wise, which is great. I've been adding the benchmarks to arrowbench and are currently in a PR there in case you're curious about them. One thing I did notice is that simple feature columns are having issues that aren't there in the release. Here's a test that exercises the issue (I dug a bit to see if I could find the bug but haven't yet). The structure of the list column is meant to be minimal but is based off of the failure I saw with a real sf tibble (see below) the error+traceback is: A more naturalistic example of this is the following which works in 3.0, but not on this branch |
|
slightly simpler reprex: arrow::Array$create(list(structure(list(1), class = c("one"))))
#> Error: Input must be a vector, not a `one` object.Created on 2021-03-01 by the reprex package (v0.3.0) |
|
Which essentially is: vctrs::vec_size(structure(list(1), class = c("one")))
#> Error: Input must be a vector, not a `one` object.Created on 2021-03-01 by the reprex package (v0.3.0) as used in the template <typename T>
class RListConverter : public ListConverter<T, RConverter, RConverterTrait> {
public:
Status Extend(SEXP x, int64_t size) override {
RETURN_NOT_OK(this->Reserve(size));
RVectorType rtype = GetVectorType(x);
if (rtype != LIST) {
return Status::Invalid("Cannot convert to list type");
}
auto append_value = [this](SEXP value) {
auto n = vctrs::short_vec_size(value); // <---------- HERE
RETURN_NOT_OK(this->list_builder_->ValidateOverflow(n));
RETURN_NOT_OK(this->list_builder_->Append());
return this->value_converter_.get()->Extend(value, n);
};
auto append_null = [this]() { return this->list_builder_->AppendNull(); };
return RVectorVisitor<SEXP>::Visit(x, size, append_null, append_value);
}
}; |
|
In terms of the df_simple <- sf::read_sf(system.file("shape/nc.shp", package = "sf"))
vctrs::vec_size(df_simple$geometry)
#> [1] 100
class(df_simple$geometry)
#> [1] "sfc_MULTIPOLYGON" "sfc"
vctrs::vec_size(df_simple$geometry[[1]])
#> Error: Input must be a vector, not a `XY/MULTIPOLYGON/sfg` object.
class(df_simple$geometry[[1]])
#> [1] "XY" "MULTIPOLYGON" "sfg"Created on 2021-03-01 by the reprex package (v0.3.0) I'm not sure this is a |
…ast momentarily, to help with sf geometry columns.
|
I've put a fix in place, at least momentarily |
|
Ok, I've looked at this more, and my first reprex was a bit too minimal it looks like. Here is another example/test we could/should add to exercise this (I'm happy to push a commit including it to this branch if you would like): The fix that you made does fix an error on the Line 1014 in fe1c774
outer "class".
This was extra-fun to debug (and explains why the sf example above worked) because it looks like sf registers some vctrs methods which mean that these will work so long as those have been registered (i.e. whenever I suspect that it would be very infrequent to have someone want to round-trip a parquet file including sf data without also having sf loaded (so in practice the current state would be fine), attempting to debug what's going on is super complicated (and any other non-standard vctrs that use that outer level style classing would similarly fail). |
nealrichardson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @romainfrancois! I'll merge now and make a followup issue for the edge case.
|
Followup is ARROW-11832. Thanks @jonkeane for digging! |
|
library(arrow, warn.conflicts = FALSE)
df <- structure(list(col = structure(list(structure(list(list(structure(1))), class = "inner")), class = "outer")), class = "data.frame")
nrow(df)
#> [1] 0
vctrs::vec_size(df)
#> Error: Corrupt data frame: row.names are missing
Array$create(df)
#> Error: Corrupt data frame: row.names are missing
df <- structure(list(col = structure(list(structure(list(list(structure(1))), class = "inner")), class = "outer")), class = "data.frame", row.names = c(NA, -1L))
nrow(df)
#> [1] 1
vctrs::vec_size(df)
#> [1] 1
Array$create(df)
#> Error: Input must be a vector, not a <outer> object.Created on 2021-04-06 by the reprex package (v0.3.0) |
No description provided.