-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-7662: [R] Support creating ListArray from R list #6275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks for opening a pull request! Could you open an issue for this pull request on JIRA? Then could you also rename pull request title in the following format? See also: |
|
It would indicate that it can't find Arrow's C++ source. |
|
Thanks @fsaintjacques . I tried following the R package's README for building the package & can confirm the package works (so arrow) is installed. Maybe it's a version mismatch problem? I know current master is more recent than I'll try building completely from source... |
nealrichardson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! A few suggestions here. Let me know if you want help setting up a dev C++ build--I can advise if you tell me what platform you're on.
r/src/array_from_vector.cpp
Outdated
| R_xlen_t n = XLENGTH(x); | ||
| if (n == 0) | ||
| Rcpp::stop("received length-0 list"); | ||
| std::shared_ptr<arrow::Type> element_type = InferType(VECTOR_ELT(x, 0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like you weren't able to compile this locally, so see the results from CI: https://github.com/apache/arrow/pull/6275/checks?check_run_id=406491019#step:5:2868
There are a few things to fix here.
Co-Authored-By: Neal Richardson <neal.p.richardson@gmail.com>
|
Since you're on macOS, try this (from the root of the arrow repository) If you don't have it, you may need to |
|
Thanks @nealrichardson Before reading this I had already begun a "plain" build via instructions here: https://arrow.apache.org/docs/developers/cpp.html And advantage of the flags you cited? Faster build time? |
|
The flags I gave are the "works on my machine" flags. It turns on the features we need for the R package and covers some macOS idiosyncrasies (at least for my machine). YMMV of course but figured it might save you some unnecessary debugging. And the |
|
FWIW my build worked without those flags but I was still getting compilation errors from the R build: Re-built with your flags & now I'm up & running 💪 |
|
Have got the PS I also noticed |
|
I think the StructType issue is a broader limitation of the C++ library that's been on the roadmap for some time: https://issues.apache.org/jira/browse/ARROW-1644 The relevant issue for ListType is still in the R-to-Arrow conversion, so before we even get to the question of writing Parquet (we have to go R to Arrow, then Arrow to Parquet). |
|
Scratch that, table.cpp is probably irrelevant because you can reproduce this error with less overhead: Your current change should let us call that without specifying |
|
it's from array_from_vector.cpp but nothing in there jumped out at me. would need to dive into the macros there but don't have time at the moment |
|
https://github.com/apache/arrow/blob/master/r/src/array_from_vector.cpp#L797 is the error so it looks like there's another missing case statement |
|
Maybe the better approach is to follow how Struct is handled again, so before that switch statement gets called, check for ListType and call MakeListArray (which needs to be written). Like here: https://github.com/apache/arrow/blob/master/r/src/array_from_vector.cpp#L1040 The approach would be similar to MakeStructArray I think, iterate over the list and convert each component to an array of the given type. |
|
A quick note, we don't have support to read Struct data from parquet files yet. |
|
Right @fsaintjacques, that's https://issues.apache.org/jira/browse/ARROW-1644, but we can do List Array right? |
r/src/array_from_vector.cpp
Outdated
| Rcpp::stop("received length-0 list"); | ||
| } | ||
| std::shared_ptr<arrow::DataType> element_type = InferType(VECTOR_ELT(x, 0)); | ||
| for (R_xlen_t i = 1; i < n; i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also see this util function arrow::Status CheckCompatibleStruct, would it make more sense to define CheckCompatibleList and do this across-row consistency check there?
r/src/array_from_vector.cpp
Outdated
| auto array_indices = MakeArray(array_indices_data); | ||
|
|
||
| return std::make_shared<ListArray>( | ||
| type, n, offset_buffer, array_indices, null_buffer, null_count); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to convert the value_buffer into an array, I think the way is with ArrayData::Make but still not quite there
@fsaintjacques we can read structs-of-structs and lists-of-lists but not a mix of the two |
|
@MichaelChirico are you still on this, I started looking into it last week and got sucked into a refactor of some parts of array_from_vector, do you want me to take over? |
|
@fsaintjacques I got stuck as noted, would be happy to go back at it if you have any advice on this part:
either way feel free to push to this branch PS FYI I'll be on vacation the next week |
- Refactor InferArrowType for better readability - Create VectorToArrayConverter class as a basis to use recursive builders. Future refactor can move other conversions, e.g. we could have UnionArray support with heterogeneous lists. - Various other cleanups.
|
@MichaelChirico I rewrote the List/String to use C++'s builder facility. This should allow easier integration with lists of nested types. I encourage you to check the functionality. I saw some low hanging fruits, notably re-writing the factor -> DictionaryArray conversion could benefit from this recursion to support more factor types (it only does utf8() for now). |
nealrichardson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! In addition to the comments, I think it would be good to add an e2e test that does round trip to Parquet with a list array type, as the original issue reported.
c46f031 to
2f28b6e
Compare
|
Failures are unrelated, still have to wait for appveyor... |
|
Rebase should resolve the lint/release changes |
2f28b6e to
a8f26a1
Compare
| } else if (n_factors < INT16_MAX) { | ||
| return arrow::int16(); | ||
| } else { | ||
| return arrow::int32(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the R storage is int32, would it be better to always use int32 (so we could zero-copy the indices)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point.
nealrichardson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
@fsaintjacques I think you mentioned some followup issues; can you link to those Jiras for future reference?
|
Followup ticket https://issues.apache.org/jira/browse/ARROW-7798 |
r/NEWS.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fsaintjacques could you please move this bullet up to the current release (L23) (since this did not get included in 0.16.0)?
Other than that, if you're done here, please merge. Merci!
c0c7637 to
3ebe953
Compare
|
@nealrichardson The failure in https://github.com/apache/arrow/pull/6275/checks?check_run_id=432684028#step:5:1135 seems spurious. I had it locally but went away with a make clean. I'm not gonna merge yet to break the CI. |
r/tests/testthat/test-parquet.R
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're hitting a misfeature in dplyr so let's work around it. See https://github.com/apache/arrow/blob/master/r/tests/testthat/helper-expectation.R#L26-L37
| expect_identical(df, df_read) | |
| expect_equivalent(df, df_read) |
3ebe953 to
2914c50
Compare
|
Awesome stuff! Thanks so much @nealrichardson and @fsaintjacques for your help & for pushing it over the top with a big refactor 🙇 |
Closes apache#6275 from MichaelChirico/r-parquet-array and squashes the following commits: 2914c50 <Neal Richardson> Merge branch 'master' into r-parquet-array a8f26a1 <François Saint-Jacques> Address comments fc75d86 <François Saint-Jacques> Implement R's vector to arrow::Array conversion a3a406b <Michael Chirico> half step forward 93a0b90 <Michael Chirico> progress (?) by mimicking MakeStringArray 6d172df <Michael Chirico> more intermediate 45663c2 <Michael Chirico> initial attempt at CheckCompatibleList cacd799 <Michael Chirico> skeleton of way forward 9cdd2fe <Michael Chirico> linting 62cc75c <Michael Chirico> linting again 331fa3d <Michael Chirico> linting 761d083 <Michael Chirico> Merge branch 'master' into r-parquet-array 7e1d331 <Michael Chirico> Merge branch 'r-parquet-array' of github.com:MichaelChirico/arrow into r-parquet-array 1de3dc4 <Michael Chirico> fix typos 35ae9c8 <Michael Chirico> Update r/NEWS.md c234788 <Michael Chirico> initial foray into adding list column support for parquet writing Lead-authored-by: Michael Chirico <michaelchirico4@gmail.com> Co-authored-by: Michael Chirico <michael.chirico@grabtaxi.com> Co-authored-by: François Saint-Jacques <fsaintjacques@gmail.com> Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com> Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>
No description provided.