Skip to content

Conversation

@MichaelChirico
Copy link
Contributor

@MichaelChirico MichaelChirico commented Jan 24, 2020

No description provided.

@github-actions
Copy link

Thanks for opening a pull request!

Could you open an issue for this pull request on JIRA?
https://issues.apache.org/jira/browse/ARROW

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

See also:

@fsaintjacques
Copy link
Contributor

fsaintjacques commented Jan 24, 2020

It would indicate that it can't find Arrow's C++ source.

@MichaelChirico MichaelChirico changed the title initial foray into adding list column support for parquet writing ARROW-7662: [R] initial foray into adding list column support for parquet writing Jan 24, 2020
@github-actions
Copy link

@MichaelChirico
Copy link
Contributor Author

Thanks @fsaintjacques . I tried following the R package's README for building the package & can confirm the package works (so arrow) is installed. Maybe it's a version mismatch problem? I know current master is more recent than 0.14.1 but brew install approach found that:

Arrow C++ libraries found via pkg-config
PKG_CFLAGS=-DNDEBUG -I/usr/local/Cellar/apache-arrow/0.14.1/include -DARROW_R_WITH_ARROW
PKG_LIBS=-L/usr/local/Cellar/apache-arrow/0.14.1/lib -larrow -lparquet
** libs
clang++ -std=gnu++11 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG -DNDEBUG -I/usr/local/Cellar/apache-arrow/0.14.1/include -DARROW_R_WITH_ARROW -I"/Library/Frameworks/R.framework/Versions/3.6/Resources/library/Rcpp/include" -I/usr/local/opt/gettext/include -I/usr/local/opt/llvm/include  -fPIC  -Wall -g -O2  -c array.cpp -o array.o

I'll try building completely from source...

Copy link
Member

@nealrichardson nealrichardson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! A few suggestions here. Let me know if you want help setting up a dev C++ build--I can advise if you tell me what platform you're on.

R_xlen_t n = XLENGTH(x);
if (n == 0)
Rcpp::stop("received length-0 list");
std::shared_ptr<arrow::Type> element_type = InferType(VECTOR_ELT(x, 0));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like you weren't able to compile this locally, so see the results from CI: https://github.com/apache/arrow/pull/6275/checks?check_run_id=406491019#step:5:2868

There are a few things to fix here.

Co-Authored-By: Neal Richardson <neal.p.richardson@gmail.com>
@nealrichardson
Copy link
Member

Since you're on macOS, try this (from the root of the arrow repository)

mkdir cpp/build
cd cpp/build
cmake -DARROW_PARQUET=ON -DARROW_INSTALL_NAME_RPATH=OFF -DBOOST_SOURCE=BUNDLED -DARROW_DEPENDENCY_SOURCE=SYSTEM -DARROW_WITH_ZLIB=ON -DARROW_WITH_SNAPPY=ON -DARROW_BUILD_TESTS=OFF -DCMAKE_BUILD_TYPE=debug -DARROW_DATASET=ON -DARROW_EXTRA_ERROR_CONTEXT=ON -DARROW_CSV=ON -DARROW_JSON=ON -DARROW_COMPUTE=ON -DARROW_FILESYSTEM=ON -DRapidJSON_SOURCE=BUNDLED ..
make -j$(sysctl -n hw.ncpu) install

If you don't have it, you may need to brew install cmake first.

@MichaelChirico
Copy link
Contributor Author

Thanks @nealrichardson

Before reading this I had already begun a "plain" build via instructions here: https://arrow.apache.org/docs/developers/cpp.html

And advantage of the flags you cited? Faster build time?

@nealrichardson
Copy link
Member

The flags I gave are the "works on my machine" flags. It turns on the features we need for the R package and covers some macOS idiosyncrasies (at least for my machine). YMMV of course but figured it might save you some unnecessary debugging. And the make -j(nproc) will make it faster, yes.

Michael Chirico added 2 commits January 25, 2020 02:28
@MichaelChirico
Copy link
Contributor Author

FWIW my build worked without those flags but I was still getting compilation errors from the R build:

arrowExports.cpp:1360:21: error: no member named 'SourceFactory' in namespace 'arrow::dataset'
std::shared_ptr<ds::SourceFactory> dataset___FSSFactory__Make2(const std::shared_ptr<fs::FileSystem>& fs, const std::shared_ptr<fs::FileSelector>& selector, co...
                ~~~~^
arrowExports.cpp:1360:133: error: no member named 'FileSelector' in namespace 'arrow::fs'
std::shared_ptr<ds::SourceFactory> dataset___FSSFactory__Make2(const std::shared_ptr<fs::FileSystem>& fs, const std::shared_ptr<fs::FileSelector>& selector, co...
                                                                                                                                ~~~~^
arrowExports.cpp:1360:231: error: no member named 'Partitioning' in namespace 'arrow::dataset'
  ...fs, const std::shared_ptr<fs::FileSelector>& selector, const std::shared_ptr<ds::FileFormat>& format, const std::shared_ptr<ds::Partitioning>& partitioning);
                                                                                                                                 ~~~~^
arrowExports.cpp:1364:58: error: no member named 'FileSelector' in namespace 'arrow::fs'
        Rcpp::traits::input_parameter<const std::shared_ptr<fs::FileSelector>&>::type selector(selector_sexp);
                                                            ~~~~^
arrowExports.cpp:1364:75: error: no type named 'type' in the global namespace
        Rcpp::traits::input_parameter<const std::shared_ptr<fs::FileSelector>&>::type selector(selector_sexp);
                                                                               ~~^
arrowExports.cpp:1366:58: error: no member named 'Partitioning' in namespace 'arrow::dataset'
        Rcpp::traits::input_parameter<const std::shared_ptr<ds::Partitioning>&>::type partitioning(partitioning_sexp);
                                                            ~~~~^
arrowExports.cpp:1366:75: error: no type named 'type' in the global namespace
        Rcpp::traits::input_parameter<const std::shared_ptr<ds::Partitioning>&>::type partitioning(partitioning_sexp);
                                                                               ~~^
arrowExports.cpp:1378:21: error: no member named 'SourceFactory' in namespace 'arrow::dataset'
std::shared_ptr<ds::SourceFactory> dataset___FSSFactory__Make1(const std::shared_ptr<fs::FileSystem>& fs, const std::shared_ptr<fs::FileSelector>& selector, co...
                ~~~~^
arrowExports.cpp:1378:133: error: no member named 'FileSelector' in namespace 'arrow::fs'
std::shared_ptr<ds::SourceFactory> dataset___FSSFactory__Make1(const std::shared_ptr<fs::FileSystem>& fs, const std::shared_ptr<fs::FileSelector>& selector, co...
                                                                                                                                ~~~~^
arrowExports.cpp:1382:58: error: no member named 'FileSelector' in namespace 'arrow::fs'
        Rcpp::traits::input_parameter<const std::shared_ptr<fs::FileSelector>&>::type selector(selector_sexp);
                                                            ~~~~^
arrowExports.cpp:1382:75: error: no type named 'type' in the global namespace
        Rcpp::traits::input_parameter<const std::shared_ptr<fs::FileSelector>&>::type selector(selector_sexp);
                                                                               ~~^
arrowExports.cpp:1395:21: error: no member named 'SourceFactory' in namespace 'arrow::dataset'
std::shared_ptr<ds::SourceFactory> dataset___FSSFactory__Make3(const std::shared_ptr<fs::FileSystem>& fs, const std::shared_ptr<fs::FileSelector>& selector, co...
                ~~~~^
arrowExports.cpp:1395:133: error: no member named 'FileSelector' in namespace 'arrow::fs'
std::shared_ptr<ds::SourceFactory> dataset___FSSFactory__Make3(const std::shared_ptr<fs::FileSystem>& fs, const std::shared_ptr<fs::FileSelector>& selector, co...
                                                                                                                                ~~~~^
arrowExports.cpp:1395:231: error: no member named 'PartitioningFactory' in namespace 'arrow::dataset'
  ...fs, const std::shared_ptr<fs::FileSelector>& selector, const std::shared_ptr<ds::FileFormat>& format, const std::shared_ptr<ds::PartitioningFactory>& factory);
                                                                                                                                 ~~~~^
arrowExports.cpp:1399:58: error: no member named 'FileSelector' in namespace 'arrow::fs'
        Rcpp::traits::input_parameter<const std::shared_ptr<fs::FileSelector>&>::type selector(selector_sexp);
                                                            ~~~~^
arrowExports.cpp:1399:75: error: no type named 'type' in the global namespace
        Rcpp::traits::input_parameter<const std::shared_ptr<fs::FileSelector>&>::type selector(selector_sexp);
                                                                               ~~^
arrowExports.cpp:1401:58: error: no member named 'PartitioningFactory' in namespace 'arrow::dataset'
        Rcpp::traits::input_parameter<const std::shared_ptr<ds::PartitioningFactory>&>::type factory(factory_sexp);
                                                            ~~~~^
arrowExports.cpp:1401:82: error: no type named 'type' in the global namespace
        Rcpp::traits::input_parameter<const std::shared_ptr<ds::PartitioningFactory>&>::type factory(factory_sexp);
                                                                                      ~~^
arrowExports.cpp:1413:21: error: no member named 'ParquetFileFormat' in namespace 'arrow::dataset'
std::shared_ptr<ds::ParquetFileFormat> dataset___ParquetFileFormat__Make();
                ~~~~^
fatal error: too many errors emitted, stopping now [-ferror-limit=]
20 errors generated.`

Re-built with your flags & now I'm up & running 💪

@MichaelChirico
Copy link
Contributor Author

Have got the ListType inference working. Now to the more complicated part of actually writing the object as parquet, we run into the issue you cited earlier:

DF = data.frame(a = 1:10)
DF$b = as.list(DF$a)
arrow::write_parquet(DF, 'test.parquet')
# Error in Table__from_dots(dots, schema) : 
#   NotImplemented: type not implemented

PS I also noticed StructType is also not supported for parquet, seems beyond the scope of this PR:

DF$b = as.data.frame(DF$a)
arrow::write_parquet(DF, 'test.parquet')
Error in parquet___arrow___FileWriter__WriteTable(self, table, chunk_size) : 
  NotImplemented: Level generation for Struct not supported yet
In /Users/michael.chirico/github/arrow/cpp/src/parquet/arrow/writer.cc, line 161, code: VisitInline(array)
In /Users/michael.chirico/github/arrow/cpp/src/parquet/arrow/writer.cc, line 341, code: level_builder.GenerateLevels( data, &values_offset, &num_values, &num_levels, ctx_->def_levels_buffer, &def_levels_buffer, &rep_levels_buffer, &_values_array)
In /Users/michael.chirico/github/arrow/cpp/src/parquet/arrow/writer.cc, line 388, code: Write(*array_to_write)
In /Users/michael.chirico/github/arrow/cpp/src/parquet/arrow/writer.cc, line 471, code: arrow_writer.Write(*data, offset, size)
In /Users/michael.chirico/github/arrow/cpp/src/parquet/arrow/writer.cc, line 497, code: WriteColumnChunk(table.column(i), offset, size)

@nealrichardson
Copy link
Member

I think the StructType issue is a broader limitation of the C++ library that's been on the roadmap for some time: https://issues.apache.org/jira/browse/ARROW-1644

The relevant issue for ListType is still in the R-to-Arrow conversion, so before we even get to the question of writing Parquet (we have to go R to Arrow, then Arrow to Parquet). Table__from_dots is in https://github.com/apache/arrow/blob/master/r/src/table.cpp. I'll dig around a little as well to see where that NotImplemented is coming from

@nealrichardson
Copy link
Member

Scratch that, table.cpp is probably irrelevant because you can reproduce this error with less overhead:

> lcol <- list(1, 2, 3)
> Array$create(lcol, type = list_of(float64()))
Error in Array__from_vector(x, type) : 
  NotImplemented: type not implemented

Your current change should let us call that without specifying type but now there is missing conversion code.

@MichaelChirico
Copy link
Contributor Author

it's from array_from_vector.cpp but nothing in there jumped out at me. would need to dive into the macros there but don't have time at the moment

@nealrichardson
Copy link
Member

https://github.com/apache/arrow/blob/master/r/src/array_from_vector.cpp#L797 is the error so it looks like there's another missing case statement

@nealrichardson
Copy link
Member

Maybe the better approach is to follow how Struct is handled again, so before that switch statement gets called, check for ListType and call MakeListArray (which needs to be written). Like here: https://github.com/apache/arrow/blob/master/r/src/array_from_vector.cpp#L1040

The approach would be similar to MakeStructArray I think, iterate over the list and convert each component to an array of the given type.

@fsaintjacques
Copy link
Contributor

A quick note, we don't have support to read Struct data from parquet files yet.

@nealrichardson
Copy link
Member

Right @fsaintjacques, that's https://issues.apache.org/jira/browse/ARROW-1644, but we can do List Array right?

Rcpp::stop("received length-0 list");
}
std::shared_ptr<arrow::DataType> element_type = InferType(VECTOR_ELT(x, 0));
for (R_xlen_t i = 1; i < n; i++) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also see this util function arrow::Status CheckCompatibleStruct, would it make more sense to define CheckCompatibleList and do this across-row consistency check there?

auto array_indices = MakeArray(array_indices_data);

return std::make_shared<ListArray>(
type, n, offset_buffer, array_indices, null_buffer, null_count);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to convert the value_buffer into an array, I think the way is with ArrayData::Make but still not quite there

@wesm
Copy link
Member

wesm commented Jan 28, 2020

we don't have support to read Struct data from parquet files yet.

@fsaintjacques we can read structs-of-structs and lists-of-lists but not a mix of the two

@fsaintjacques
Copy link
Contributor

@MichaelChirico are you still on this, I started looking into it last week and got sucked into a refactor of some parts of array_from_vector, do you want me to take over?

@MichaelChirico
Copy link
Contributor Author

@fsaintjacques I got stuck as noted, would be happy to go back at it if you have any advice on this part:

We need to convert the value_buffer into an array, I think the way is with ArrayData::Make but still not quite there

either way feel free to push to this branch

PS FYI I'll be on vacation the next week

- Refactor InferArrowType for better readability
- Create VectorToArrayConverter class as a basis to use recursive
  builders. Future refactor can move other conversions, e.g. we could
  have UnionArray support with heterogeneous lists.
- Various other cleanups.
@fsaintjacques
Copy link
Contributor

@MichaelChirico I rewrote the List/String to use C++'s builder facility. This should allow easier integration with lists of nested types. I encourage you to check the functionality. I saw some low hanging fruits, notably re-writing the factor -> DictionaryArray conversion could benefit from this recursion to support more factor types (it only does utf8() for now).

Copy link
Member

@nealrichardson nealrichardson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! In addition to the comments, I think it would be good to add an e2e test that does round trip to Parquet with a list array type, as the original issue reported.

@nealrichardson nealrichardson changed the title ARROW-7662: [R] initial foray into adding list column support for parquet writing ARROW-7662: [R] Support creating ListArray from R list Feb 6, 2020
@fsaintjacques
Copy link
Contributor

Failures are unrelated, still have to wait for appveyor...

@nealrichardson
Copy link
Member

Rebase should resolve the lint/release changes

} else if (n_factors < INT16_MAX) {
return arrow::int16();
} else {
return arrow::int32();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the R storage is int32, would it be better to always use int32 (so we could zero-copy the indices)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point.

Copy link
Member

@nealrichardson nealrichardson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@fsaintjacques I think you mentioned some followup issues; can you link to those Jiras for future reference?

@fsaintjacques
Copy link
Contributor

fsaintjacques commented Feb 7, 2020

Followup ticket https://issues.apache.org/jira/browse/ARROW-7798

r/NEWS.md Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fsaintjacques could you please move this bullet up to the current release (L23) (since this did not get included in 0.16.0)?

Other than that, if you're done here, please merge. Merci!

@fsaintjacques
Copy link
Contributor

@nealrichardson The failure in https://github.com/apache/arrow/pull/6275/checks?check_run_id=432684028#step:5:1135 seems spurious. I had it locally but went away with a make clean. I'm not gonna merge yet to break the CI.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're hitting a misfeature in dplyr so let's work around it. See https://github.com/apache/arrow/blob/master/r/tests/testthat/helper-expectation.R#L26-L37

Suggested change
expect_identical(df, df_read)
expect_equivalent(df, df_read)

@MichaelChirico MichaelChirico deleted the r-parquet-array branch February 13, 2020 15:51
@MichaelChirico
Copy link
Contributor Author

Awesome stuff! Thanks so much @nealrichardson and @fsaintjacques for your help & for pushing it over the top with a big refactor 🙇

nealrichardson added a commit to nealrichardson/arrow that referenced this pull request Feb 13, 2020
Closes apache#6275 from MichaelChirico/r-parquet-array and squashes the following commits:

2914c50 <Neal Richardson> Merge branch 'master' into r-parquet-array
a8f26a1 <François Saint-Jacques> Address comments
fc75d86 <François Saint-Jacques> Implement R's vector to arrow::Array conversion
a3a406b <Michael Chirico> half step forward
93a0b90 <Michael Chirico> progress (?) by mimicking MakeStringArray
6d172df <Michael Chirico> more intermediate
45663c2 <Michael Chirico> initial attempt at CheckCompatibleList
cacd799 <Michael Chirico> skeleton of way forward
9cdd2fe <Michael Chirico> linting
62cc75c <Michael Chirico> linting again
331fa3d <Michael Chirico> linting
761d083 <Michael Chirico> Merge branch 'master' into r-parquet-array
7e1d331 <Michael Chirico> Merge branch 'r-parquet-array' of github.com:MichaelChirico/arrow into r-parquet-array
1de3dc4 <Michael Chirico> fix typos
35ae9c8 <Michael Chirico> Update r/NEWS.md
c234788 <Michael Chirico> initial foray into adding list column support for parquet writing

Lead-authored-by: Michael Chirico <michaelchirico4@gmail.com>
Co-authored-by: Michael Chirico <michael.chirico@grabtaxi.com>
Co-authored-by: François Saint-Jacques <fsaintjacques@gmail.com>
Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants