Skip to content

Conversation

@nealrichardson
Copy link
Member

@nealrichardson nealrichardson commented Apr 11, 2020

This patch does the foundational work to conditionally build the S3 bindings if the C++ library was built with ARROW_S3=ON. It also adds bindings for FileSystemFromUri and enough wiring up in the datasets code so that open_dataset("s3://ursa-labs-taxi-data", partitioning = c("year", "month")) works.

There's lots of other S3FileSystem methods that probably should be implemented and aren't here. Also, calling S3FileSystem$create() segfaults. But FileSystemFromUri works fine.

@github-actions
Copy link

@nealrichardson nealrichardson marked this pull request as ready for review April 13, 2020 19:51
@fsaintjacques fsaintjacques self-requested a review April 16, 2020 17:36
Copy link
Contributor

@fsaintjacques fsaintjacques left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed the segfault. :shipit:

> ds <- open_dataset("file:///home/fsaintjacques/datasets/nyc-tlc/parquet", format="parquet")
> ds %>% filter(total_amount > 1000) %>% select(passenger_count) %>% collect()
# A tibble: 746 x 1
   passenger_count
             <int>
 1               1
 2               1
 3               1
 4               1
 5               1
 6               1
 7               1
 8               1
 9               4
10               1
# … with 736 more rows
> ds_s3 <- open_dataset("s3://123:12345678@nyc-tlc/parquet?scheme=http&endpoint_override=localhost:9000", format="parquet")
> ds_s3 %>% filter(total_amount > 1000) %>% select(passenger_count) %>% collect()
# A tibble: 746 x 1
   passenger_count
             <int>
 1               1
 2               1
 3               1
 4               1
 5               1
 6               1
 7               1
 8               1
 9               1
10               4
# … with 736 more rows

#' @export
S3FileSystem <- R6Class("S3FileSystem", inherit = FileSystem)
S3FileSystem$create <- function() {
fs___EnsureS3Initialized()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should you just hide this in fs___S3FileSystem__create instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could, sure. Wasn't sure how many functions I would need to write and whether they'd all need it.

// [[arrow::export]]
Rcpp::List fs___FileSystemFromUri(const std::string& path) {
std::string out_path;
auto file_system = VALUE_OR_STOP(fs::FileSystemFromUri(path, &out_path));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created ticket https://issues.apache.org/jira/browse/ARROW-8488, prefer ValueOrStop instead.

test_that("FileSystem$from_uri", {
skip_on_cran()
skip_if_not_available("s3")
fs_and_path <- FileSystem$from_uri("s3://ursa-labs-taxi-data")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just make sure this works with a non-existing bucket so we don't depend on the presence of ursa-labs-taxi-data bucket.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it work with a bucket that doesn't exist? I actually didn't try.

We already talk about the ursa-labs-taxi-data bucket in r/vignettes/dataset.Rmd, so the presence of the bucket is already "depended on".

pmap_chr(function(name, return_type, args, file, line){

all_decorations <- cpp_decorations()
arrow_exports <- get_exported_functions(all_decorations, c("arrow", "s3"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WIll you want to do this for other component, e.g. json, csv, parquet, dataset...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this change we could if we wanted to enable building slimmer versions of the package, but I'm not planning on doing that unless there's a compelling reason to.

@fsaintjacques
Copy link
Contributor

Will wait for CI, then merge.

Copy link
Member Author

@nealrichardson nealrichardson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#' @export
S3FileSystem <- R6Class("S3FileSystem", inherit = FileSystem)
S3FileSystem$create <- function() {
fs___EnsureS3Initialized()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could, sure. Wasn't sure how many functions I would need to write and whether they'd all need it.

pmap_chr(function(name, return_type, args, file, line){

all_decorations <- cpp_decorations()
arrow_exports <- get_exported_functions(all_decorations, c("arrow", "s3"))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this change we could if we wanted to enable building slimmer versions of the package, but I'm not planning on doing that unless there's a compelling reason to.

test_that("FileSystem$from_uri", {
skip_on_cran()
skip_if_not_available("s3")
fs_and_path <- FileSystem$from_uri("s3://ursa-labs-taxi-data")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it work with a bucket that doesn't exist? I actually didn't try.

We already talk about the ursa-labs-taxi-data bucket in r/vignettes/dataset.Rmd, so the presence of the bucket is already "depended on".

Copy link
Member Author

@nealrichardson nealrichardson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the comments, will accept the suggestions and merge

@nealrichardson nealrichardson deleted the r-s3 branch April 16, 2020 23:09
kszucs pushed a commit that referenced this pull request Apr 20, 2020
This patch does the foundational work to conditionally build the S3 bindings if the C++ library was built with ARROW_S3=ON. It also adds bindings for `FileSystemFromUri` and enough wiring up in the datasets code so that `open_dataset("s3://ursa-labs-taxi-data", partitioning = c("year", "month"))` works.

There's lots of other S3FileSystem methods that probably should be implemented and aren't here. Also, calling `S3FileSystem$create()` segfaults. But `FileSystemFromUri` works fine.

Closes #6901 from nealrichardson/r-s3

Lead-authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Co-authored-by: François Saint-Jacques <fsaintjacques@gmail.com>
Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants