-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-6439: [R] Implement S3 file-system interface in R #6901
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
fsaintjacques
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fixed the segfault. ![]()
> ds <- open_dataset("file:///home/fsaintjacques/datasets/nyc-tlc/parquet", format="parquet")
> ds %>% filter(total_amount > 1000) %>% select(passenger_count) %>% collect()
# A tibble: 746 x 1
passenger_count
<int>
1 1
2 1
3 1
4 1
5 1
6 1
7 1
8 1
9 4
10 1
# … with 736 more rows
> ds_s3 <- open_dataset("s3://123:12345678@nyc-tlc/parquet?scheme=http&endpoint_override=localhost:9000", format="parquet")
> ds_s3 %>% filter(total_amount > 1000) %>% select(passenger_count) %>% collect()
# A tibble: 746 x 1
passenger_count
<int>
1 1
2 1
3 1
4 1
5 1
6 1
7 1
8 1
9 1
10 4
# … with 736 more rows| #' @export | ||
| S3FileSystem <- R6Class("S3FileSystem", inherit = FileSystem) | ||
| S3FileSystem$create <- function() { | ||
| fs___EnsureS3Initialized() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should you just hide this in fs___S3FileSystem__create instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could, sure. Wasn't sure how many functions I would need to write and whether they'd all need it.
| // [[arrow::export]] | ||
| Rcpp::List fs___FileSystemFromUri(const std::string& path) { | ||
| std::string out_path; | ||
| auto file_system = VALUE_OR_STOP(fs::FileSystemFromUri(path, &out_path)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created ticket https://issues.apache.org/jira/browse/ARROW-8488, prefer ValueOrStop instead.
| test_that("FileSystem$from_uri", { | ||
| skip_on_cran() | ||
| skip_if_not_available("s3") | ||
| fs_and_path <- FileSystem$from_uri("s3://ursa-labs-taxi-data") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would just make sure this works with a non-existing bucket so we don't depend on the presence of ursa-labs-taxi-data bucket.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it work with a bucket that doesn't exist? I actually didn't try.
We already talk about the ursa-labs-taxi-data bucket in r/vignettes/dataset.Rmd, so the presence of the bucket is already "depended on".
| pmap_chr(function(name, return_type, args, file, line){ | ||
|
|
||
| all_decorations <- cpp_decorations() | ||
| arrow_exports <- get_exported_functions(all_decorations, c("arrow", "s3")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WIll you want to do this for other component, e.g. json, csv, parquet, dataset...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this change we could if we wanted to enable building slimmer versions of the package, but I'm not planning on doing that unless there's a compelling reason to.
|
Will wait for CI, then merge. |
nealrichardson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @fsaintjacques
| #' @export | ||
| S3FileSystem <- R6Class("S3FileSystem", inherit = FileSystem) | ||
| S3FileSystem$create <- function() { | ||
| fs___EnsureS3Initialized() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could, sure. Wasn't sure how many functions I would need to write and whether they'd all need it.
| pmap_chr(function(name, return_type, args, file, line){ | ||
|
|
||
| all_decorations <- cpp_decorations() | ||
| arrow_exports <- get_exported_functions(all_decorations, c("arrow", "s3")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this change we could if we wanted to enable building slimmer versions of the package, but I'm not planning on doing that unless there's a compelling reason to.
| test_that("FileSystem$from_uri", { | ||
| skip_on_cran() | ||
| skip_if_not_available("s3") | ||
| fs_and_path <- FileSystem$from_uri("s3://ursa-labs-taxi-data") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it work with a bucket that doesn't exist? I actually didn't try.
We already talk about the ursa-labs-taxi-data bucket in r/vignettes/dataset.Rmd, so the presence of the bucket is already "depended on".
nealrichardson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the comments, will accept the suggestions and merge
This patch does the foundational work to conditionally build the S3 bindings if the C++ library was built with ARROW_S3=ON. It also adds bindings for `FileSystemFromUri` and enough wiring up in the datasets code so that `open_dataset("s3://ursa-labs-taxi-data", partitioning = c("year", "month"))` works.
There's lots of other S3FileSystem methods that probably should be implemented and aren't here. Also, calling `S3FileSystem$create()` segfaults. But `FileSystemFromUri` works fine.
Closes #6901 from nealrichardson/r-s3
Lead-authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Co-authored-by: François Saint-Jacques <fsaintjacques@gmail.com>
Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>
This patch does the foundational work to conditionally build the S3 bindings if the C++ library was built with ARROW_S3=ON. It also adds bindings for
FileSystemFromUriand enough wiring up in the datasets code so thatopen_dataset("s3://ursa-labs-taxi-data", partitioning = c("year", "month"))works.There's lots of other S3FileSystem methods that probably should be implemented and aren't here. Also, calling
S3FileSystem$create()segfaults. ButFileSystemFromUriworks fine.