From 0d1d538f3c92c0787bc9eee0de289994580c1c6e Mon Sep 17 00:00:00 2001 From: Will Jones Date: Fri, 1 Jul 2022 08:52:36 -0700 Subject: [PATCH 01/13] Start drafting todos --- r/vignettes/fs.Rmd | 17 ++++++++++++++--- 1 file changed, 14 insertions(+), 3 deletions(-) diff --git a/r/vignettes/fs.Rmd b/r/vignettes/fs.Rmd index a0c92bb6be2..32dbf89e7b1 100644 --- a/r/vignettes/fs.Rmd +++ b/r/vignettes/fs.Rmd @@ -10,10 +10,13 @@ vignette: > The Arrow C++ library includes a generic filesystem interface and specific implementations for some cloud storage systems. This setup allows various parts of the project to be able to read and write data with different storage -backends. In the `arrow` R package, support has been enabled for AWS S3. -This vignette provides an overview of working with S3 data using Arrow. +backends. In the `arrow` R package, support has been enabled for AWS S3 and +Google Cloud Storage (GCS). This vignette provides an overview of working with +S3 and GCS data using Arrow. -> In Windows and macOS binary packages, S3 support is included. On Linux when + + +> In Windows and macOS binary packages, S3 and GCS support are included. On Linux when installing from source, S3 support is not enabled by default, and it has additional system requirements. See `vignette("install", package = "arrow")` for details. @@ -31,6 +34,8 @@ s3://[access_key:secret_key@]bucket/path[?region=] For example, one of the NYC taxi data files used in `vignette("dataset", package = "arrow")` is found at + + ``` s3://ursa-labs-taxi-data/2019/06/data.parquet ``` @@ -95,6 +100,8 @@ june2019 <- SubTreeFileSystem$create("s3://ursa-labs-taxi-data/2019/06") ## Authentication +### S3 Authentication + To access private S3 buckets, you need typically need two secret parameters: a `access_key`, which is like a user id, and `secret_key`, which is like a token or password. There are a few options for passing these credentials: @@ -110,6 +117,10 @@ or password. There are a few options for passing these credentials: - Use an [AccessRole](https://docs.aws.amazon.com/STS/latest/APIReference/API_AssumeRole.html) for temporary access by passing the `role_arn` identifier to `S3FileSystem$create()` or `s3_bucket()`. +### GCS Authentication + + + ## Using a proxy server If you need to use a proxy server to connect to an S3 bucket, you can provide From 79281c3327504192d3d87147755aaf76d5dd730b Mon Sep 17 00:00:00 2001 From: Will Jones Date: Wed, 13 Jul 2022 15:27:51 -0700 Subject: [PATCH 02/13] chore: replace ursa bucket with voltrondata-labs-datasets --- cpp/src/arrow/filesystem/s3fs_test.cc | 2 +- docs/source/python/dataset.rst | 8 +++--- python/pyarrow/_s3fs.pyx | 4 +-- python/pyarrow/tests/test_fs.py | 10 ++++---- r/R/filesystem.R | 2 +- r/man/s3_bucket.Rd | 2 +- r/tests/testthat/test-filesystem.R | 6 ++--- r/vignettes/dataset.Rmd | 36 ++++++++++++++++++--------- 8 files changed, 40 insertions(+), 30 deletions(-) diff --git a/cpp/src/arrow/filesystem/s3fs_test.cc b/cpp/src/arrow/filesystem/s3fs_test.cc index 7216af297a0..1d89e2da711 100644 --- a/cpp/src/arrow/filesystem/s3fs_test.cc +++ b/cpp/src/arrow/filesystem/s3fs_test.cc @@ -322,7 +322,7 @@ TEST_F(S3OptionsTest, FromAssumeRole) { class S3RegionResolutionTest : public AwsTestMixin {}; TEST_F(S3RegionResolutionTest, PublicBucket) { - ASSERT_OK_AND_EQ("us-east-2", ResolveS3BucketRegion("ursa-labs-taxi-data")); + ASSERT_OK_AND_EQ("us-east-2", ResolveS3BucketRegion("voltrondata-labs-datasets")); // Taken from a registry of open S3-hosted datasets // at https://github.com/awslabs/open-data-registry diff --git a/docs/source/python/dataset.rst b/docs/source/python/dataset.rst index 4808457355d..2ac592d8d0c 100644 --- a/docs/source/python/dataset.rst +++ b/docs/source/python/dataset.rst @@ -355,7 +355,7 @@ specifying a S3 path: .. code-block:: python - dataset = ds.dataset("s3://ursa-labs-taxi-data/", partitioning=["year", "month"]) + dataset = ds.dataset("s3://voltrondata-labs-datasets/nyc-taxi/") Typically, you will want to customize the connection parameters, and then a file system object can be created and passed to the ``filesystem`` keyword: @@ -365,8 +365,7 @@ a file system object can be created and passed to the ``filesystem`` keyword: from pyarrow import fs s3 = fs.S3FileSystem(region="us-east-2") - dataset = ds.dataset("ursa-labs-taxi-data/", filesystem=s3, - partitioning=["year", "month"]) + dataset = ds.dataset("voltrondata-labs-datasets/nyc-taxi/", filesystem=s3) The currently available classes are :class:`~pyarrow.fs.S3FileSystem` and :class:`~pyarrow.fs.HadoopFileSystem`. See the :ref:`filesystem` docs for more @@ -387,8 +386,7 @@ useful for testing or benchmarking. # By default, MinIO will listen for unencrypted HTTP traffic. minio = fs.S3FileSystem(scheme="http", endpoint_override="localhost:9000") - dataset = ds.dataset("ursa-labs-taxi-data/", filesystem=minio, - partitioning=["year", "month"]) + dataset = ds.dataset("voltrondata-labs-datasets/nyc-taxi/", filesystem=minio) Working with Parquet Datasets diff --git a/python/pyarrow/_s3fs.pyx b/python/pyarrow/_s3fs.pyx index d9335995dc2..f668038e623 100644 --- a/python/pyarrow/_s3fs.pyx +++ b/python/pyarrow/_s3fs.pyx @@ -74,8 +74,8 @@ def resolve_s3_region(bucket): Examples -------- - >>> fs.resolve_s3_region('registry.opendata.aws') - 'us-east-1' + >>> fs.resolve_s3_region('voltrondata-labs-datasets') + 'us-east-2' """ cdef: c_string c_bucket diff --git a/python/pyarrow/tests/test_fs.py b/python/pyarrow/tests/test_fs.py index 41c242ff83b..e1b4604bbd2 100644 --- a/python/pyarrow/tests/test_fs.py +++ b/python/pyarrow/tests/test_fs.py @@ -1616,15 +1616,15 @@ def test_s3_real_aws(): assert fs.region == default_region fs = S3FileSystem(anonymous=True, region='us-east-2') - entries = fs.get_file_info(FileSelector('ursa-labs-taxi-data')) + entries = fs.get_file_info(FileSelector('voltrondata-labs-datasets/nyc-taxi')) assert len(entries) > 0 - with fs.open_input_stream('ursa-labs-taxi-data/2019/06/data.parquet') as f: + with fs.open_input_stream('voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/part-0.parquet') as f: md = f.metadata() assert 'Content-Type' in md - assert md['Last-Modified'] == b'2020-01-17T16:26:28Z' + assert md['Last-Modified'] == b'2022-07-12T23:32:00Z' # For some reason, the header value is quoted # (both with AWS and Minio) - assert md['ETag'] == b'"f1efd5d76cb82861e1542117bfa52b90-8"' + assert md['ETag'] == b'"4c6a76826a695c6ac61592bc30cda3df-16"' @pytest.mark.s3 @@ -1653,7 +1653,7 @@ def test_s3_real_aws_region_selection(): @pytest.mark.s3 def test_resolve_s3_region(): from pyarrow.fs import resolve_s3_region - assert resolve_s3_region('ursa-labs-taxi-data') == 'us-east-2' + assert resolve_s3_region('voltrondata-labs-datasets') == 'us-east-2' assert resolve_s3_region('mf-nwp-models') == 'eu-west-1' with pytest.raises(ValueError, match="Not a valid bucket name"): diff --git a/r/R/filesystem.R b/r/R/filesystem.R index 3cebbc30c85..b99dd633b82 100644 --- a/r/R/filesystem.R +++ b/r/R/filesystem.R @@ -426,7 +426,7 @@ default_s3_options <- list( #' relative path. Note that this function's success does not guarantee that you #' are authorized to access the bucket's contents. #' @examplesIf FALSE -#' bucket <- s3_bucket("ursa-labs-taxi-data") +#' bucket <- s3_bucket("voltrondata-labs-datasets") #' @export s3_bucket <- function(bucket, ...) { assert_that(is.string(bucket)) diff --git a/r/man/s3_bucket.Rd b/r/man/s3_bucket.Rd index 7baeb49b698..2ab7d4962ed 100644 --- a/r/man/s3_bucket.Rd +++ b/r/man/s3_bucket.Rd @@ -23,6 +23,6 @@ relative path. } \examples{ \dontshow{if (FALSE) (if (getRversion() >= "3.4") withAutoprint else force)(\{ # examplesIf} -bucket <- s3_bucket("ursa-labs-taxi-data") +bucket <- s3_bucket("voltrondata-labs-datasets") \dontshow{\}) # examplesIf} } diff --git a/r/tests/testthat/test-filesystem.R b/r/tests/testthat/test-filesystem.R index 1852634ac99..fdca3d9b420 100644 --- a/r/tests/testthat/test-filesystem.R +++ b/r/tests/testthat/test-filesystem.R @@ -147,7 +147,7 @@ test_that("FileSystem$from_uri", { skip_on_cran() skip_if_not_available("s3") skip_if_offline() - fs_and_path <- FileSystem$from_uri("s3://ursa-labs-taxi-data") + fs_and_path <- FileSystem$from_uri("s3://voltrondata-labs-datasets") expect_r6_class(fs_and_path$fs, "S3FileSystem") expect_identical(fs_and_path$fs$region, "us-east-2") }) @@ -156,11 +156,11 @@ test_that("SubTreeFileSystem$create() with URI", { skip_on_cran() skip_if_not_available("s3") skip_if_offline() - fs <- SubTreeFileSystem$create("s3://ursa-labs-taxi-data") + fs <- SubTreeFileSystem$create("s3://voltrondata-labs-datasets") expect_r6_class(fs, "SubTreeFileSystem") expect_identical( capture.output(print(fs)), - "SubTreeFileSystem: s3://ursa-labs-taxi-data/" + "SubTreeFileSystem: s3://voltrondata-labs-datasets/" ) }) diff --git a/r/vignettes/dataset.Rmd b/r/vignettes/dataset.Rmd index 5c430c4be0d..e78f6851c3e 100644 --- a/r/vignettes/dataset.Rmd +++ b/r/vignettes/dataset.Rmd @@ -44,7 +44,9 @@ directory. If your arrow build has S3 support, you can sync the data locally with: ```{r, eval = FALSE} -arrow::copy_files("s3://ursa-labs-taxi-data", "nyc-taxi") +arrow::copy_files("s3://voltrondata-labs-datasets/nyc-taxi", "nyc-taxi") +# Alternatively, with GCS: +# arrow::copy_files("gs://voltrondata-labs-datasets/nyc-taxi", "nyc-taxi") ``` If your arrow build doesn't have S3 support, you can download the files @@ -53,7 +55,7 @@ you may need to increase R's download timeout from the default of 60 seconds, e. `options(timeout = 300)`. ```{r, eval = FALSE} -bucket <- "https://ursa-labs-taxi-data.s3.us-east-2.amazonaws.com" +bucket <- "https://voltrondata-labs-datasets.s3.us-east-2.amazonaws.com" for (year in 2009:2019) { if (year == 2019) { # We only have through June 2019 there @@ -64,8 +66,8 @@ for (year in 2009:2019) { for (month in sprintf("%02d", months)) { dir.create(file.path("nyc-taxi", year, month), recursive = TRUE) try(download.file( - paste(bucket, year, month, "data.parquet", sep = "/"), - file.path("nyc-taxi", year, month, "data.parquet"), + paste(bucket, "nyc-taxi", paste0("year=", year), paste0("month=", month), "data.parquet", sep = "/"), + file.path("nyc-taxi", paste0("year=", year), paste0("month=", month), "data.parquet"), mode = "wb" ), silent = TRUE) } @@ -99,7 +101,7 @@ library(dplyr, warn.conflicts = FALSE) The first step is to create a Dataset object, pointing at the directory of data. ```{r, eval = file.exists("nyc-taxi")} -ds <- open_dataset("nyc-taxi", partitioning = c("year", "month")) +ds <- open_dataset("nyc-taxi") ``` The file format for `open_dataset()` is controlled by the `format` parameter, @@ -122,9 +124,18 @@ For text files, you can pass the following parsing options to `open_dataset()`: For more information on the usage of these parameters, see `?read_delim_arrow()`. -The `partitioning` argument lets you specify how the file paths provide information -about how the dataset is chunked into different files. The files in this example -have file paths like +`open_dataset()` was able to automatically infer column values for `year` and `month` +--which are not present in the data files--based on the directory structure. The +Hive-style partitioning structure is self-describing, with file paths like + +``` +year=2009/month=1/data.parquet +year=2009/month=2/data.parquet +... +``` + +But sometimes the directory partitioning isn't self describing; that is, it doesn't +contain field names. For example, if instead we had file paths like ``` 2009/01/data.parquet @@ -132,12 +143,13 @@ have file paths like ... ``` -By providing `c("year", "month")` to the `partitioning` argument, you're saying that the first -path segment gives the value for `year`, and the second segment is `month`. -Every row in `2009/01/data.parquet` has a value of 2009 for `year` +then `open_dataset()` would need some hints as to how to use the file paths. In this +case, you could provide `c("year", "month")` to the `partitioning` argument, +saying that the first path segment gives the value for `year`, and the second +segment is `month`. Every row in `2009/01/data.parquet` has a value of 2009 for `year` and 1 for `month`, even though those columns may not be present in the file. -Indeed, when you look at the dataset, you can see that in addition to the columns present +In either case, when you look at the dataset, you can see that in addition to the columns present in every file, there are also columns `year` and `month` even though they are not present in the files themselves. ```{r, eval = file.exists("nyc-taxi")} From 91d3211ee9f245f8bb4b491eb57b59ac684b20a4 Mon Sep 17 00:00:00 2001 From: Will Jones Date: Thu, 14 Jul 2022 10:01:13 -0700 Subject: [PATCH 03/13] feat: add gs_bucket() function --- r/NAMESPACE | 1 + r/R/filesystem.R | 22 +++++++++ r/man/ArrayData.Rd | 6 +-- r/man/Scalar.Rd | 6 +-- r/man/array.Rd | 6 +-- r/man/arrow-package.Rd | 2 +- r/man/gs_bucket.Rd | 27 +++++++++++ r/tests/testthat/test-filesystem.R | 15 +++++++ r/vignettes/fs.Rmd | 72 ++++++++++++++++++++++-------- 9 files changed, 125 insertions(+), 32 deletions(-) create mode 100644 r/man/gs_bucket.Rd diff --git a/r/NAMESPACE b/r/NAMESPACE index 750a815f9ff..0fa23fd01e9 100644 --- a/r/NAMESPACE +++ b/r/NAMESPACE @@ -301,6 +301,7 @@ export(float) export(float16) export(float32) export(float64) +export(gs_bucket) export(halffloat) export(hive_partition) export(infer_type) diff --git a/r/R/filesystem.R b/r/R/filesystem.R index b99dd633b82..1339818e2f9 100644 --- a/r/R/filesystem.R +++ b/r/R/filesystem.R @@ -448,6 +448,28 @@ s3_bucket <- function(bucket, ...) { SubTreeFileSystem$create(fs_and_path$path, fs) } +#' Connect to an Google Storage Service (GCS) bucket +#' +#' `gs_bucket()` is a convenience function to create an `GcsFileSystem` object +#' that holds onto its relative path +#' +#' @param bucket string GCS bucket name or path +#' @param ... Additional connection options, passed to `GcsFileSystem$create()` +#' @return A `SubTreeFileSystem` containing an `GcsFileSystem` and the bucket's +#' relative path. Note that this function's success does not guarantee that you +#' are authorized to access the bucket's contents. +#' @examplesIf FALSE +#' bucket <- gs_bucket("voltrondata-labs-datasets") +#' @export +gs_bucket <- function(bucket, ...) { + assert_that(is.string(bucket)) + args <- list2(...) + + fs <- exec(Gcs3FileSystem$create, !!!args) + + SubTreeFileSystem(bucket, fs) +} + #' @usage NULL #' @format NULL #' @rdname FileSystem diff --git a/r/man/ArrayData.Rd b/r/man/ArrayData.Rd index 2e27c6cfca5..383ab317d1e 100644 --- a/r/man/ArrayData.Rd +++ b/r/man/ArrayData.Rd @@ -9,16 +9,14 @@ The \code{ArrayData} class allows you to get and inspect the data inside an \code{arrow::Array}. } \section{Usage}{ - - -\if{html}{\out{
}}\preformatted{data <- Array$create(x)$data() +\preformatted{data <- Array$create(x)$data() data$type data$length data$null_count data$offset data$buffers -}\if{html}{\out{
}} +} } \section{Methods}{ diff --git a/r/man/Scalar.Rd b/r/man/Scalar.Rd index e9eac70776b..d814c623372 100644 --- a/r/man/Scalar.Rd +++ b/r/man/Scalar.Rd @@ -17,14 +17,12 @@ The \code{Scalar$create()} factory method instantiates a \code{Scalar} and takes } \section{Usage}{ - - -\if{html}{\out{
}}\preformatted{a <- Scalar$create(x) +\preformatted{a <- Scalar$create(x) length(a) print(a) a == a -}\if{html}{\out{
}} +} } \section{Methods}{ diff --git a/r/man/array.Rd b/r/man/array.Rd index 5a4bc40d95e..371c53ac87a 100644 --- a/r/man/array.Rd +++ b/r/man/array.Rd @@ -41,14 +41,12 @@ but not limited to strings only) } \section{Usage}{ - - -\if{html}{\out{
}}\preformatted{a <- Array$create(x) +\preformatted{a <- Array$create(x) length(a) print(a) a == a -}\if{html}{\out{
}} +} } \section{Methods}{ diff --git a/r/man/arrow-package.Rd b/r/man/arrow-package.Rd index e1b6808f6bf..2a0143d02e5 100644 --- a/r/man/arrow-package.Rd +++ b/r/man/arrow-package.Rd @@ -6,7 +6,7 @@ \alias{arrow-package} \title{arrow: Integration to 'Apache' 'Arrow'} \description{ -'Apache' 'Arrow' \url{https://arrow.apache.org/} is a cross-language development platform for in-memory data. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware. This package provides an interface to the 'Arrow C++' library. +'Apache' 'Arrow' is a cross-language development platform for in-memory data. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware. This package provides an interface to the 'Arrow C++' library. } \seealso{ Useful links: diff --git a/r/man/gs_bucket.Rd b/r/man/gs_bucket.Rd new file mode 100644 index 00000000000..d45626a6beb --- /dev/null +++ b/r/man/gs_bucket.Rd @@ -0,0 +1,27 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/filesystem.R +\name{gs_bucket} +\alias{gs_bucket} +\title{Connect to an Google Storage Service (GCS) bucket} +\usage{ +gs_bucket(bucket, ...) +} +\arguments{ +\item{bucket}{string GCS bucket name or path} + +\item{...}{Additional connection options, passed to \code{GcsFileSystem$create()}} +} +\value{ +A \code{SubTreeFileSystem} containing an \code{GcsFileSystem} and the bucket's +relative path. Note that this function's success does not guarantee that you +are authorized to access the bucket's contents. +} +\description{ +\code{gs_bucket()} is a convenience function to create an \code{GcsFileSystem} object +that holds onto its relative path +} +\examples{ +\dontshow{if (FALSE) (if (getRversion() >= "3.4") withAutoprint else force)(\{ # examplesIf} +bucket <- gs_bucket("voltrondata-labs-datasets") +\dontshow{\}) # examplesIf} +} diff --git a/r/tests/testthat/test-filesystem.R b/r/tests/testthat/test-filesystem.R index fdca3d9b420..ee1dd890787 100644 --- a/r/tests/testthat/test-filesystem.R +++ b/r/tests/testthat/test-filesystem.R @@ -190,3 +190,18 @@ test_that("s3_bucket", { skip_on_os("windows") # FIXME expect_identical(bucket$base_path, "ursa-labs-r-test/") }) + +test_that("gs_bucket", { + skip_on_cran() + skip_if_not_available("gcs") + skip_if_offline() + bucket <- s3_bucket("voltrondata-labs-datasets") + expect_r6_class(bucket, "SubTreeFileSystem") + expect_r6_class(bucket$base_fs, "GcsFileSystem") + expect_identical( + capture.output(print(bucket)), + "SubTreeFileSystem: gs://voltrondata-labs-datasets/" + ) + skip_on_os("windows") # FIXME + expect_identical(bucket$base_path, "voltrondata-labs-datasets/") +}) diff --git a/r/vignettes/fs.Rmd b/r/vignettes/fs.Rmd index 32dbf89e7b1..e3322c1612d 100644 --- a/r/vignettes/fs.Rmd +++ b/r/vignettes/fs.Rmd @@ -1,8 +1,8 @@ --- -title: "Working with Cloud Storage (S3)" +title: "Working with Cloud Storage (S3, GCS)" output: rmarkdown::html_vignette vignette: > - %\VignetteIndexEntry{Working with Cloud Storage (S3)} + %\VignetteIndexEntry{Working with Cloud Storage (S3, GCS)} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- @@ -14,8 +14,6 @@ backends. In the `arrow` R package, support has been enabled for AWS S3 and Google Cloud Storage (GCS). This vignette provides an overview of working with S3 and GCS data using Arrow. - - > In Windows and macOS binary packages, S3 and GCS support are included. On Linux when installing from source, S3 support is not enabled by default, and it has additional system requirements. See `vignette("install", package = "arrow")` @@ -32,30 +30,64 @@ An S3 URI looks like: s3://[access_key:secret_key@]bucket/path[?region=] ``` -For example, one of the NYC taxi data files used in `vignette("dataset", package = "arrow")` is found at +A GCS URI looks like: + +``` +gs://[access_key:secret_key@]bucket/path[?region=] +gs://anonymous@bucket/path[?region=] +``` - +For example, one of the NYC taxi data files used in `vignette("dataset", package = "arrow")` is found at ``` -s3://ursa-labs-taxi-data/2019/06/data.parquet +s3://voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/data.parquet +# Or in GCS: +# gs://voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/data.parquet ``` + + Given this URI, you can pass it to `read_parquet()` just as if it were a local file path: ```r -df <- read_parquet("s3://ursa-labs-taxi-data/2019/06/data.parquet") +df <- read_parquet("s3://voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/data.parquet") +# Or in GCS: +# df <- read_parquet("gs://anonymous@voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/data.parquet") ``` Note that this will be slower to read than if the file were local, though if you're running on a machine in the same AWS region as the file in S3, the cost of reading the data over the network should be much lower. +### URI options + +URIs accept additional options in the query parameters (the part after the `?`) +that are passed down to configure the underlying file system. They are separated +by `&`. For example, + +``` +s3://voltrondata-labs-datasets/nyc-taxi/?endpoint_override=https://storage.googleapis.com&allow_bucket_creation=true +``` + +tells the `S3FileSystem` that it should allow the creation of new buckets and to +talk to Google Storage instead of S3. The latter is a neat trick that exists +because GCS implements an S3-compatible API, but for better support for GCS use +the GCSFileSystem with `gs://`. + +In GCS, a useful option is `retry_limit_seconds`, which sets the number of seconds +a request may spend retrying before returning an error. The current default is +15 minutes, so in many interactive contexts it's nice to set a lower value: + +``` +gs://anonymous@voltrondata-labs-datasets/nyc-taxi/?retry_limit_seconds=10 +``` + ## Creating a FileSystem object -Another way to connect to S3 is to create a `FileSystem` object once and pass -that to the read/write functions. -`S3FileSystem` objects can be created with the `s3_bucket()` function, which -automatically detects the bucket's AWS region. Additionally, the resulting +Instead of configuring filesystems through URIs, you can create `FileSystem` +objects. `S3FileSystem` objects can be created with the `s3_bucket()` function, which +automatically detects the bucket's AWS region. Similarly, `GcsFileSystem` objects +can be created with the `gs_bucket()` function. Additionally, the resulting `FileSystem` will consider paths relative to the bucket's path (so for example you don't need to prefix the bucket path when listing a directory). This may be convenient when dealing with @@ -66,19 +98,21 @@ With a `FileSystem` object, you can point to specific files in it with the `$pat In the previous example, this would look like: ```r -bucket <- s3_bucket("ursa-labs-taxi-data") -df <- read_parquet(bucket$path("2019/06/data.parquet")) +bucket <- s3_bucket("voltrondata-labs-datasets") +df <- read_parquet(bucket$path("nyc-taxi/year=2019/month=6/data.parquet")) ``` You can list the files and/or directories in an S3 bucket or subdirectory using the `$ls()` method: ```r -bucket$ls() +bucket$ls("nyc-taxi") ``` -See `help(FileSystem)` for a list of options that `s3_bucket()` and `S3FileSystem$create()` -can take. `region`, `scheme`, and `endpoint_override` can be encoded as query +See `help(FileSystem)` for a list of options that `s3_bucket()`/`S3FileSystem$create()` +and `gs_bucket()`/`GcsFileSystem$create()` can take. + +For S3, only the following options can be included in t`region`, `scheme`, `endpoint_override` can be encoded as query parameters in the URI (though `region` will be auto-detected in `s3_bucket()` or from the URI if omitted). `access_key` and `secret_key` can also be included, but other options are not supported in the URI. @@ -95,7 +129,7 @@ df <- read_parquet(june2019$path("data.parquet")) `SubTreeFileSystem` can also be made from a URI: ```r -june2019 <- SubTreeFileSystem$create("s3://ursa-labs-taxi-data/2019/06") +june2019 <- SubTreeFileSystem$create("s3://voltrondata-labs-datasets/nyc-taxi/2019/06") ``` ## Authentication @@ -128,7 +162,7 @@ a URI in the form `http://user:password@host:port` to `proxy_options`. For example, a local proxy server running on port 1316 can be used like this: ```r -bucket <- s3_bucket("ursa-labs-taxi-data", proxy_options = "http://localhost:1316") +bucket <- s3_bucket("voltrondata-labs-datasets", proxy_options = "http://localhost:1316") ``` ## File systems that emulate S3 From 5725e8c19bcd0993968bc43014e69e37d3ab860e Mon Sep 17 00:00:00 2001 From: Will Jones Date: Fri, 15 Jul 2022 10:11:29 -0700 Subject: [PATCH 04/13] fix: address failures in gs_bucket() --- r/R/filesystem.R | 4 ++-- r/tests/testthat/test-filesystem.R | 2 +- r/vignettes/fs.Rmd | 5 +++++ 3 files changed, 8 insertions(+), 3 deletions(-) diff --git a/r/R/filesystem.R b/r/R/filesystem.R index 1339818e2f9..0df5608a0cb 100644 --- a/r/R/filesystem.R +++ b/r/R/filesystem.R @@ -465,9 +465,9 @@ gs_bucket <- function(bucket, ...) { assert_that(is.string(bucket)) args <- list2(...) - fs <- exec(Gcs3FileSystem$create, !!!args) + fs <- exec(GcsFileSystem$create, !!!args) - SubTreeFileSystem(bucket, fs) + SubTreeFileSystem$create(bucket, fs) } #' @usage NULL diff --git a/r/tests/testthat/test-filesystem.R b/r/tests/testthat/test-filesystem.R index ee1dd890787..a4ac45a47dd 100644 --- a/r/tests/testthat/test-filesystem.R +++ b/r/tests/testthat/test-filesystem.R @@ -195,7 +195,7 @@ test_that("gs_bucket", { skip_on_cran() skip_if_not_available("gcs") skip_if_offline() - bucket <- s3_bucket("voltrondata-labs-datasets") + bucket <- gs_bucket("voltrondata-labs-datasets") expect_r6_class(bucket, "SubTreeFileSystem") expect_r6_class(bucket$base_fs, "GcsFileSystem") expect_identical( diff --git a/r/vignettes/fs.Rmd b/r/vignettes/fs.Rmd index e3322c1612d..cd2816d0fa0 100644 --- a/r/vignettes/fs.Rmd +++ b/r/vignettes/fs.Rmd @@ -99,6 +99,8 @@ In the previous example, this would look like: ```r bucket <- s3_bucket("voltrondata-labs-datasets") +# Or in GCS (anonymous = TRUE is required): +# bucket <- gs_bucket("voltrondata-labs-datasets", anonymous = TRUE) df <- read_parquet(bucket$path("nyc-taxi/year=2019/month=6/data.parquet")) ``` @@ -107,6 +109,9 @@ the `$ls()` method: ```r bucket$ls("nyc-taxi") +# Or recursive: +# bucket$ls("nyc-taxi", recursive = TRUE) +# NOTE: in GCS, you must use recursive as directories often don't appear in ls results ``` See `help(FileSystem)` for a list of options that `s3_bucket()`/`S3FileSystem$create()` From c4ffa744193c63a749fca00e54b3962fa201a6b9 Mon Sep 17 00:00:00 2001 From: Will Jones Date: Mon, 18 Jul 2022 14:11:18 -0700 Subject: [PATCH 05/13] chore: update dataset.Rmd fake output --- r/vignettes/dataset.Rmd | 62 ++++++++++++++++++++++------------------- 1 file changed, 33 insertions(+), 29 deletions(-) diff --git a/r/vignettes/dataset.Rmd b/r/vignettes/dataset.Rmd index e78f6851c3e..3774e6445d6 100644 --- a/r/vignettes/dataset.Rmd +++ b/r/vignettes/dataset.Rmd @@ -157,29 +157,31 @@ ds ``` ```{r, echo = FALSE, eval = !file.exists("nyc-taxi")} cat(" -FileSystemDataset with 125 Parquet files -vendor_id: string -pickup_at: timestamp[us] -dropoff_at: timestamp[us] -passenger_count: int8 -trip_distance: float -pickup_longitude: float -pickup_latitude: float -rate_code_id: null -store_and_fwd_flag: string -dropoff_longitude: float -dropoff_latitude: float +FileSystemDataset with 158 Parquet files +vendor_name: string +pickup_datetime: timestamp[ms] +dropoff_datetime: timestamp[ms] +passenger_count: int64 +trip_distance: double +pickup_longitude: double +pickup_latitude: double +rate_code: string +store_and_fwd: string +dropoff_longitude: double +dropoff_latitude: double payment_type: string -fare_amount: float -extra: float -mta_tax: float -tip_amount: float -tolls_amount: float -total_amount: float +fare_amount: double +extra: double +mta_tax: double +tip_amount: double +tolls_amount: double +total_amount: double +improvement_surcharge: double +congestion_surcharge: double +pickup_location_id: int64 +dropoff_location_id: int64 year: int32 month: int32 - -See $metadata for additional Schema metadata ") ``` @@ -283,7 +285,7 @@ ds %>% ```{r, echo = FALSE, eval = !file.exists("nyc-taxi")} cat(" FileSystemDataset (query) -passenger_count: int8 +passenger_count: int64 median_tip_pct: double n: int32 @@ -324,19 +326,20 @@ percentage of rows from each batch: sampled_data <- ds %>% filter(year == 2015) %>% select(tip_amount, total_amount, passenger_count) %>% - map_batches(~ sample_frac(as.data.frame(.), 1e-4)) %>% - mutate(tip_pct = tip_amount / total_amount) + map_batches(~ as_record_batch(sample_frac(as.data.frame(.), 1e-4))) %>% + mutate(tip_pct = tip_amount / total_amount) %>% + collect() str(sampled_data) ``` ```{r, echo = FALSE, eval = !file.exists("nyc-taxi")} cat(" -'data.frame': 15603 obs. of 4 variables: - $ tip_amount : num 0 0 1.55 1.45 5.2 ... - $ total_amount : num 5.8 16.3 7.85 8.75 26 ... - $ passenger_count: int 1 1 1 1 1 6 5 1 2 1 ... - $ tip_pct : num 0 0 0.197 0.166 0.2 ... +tibble [10,918 × 4] (S3: tbl_df/tbl/data.frame) + $ tip_amount : num [1:10918] 3 0 4 1 1 6 0 1.35 0 5.9 ... + $ total_amount : num [1:10918] 18.8 13.3 20.3 15.8 13.3 ... + $ passenger_count: int [1:10918] 3 2 1 1 1 1 1 1 1 3 ... + $ tip_pct : num [1:10918] 0.1596 0 0.197 0.0633 0.0752 ... ") ``` @@ -357,7 +360,8 @@ ds %>% as.data.frame() %>% mutate(pred_tip_pct = predict(model, newdata = .)) %>% filter(!is.nan(tip_pct)) %>% - summarize(sse_partial = sum((pred_tip_pct - tip_pct)^2), n_partial = n()) + summarize(sse_partial = sum((pred_tip_pct - tip_pct)^2), n_partial = n()) %>% + as_record_batch() }) %>% summarize(mse = sum(sse_partial) / sum(n_partial)) %>% pull(mse) From 6fd2e3fdf02475bcb22ce81ad9a45c69bf9a5080 Mon Sep 17 00:00:00 2001 From: Will Jones Date: Mon, 18 Jul 2022 14:11:49 -0700 Subject: [PATCH 06/13] doc: add GCS to fs.Rmd vignette --- r/.gitignore | 1 + r/_pkgdown.yml | 3 +- r/vignettes/fs.Rmd | 75 +++++++++++++++++++++++++++++++++------------- 3 files changed, 58 insertions(+), 21 deletions(-) diff --git a/r/.gitignore b/r/.gitignore index 695e42b7593..e607d2662f2 100644 --- a/r/.gitignore +++ b/r/.gitignore @@ -18,6 +18,7 @@ vignettes/nyc-taxi/ arrow_*.tar.gz arrow_*.tgz extra-tests/files +.deps # C++ sources for an offline build. They're copied from the ../cpp directory, so ignore them here. /tools/cpp/ diff --git a/r/_pkgdown.yml b/r/_pkgdown.yml index c0f599fb8a5..12ab3ccadcf 100644 --- a/r/_pkgdown.yml +++ b/r/_pkgdown.yml @@ -90,7 +90,7 @@ navbar: href: articles/install.html - text: Working with Arrow Datasets and dplyr href: articles/dataset.html - - text: Working with Cloud Storage (S3) + - text: Working with Cloud Storage (S3, GCS) href: articles/fs.html - text: Apache Arrow in Python and R with reticulate href: articles/python.html @@ -198,6 +198,7 @@ reference: - title: File systems contents: - s3_bucket + - gs_bucket - FileSystem - FileInfo - FileSelector diff --git a/r/vignettes/fs.Rmd b/r/vignettes/fs.Rmd index cd2816d0fa0..1b0a3b1f7f0 100644 --- a/r/vignettes/fs.Rmd +++ b/r/vignettes/fs.Rmd @@ -15,15 +15,14 @@ Google Cloud Storage (GCS). This vignette provides an overview of working with S3 and GCS data using Arrow. > In Windows and macOS binary packages, S3 and GCS support are included. On Linux when -installing from source, S3 support is not enabled by default, and it has +installing from source, S3 and GCS support is not enabled by default, and it has additional system requirements. See `vignette("install", package = "arrow")` for details. ## URIs File readers and writers (`read_parquet()`, `write_feather()`, et al.) -accept an S3 URI as the source or destination file, -as do `open_dataset()` and `write_dataset()`. +accept a URI as the source or destination file, as do `open_dataset()` and `write_dataset()`. An S3 URI looks like: ``` @@ -41,18 +40,16 @@ For example, one of the NYC taxi data files used in `vignette("dataset", package ``` s3://voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/data.parquet -# Or in GCS: -# gs://voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/data.parquet +# Or in GCS (anonymous required on public buckets): +# gs://anonymous@voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/data.parquet ``` - - Given this URI, you can pass it to `read_parquet()` just as if it were a local file path: ```r df <- read_parquet("s3://voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/data.parquet") # Or in GCS: -# df <- read_parquet("gs://anonymous@voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/data.parquet") +df <- read_parquet("gs://anonymous@voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/data.parquet") ``` Note that this will be slower to read than if the file were local, @@ -70,9 +67,9 @@ s3://voltrondata-labs-datasets/nyc-taxi/?endpoint_override=https://storage.googl ``` tells the `S3FileSystem` that it should allow the creation of new buckets and to -talk to Google Storage instead of S3. The latter is a neat trick that exists -because GCS implements an S3-compatible API, but for better support for GCS use -the GCSFileSystem with `gs://`. +talk to Google Storage instead of S3. The latter works because GCS implements an +S3-compatible API--see [File systems that emulate S3](#file-systems-that-emulate-s3) +below--but for better support for GCS use the GCSFileSystem with `gs://`. In GCS, a useful option is `retry_limit_seconds`, which sets the number of seconds a request may spend retrying before returning an error. The current default is @@ -100,7 +97,7 @@ In the previous example, this would look like: ```r bucket <- s3_bucket("voltrondata-labs-datasets") # Or in GCS (anonymous = TRUE is required): -# bucket <- gs_bucket("voltrondata-labs-datasets", anonymous = TRUE) +bucket <- gs_bucket("voltrondata-labs-datasets", anonymous = TRUE) df <- read_parquet(bucket$path("nyc-taxi/year=2019/month=6/data.parquet")) ``` @@ -110,19 +107,25 @@ the `$ls()` method: ```r bucket$ls("nyc-taxi") # Or recursive: -# bucket$ls("nyc-taxi", recursive = TRUE) -# NOTE: in GCS, you must use recursive as directories often don't appear in ls results +bucket$ls("nyc-taxi", recursive = TRUE) ``` +**NOTE**: in GCS, you *should always* use recursive as directories often don't appear in +`$ls()` results. + + + See `help(FileSystem)` for a list of options that `s3_bucket()`/`S3FileSystem$create()` and `gs_bucket()`/`GcsFileSystem$create()` can take. -For S3, only the following options can be included in t`region`, `scheme`, `endpoint_override` can be encoded as query -parameters in the URI (though `region` will be auto-detected in `s3_bucket()` or from the URI if omitted). -`access_key` and `secret_key` can also be included, -but other options are not supported in the URI. +For S3, only the following options can be included in the URI as query parameters +are `region`, `scheme`, `endpoint_override`, `access_key`, `secret_key`, `allow_bucket_creation`, +and `allow_bucket_deletion`. For GCS, the supported parameters are `scheme`, `endpoint_override`, +and `retry_limit_seconds`. -The object that `s3_bucket()` returns is technically a `SubTreeFileSystem`, which holds a path and a file system to which it corresponds. `SubTreeFileSystem`s can be useful for holding a reference to a subdirectory somewhere (on S3 or elsewhere). +The object that `s3_bucket()` and `gs_bucket()` return is technically a `SubTreeFileSystem`, +which holds a path and a file system to which it corresponds. `SubTreeFileSystem`s can be +useful for holding a reference to a subdirectory somewhere (on S3, GCS, or elsewhere). One way to get a subtree is to call the `$cd()` method on a `FileSystem` @@ -158,7 +161,39 @@ for temporary access by passing the `role_arn` identifier to `S3FileSystem$creat ### GCS Authentication - +To access public buckets, you must pass `anonymous = TRUE` or `anonymous` as the +user in a URI: + +```r +bucket <- gs_bucket("voltrondata-labs-datasets", anonymous = TRUE) +fs <- GcsFileSystem$create(anonymous = TRUE) +df <- read_parquet("gs://anonymous@voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/data.parquet") +``` + +To access private buckets, you can pass either `access_token` and `expiration`, for using +temporary tokens designed for interactive use, or `json_credentials`, to directly reference +a service account credentials file desgined for application use. +Temporary tokens must be refreshed manually, while the latter method will handle refreshing +authentication automatically. + +You can get token credentials with the [gargle](https://gargle.r-lib.org/index.html) library: + +```r +library(gargle) + +token <- token_fetch() + +fs <- GcsFileSystem$create( + access_token=token$credentials$access_token, + expiration=Sys.time() + token$credentials$expires_in +) +``` + +For service accounts, you can provide the path to the credentials file with the +`json_credentials` or using the `GOOGLE_APPLICATION_CREDENTIALS` environment +variable. See Google's guide to +[Authenticating as a service account](https://cloud.google.com/docs/authentication/production#create_service_account) +for more information. ## Using a proxy server From 909b84c7ba479dba2e52dc1ea7431bb65382a441 Mon Sep 17 00:00:00 2001 From: Will Jones Date: Tue, 19 Jul 2022 08:28:39 -0700 Subject: [PATCH 07/13] docs: leave auth for another PR --- python/pyarrow/tests/test_fs.py | 3 ++- r/vignettes/fs.Rmd | 30 ++++++------------------------ 2 files changed, 8 insertions(+), 25 deletions(-) diff --git a/python/pyarrow/tests/test_fs.py b/python/pyarrow/tests/test_fs.py index e1b4604bbd2..3e35efa74f0 100644 --- a/python/pyarrow/tests/test_fs.py +++ b/python/pyarrow/tests/test_fs.py @@ -1616,7 +1616,8 @@ def test_s3_real_aws(): assert fs.region == default_region fs = S3FileSystem(anonymous=True, region='us-east-2') - entries = fs.get_file_info(FileSelector('voltrondata-labs-datasets/nyc-taxi')) + entries = fs.get_file_info(FileSelector( + 'voltrondata-labs-datasets/nyc-taxi')) assert len(entries) > 0 with fs.open_input_stream('voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/part-0.parquet') as f: md = f.metadata() diff --git a/r/vignettes/fs.Rmd b/r/vignettes/fs.Rmd index 1b0a3b1f7f0..ef10ab930ac 100644 --- a/r/vignettes/fs.Rmd +++ b/r/vignettes/fs.Rmd @@ -63,7 +63,7 @@ that are passed down to configure the underlying file system. They are separated by `&`. For example, ``` -s3://voltrondata-labs-datasets/nyc-taxi/?endpoint_override=https://storage.googleapis.com&allow_bucket_creation=true +s3://voltrondata-labs-datasets/?endpoint_override=https%3A%2F%2Fstorage.googleapis.com&allow_bucket_creation=true ``` tells the `S3FileSystem` that it should allow the creation of new buckets and to @@ -170,30 +170,12 @@ fs <- GcsFileSystem$create(anonymous = TRUE) df <- read_parquet("gs://anonymous@voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/data.parquet") ``` -To access private buckets, you can pass either `access_token` and `expiration`, for using -temporary tokens designed for interactive use, or `json_credentials`, to directly reference -a service account credentials file desgined for application use. -Temporary tokens must be refreshed manually, while the latter method will handle refreshing -authentication automatically. +To access private buckets, you can pass either `access_token` and `expiration`, for using +temporary tokens generated elsewhere, or `json_credentials`, to reference a downloaded +credentials file. -You can get token credentials with the [gargle](https://gargle.r-lib.org/index.html) library: - -```r -library(gargle) - -token <- token_fetch() - -fs <- GcsFileSystem$create( - access_token=token$credentials$access_token, - expiration=Sys.time() + token$credentials$expires_in -) -``` - -For service accounts, you can provide the path to the credentials file with the -`json_credentials` or using the `GOOGLE_APPLICATION_CREDENTIALS` environment -variable. See Google's guide to -[Authenticating as a service account](https://cloud.google.com/docs/authentication/production#create_service_account) -for more information. + ## Using a proxy server From cd659af708144b763205c3c8e66a00a9d7b645db Mon Sep 17 00:00:00 2001 From: Will Jones Date: Tue, 19 Jul 2022 10:33:27 -0700 Subject: [PATCH 08/13] doc: fix regression in docs format --- r/man/ArrayData.Rd | 6 ++++-- r/man/Scalar.Rd | 6 ++++-- r/man/array.Rd | 6 ++++-- r/man/arrow-package.Rd | 2 +- r/vignettes/dataset.Rmd | 2 +- 5 files changed, 14 insertions(+), 8 deletions(-) diff --git a/r/man/ArrayData.Rd b/r/man/ArrayData.Rd index 383ab317d1e..2e27c6cfca5 100644 --- a/r/man/ArrayData.Rd +++ b/r/man/ArrayData.Rd @@ -9,14 +9,16 @@ The \code{ArrayData} class allows you to get and inspect the data inside an \code{arrow::Array}. } \section{Usage}{ -\preformatted{data <- Array$create(x)$data() + + +\if{html}{\out{
}}\preformatted{data <- Array$create(x)$data() data$type data$length data$null_count data$offset data$buffers -} +}\if{html}{\out{
}} } \section{Methods}{ diff --git a/r/man/Scalar.Rd b/r/man/Scalar.Rd index d814c623372..e9eac70776b 100644 --- a/r/man/Scalar.Rd +++ b/r/man/Scalar.Rd @@ -17,12 +17,14 @@ The \code{Scalar$create()} factory method instantiates a \code{Scalar} and takes } \section{Usage}{ -\preformatted{a <- Scalar$create(x) + + +\if{html}{\out{
}}\preformatted{a <- Scalar$create(x) length(a) print(a) a == a -} +}\if{html}{\out{
}} } \section{Methods}{ diff --git a/r/man/array.Rd b/r/man/array.Rd index 371c53ac87a..5a4bc40d95e 100644 --- a/r/man/array.Rd +++ b/r/man/array.Rd @@ -41,12 +41,14 @@ but not limited to strings only) } \section{Usage}{ -\preformatted{a <- Array$create(x) + + +\if{html}{\out{
}}\preformatted{a <- Array$create(x) length(a) print(a) a == a -} +}\if{html}{\out{
}} } \section{Methods}{ diff --git a/r/man/arrow-package.Rd b/r/man/arrow-package.Rd index 2a0143d02e5..e1b6808f6bf 100644 --- a/r/man/arrow-package.Rd +++ b/r/man/arrow-package.Rd @@ -6,7 +6,7 @@ \alias{arrow-package} \title{arrow: Integration to 'Apache' 'Arrow'} \description{ -'Apache' 'Arrow' is a cross-language development platform for in-memory data. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware. This package provides an interface to the 'Arrow C++' library. +'Apache' 'Arrow' \url{https://arrow.apache.org/} is a cross-language development platform for in-memory data. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware. This package provides an interface to the 'Arrow C++' library. } \seealso{ Useful links: diff --git a/r/vignettes/dataset.Rmd b/r/vignettes/dataset.Rmd index 3774e6445d6..1a969f979c6 100644 --- a/r/vignettes/dataset.Rmd +++ b/r/vignettes/dataset.Rmd @@ -46,7 +46,7 @@ If your arrow build has S3 support, you can sync the data locally with: ```{r, eval = FALSE} arrow::copy_files("s3://voltrondata-labs-datasets/nyc-taxi", "nyc-taxi") # Alternatively, with GCS: -# arrow::copy_files("gs://voltrondata-labs-datasets/nyc-taxi", "nyc-taxi") +arrow::copy_files("gs://voltrondata-labs-datasets/nyc-taxi", "nyc-taxi") ``` If your arrow build doesn't have S3 support, you can download the files From 21149690b45d4c2fb5de129d0024faa4ad147a6c Mon Sep 17 00:00:00 2001 From: Will Jones Date: Tue, 19 Jul 2022 10:54:54 -0700 Subject: [PATCH 09/13] fix: lint python --- python/pyarrow/tests/test_fs.py | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/python/pyarrow/tests/test_fs.py b/python/pyarrow/tests/test_fs.py index 3e35efa74f0..05ebf4ed4c7 100644 --- a/python/pyarrow/tests/test_fs.py +++ b/python/pyarrow/tests/test_fs.py @@ -1619,7 +1619,8 @@ def test_s3_real_aws(): entries = fs.get_file_info(FileSelector( 'voltrondata-labs-datasets/nyc-taxi')) assert len(entries) > 0 - with fs.open_input_stream('voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/part-0.parquet') as f: + key = 'voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/part-0.parquet' + with fs.open_input_stream(key) as f: md = f.metadata() assert 'Content-Type' in md assert md['Last-Modified'] == b'2022-07-12T23:32:00Z' From 176a7f5fd7c5e421d4a89774be49e9d0178138c2 Mon Sep 17 00:00:00 2001 From: Will Jones Date: Tue, 19 Jul 2022 15:27:36 -0700 Subject: [PATCH 10/13] test: try not skipping unnecessarily on Windows --- r/tests/testthat/test-filesystem.R | 2 -- 1 file changed, 2 deletions(-) diff --git a/r/tests/testthat/test-filesystem.R b/r/tests/testthat/test-filesystem.R index a4ac45a47dd..7957743a2aa 100644 --- a/r/tests/testthat/test-filesystem.R +++ b/r/tests/testthat/test-filesystem.R @@ -187,7 +187,6 @@ test_that("s3_bucket", { capture.output(print(bucket)), "SubTreeFileSystem: s3://ursa-labs-r-test/" ) - skip_on_os("windows") # FIXME expect_identical(bucket$base_path, "ursa-labs-r-test/") }) @@ -202,6 +201,5 @@ test_that("gs_bucket", { capture.output(print(bucket)), "SubTreeFileSystem: gs://voltrondata-labs-datasets/" ) - skip_on_os("windows") # FIXME expect_identical(bucket$base_path, "voltrondata-labs-datasets/") }) From 72a64d09463a782532044b66d1ed15d33474af11 Mon Sep 17 00:00:00 2001 From: Will Jones Date: Thu, 21 Jul 2022 12:59:11 -0700 Subject: [PATCH 11/13] doc: clarify use of gcloud to get credentials --- r/vignettes/fs.Rmd | 21 ++++++++++++++------- 1 file changed, 14 insertions(+), 7 deletions(-) diff --git a/r/vignettes/fs.Rmd b/r/vignettes/fs.Rmd index ef10ab930ac..962f1a024f0 100644 --- a/r/vignettes/fs.Rmd +++ b/r/vignettes/fs.Rmd @@ -96,7 +96,7 @@ In the previous example, this would look like: ```r bucket <- s3_bucket("voltrondata-labs-datasets") -# Or in GCS (anonymous = TRUE is required): +# Or in GCS (anonymous = TRUE is required if credentials are not configured): bucket <- gs_bucket("voltrondata-labs-datasets", anonymous = TRUE) df <- read_parquet(bucket$path("nyc-taxi/year=2019/month=6/data.parquet")) ``` @@ -161,8 +161,19 @@ for temporary access by passing the `role_arn` identifier to `S3FileSystem$creat ### GCS Authentication -To access public buckets, you must pass `anonymous = TRUE` or `anonymous` as the -user in a URI: +The simplest way to authenticate with GCS is to run the [gcloud](https://cloud.google.com/sdk/docs/) +command to setup application default credentials: + +``` +gcloud auth application-default login +``` + +To manually configure credentials, you can pass either `access_token` and `expiration`, for using +temporary tokens generated elsewhere, or `json_credentials`, to reference a downloaded +credentials file. + +If you haven't configured credentials, then to access *public* buckets, you +must pass `anonymous = TRUE` or `anonymous` as the user in a URI: ```r bucket <- gs_bucket("voltrondata-labs-datasets", anonymous = TRUE) @@ -170,10 +181,6 @@ fs <- GcsFileSystem$create(anonymous = TRUE) df <- read_parquet("gs://anonymous@voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/data.parquet") ``` -To access private buckets, you can pass either `access_token` and `expiration`, for using -temporary tokens generated elsewhere, or `json_credentials`, to reference a downloaded -credentials file. - From 3aae2e922f68dd0ba8e9c8831162ad29f049c802 Mon Sep 17 00:00:00 2001 From: Will Jones Date: Fri, 22 Jul 2022 10:45:01 -0700 Subject: [PATCH 12/13] fix: minor PR feedback --- r/R/filesystem.R | 2 +- r/vignettes/fs.Rmd | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/r/R/filesystem.R b/r/R/filesystem.R index 0df5608a0cb..048d90e098f 100644 --- a/r/R/filesystem.R +++ b/r/R/filesystem.R @@ -448,7 +448,7 @@ s3_bucket <- function(bucket, ...) { SubTreeFileSystem$create(fs_and_path$path, fs) } -#' Connect to an Google Storage Service (GCS) bucket +#' Connect to a Google Cloud Storage (GCS) bucket #' #' `gs_bucket()` is a convenience function to create an `GcsFileSystem` object #' that holds onto its relative path diff --git a/r/vignettes/fs.Rmd b/r/vignettes/fs.Rmd index 962f1a024f0..615f1656511 100644 --- a/r/vignettes/fs.Rmd +++ b/r/vignettes/fs.Rmd @@ -15,7 +15,7 @@ Google Cloud Storage (GCS). This vignette provides an overview of working with S3 and GCS data using Arrow. > In Windows and macOS binary packages, S3 and GCS support are included. On Linux when -installing from source, S3 and GCS support is not enabled by default, and it has +installing from source, S3 and GCS support is not always enabled by default, and it has additional system requirements. See `vignette("install", package = "arrow")` for details. From 7bd2f8a05da2b24d9a5536349190518665756f71 Mon Sep 17 00:00:00 2001 From: Will Jones Date: Fri, 22 Jul 2022 11:46:05 -0700 Subject: [PATCH 13/13] docs: refactor fs vignette and add description of GcsFileSystem params --- r/R/filesystem.R | 20 ++++++ r/man/FileSystem.Rd | 21 +++++++ r/man/gs_bucket.Rd | 2 +- r/vignettes/fs.Rmd | 148 ++++++++++++++++++++++++-------------------- 4 files changed, 122 insertions(+), 69 deletions(-) diff --git a/r/R/filesystem.R b/r/R/filesystem.R index 048d90e098f..2f0b1cfd585 100644 --- a/r/R/filesystem.R +++ b/r/R/filesystem.R @@ -155,6 +155,26 @@ FileSelector$create <- function(base_dir, allow_not_found = FALSE, recursive = F #' - `allow_bucket_deletion`: logical, if TRUE, the filesystem will delete #' buckets if`$DeleteDir()` is called on the bucket level (default `FALSE`). #' +#' `GcsFileSystem$create()` optionally takes arguments: +#' +#' - `anonymous`: logical, default `FALSE`. If true, will not attempt to look up +#' credentials using standard GCS configuration methods. +#' - `access_token`: optional string for authentication. Should be provided along +#' with `expiration` +#' - `expiration`: optional date representing point at which `access_token` will +#' expire. +#' - `json_credentials`: optional string for authentication. Point to a JSON +#' credentials file downloaded from GCS. +#' - `endpoint_override`: if non-empty, will connect to provided host name / port, +#' such as "localhost:9001", instead of default GCS ones. This is primarily useful +#' for testing purposes. +#' - `scheme`: connection transport (default "https") +#' - `default_bucket_location`: the default location (or "region") to create new +#' buckets in. +#' - `retry_limit_seconds`: the maximum amount of time to spend retrying if +#' the filesystem encounters errors. Default is 15 seconds. +#' - `default_metadata`: default metadata to write in new objects. +#' #' @section Methods: #' #' - `$GetFileInfo(x)`: `x` may be a [FileSelector][FileSelector] or a character diff --git a/r/man/FileSystem.Rd b/r/man/FileSystem.Rd index 41d9e925140..f4f6cb57ffc 100644 --- a/r/man/FileSystem.Rd +++ b/r/man/FileSystem.Rd @@ -56,6 +56,27 @@ buckets if \verb{$CreateDir()} is called on the bucket level (default \code{FALS \item \code{allow_bucket_deletion}: logical, if TRUE, the filesystem will delete buckets if\verb{$DeleteDir()} is called on the bucket level (default \code{FALSE}). } + +\code{GcsFileSystem$create()} optionally takes arguments: +\itemize{ +\item \code{anonymous}: logical, default \code{FALSE}. If true, will not attempt to look up +credentials using standard GCS configuration methods. +\item \code{access_token}: optional string for authentication. Should be provided along +with \code{expiration} +\item \code{expiration}: optional date representing point at which \code{access_token} will +expire. +\item \code{json_credentials}: optional string for authentication. Point to a JSON +credentials file downloaded from GCS. +\item \code{endpoint_override}: if non-empty, will connect to provided host name / port, +such as "localhost:9001", instead of default GCS ones. This is primarily useful +for testing purposes. +\item \code{scheme}: connection transport (default "https") +\item \code{default_bucket_location}: the default location (or "region") to create new +buckets in. +\item \code{retry_limit_seconds}: the maximum amount of time to spend retrying if +the filesystem encounters errors. Default is 15 seconds. +\item \code{default_metadata}: default metadata to write in new objects. +} } \section{Methods}{ diff --git a/r/man/gs_bucket.Rd b/r/man/gs_bucket.Rd index d45626a6beb..7dc39a42c3d 100644 --- a/r/man/gs_bucket.Rd +++ b/r/man/gs_bucket.Rd @@ -2,7 +2,7 @@ % Please edit documentation in R/filesystem.R \name{gs_bucket} \alias{gs_bucket} -\title{Connect to an Google Storage Service (GCS) bucket} +\title{Connect to a Google Cloud Storage (GCS) bucket} \usage{ gs_bucket(bucket, ...) } diff --git a/r/vignettes/fs.Rmd b/r/vignettes/fs.Rmd index 615f1656511..6fb7e2d1af9 100644 --- a/r/vignettes/fs.Rmd +++ b/r/vignettes/fs.Rmd @@ -19,9 +19,68 @@ installing from source, S3 and GCS support is not always enabled by default, and additional system requirements. See `vignette("install", package = "arrow")` for details. +## Creating a FileSystem object + +One way of working with filesystems is to create `?FileSystem` objects. +`?S3FileSystem` objects can be created with the `s3_bucket()` function, which +automatically detects the bucket's AWS region. Similarly, `?GcsFileSystem` objects +can be created with the `gs_bucket()` function. The resulting +`FileSystem` will consider paths relative to the bucket's path (so for example +you don't need to prefix the bucket path when listing a directory). + +With a `FileSystem` object, you can point to specific files in it with the `$path()` method +and pass the result to file readers and writers (`read_parquet()`, `write_feather()`, et al.). +For example, to read a parquet file from the example NYC taxi data +(used in `vignette("dataset", package = "arrow")`): + +```r +bucket <- s3_bucket("voltrondata-labs-datasets") +# Or in GCS (anonymous = TRUE is required if credentials are not configured): +bucket <- gs_bucket("voltrondata-labs-datasets", anonymous = TRUE) +df <- read_parquet(bucket$path("nyc-taxi/year=2019/month=6/data.parquet")) +``` + +Note that this will be slower to read than if the file were local, +though if you're running on a machine in the same AWS region as the file in S3, +the cost of reading the data over the network should be much lower. + +You can list the files and/or directories in a bucket or subdirectory using +the `$ls()` method: + +```r +bucket$ls("nyc-taxi") +# Or recursive: +bucket$ls("nyc-taxi", recursive = TRUE) +``` + +**NOTE**: in GCS, you *should always* use `recursive = TRUE` as directories often don't appear in +`$ls()` results. + + + +See `help(FileSystem)` for a list of options that `s3_bucket()`/`S3FileSystem$create()` +and `gs_bucket()`/`GcsFileSystem$create()` can take. + +The object that `s3_bucket()` and `gs_bucket()` return is technically a `SubTreeFileSystem`, +which holds a path and a file system to which it corresponds. `SubTreeFileSystem`s can be +useful for holding a reference to a subdirectory somewhere (on S3, GCS, or elsewhere). + +One way to get a subtree is to call the `$cd()` method on a `FileSystem` + +```r +june2019 <- bucket$cd("2019/06") +df <- read_parquet(june2019$path("data.parquet")) +``` + +`SubTreeFileSystem` can also be made from a URI: + +```r +june2019 <- SubTreeFileSystem$create("s3://voltrondata-labs-datasets/nyc-taxi/2019/06") +``` + ## URIs -File readers and writers (`read_parquet()`, `write_feather()`, et al.) +File readers and writers (`read_parquet()`, `write_feather()`, et al.) also accept a URI as the source or destination file, as do `open_dataset()` and `write_dataset()`. An S3 URI looks like: @@ -32,8 +91,8 @@ s3://[access_key:secret_key@]bucket/path[?region=] A GCS URI looks like: ``` -gs://[access_key:secret_key@]bucket/path[?region=] -gs://anonymous@bucket/path[?region=] +gs://[access_key:secret_key@]bucket/path +gs://anonymous@bucket/path ``` For example, one of the NYC taxi data files used in `vignette("dataset", package = "arrow")` is found at @@ -41,7 +100,7 @@ For example, one of the NYC taxi data files used in `vignette("dataset", package ``` s3://voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/data.parquet # Or in GCS (anonymous required on public buckets): -# gs://anonymous@voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/data.parquet +gs://anonymous@voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/data.parquet ``` Given this URI, you can pass it to `read_parquet()` just as if it were a local file path: @@ -52,10 +111,6 @@ df <- read_parquet("s3://voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/da df <- read_parquet("gs://anonymous@voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/data.parquet") ``` -Note that this will be slower to read than if the file were local, -though if you're running on a machine in the same AWS region as the file in S3, -the cost of reading the data over the network should be much lower. - ### URI options URIs accept additional options in the query parameters (the part after the `?`) @@ -66,78 +121,35 @@ by `&`. For example, s3://voltrondata-labs-datasets/?endpoint_override=https%3A%2F%2Fstorage.googleapis.com&allow_bucket_creation=true ``` -tells the `S3FileSystem` that it should allow the creation of new buckets and to -talk to Google Storage instead of S3. The latter works because GCS implements an -S3-compatible API--see [File systems that emulate S3](#file-systems-that-emulate-s3) -below--but for better support for GCS use the GCSFileSystem with `gs://`. - -In GCS, a useful option is `retry_limit_seconds`, which sets the number of seconds -a request may spend retrying before returning an error. The current default is -15 minutes, so in many interactive contexts it's nice to set a lower value: - -``` -gs://anonymous@voltrondata-labs-datasets/nyc-taxi/?retry_limit_seconds=10 -``` - -## Creating a FileSystem object - -Instead of configuring filesystems through URIs, you can create `FileSystem` -objects. `S3FileSystem` objects can be created with the `s3_bucket()` function, which -automatically detects the bucket's AWS region. Similarly, `GcsFileSystem` objects -can be created with the `gs_bucket()` function. Additionally, the resulting -`FileSystem` will consider paths relative to the bucket's path (so for example -you don't need to prefix the bucket path when listing a directory). -This may be convenient when dealing with -long URIs, and it's necessary for some options and authentication methods -that aren't supported in the URI format. - -With a `FileSystem` object, you can point to specific files in it with the `$path()` method. -In the previous example, this would look like: +is equivlant to: ```r -bucket <- s3_bucket("voltrondata-labs-datasets") -# Or in GCS (anonymous = TRUE is required if credentials are not configured): -bucket <- gs_bucket("voltrondata-labs-datasets", anonymous = TRUE) -df <- read_parquet(bucket$path("nyc-taxi/year=2019/month=6/data.parquet")) -``` - -You can list the files and/or directories in an S3 bucket or subdirectory using -the `$ls()` method: - -```r -bucket$ls("nyc-taxi") -# Or recursive: -bucket$ls("nyc-taxi", recursive = TRUE) +fs <- S3FileSystem$create( + endpoint_override="https://storage.googleapis.com", + allow_bucket_creation=TRUE +) +fs$path("voltrondata-labs-datasets/") ``` -**NOTE**: in GCS, you *should always* use recursive as directories often don't appear in -`$ls()` results. - - - -See `help(FileSystem)` for a list of options that `s3_bucket()`/`S3FileSystem$create()` -and `gs_bucket()`/`GcsFileSystem$create()` can take. +Both tell the `S3FileSystem` that it should allow the creation of new buckets and to +talk to Google Storage instead of S3. The latter works because GCS implements an +S3-compatible API--see [File systems that emulate S3](#file-systems-that-emulate-s3) +below--but for better support for GCS use the GCSFileSystem with `gs://`. Also note +that parameters in the URI need to be +[percent encoded](https://en.wikipedia.org/wiki/Percent-encoding), which is why +`://` is written as `%3A%2F%2F`. For S3, only the following options can be included in the URI as query parameters are `region`, `scheme`, `endpoint_override`, `access_key`, `secret_key`, `allow_bucket_creation`, and `allow_bucket_deletion`. For GCS, the supported parameters are `scheme`, `endpoint_override`, and `retry_limit_seconds`. -The object that `s3_bucket()` and `gs_bucket()` return is technically a `SubTreeFileSystem`, -which holds a path and a file system to which it corresponds. `SubTreeFileSystem`s can be -useful for holding a reference to a subdirectory somewhere (on S3, GCS, or elsewhere). - -One way to get a subtree is to call the `$cd()` method on a `FileSystem` +In GCS, a useful option is `retry_limit_seconds`, which sets the number of seconds +a request may spend retrying before returning an error. The current default is +15 minutes, so in many interactive contexts it's nice to set a lower value: -```r -june2019 <- bucket$cd("2019/06") -df <- read_parquet(june2019$path("data.parquet")) ``` - -`SubTreeFileSystem` can also be made from a URI: - -```r -june2019 <- SubTreeFileSystem$create("s3://voltrondata-labs-datasets/nyc-taxi/2019/06") +gs://anonymous@voltrondata-labs-datasets/nyc-taxi/?retry_limit_seconds=10 ``` ## Authentication