From d18702d87a92c3b5db5546f1e180a7f3d7cd1fdb Mon Sep 17 00:00:00 2001 From: Will Jones Date: Sun, 24 Jul 2022 17:47:20 -0400 Subject: [PATCH 1/3] doc: write news --- r/NEWS.md | 47 +++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 41 insertions(+), 6 deletions(-) diff --git a/r/NEWS.md b/r/NEWS.md index 560e484c33e..1041fa51f4a 100644 --- a/r/NEWS.md +++ b/r/NEWS.md @@ -19,19 +19,54 @@ # arrow 8.0.0.9000 -* The `arrow.dev_repo` for nightly builds of the R package and prebuilt - libarrow binaries is now https://nightlies.apache.org/arrow/r/. -* `lubridate::parse_date_time()` datetime parser: - * `orders` with year, month, day, hours, minutes, and seconds components are supported. - * the `orders` argument in the Arrow binding works as follows: `orders` are transformed into `formats` which subsequently get applied in turn. There is no `select_formats` parameter and no inference takes place (like is the case in `lubridate::parse_date_time()`). +## Arrays and Tables + +* Table and RecordBatch `$num_rows()` method returns a double (previously integer), avoiding integer overflow on larger tables. (ARROW-14989, ARROW-16977) + +## Reading and Writing + * New functions `read_ipc_file()` and `write_ipc_file()` are added. These functions are almost the same as `read_feather()` and `write_feather()`, but differ in that they only target IPC files (Feather V2 files), not Feather V1 files. * `read_arrow()` and `write_arrow()`, deprecated since 1.0.0 (July 2020), have been removed. Instead of these, use the `read_ipc_file()` and `write_ipc_file()` for IPC files, or, `read_ipc_stream()` and `write_ipc_stream()` for IPC streams. -* `write_parquet()` now defaults to writing Parquet format version 2.4 (was 1.0). Previously deprecated arguments `properties` and `arrow_properties` have been removed; if you need to deal with these lower-level properties objects directly, use `ParquetFileWriter`, which `write_parquet()` wraps. +* `write_parquet()` now defaults to writing Parquet format version 2.4 (was 1.0). Previously deprecated arguments `properties` and `arrow_properties` have been removed; if you need to deal with these lower-level properties objects directly, use `ParquetFileWriter`, which `write_parquet()` wraps. (ARROW-16715) +* UnionDatasets can unify schemas of multiple InMemoryDatasets with varying + schemas. (ARROW-16085) +* `write_dataset()` preserves all schema metadata again. In 8.0.0, it would drop most metadata, breaking packages such as sfarrow. (ARROW-16511) +* Reading and writing functions (such as `write_csv_arrow()`) will automatically (de-)compress data if the file path contains a compression extension (e.g. `"data.csv.gz"`). This works locally as well as on remote filesystems like S3 and GCS. (ARROW-16144)] +* `FileSystemFactoryOptions` can be provided to `open_dataset()`, allowing you to pass options such as which file prefixes to ignore. (ARROW-15280) +* By default, `S3FileSystem` will not create or delete buckets. To enable that, pass the configuration option `allow_bucket_creation` or `allow_bucket_deletion`. (ARROW-15906) +* `GcsFileSystem` and `gs_bucket()` allow connecting to Google Cloud Storage. (ARROW-13404, ARROW-16887) +* Removed `read_arrow()` and `write_arrow()` functions. They have been deprecated for several versions in favor of corresponding "ipc" and "feather" functions. (ARROW-16268) + +## Arrow dplyr queries + +* Bugfixes: + * Count distinct now gives correct result across multiple row groups. (ARROW-16807) + * Aggregations over partition columns return correct results. (ARROW-16700) +* `dplyr::union` and `dplyr::union_all` are supported. (ARROW-15622) +* `dplyr::glimpse` is supported. (ARROW-16776) +* `show_exec_plan()` can be added to the end of a dplyr pipeline to show the underlying plan, similar to `dplyr::show_query`. (ARROW-15016) +* User-defined functions are supported in queries. Use `register_scalar_function()` to create them. (ARROW-16444) +* `lubridate::parse_date_time()` datetime parser: (ARROW-14848, ARROW-16407) + * `orders` with year, month, day, hours, minutes, and seconds components are supported. + * the `orders` argument in the Arrow binding works as follows: `orders` are transformed into `formats` which subsequently get applied in turn. There is no `select_formats` parameter and no inference takes place (like is the case in `lubridate::parse_date_time()`). +* `lubridate::ymd()` and related string date parsers supported. (ARROW-16394). Month (`ym`, `my`) and quarter (`yq`) resolution parsers are also added. (ARROW-16516) +* lubridate family of `ymd_hms` datetime parsing functions are supported. (ARROW-16395) +* `lubridate::fast_strptime()` supported. (ARROW-16439) +* `lubridate::floor_date()`, `lubridate::ceiling_date()`, and `lubridate::round_date()` are supported. (ARROW-14821) +* `strptime()` supports the `tz` argument to pass timezones. (ARROW-16415) * added `lubridate::qday()` (day of quarter) +* `map_batches()` returns a `RecordBatchReader` and requires that the function it maps returns something coercible to a `RecordBatch` through the `as_record_batch()` S3 function. It can also run in streaming fashion if passed `.lazy = TRUE`. (ARROW-15271, ARROW-16703) +* `exp()` and `sqrt()` supported in dplyr queries. (ARROW-16871) + +## Packaging + +* The `arrow.dev_repo` for nightly builds of the R package and prebuilt + libarrow binaries is now https://nightlies.apache.org/arrow/r/. +* Brotli and BZ2 are shipped with MacOS binaries. BZ2 is shipped with Windows binaries. (ARROW-16828) # arrow 8.0.0 From 9cc9caa668b2e39e39c0b8e9e0e75826fb9e7369 Mon Sep 17 00:00:00 2001 From: Will Jones Date: Wed, 27 Jul 2022 11:48:27 -0400 Subject: [PATCH 2/3] docs: cleanup --- r/NEWS.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/r/NEWS.md b/r/NEWS.md index 1041fa51f4a..67efa090ab6 100644 --- a/r/NEWS.md +++ b/r/NEWS.md @@ -19,27 +19,26 @@ # arrow 8.0.0.9000 -## Arrays and Tables +## Arrays and tables * Table and RecordBatch `$num_rows()` method returns a double (previously integer), avoiding integer overflow on larger tables. (ARROW-14989, ARROW-16977) -## Reading and Writing +## Reading and writing * New functions `read_ipc_file()` and `write_ipc_file()` are added. These functions are almost the same as `read_feather()` and `write_feather()`, but differ in that they only target IPC files (Feather V2 files), not Feather V1 files. * `read_arrow()` and `write_arrow()`, deprecated since 1.0.0 (July 2020), have been removed. Instead of these, use the `read_ipc_file()` and `write_ipc_file()` for IPC files, or, - `read_ipc_stream()` and `write_ipc_stream()` for IPC streams. + `read_ipc_stream()` and `write_ipc_stream()` for IPC streams. (ARROW-16268) * `write_parquet()` now defaults to writing Parquet format version 2.4 (was 1.0). Previously deprecated arguments `properties` and `arrow_properties` have been removed; if you need to deal with these lower-level properties objects directly, use `ParquetFileWriter`, which `write_parquet()` wraps. (ARROW-16715) * UnionDatasets can unify schemas of multiple InMemoryDatasets with varying schemas. (ARROW-16085) * `write_dataset()` preserves all schema metadata again. In 8.0.0, it would drop most metadata, breaking packages such as sfarrow. (ARROW-16511) -* Reading and writing functions (such as `write_csv_arrow()`) will automatically (de-)compress data if the file path contains a compression extension (e.g. `"data.csv.gz"`). This works locally as well as on remote filesystems like S3 and GCS. (ARROW-16144)] +* Reading and writing functions (such as `write_csv_arrow()`) will automatically (de-)compress data if the file path contains a compression extension (e.g. `"data.csv.gz"`). This works locally as well as on remote filesystems like S3 and GCS. (ARROW-16144) * `FileSystemFactoryOptions` can be provided to `open_dataset()`, allowing you to pass options such as which file prefixes to ignore. (ARROW-15280) * By default, `S3FileSystem` will not create or delete buckets. To enable that, pass the configuration option `allow_bucket_creation` or `allow_bucket_deletion`. (ARROW-15906) * `GcsFileSystem` and `gs_bucket()` allow connecting to Google Cloud Storage. (ARROW-13404, ARROW-16887) -* Removed `read_arrow()` and `write_arrow()` functions. They have been deprecated for several versions in favor of corresponding "ipc" and "feather" functions. (ARROW-16268) ## Arrow dplyr queries @@ -48,7 +47,8 @@ * Aggregations over partition columns return correct results. (ARROW-16700) * `dplyr::union` and `dplyr::union_all` are supported. (ARROW-15622) * `dplyr::glimpse` is supported. (ARROW-16776) -* `show_exec_plan()` can be added to the end of a dplyr pipeline to show the underlying plan, similar to `dplyr::show_query`. (ARROW-15016) +* `show_exec_plan()` can be added to the end of a dplyr pipeline to show the underlying plan, similar to `dplyr::show_query()`. `dplyr::show_query()` and `dplyr::explain()` also work in Arrow dplyr pipelines. (ARROW-15016) +* Functions can be called with package namespace prefixes (e.g. `stringr::`, `lubridate::`) within queries. For example, `stringr::str_length` will now dispatch to the same kernel as `str_length`. (ARROW-14575) * User-defined functions are supported in queries. Use `register_scalar_function()` to create them. (ARROW-16444) * `lubridate::parse_date_time()` datetime parser: (ARROW-14848, ARROW-16407) * `orders` with year, month, day, hours, minutes, and seconds components are supported. From 2084df2b59639fb3d0d870e75781eada2612c206 Mon Sep 17 00:00:00 2001 From: Will Jones Date: Wed, 27 Jul 2022 15:15:53 -0400 Subject: [PATCH 3/3] docs: reorder and consolidate --- r/NEWS.md | 46 ++++++++++++++++++++++++---------------------- 1 file changed, 24 insertions(+), 22 deletions(-) diff --git a/r/NEWS.md b/r/NEWS.md index 67efa090ab6..c2ad7f86ddb 100644 --- a/r/NEWS.md +++ b/r/NEWS.md @@ -19,9 +19,28 @@ # arrow 8.0.0.9000 -## Arrays and tables +## Arrow dplyr queries -* Table and RecordBatch `$num_rows()` method returns a double (previously integer), avoiding integer overflow on larger tables. (ARROW-14989, ARROW-16977) +* New dplyr verbs: + * `dplyr::union` and `dplyr::union_all` (ARROW-15622) + * `dplyr::glimpse` (ARROW-16776) + * `show_exec_plan()` can be added to the end of a dplyr pipeline to show the underlying plan, similar to `dplyr::show_query()`. `dplyr::show_query()` and `dplyr::explain()` also work and show the same output, but may change in the future. (ARROW-15016) +* User-defined functions are supported in queries. Use `register_scalar_function()` to create them. (ARROW-16444) +* `map_batches()` returns a `RecordBatchReader` and requires that the function it maps returns something coercible to a `RecordBatch` through the `as_record_batch()` S3 function. It can also run in streaming fashion if passed `.lazy = TRUE`. (ARROW-15271, ARROW-16703) +* Functions can be called with package namespace prefixes (e.g. `stringr::`, `lubridate::`) within queries. For example, `stringr::str_length` will now dispatch to the same kernel as `str_length`. (ARROW-14575) +* Support for new functions: + * `lubridate::parse_date_time()` datetime parser: (ARROW-14848, ARROW-16407) + * `orders` with year, month, day, hours, minutes, and seconds components are supported. + * the `orders` argument in the Arrow binding works as follows: `orders` are transformed into `formats` which subsequently get applied in turn. There is no `select_formats` parameter and no inference takes place (like is the case in `lubridate::parse_date_time()`). + * `lubridate` date and datetime parsers such as `lubridate::ymd()`, `lubridate::yq()`, and `lubridate::ymd_hms()` (ARROW-16394, ARROW-16516, ARROW-16395) + * `lubridate::fast_strptime()` (ARROW-16439) + * `lubridate::floor_date()`, `lubridate::ceiling_date()`, and `lubridate::round_date()` (ARROW-14821) + * `strptime()` supports the `tz` argument to pass timezones. (ARROW-16415) + * `lubridate::qday()` (day of quarter) + * `exp()` and `sqrt()`. (ARROW-16871) +* Bugfixes: + * Count distinct now gives correct result across multiple row groups. (ARROW-16807) + * Aggregations over partition columns return correct results. (ARROW-16700) ## Reading and writing @@ -40,27 +59,10 @@ * By default, `S3FileSystem` will not create or delete buckets. To enable that, pass the configuration option `allow_bucket_creation` or `allow_bucket_deletion`. (ARROW-15906) * `GcsFileSystem` and `gs_bucket()` allow connecting to Google Cloud Storage. (ARROW-13404, ARROW-16887) -## Arrow dplyr queries -* Bugfixes: - * Count distinct now gives correct result across multiple row groups. (ARROW-16807) - * Aggregations over partition columns return correct results. (ARROW-16700) -* `dplyr::union` and `dplyr::union_all` are supported. (ARROW-15622) -* `dplyr::glimpse` is supported. (ARROW-16776) -* `show_exec_plan()` can be added to the end of a dplyr pipeline to show the underlying plan, similar to `dplyr::show_query()`. `dplyr::show_query()` and `dplyr::explain()` also work in Arrow dplyr pipelines. (ARROW-15016) -* Functions can be called with package namespace prefixes (e.g. `stringr::`, `lubridate::`) within queries. For example, `stringr::str_length` will now dispatch to the same kernel as `str_length`. (ARROW-14575) -* User-defined functions are supported in queries. Use `register_scalar_function()` to create them. (ARROW-16444) -* `lubridate::parse_date_time()` datetime parser: (ARROW-14848, ARROW-16407) - * `orders` with year, month, day, hours, minutes, and seconds components are supported. - * the `orders` argument in the Arrow binding works as follows: `orders` are transformed into `formats` which subsequently get applied in turn. There is no `select_formats` parameter and no inference takes place (like is the case in `lubridate::parse_date_time()`). -* `lubridate::ymd()` and related string date parsers supported. (ARROW-16394). Month (`ym`, `my`) and quarter (`yq`) resolution parsers are also added. (ARROW-16516) -* lubridate family of `ymd_hms` datetime parsing functions are supported. (ARROW-16395) -* `lubridate::fast_strptime()` supported. (ARROW-16439) -* `lubridate::floor_date()`, `lubridate::ceiling_date()`, and `lubridate::round_date()` are supported. (ARROW-14821) -* `strptime()` supports the `tz` argument to pass timezones. (ARROW-16415) -* added `lubridate::qday()` (day of quarter) -* `map_batches()` returns a `RecordBatchReader` and requires that the function it maps returns something coercible to a `RecordBatch` through the `as_record_batch()` S3 function. It can also run in streaming fashion if passed `.lazy = TRUE`. (ARROW-15271, ARROW-16703) -* `exp()` and `sqrt()` supported in dplyr queries. (ARROW-16871) +## Arrays and tables + +* Table and RecordBatch `$num_rows()` method returns a double (previously integer), avoiding integer overflow on larger tables. (ARROW-14989, ARROW-16977) ## Packaging