ARROW-14029: [R] Repair map_batches() #11894

wjones127 · 2021-12-07T23:03:00Z

Updating map_batches() function to use RecordBatchReader instead of Scanner$ScanBatches() so that only one record batch is in memory at a time.

~~As part of this, I refactored do_exec_plan to always return a RBR instead of a materialized Table.~~ I don't think I can refactor do_exec_plan to always return a RBR until we get arrange, head, and tail operations to work outside of a sink node. See: https://issues.apache.org/jira/browse/ARROW-15271

github-actions · 2021-12-07T23:03:27Z

https://issues.apache.org/jira/browse/ARROW-14029

wjones127 · 2021-12-09T21:24:17Z

r/vignettes/dataset.Rmd

@@ -290,6 +290,64 @@ rows match the filter. Relatedly, since Parquet files contain row groups with
 statistics on the data within, there may be entire chunks of data you can
 avoid scanning because they have no rows where `total_amount > 100`.

+### Processing data in batches


This example was useful in testing, and hopefully gives some ideas for usage. Though perhaps it belongs more in the cookbook? LMK what you think.

I like it a lot. And I think it totally belongs here in a vignette (especially in the tone you have here). But it wouldn't be bad to make an issue to add to the cookbook as well (though don't feel obligated to do that right now if you don't want to!).

Created: https://issues.apache.org/jira/browse/ARROW-15276

paleolimbot

Just a note on the use of append() that may or may not be useful here!

r/R/dataset-scan.R

wjones127 · 2022-01-05T19:14:30Z

@jonkeane If you have time to review, I think it would be cool to get this into 7.0.0.

jonkeane

Thank you for this fix up, especially with a part of the code that is not easy to grok / not super well documented! A few questions / comments, but this looks really good so far

jonkeane · 2022-01-06T16:03:30Z

r/tests/testthat/test-dataset.R

+    c(5, 10)
+  )
+
+  # $Take returns RecordBatch


Is this comment accurate here? It looks like it's returning a tibble? Or is that a side effect of arrange()?

It returns a record batch within that function. But since the .data.frame option on map_batches() is TRUE by default, the results are combined into a tibble using dplyr::bind_rows().

Ah I see, it might be good to add that to the comment so that it's clear what's going on there.

Yes, I've added a couple clarifying comments.

jonkeane · 2022-01-06T16:05:11Z

r/tests/testthat/test-dataset.R

+      map_batches(~ .$num_rows, .data.frame = FALSE) %>%
+      as.numeric() %>%
+      sort(),
+    c(5, 10)


Suggested change

map_batches(~ .$num_rows, .data.frame = FALSE) %>%

as.numeric() %>%

sort(),

c(5, 10)

map_batches(~ .$num_rows, .data.frame = FALSE) %>%

sort(),

c(5L, 10L)

Does this work? as.numeric() in there is a little suspicious — what issues do you have if you take it out?

map_batches() will return a list, but sort() only takes atomic vectors.

Aaaaaah, I see now. Maybe it would be a bit cleared with unlist() instead of as.numeric()?

jonkeane · 2022-01-06T16:12:20Z

r/R/dataset-scan.R

-    lapply <- map_dfr
-  }
-  scanner <- Scanner$create(ensure_group_vars(X))
+  # TODO: possibly refactor do_exec_plan to return a RecordBatchReader


Would you mind making the Jira for this + put the number here? I don't know of one off the top of my head but we should get one if we think we'll (possibly) want to move in that direction

JIra created: https://issues.apache.org/jira/browse/ARROW-15271

Added to the comment as well.

jonkeane · 2022-01-06T16:15:26Z

r/R/dataset-scan.R

@@ -174,8 +174,6 @@ ScanTask <- R6Class("ScanTask",
 #' a `data.frame` for further aggregation, even if you couldn't fit the whole
 #' `Dataset` result in memory.
 #'
-#' This is experimental and not recommended for production use.


We might want to still keep the experimental label here — it's working now, but as you mention, we might refactor it / have it possibly have different behavior in the future.

I think the do_exec_plan refactor wouldn't affect behavior of this. But not sure about the "wrap in arrow_dplyr_query" one.

If we keep it experimental, we should either remove the vignette below or mark that as experimental as well.

I think the vignette is helpful and useful. I would lean (slightly) towards marking this as experimental in both places, but only pretty weakly, happy to go with it if you would prefer to take the experimental marks off.

Okay, I've marked both as experimental.

This reverts commit ff1dab7e3ab835e23ef963aa5bf78d61832a1d78.

jonkeane

This looks great, thank you! One super minor suggestion that you can take or leave. I'll merge this sometime tomorrow, to give you a chance to look at that.

r/R/dataset-scan.R

jonkeane · 2022-01-06T23:09:27Z

r/tests/testthat/test-dataset.R

+      filter(int > 5) %>%
+      select(int, lgl) %>%
+      map_batches(~ .$num_rows, .data.frame = FALSE) %>%
+      unlist() %>% # Returns list because .data.frame is FALSE


Thanks for this fantastic extra clarification!

jonkeane · 2022-01-06T23:10:37Z

r/vignettes/dataset.Rmd

@@ -290,6 +290,64 @@ rows match the filter. Relatedly, since Parquet files contain row groups with
 statistics on the data within, there may be entire chunks of data you can
 avoid scanning because they have no rows where `total_amount > 100`.

+### Processing data in batches


I like it a lot. And I think it totally belongs here in a vignette (especially in the tone you have here). But it wouldn't be bad to make an issue to add to the cookbook as well (though don't feel obligated to do that right now if you don't want to!).

Co-authored-by: Jonathan Keane <jkeane@gmail.com>

ursabot · 2022-01-07T14:11:17Z

Benchmark runs are scheduled for baseline = e64480d and contender = f054440. f054440 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️1.35% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.7% ⬆️0.04%] ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

github-actions bot added the Component: R label Dec 7, 2021

wjones127 commented Dec 9, 2021

View reviewed changes

wjones127 marked this pull request as ready for review December 9, 2021 21:25

paleolimbot reviewed Dec 21, 2021

View reviewed changes

r/R/dataset-scan.R Outdated Show resolved Hide resolved

wjones127 force-pushed the ARROW-14029-r-map-batches branch from 1ab8643 to adbd79e Compare December 21, 2021 16:27

jonkeane requested changes Jan 6, 2022

View reviewed changes

wjones127 added 11 commits January 6, 2022 08:37

Get existing test to pass

1682f08

Test a few more cases

df5ad10

Simplify logic

d0d7740

GImplement with RecordBatchReader

bf520d0

Update to reading plan

93e090d

Refactor do_exec_plan to always return RBR

d5f78a1

Revert "Refactor do_exec_plan to always return RBR"

d1950bf

This reverts commit ff1dab7e3ab835e23ef963aa5bf78d61832a1d78.

Fix tests and linting

2d91584

Document map_batches in vignette

6811bdc

Efficiently allocate the list

bc3e39b

PR feedback from Jon

7457c9d

wjones127 force-pushed the ARROW-14029-r-map-batches branch from adbd79e to 7457c9d Compare January 6, 2022 18:08

wjones127 requested a review from jonkeane January 6, 2022 20:40

jonkeane approved these changes Jan 6, 2022

View reviewed changes

Update r/R/dataset-scan.R

fcb5515

Co-authored-by: Jonathan Keane <jkeane@gmail.com>

wjones127 mentioned this pull request Jan 6, 2022

[R] Add map_batches examples from vignette apache/arrow-cookbook#115

Open

jonkeane closed this in f054440 Jan 7, 2022

wjones127 mentioned this pull request Jan 7, 2022

[R] Recipe for random sampling apache/arrow-cookbook#83

Open

asfimport mentioned this pull request Nov 23, 2022

[R] Repair map_batches() #29627

Closed

ARROW-14029: [R] Repair map_batches() #11894

ARROW-14029: [R] Repair map_batches() #11894

Uh oh!

Conversation

wjones127 commented Dec 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Dec 7, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

paleolimbot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

wjones127 commented Jan 5, 2022

Uh oh!

jonkeane left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jonkeane left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ursabot commented Jan 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

wjones127 commented Dec 7, 2021 •

edited

Loading

ursabot commented Jan 7, 2022 •

edited

Loading