-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-14029: [R] Repair map_batches() #11894
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
1682f08
df5ad10
d0d7740
bf520d0
93e090d
d5f78a1
d1950bf
2d91584
6811bdc
bc3e39b
7457c9d
fcb5515
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -290,6 +290,66 @@ rows match the filter. Relatedly, since Parquet files contain row groups with | |
| statistics on the data within, there may be entire chunks of data you can | ||
| avoid scanning because they have no rows where `total_amount > 100`. | ||
|
|
||
| ### Processing data in batches | ||
|
||
|
|
||
| Sometimes you want to run R code on the entire dataset, but that dataset is much | ||
| larger than memory. You can use `map_batches` on a dataset query to process | ||
| it batch-by-batch. | ||
|
|
||
| **Note**: `map_batches` is experimental and not recommended for production use. | ||
|
|
||
| As an example, to randomly sample a dataset, use `map_batches` to sample a | ||
| percentage of rows from each batch: | ||
|
|
||
| ```{r, eval = file.exists("nyc-taxi")} | ||
| sampled_data <- ds %>% | ||
| filter(year == 2015) %>% | ||
| select(tip_amount, total_amount, passenger_count) %>% | ||
| map_batches(~ sample_frac(as.data.frame(.), 1e-4)) %>% | ||
| mutate(tip_pct = tip_amount / total_amount) | ||
|
|
||
| str(sampled_data) | ||
| ``` | ||
|
|
||
| ```{r, echo = FALSE, eval = !file.exists("nyc-taxi")} | ||
| cat(" | ||
| 'data.frame': 15603 obs. of 4 variables: | ||
| $ tip_amount : num 0 0 1.55 1.45 5.2 ... | ||
| $ total_amount : num 5.8 16.3 7.85 8.75 26 ... | ||
| $ passenger_count: int 1 1 1 1 1 6 5 1 2 1 ... | ||
| $ tip_pct : num 0 0 0.197 0.166 0.2 ... | ||
| ") | ||
| ``` | ||
|
|
||
| This function can also be used to aggregate summary statistics over a dataset by | ||
| computing partial results for each batch and then aggregating those partial | ||
| results. Extending the example above, you could fit a model to the sample data | ||
| and then use `map_batches` to compute the MSE on the full dataset. | ||
|
|
||
| ```{r, eval = file.exists("nyc-taxi")} | ||
| model <- lm(tip_pct ~ total_amount + passenger_count, data = sampled_data) | ||
|
|
||
| ds %>% | ||
| filter(year == 2015) %>% | ||
| select(tip_amount, total_amount, passenger_count) %>% | ||
| mutate(tip_pct = tip_amount / total_amount) %>% | ||
| map_batches(function(batch) { | ||
| batch %>% | ||
| as.data.frame() %>% | ||
| mutate(pred_tip_pct = predict(model, newdata = .)) %>% | ||
| filter(!is.nan(tip_pct)) %>% | ||
| summarize(sse_partial = sum((pred_tip_pct - tip_pct)^2), n_partial = n()) | ||
| }) %>% | ||
| summarize(mse = sum(sse_partial) / sum(n_partial)) %>% | ||
| pull(mse) | ||
| ``` | ||
|
|
||
| ```{r, echo = FALSE, eval = !file.exists("nyc-taxi")} | ||
| cat(" | ||
| [1] 0.1304284 | ||
| ") | ||
| ``` | ||
|
|
||
| ## More dataset options | ||
|
|
||
| There are a few ways you can control the Dataset creation to adapt to special use cases. | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this fantastic extra clarification!