-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-14745: [R] Enable true duckdb streaming #11730
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from all commits
Commits
Show all changes
13 commits
Select commit
Hold shift + click to select a range
9cbf065
True DuckDB streaming
jonkeane b99e361
Add arrange
jonkeane 495d489
force tbl_name, take out validity check (which leads to segfaults
jonkeane e418f1f
clean up tests back to original
jonkeane 138c689
Make sure that we test parquet dataset roundtrips through the C-inter…
jonkeane f21e364
Also test `$read_table()` for now
jonkeane 5b065be
Add a projection/filter test for arrow streaming
jonkeane 2c77276
Clean up, always emit a RecordBatchReader
jonkeane aa84d12
Remove extra new line
jonkeane f5d1ed6
We don't need to `stream` anymore
jonkeane 9c56ed0
Simplified RBR -> stream test
jonkeane 115d651
add back in
jonkeane ec1b8c3
bump CI
jonkeane File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -104,12 +104,11 @@ unique_arrow_tablename <- function() { | |||||||||||||||||||||
|
|
||||||||||||||||||||||
| # Creates an environment that disconnects the database when it's GC'd | ||||||||||||||||||||||
| duckdb_disconnector <- function(con, tbl_name) { | ||||||||||||||||||||||
| force(tbl_name) | ||||||||||||||||||||||
| reg.finalizer(environment(), function(...) { | ||||||||||||||||||||||
| # remote the table we ephemerally created (though only if the connection is | ||||||||||||||||||||||
| # still valid) | ||||||||||||||||||||||
| if (DBI::dbIsValid(con)) { | ||||||||||||||||||||||
| duckdb::duckdb_unregister_arrow(con, tbl_name) | ||||||||||||||||||||||
| } | ||||||||||||||||||||||
| duckdb::duckdb_unregister_arrow(con, tbl_name) | ||||||||||||||||||||||
| }) | ||||||||||||||||||||||
| environment() | ||||||||||||||||||||||
| } | ||||||||||||||||||||||
|
|
@@ -120,8 +119,11 @@ duckdb_disconnector <- function(con, tbl_name) { | |||||||||||||||||||||
| #' other processes (like DuckDB). | ||||||||||||||||||||||
| #' | ||||||||||||||||||||||
| #' @param .data the object to be converted | ||||||||||||||||||||||
| #' @param as_arrow_query should the returned object be wrapped as an | ||||||||||||||||||||||
| #' `arrow_dplyr_query`? (logical, default: `TRUE`) | ||||||||||||||||||||||
| #' | ||||||||||||||||||||||
| #' @return an `arrow_dplyr_query` object, to be used in dplyr pipelines. | ||||||||||||||||||||||
| #' @return a `RecordBatchReader` object, wrapped as an arrow dplyr query which | ||||||||||||||||||||||
| #' can be used in dplyr pipelines. | ||||||||||||||||||||||
| #' @export | ||||||||||||||||||||||
| #' | ||||||||||||||||||||||
| #' @examplesIf getFromNamespace("run_duckdb_examples", "arrow")() | ||||||||||||||||||||||
|
|
@@ -136,7 +138,7 @@ duckdb_disconnector <- function(con, tbl_name) { | |||||||||||||||||||||
| #' summarize(mean_mpg = mean(mpg, na.rm = TRUE)) %>% | ||||||||||||||||||||||
| #' to_arrow() %>% | ||||||||||||||||||||||
| #' collect() | ||||||||||||||||||||||
| to_arrow <- function(.data) { | ||||||||||||||||||||||
| to_arrow <- function(.data, as_arrow_query = TRUE) { | ||||||||||||||||||||||
| # If this is an Arrow object already, return quickly since we're already Arrow | ||||||||||||||||||||||
| if (inherits(.data, c("arrow_dplyr_query", "ArrowObject"))) { | ||||||||||||||||||||||
| return(.data) | ||||||||||||||||||||||
|
|
@@ -155,6 +157,9 @@ to_arrow <- function(.data) { | |||||||||||||||||||||
| # Run the query | ||||||||||||||||||||||
| res <- DBI::dbSendQuery(dbplyr::remote_con(.data), dbplyr::remote_query(.data), arrow = TRUE) | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| # TODO: we shouldn't need $read_table(), but we get segfaults when we do. | ||||||||||||||||||||||
| arrow_dplyr_query(duckdb::duckdb_fetch_record_batch(res)$read_table()) | ||||||||||||||||||||||
| if (as_arrow_query) { | ||||||||||||||||||||||
| arrow_dplyr_query(duckdb::duckdb_fetch_record_batch(res)) | ||||||||||||||||||||||
| } else { | ||||||||||||||||||||||
| duckdb::duckdb_fetch_record_batch(res) | ||||||||||||||||||||||
| } | ||||||||||||||||||||||
|
Comment on lines
+160
to
+164
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||||||||||
| } | ||||||||||||||||||||||
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Oops, something went wrong.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How useful is this argument?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably (and hopefully) not very, but I wanted to have an escape route in case we see another issue like at the start of this PR where
duckdb::duckdb_fetch_record_batch(res)fails, butduckdb::duckdb_fetch_record_batch(res)$read_table()works (over time, the DuckDB master branch got into a state where both failed consistently, but at the beginning reading the table worked just fine, but accessing the RBR did not. And since we are at the mercy of both of our release cycles for fixing this, the cost of having this escape hatch doesn't seem so bad to me, but I can remove it.It also helps a bit when debugging / testing (one doesn't have to recreate
to_arrow()without the wrapper).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can get the RBR from
$.datain the query object, is that sufficient for debugging purposes?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh nm, it gets wrapped in a InMemoryDataset. Alright, I don't like this but this is fine, we can prune it later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could also mark it is a temporary workaround for when we simplify this in the future? It's a little silly to effectively say "this is deprecated" when we introduce it, but I'm not sure when we'll get to doing the simplification + improvements so this can simply emit RBRs and those have the right signaling/methods so that it's clear that they are a good thing to use in dplyr queries
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think
to_arrow()is a good fit...it gives the thing that a user probably wants. A possible futureas_record_batch_reader()would be the right incantation for a user who wants it!