Skip to content

Conversation

@gadenbuie
Copy link
Contributor

Closes #51

How does this work?

pkgload::load_all("~/work/posit-dev/querychat/pkg-r")
#> ℹ Loading querychat
library(dplyr, warn.conflicts = FALSE)
library(dbplyr, warn.conflicts = FALSE)

con <- DBI::dbConnect(duckdb::duckdb())
duckdb::dbWriteTable(con, "mtcars", mtcars)

mtcars_db <- tbl(con, "mtcars")

Simple tbl source

First, we can create a new data source from the tbl() object.

src <- TblLazySource$new(mtcars_db)
(res <- src$execute_query("SELECT * FROM mtcars WHERE cyl > 4"))
#> # Source:   SQL [?? x 11]
#> # Database: DuckDB 1.4.1 [root@Darwin 25.0.0:R 4.5.2/:memory:]
#>      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
#>  2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
#>  3  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
#>  4  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
#>  5  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
#>  6  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
#>  7  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
#>  8  17.8     6  168.   123  3.92  3.44  18.9     1     0     4     4
#>  9  16.4     8  276.   180  3.07  4.07  17.4     0     0     3     3
#> 10  17.3     8  276.   180  3.07  3.73  17.6     0     0     3     3
#> # ℹ more rows

Which returns a tbl() that can be chained into further dplyr operations.

res |> count(cyl, gear)
#> # Source:   SQL [?? x 3]
#> # Database: DuckDB 1.4.1 [root@Darwin 25.0.0:R 4.5.2/:memory:]
#>     cyl  gear     n
#>   <dbl> <dbl> <dbl>
#> 1     6     5     1
#> 2     6     3     2
#> 3     8     3    12
#> 4     6     4     4
#> 5     8     5     2

Complicated tbl source

This same process even works for more complicated tibbles, like the result of
of dplyr pipeline on SQL tibbles.

mtcars_6_8_cyl <- mtcars_db |> inner_join(mtcars_db |> dplyr::filter(cyl > 4))
#> Joining with `by = join_by(mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear,
#> carb)`
src <- TblLazySource$new(mtcars_6_8_cyl)

And again, the result is a tbl() that can be folded into further dplyr
operations.

(res2 <- src$execute_query("SELECT * FROM mtcars_6_8_cyl WHERE gear < 6"))
#> # Source:   SQL [?? x 11]
#> # Database: DuckDB 1.4.1 [root@Darwin 25.0.0:R 4.5.2/:memory:]
#>      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
#>  2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
#>  3  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
#>  4  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
#>  5  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
#>  6  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
#>  7  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
#>  8  17.8     6  168.   123  3.92  3.44  18.9     1     0     4     4
#>  9  16.4     8  276.   180  3.07  4.07  17.4     0     0     3     3
#> 10  17.3     8  276.   180  3.07  3.73  17.6     0     0     3     3
#> # ℹ more rows
res2 |> count(cyl, gear)
#> # Source:   SQL [?? x 3]
#> # Database: DuckDB 1.4.1 [root@Darwin 25.0.0:R 4.5.2/:memory:]
#>     cyl  gear     n
#>   <dbl> <dbl> <dbl>
#> 1     6     3     2
#> 2     8     5     2
#> 3     6     4     4
#> 4     6     5     1
#> 5     8     3    12

The way we make this work is by extracting the SQL for the dplyr pipeline up
until we create a data source, and then, for complicated queries at least, we
use a local CTE, letting the LLM write queries against that CTE as if it were
a fixed table.

src$complete_query("SELECT * FROM mtcars_6_8_cyl WHERE gear < 6") |> cat()
#> Error in cat(src$complete_query("SELECT * FROM mtcars_6_8_cyl WHERE gear < 6")): attempt to apply non-function

Amazingly, we can even apply this strategy to get the schema of the CTE. This
took a small amount of updating to get_schema_impl() to make it work, but
the core logic is exactly the same.

src$get_schema() |> cat()
#> Table: mtcars_6_8_cyl
#> Columns:
#> - mpg (FLOAT)
#>   Range: 10.4 to 21.4
#> - cyl (FLOAT)
#>   Range: 6 to 8
#> - disp (FLOAT)
#>   Range: 145 to 472
#> - hp (FLOAT)
#>   Range: 105 to 335
#> - drat (FLOAT)
#>   Range: 2.76 to 4.22
#> - wt (FLOAT)
#>   Range: 2.62 to 5.424
#> - qsec (FLOAT)
#>   Range: 14.5 to 20.22
#> - vs (FLOAT)
#>   Range: 0 to 1
#> - am (FLOAT)
#>   Range: 0 to 1
#> - gear (FLOAT)
#>   Range: 3 to 5
#> - carb (FLOAT)
#>   Range: 1 to 8

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

(R) Return a dbplyr::tbl object from querychat_server

2 participants