WIP: extend jplyr interface #2

davidagold · 2016-09-08T20:37:00Z

This PR reframes the SQL translation regime implemented in SQLQuery as an extension of the jplyr querying framework. The translation logic is left essentially untouched.

This PR introduces the following broad changes:

It introduces the abstract SQLTable <: AbstractTable and the concrete SQLiteTable <: SQLTable. The latter is a thin wrapper around a SQLite.DB connection and a field that names the table.
The present package is made to rely on jplyr to provide the graph generation @query macro. The intended user-facing interface is that users use @query src |> ..., where src is an object of type T <: SQLTable to generate a Query{T}, which they can then collect against src. (Note that jplyr does not support all of the QueryNode leaf subtype objects that the present package does, in particular DistinctNode, LimitNode, and OffsetNode. Thus, the present package introduces these types and illustrates (or will illustrate) jplyr's QueryNode registration mechanism (which still needs some work).)
The SQL translation machinery is thinly wrapped by Base.collect(tbl::SQLTable, graph::jplyr.QueryNode), which translates the graph into a SQL string via translatesql and runs the query against tbl. By default, the results set is streamed into a Tables.Table.
A bunch of file reorganization. The philosophy is to segregate functionality by the primary type they concern (e.g. QueryNode objects, SQLTables or SQLiteTables). In the sqltable and sqlitetable folders, a query folder houses the code relevant to extending the jplyr collect machinery. translate.jl lives in sqltable/query.

TO-DO (not exhaustive):

finalize and implement jplyr.QueryNode registration framework
implement primitives for QueryNode leaf subtypes, for SQLTables/SQLiteTables
polish DataStreams interfacing
documentation!
tests
figure out story of registering/requiring AbstractTables/Tables/jplyr

Here's a teaser (relying on new work in Tables/AbtractTables/jplyr that I haven't yet pushed):

julia> using SQLQuery

julia> iris_sql = SQLiteTable("/Users/David/.julia/v0.5/SQLQuery/db/iris.db", "iris")
SQLQuery.SQLiteTable(SQLite.DB("/Users/David/.julia/v0.5/SQLQuery/db/iris.db"),"iris")

julia> iris_tbl = Table(CSV.Source("/Users/David/.julia/v0.5/SQLQuery/csv/iris.csv"))
Tables.Table
│ Row │ sepal_length │ sepal_width │ petal_length │ petal_width │ species  │
├─────┼──────────────┼─────────────┼──────────────┼─────────────┼──────────┤
│ 1   │ 5.1          │ 3.5         │ 1.4          │ 0.2         │ "setosa" │
│ 2   │ 4.9          │ 3.0         │ 1.4          │ 0.2         │ "setosa" │
│ 3   │ 4.7          │ 3.2         │ 1.3          │ 0.2         │ "setosa" │
│ 4   │ 4.6          │ 3.1         │ 1.5          │ 0.2         │ "setosa" │
│ 5   │ 5.0          │ 3.6         │ 1.4          │ 0.2         │ "setosa" │
│ 6   │ 5.4          │ 3.9         │ 1.7          │ 0.4         │ "setosa" │
│ 7   │ 4.6          │ 3.4         │ 1.4          │ 0.3         │ "setosa" │
│ 8   │ 5.0          │ 3.4         │ 1.5          │ 0.2         │ "setosa" │
│ 9   │ 4.4          │ 2.9         │ 1.4          │ 0.2         │ "setosa" │
│ 10  │ 4.9          │ 3.1         │ 1.5          │ 0.1         │ "setosa" │
⋮
with 140 more rows.

julia> qry = @query :src |>
           filter(sepal_length > 5.0) |>
           select(species, sepal_length, petal_width)
Query with dummy source src

julia> collect(qry, src=iris_tbl)
Tables.Table
│ Row │ species  │ sepal_length │ petal_width │
├─────┼──────────┼──────────────┼─────────────┤
│ 1   │ "setosa" │ 5.1          │ 0.2         │
│ 2   │ "setosa" │ 5.4          │ 0.4         │
│ 3   │ "setosa" │ 5.4          │ 0.2         │
│ 4   │ "setosa" │ 5.8          │ 0.2         │
│ 5   │ "setosa" │ 5.7          │ 0.4         │
│ 6   │ "setosa" │ 5.4          │ 0.4         │
│ 7   │ "setosa" │ 5.1          │ 0.3         │
│ 8   │ "setosa" │ 5.7          │ 0.3         │
│ 9   │ "setosa" │ 5.1          │ 0.3         │
│ 10  │ "setosa" │ 5.4          │ 0.2         │
⋮
with 108 more rows.

julia> collect(qry, src=iris_sql)
Tables.Table
│ Row │ species  │ sepal_length │ petal_width │
├─────┼──────────┼──────────────┼─────────────┤
│ 1   │ "setosa" │ "5.1"        │ "0.2"       │
│ 2   │ "setosa" │ "5.4"        │ "0.4"       │
│ 3   │ "setosa" │ "5.4"        │ "0.2"       │
│ 4   │ "setosa" │ "5.8"        │ "0.2"       │
│ 5   │ "setosa" │ "5.7"        │ "0.4"       │
│ 6   │ "setosa" │ "5.4"        │ "0.4"       │
│ 7   │ "setosa" │ "5.1"        │ "0.3"       │
│ 8   │ "setosa" │ "5.7"        │ "0.3"       │
│ 9   │ "setosa" │ "5.1"        │ "0.3"       │
│ 10  │ "setosa" │ "5.4"        │ "0.2"       │
⋮
with 108 more rows.

davidagold · 2016-09-08T21:41:28Z

Also -- and this is just a suggestion -- would you be amenable to changing the name of the package to SQLTables in keeping with the Julia style of naming packages after types they provide? I kind of like that idea, but I'm really not wedded to it if you prefer SQLQuery or SQLQueries or something like that.

yeesian · 2016-09-09T14:58:05Z

I like the direction you're going with this PR, and think it should precede the work in #1. I also agree the package should be renamed to SQLTables.jl after this PR is merged into master.

It remains unclear to me what needs to be done on my end, so I'll wait for clearer directives from you after the dust settles.

davidagold · 2016-09-09T17:44:58Z

That sounds good. One thing I think you and I need to clear up is how to support manipulation verbs that come after a groupby mention. IIRC, you prefer to have things like having and summarize/aggregate passed as arguments to the groupby verb, e.g.:

@query iris |>
    groupby(species, 
        having(mean(petal_length) > 1.5))

@query iris |>
    groupby(species, 
        aggregate(avg_petal_length = mean(petal_length)))

whereas I was leaning towards representing such qualifiers as their own manipulation verbs with QueryNode objects, e.g.

@query iris |>
    groupby(species) |>
    having(mean(petal_length) > 1.5)

@query iris |>
    groupby(species) |>
    summarize(avg_petal_length = mean(petal_length))

Note that summarize is its own manipulation verb regardless, since its application makes sense even not in the context of a groupby invocation. So users will be able to create the graph representing the lattermost query regardless. The question is whether or not we support this syntax in this package. I support it in the AbstractTables query interface, and it would be nice if code that relied on it were portable. Where are you on this issue?

EDIT: I should give my arguments in favor of at least supporting the latter syntax in the present package:

As mentioned above, summarize is supported in jplyr as a manipulation verb itself, hence users will be able to create graphs representing

@query iris |>
    groupby(species) |>
    summarize(avg_petal_length = mean(petal_length))

right out of the box. I think it would be strange not to support this syntax for SQL backends, especially when dplyr does support it (and hence users coming from dplyr may expect it).

Detecting new manipulation verbs and rendering them as nodes in a QueryNode graph is easier on the "parser" than detecting special arguments within a manipulation verb invocation.
The latter syntax better allows users to factor queries, e.g.

qry = @query iris |>
    groupby(species)

qrya = @query qry |>
    having(mean(petal_length) > 1.5)

qryb = @query qry |>
    summarize(avg_petal_width = mean(petal_width))

yeesian · 2016-09-09T18:45:40Z

When we were discussing it some weeks back, there wasn't very good support for the notion of a GroupedTable. If we do have representations for them, I'll be okay with the suggestion to represent groupby qualifiers as their own manipulation verbs.

davidagold · 2016-10-01T00:32:08Z

All of those TO-DO items may be more appropriate for future PRs.

@yeesian, should the return type for collecting against a SQLTable be a Table or a DataFrame? I'm fine with either.

yeesian · 2016-10-01T03:45:31Z

I'm fine with either.

Let's make it a DataFrame then.

davidagold · 2016-10-19T04:09:22Z

@yeesian So, here's my current thinking about the groupby semantics.

For in-memory column-indexable Julia tables ("Julia tables" for short), I have a Grouped wrapper type, so @collect groupby(tbl, ...) for tbl::T returns a Grouped{T}. Verbs (select, filter, summarize) collected against a Grouped{T} return a Grouped{T} (but groupby would return a Grouped{Grouped{T}}, now that I think of it... got to fix that). So, there's something of a closure property holding with Grouped{T}. The Grouped wrapper type includes, among other things, information about the indices of the group levels.

However, collecting q = @query groupby(tbl, ...) against a tbl::SQLTable should not return a Grouped{DataFrame} (given that, in general, collecting a query against a SQLTable should return a DataFrame), because a SQLTable is not a Julia table, and hence there is no indices information because we have no idea how whatever SQL backend is storing the grouped data. I propose that collecting q above should just return a DataFrame and a warning saying that all grouping information is lost when collecting as a DataFrame. A query doesn't return a Grouped{T} unless it can construct meaningful index representations of the groups.

yeesian · 2016-10-19T12:59:39Z

I'm okay with the decision. Just some questions for my understanding:

I propose that collecting q above should just return a DataFrame and a warning saying that all grouping information is lost when collecting as a DataFrame.

What do we mean by grouping information is lost?

I have a Grouped wrapper type, so @collect groupby(tbl, ...) for tbl::T returns a Grouped{T}.

Is it meant to be an eventual replacement for GroupedDataFrames?

However, collecting q = @query groupby(tbl, ...) against a tbl::SQLTable should not return a Grouped{DataFrame} ... A query doesn't return a Grouped{T} unless it can construct meaningful index representations of the groups.

I was under the impression a @query groupby(t::SQLTable, ...) would return a Grouped{SQLTable} instead. But collect(t::Grouped{SQLTable}) would return a DataFrame. A Grouped{SQLTable} might then just store information about the groupby expression (rather than indices of the group levels [like for "Julia Tables"]). Or am I missing something?

Initial implementation of SQLTable interface

2e5ad8f

davidagold mentioned this pull request Sep 8, 2016

Roadmap to 0.1.0 davidagold/StructuredQueries.jl#19

Open

11 tasks

davidagold mentioned this pull request Sep 9, 2016

Name this package davidagold/StructuredQueries.jl#8

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

WIP: extend jplyr interface #2

WIP: extend jplyr interface #2

Uh oh!

davidagold commented Sep 8, 2016 •

edited

Loading

Uh oh!

davidagold commented Sep 8, 2016

Uh oh!

yeesian commented Sep 9, 2016

Uh oh!

davidagold commented Sep 9, 2016 •

edited

Loading

Uh oh!

yeesian commented Sep 9, 2016

Uh oh!

davidagold commented Oct 1, 2016

Uh oh!

yeesian commented Oct 1, 2016

Uh oh!

davidagold commented Oct 19, 2016

Uh oh!

yeesian commented Oct 19, 2016 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

WIP: extend jplyr interface #2

Are you sure you want to change the base?

WIP: extend jplyr interface #2

Uh oh!

Conversation

davidagold commented Sep 8, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davidagold commented Sep 8, 2016

Uh oh!

yeesian commented Sep 9, 2016

Uh oh!

davidagold commented Sep 9, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yeesian commented Sep 9, 2016

Uh oh!

davidagold commented Oct 1, 2016

Uh oh!

yeesian commented Oct 1, 2016

Uh oh!

davidagold commented Oct 19, 2016

Uh oh!

yeesian commented Oct 19, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

davidagold commented Sep 8, 2016 •

edited

Loading

davidagold commented Sep 9, 2016 •

edited

Loading

yeesian commented Oct 19, 2016 •

edited

Loading