Skip to content

Conversation

@davidagold
Copy link

@davidagold davidagold commented Sep 8, 2016

This PR reframes the SQL translation regime implemented in SQLQuery as an extension of the jplyr querying framework. The translation logic is left essentially untouched.

This PR introduces the following broad changes:

  1. It introduces the abstract SQLTable <: AbstractTable and the concrete SQLiteTable <: SQLTable. The latter is a thin wrapper around a SQLite.DB connection and a field that names the table.
  2. The present package is made to rely on jplyr to provide the graph generation @query macro. The intended user-facing interface is that users use @query src |> ..., where src is an object of type T <: SQLTable to generate a Query{T}, which they can then collect against src. (Note that jplyr does not support all of the QueryNode leaf subtype objects that the present package does, in particular DistinctNode, LimitNode, and OffsetNode. Thus, the present package introduces these types and illustrates (or will illustrate) jplyr's QueryNode registration mechanism (which still needs some work).)
  3. The SQL translation machinery is thinly wrapped by Base.collect(tbl::SQLTable, graph::jplyr.QueryNode), which translates the graph into a SQL string via translatesql and runs the query against tbl. By default, the results set is streamed into a Tables.Table.
  4. A bunch of file reorganization. The philosophy is to segregate functionality by the primary type they concern (e.g. QueryNode objects, SQLTables or SQLiteTables). In the sqltable and sqlitetable folders, a query folder houses the code relevant to extending the jplyr collect machinery. translate.jl lives in sqltable/query.

TO-DO (not exhaustive):

  • finalize and implement jplyr.QueryNode registration framework
  • implement primitives for QueryNode leaf subtypes, for SQLTables/SQLiteTables
  • polish DataStreams interfacing
  • documentation!
  • tests
  • figure out story of registering/requiring AbstractTables/Tables/jplyr

Here's a teaser (relying on new work in Tables/AbtractTables/jplyr that I haven't yet pushed):

julia> using SQLQuery

julia> iris_sql = SQLiteTable("/Users/David/.julia/v0.5/SQLQuery/db/iris.db", "iris")
SQLQuery.SQLiteTable(SQLite.DB("/Users/David/.julia/v0.5/SQLQuery/db/iris.db"),"iris")

julia> iris_tbl = Table(CSV.Source("/Users/David/.julia/v0.5/SQLQuery/csv/iris.csv"))
Tables.Table
│ Row │ sepal_length │ sepal_width │ petal_length │ petal_width │ species  │
├─────┼──────────────┼─────────────┼──────────────┼─────────────┼──────────┤
│ 15.13.51.40.2"setosa" │
│ 24.93.01.40.2"setosa" │
│ 34.73.21.30.2"setosa" │
│ 44.63.11.50.2"setosa" │
│ 55.03.61.40.2"setosa" │
│ 65.43.91.70.4"setosa" │
│ 74.63.41.40.3"setosa" │
│ 85.03.41.50.2"setosa" │
│ 94.42.91.40.2"setosa" │
│ 104.93.11.50.1"setosa"
with 140 more rows.

julia> qry = @query :src |>
           filter(sepal_length > 5.0) |>
           select(species, sepal_length, petal_width)
Query with dummy source src

julia> collect(qry, src=iris_tbl)
Tables.Table
│ Row │ species  │ sepal_length │ petal_width │
├─────┼──────────┼──────────────┼─────────────┤
│ 1"setosa"5.10.2         │
│ 2"setosa"5.40.4         │
│ 3"setosa"5.40.2         │
│ 4"setosa"5.80.2         │
│ 5"setosa"5.70.4         │
│ 6"setosa"5.40.4         │
│ 7"setosa"5.10.3         │
│ 8"setosa"5.70.3         │
│ 9"setosa"5.10.3         │
│ 10"setosa"5.40.2
with 108 more rows.

julia> collect(qry, src=iris_sql)
Tables.Table
│ Row │ species  │ sepal_length │ petal_width │
├─────┼──────────┼──────────────┼─────────────┤
│ 1"setosa""5.1""0.2"       │
│ 2"setosa""5.4""0.4"       │
│ 3"setosa""5.4""0.2"       │
│ 4"setosa""5.8""0.2"       │
│ 5"setosa""5.7""0.4"       │
│ 6"setosa""5.4""0.4"       │
│ 7"setosa""5.1""0.3"       │
│ 8"setosa""5.7""0.3"       │
│ 9"setosa""5.1""0.3"       │
│ 10"setosa""5.4""0.2"
with 108 more rows.

@davidagold
Copy link
Author

Also -- and this is just a suggestion -- would you be amenable to changing the name of the package to SQLTables in keeping with the Julia style of naming packages after types they provide? I kind of like that idea, but I'm really not wedded to it if you prefer SQLQuery or SQLQueries or something like that.

@yeesian
Copy link
Owner

yeesian commented Sep 9, 2016

I like the direction you're going with this PR, and think it should precede the work in #1. I also agree the package should be renamed to SQLTables.jl after this PR is merged into master.

It remains unclear to me what needs to be done on my end, so I'll wait for clearer directives from you after the dust settles.

@davidagold
Copy link
Author

davidagold commented Sep 9, 2016

That sounds good. One thing I think you and I need to clear up is how to support manipulation verbs that come after a groupby mention. IIRC, you prefer to have things like having and summarize/aggregate passed as arguments to the groupby verb, e.g.:

@query iris |>
    groupby(species, 
        having(mean(petal_length) > 1.5))

@query iris |>
    groupby(species, 
        aggregate(avg_petal_length = mean(petal_length)))

whereas I was leaning towards representing such qualifiers as their own manipulation verbs with QueryNode objects, e.g.

@query iris |>
    groupby(species) |>
    having(mean(petal_length) > 1.5)

@query iris |>
    groupby(species) |>
    summarize(avg_petal_length = mean(petal_length))

Note that summarize is its own manipulation verb regardless, since its application makes sense even not in the context of a groupby invocation. So users will be able to create the graph representing the lattermost query regardless. The question is whether or not we support this syntax in this package. I support it in the AbstractTables query interface, and it would be nice if code that relied on it were portable. Where are you on this issue?

EDIT: I should give my arguments in favor of at least supporting the latter syntax in the present package:

  1. As mentioned above, summarize is supported in jplyr as a manipulation verb itself, hence users will be able to create graphs representing
@query iris |>
    groupby(species) |>
    summarize(avg_petal_length = mean(petal_length))

right out of the box. I think it would be strange not to support this syntax for SQL backends, especially when dplyr does support it (and hence users coming from dplyr may expect it).

  1. Detecting new manipulation verbs and rendering them as nodes in a QueryNode graph is easier on the "parser" than detecting special arguments within a manipulation verb invocation.
  2. The latter syntax better allows users to factor queries, e.g.
qry = @query iris |>
    groupby(species)

qrya = @query qry |>
    having(mean(petal_length) > 1.5)

qryb = @query qry |>
    summarize(avg_petal_width = mean(petal_width))

@yeesian
Copy link
Owner

yeesian commented Sep 9, 2016

When we were discussing it some weeks back, there wasn't very good support for the notion of a GroupedTable. If we do have representations for them, I'll be okay with the suggestion to represent groupby qualifiers as their own manipulation verbs.

@davidagold
Copy link
Author

All of those TO-DO items may be more appropriate for future PRs.

@yeesian, should the return type for collecting against a SQLTable be a Table or a DataFrame? I'm fine with either.

@yeesian
Copy link
Owner

yeesian commented Oct 1, 2016

I'm fine with either.

Let's make it a DataFrame then.

@davidagold
Copy link
Author

@yeesian So, here's my current thinking about the groupby semantics.

For in-memory column-indexable Julia tables ("Julia tables" for short), I have a Grouped wrapper type, so @collect groupby(tbl, ...) for tbl::T returns a Grouped{T}. Verbs (select, filter, summarize) collected against a Grouped{T} return a Grouped{T} (but groupby would return a Grouped{Grouped{T}}, now that I think of it... got to fix that). So, there's something of a closure property holding with Grouped{T}. The Grouped wrapper type includes, among other things, information about the indices of the group levels.

However, collecting q = @query groupby(tbl, ...) against a tbl::SQLTable should not return a Grouped{DataFrame} (given that, in general, collecting a query against a SQLTable should return a DataFrame), because a SQLTable is not a Julia table, and hence there is no indices information because we have no idea how whatever SQL backend is storing the grouped data. I propose that collecting q above should just return a DataFrame and a warning saying that all grouping information is lost when collecting as a DataFrame. A query doesn't return a Grouped{T} unless it can construct meaningful index representations of the groups.

@yeesian
Copy link
Owner

yeesian commented Oct 19, 2016

I'm okay with the decision. Just some questions for my understanding:

I propose that collecting q above should just return a DataFrame and a warning saying that all grouping information is lost when collecting as a DataFrame.

What do we mean by grouping information is lost?

I have a Grouped wrapper type, so @collect groupby(tbl, ...) for tbl::T returns a Grouped{T}.

Is it meant to be an eventual replacement for GroupedDataFrames?

However, collecting q = @query groupby(tbl, ...) against a tbl::SQLTable should not return a Grouped{DataFrame} ... A query doesn't return a Grouped{T} unless it can construct meaningful index representations of the groups.

I was under the impression a @query groupby(t::SQLTable, ...) would return a Grouped{SQLTable} instead. But collect(t::Grouped{SQLTable}) would return a DataFrame. A Grouped{SQLTable} might then just store information about the groupby expression (rather than indices of the group levels [like for "Julia Tables"]). Or am I missing something?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants