Adding support for Pinot#6719
Conversation
1305f3e to
70e8d2d
Compare
eacca98 to
8450d52
Compare
Codecov Report
@@ Coverage Diff @@
## master #6719 +/- ##
==========================================
+ Coverage 56.16% 56.16% +<.01%
==========================================
Files 527 527
Lines 23318 23376 +58
Branches 2794 2794
==========================================
+ Hits 13096 13130 +34
- Misses 9812 9836 +24
Partials 410 410
Continue to review full report at Codecov.
|
|
@mistercrunch / @betodealmeida ... can I haz review on this please :-). |
|
As this is very similar to stuff I've worked on, I can review, test and check for regressions against existing dbs if needed. |
|
Taking a look. :) |
|
@villebro I meant to tag you here. Please review! |
|
Will do @mistercrunch , will try to assemble my comments by the end of the weekend. One quick catch: @agrawaldevesh Can you add instructions to |
|
@agrawaldevesh I was able to get queries working from SQL Lab using your fork of However, I am unable to add a table (schema: |
@villebro , thanks for checking this out. I am not able to repro it with our Pinot installation but let me give the open source docker one a spin. I am wondering if you could please copy/paste (pastebin preferred) me the superset server's logs. I did notice that I have to make the schema name match the table name. (Although pinot does not really have any notion of schema per se, so I can change that). Can you try with choosing schema=baseballStats please. Thanks. |
|
Using the table name as schema ( |
villebro
left a comment
There was a problem hiding this comment.
I have tested against Postgres, Snowflake, Oracle and MSSQL which seem to work fine. However, I think the PR would benefit from moving all Pinot-specific logic to db_engine_specs.py, while making as few changes to sqla/models.py and viz.py as possible.
There was a problem hiding this comment.
In my limited testing using an int column I was unable to reproduce this requirement. For instance, select * from baseballStats where hits >= 1 and hits <= 10 produced the same result as select * from baseballStats where hits between 1 and 10. Is this different?
There was a problem hiding this comment.
I will remove this. They fixed this bug in the pinot server that it no longer requires a between :-).
There was a problem hiding this comment.
Moving this to db_engine_specs is probably a good thing even in the absence of pinot, reducing some unnecessary clutter. Perhaps compact pdf, time_grain and grain to the same row?
There was a problem hiding this comment.
Perhaps this could be moved to db_engine_specs.py to a method along the lines of def make_select_compatible(select_exprs, group_exprs), which for BaseEngineSpec returns select_exprsand in the case of Pinot performs the above logic. Check mutate_label(label) in db_engine_specs.py for an example.
There was a problem hiding this comment.
This seems very specific to Pinot. I wonder if this could be dealt with in db_engine_specs, too by adding a method called mutate_column_labels(...) or similar?
There was a problem hiding this comment.
I considered that (After looking at your diff :-) ). But I decided against that for this reason:
I think that the "labels" you want in a query is a property of the query and not a property of the table/db.
In your case, you were creating unique labels mapping to real labels. But I don't have unique labels. Further, the problem is that pinot does not return me the labels.
So if I do "select foo as bar from baz", I am going to get a column named "foo" from pinot instead of "baz" :(. I am simply working around that fact here.
Basically, this "labels_expected" has to be passed as another return value from the get_query_str. I feel it shouldn't be some "state" in either the "db" or the "table".
An alternative I did try was to re-parse the query in Pinot and extract the labels there, but it was proving a bit difficult to reliably do this re-parsing with all the query strings that superset can concoct. So I gave up there :)
There was a problem hiding this comment.
As I understand it, labels_expected is what we want the columns in the resulting dataframe to be called based on the original unmutated select_exprs above. In this sense I believe it is quite similar to what mutate_label() in db_engine_spec does, as it is something that we have to do due to quirkiness of the db. To illustrate with an extreme example, we might have to implement a db_engine_spec for an engine that decides to always return an extra trailing dummy column with null values, i.e. len(df.columns) == len(labels_expected) + 1. To cater for this possibility, it would be optimal for sqla/models.py to not have to be aware of this, but rather be able to push that handling to the respective db engine spec. For this scenario I think having a def mutate_df_columns(df: pd.DataFrame, labels_expected: Sequence[str]) -> pd.DataFrame in db_engine_specs.py would be quite appropriate; BaseEngineSpec would just return the unmutated df, while PinotEngineSpec would perform the above logic, and hypothetical TrailingDummyColEngineSpec would remove the redundant column. Etc etc.
P.S. The rename logic I have done is rather controversial and does bend the rules, and could probably have been solved more elegantly, but would have required a bigger refactor that would have been unfeasible. But as it was a common problem for several dbs, the current solution was deemed a good compromise.
There was a problem hiding this comment.
I think that the logic belongs in the DB API driver instead of Superset — it would also benefit other tools that use the driver.
I wrote a Google Spreadsheets connector that has the same problem, since the SQL API supported is very crude and doesn't accept aliases. Maybe you can reuse the helper function I created? It's based on moz-sql-parser instead of sqlparse:
https://github.com/betodealmeida/gsheets-db-api/blob/master/gsheetsdb/query.py#L93
There was a problem hiding this comment.
I am wondering if I could do that as a part of the future work :). ?
I agree with you that the ideal way to do this is to have the DBAPI deal with this and that requires doing a re-parse of the query to infer the required label names. Given all the difficulty of parsing (or unparsing) PQL, I would rather do this more properly than hack something in here to unblock this diff.
I anyway have to rework the pinot-dbapi to integrate it with dbapihelper, so I can also consider integrating it with this sql parser then too.
I am just anxious of hanging onto this PR in our internal branch here that increases our merge burden.
Thanks
There was a problem hiding this comment.
This is slightly difficult to follow; a comment explaining what is happening here might be helpful.
There was a problem hiding this comment.
Okay. This UDF is weird. I will add a link to its description and some rationale.
There was a problem hiding this comment.
It's super weird, I never understood how it works.
There was a problem hiding this comment.
This seems quite specific to Pinot, and is slightly difficult to follow.
There was a problem hiding this comment.
Will try to fix and/or comment. Thanks.
d309176 to
b614286
Compare
|
@villebro, I was able to confirm the issue with the OSS docker pinot. The issue was a bug in my pinotdb-dbapi code and I have updated that PR. Pinot does not really have the notion of schema. Usually the tableName is kept same as the schemaName but that's not the case in the provided test table baseballStats. Anyway, changing the sql editor query to read "select * from baseballStats" (instead of "select * from baseball.baseballStats") did the trick (after the pinotdb-dbapi fix). Thanks for your testing and helping me fix this !. Otherwise, I have incorporated the changes you requested and so I think this PR is safe to merge in. |
|
I will add Pinot to the docs/installation.rst after I am able to push betodealmeida/pinot-dbapi#3 and update the package pinotdb on pypi. Otherwise, people will get the older version that does not quite work. So I will do that in a different diff. |
|
@agrawaldevesh I can confirm that adding a table now works if schema is left empty. Also viewing table schema works in SQL Lab. I don't have much to add at this point, so handing this back to @betodealmeida and @mistercrunch . |
b614286 to
f8538cf
Compare
|
Handing it over to @betodealmeida who has a better sense of what should be in Superset vs in the dbapi driver and/or SQLAlchemy dialect |
betodealmeida
left a comment
There was a problem hiding this comment.
@agrawaldevesh this looks good, I just think that the expected aliases and the _num_metric_expressions should be done upstream. The main reason is that we have a button to send the user from the Explore View to SQL Lab with the query pre-populated, and they would get a query that doesn't work as expected.
I linked my Google Spreadsheets DB API driver, which does similar manipulations, maybe you can reuse some of the code in the Pinot DB API driver?
There was a problem hiding this comment.
These changes don't do anything — I assume they're the result of multiple commits in this PR?
There was a problem hiding this comment.
Why not just write:
qry._num_metric_expressions = len(metrics_exprs)But this kind of hack worries me a bit. What are you using for parsing the query? For my Google Spreadsheets DB API driver I used moz-sql-parser, and I found it much easier to work with, since it translates SQL into JSON.
I think it would be better to push this and the expected labels upstream. Otherwise, when we send the user from Explore to SQL Lab with the query pre-populated, the query wouldn't run as expected.
|
@betodealmeida , Thank you for the code review ! and the very valid concern that this approach won't work for the "SQL Lab <-> Explore" use case. I am having some success with the moz-sql-parser as you suggested (thank you !). I am gonna update this PR in a couple of days to work with the revised pinot-dbapi changes. |
d26ca6b to
60caae7
Compare
|
@betodealmeida, Thanks for the comments and the suggestions. I tried them but it won't work :( The problem is that Pinot does not like having grouping expressions in the select clause. A query like this will give some PQL NullPointerException: SELECT a as A, It is okay with the "select a as A" part but not by the time function in the select clause. ie, I have to rewrite this query like this: SELECT a as A, But note that we want the timestamp field to show up as a special column name __timestamp in the visualization, and the group-by cannot take a column alias (ie I cannot do "group by a, blah as foo"). Which means that I somehow need to tell the visualization code which column is __timestamp and stamp that in so. I could hack this in another way by keeping track of the __timestamp column only and stamping that in, but no matter what I do, I need some way to "rename" something in the returned dataframe. The real way to fix this would be to fix this in Pinot itself. Were it not for this parse error, we can have the "select X as Y" in the query and then use the moz-sql-parser to get the aliases and stamp them back. But as it stands now, this is not this is not doable :(. (I can work on that on the pinot side that would take a while to get pushed through and I think that some pinot support rather than none is better until then) I decided against creating a PQL string that I then reparse and change under the covers before giving it to Pinot: I would like the generated query to be something that the user can take it and directly do a REST query to Pinot and it should work. It shouldn't be a Pinot-QL dialect that is private to Superset. The labels_expected aspect is local to the visualization logic: it expects the dataframe to have the __timestamp column to identify the time dimension. Alternatively, we can enforce that the time column is always the first column or such but that would be a bigger change everywhere. Now, as for the Explore to SQL-Lab exporting is concerned: All that will happen here is that the column names in the SQL-lab will show up as different from what the visualization wants them to be. I think that is okay ? The user can always choose not to expose the Pinot database in the SQL-Lab if he finds that annoying. We can say in the docs that the pinot support is experimental and clearly state the issues with the SQL-lab column naming issue. Thanks for the review and let me know if you have a suggestion out of this conundrum. I removed the hack for the num_metrics_expr and instead am fixing it in the pinot-dbapi in this PR: betodealmeida/pinot-dbapi#5 |
60caae7 to
475cff4
Compare
Summary: Added limited support for visualizations with Pinot via Sqlalchemy. Pinot QL (PQL) is a bit weird and limited, and this patch hacks superset to deal with that weirdness: 1. Pinot's grouping by time is best done as a long epoch. Grouping by a time string is really slow and times out. 2. Pinot's response does not respect column aliases. So columns are not named what they are expected to. So we remember the given column aliases and then stamp them back onto the dataframe 3. Pinot's Json rest call does not return the output types. Instead everything is cast to string. So when grouping by time, the group key is integral and has to be treated specially when casting back to the dataframe __timestamp column. 4. Finally, pinot does support grouping by on expressions. But those expressions cannot then appear on the select clause. They are returned regardless in the response. ie, 'select foo, count(*) from bar group by foo' is okay, but 'select expr(foo), count(*) from bar group by expr(foo)' ain't. One must use 'select count(*) from bar group by expr(foo)'. I also fixed a couple of things that looked like bugs to me: for example, the row-ordering-limit should come at the end always. Test Plan: Tested with the modified pinotdb sqlalchemy driver and an internal pinot cluster. The pinotdb driver changes are in https://github.com/agrawaldevesh/pinot-dbapi. Pinot does not support orderby-limit for aggregated queries. To annotate a query as an aggregate query, this patch adds a hint to the prepared select statement that the pinotdb sqlalchemy driver then heeds.
475cff4 to
f9da83f
Compare
|
@betodealmeida, Please take a look :). Thank you !. |
I see. I think this is a valid design decision. Essentially we need to make a choice:
For the Google Spreadsheets connector I went with (1), mainly because the user will never send the SQL to Google Spreadsheets since it only support very basic queries and everything else is done through pre and post processors, so it was an easy decision. For Druid (1) and (2) are identical, so no choice needs to be done. Ideally the changes should be pushed upstream to Pinot. When I presented the first draft of the Python DB driver they seemed excited and open to improving the SQL parser, so that might be our best option. |
betodealmeida
left a comment
There was a problem hiding this comment.
I think this is a good approach, given the requirements. Hopefully we can improve the SQL parser in Pinot and remove some of this code in the near future.
|
|
||
| select_exprs += metrics_exprs | ||
|
|
||
| labels_expected = [str(c.name) for c in select_exprs] |
There was a problem hiding this comment.
This actually also fixes a bug I found with Postgres, where the query is:
SELECT SUM(cnt) AS "SUM(cnt)" FROM ...
But the cursor returns a column named sum(cnt), lowercase, and then Pandas errors out when pivoting the table.
|
Hi @betodealmeida ... sorry I forgot to follow up on your comments here: I will surely be driving for SQL parser changes in Pinot. We have communicated that with the Pinot team(s) and they are onboard with that modulo backward compatibility. |
| try: | ||
| int(one_ts_val) | ||
| is_integral = True | ||
| except ValueError: |
Summary: Added limited support for visualizations with Pinot via Sqlalchemy. Pinot QL (PQL) is a bit weird and limited, and this patch hacks superset to deal with that weirdness: 1. Pinot's grouping by time is best done as a long epoch. Grouping by a time string is really slow and times out. 2. Pinot's response does not respect column aliases. So columns are not named what they are expected to. So we remember the given column aliases and then stamp them back onto the dataframe 3. Pinot's Json rest call does not return the output types. Instead everything is cast to string. So when grouping by time, the group key is integral and has to be treated specially when casting back to the dataframe __timestamp column. 4. Finally, pinot does support grouping by on expressions. But those expressions cannot then appear on the select clause. They are returned regardless in the response. ie, 'select foo, count(*) from bar group by foo' is okay, but 'select expr(foo), count(*) from bar group by expr(foo)' ain't. One must use 'select count(*) from bar group by expr(foo)'. I also fixed a couple of things that looked like bugs to me: for example, the row-ordering-limit should come at the end always. Test Plan: Tested with the modified pinotdb sqlalchemy driver and an internal pinot cluster. The pinotdb driver changes are in https://github.com/agrawaldevesh/pinot-dbapi. Pinot does not support orderby-limit for aggregated queries. To annotate a query as an aggregate query, this patch adds a hint to the prepared select statement that the pinotdb sqlalchemy driver then heeds.
Summary: Added limited support for visualizations with Pinot via
Sqlalchemy.
Pinot QL (PQL) is a bit weird and limited, and this patch hacks superset to
deal with that weirdness:
Pinot's grouping by time is best done as a long epoch. Grouping by a
time string is really slow and times out.
Pinot's response does not respect column aliases. So columns are not
named what they are expected to. So we remember the given column aliases
and then stamp them back onto the dataframe
Pinot's Json rest call does not return the output types. Instead
everything is cast to string. So when grouping by time, the group key
is integral and has to be treated specially when casting back to the
dataframe __timestamp column.
Finally, pinot does support grouping by on expressions. But those
expressions cannot then appear on the select clause. They are returned
regardless in the response. ie, 'select foo, count() from bar group by
foo' is okay, but 'select expr(foo), count() from bar group by
expr(foo)' ain't. One must use 'select count(*) from bar group by
expr(foo)'.
Pinot does not allow order-by on queries that have any aggregates (whether with group by or without)
I also fixed a couple of things that looked like bugs to me: for
example, the row-ordering-limit should come at the end always.
Test Plan: Tested with the modified pinotdb sqlalchemy driver and an
internal pinot cluster. The pinotdb driver changes are in
https://github.com/agrawaldevesh/pinot-dbapi.
Pinot does not support orderby-limit for aggregated queries. To annotate
a query as an aggregate query, this patch adds a hint to the prepared
select statement that the pinotdb sqlalchemy driver then heeds.