Skip to content

Big Query Data Samples#948

Merged
emtwo merged 1 commit intomasterfrom
emtwo/bq_schema
May 10, 2019
Merged

Big Query Data Samples#948
emtwo merged 1 commit intomasterfrom
emtwo/bq_schema

Conversation

@emtwo
Copy link
Copy Markdown

@emtwo emtwo commented May 3, 2019

No description provided.

@emtwo emtwo force-pushed the emtwo/bq_schema branch from 688e19a to b30d6fe Compare May 3, 2019 17:26
@emtwo emtwo requested a review from washort May 3, 2019 18:44
@jezdez jezdez requested review from jezdez and removed request for washort May 8, 2019 18:10
Copy link
Copy Markdown

@jezdez jezdez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some Python API design changes needed and a bit more extended BigQuery API use.

Comment thread redash/query_runner/big_query.py Outdated
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't that really a "field name" and not a "column name"?

Comment thread redash/query_runner/big_query.py Outdated
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a huge fan when a method is extended by returning tuples of values that are related but not quite the same since it makes type checking of the return value harder and encourages wrong data unpacking, or eve more extended return types in the future.

And since it seems this method isn't used elsewhere, can we just make this a _get_column_metadata method that then returns the list of metadata for the given column including the name that _get_columns_schema needs?

In _get_columns_schema the columns value could be extended with columns.extend(map(operator.itemgetter('name'), metadatum)) then. What do you think?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this idea, thanks!

Comment thread redash/query_runner/big_query.py Outdated
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More like field_name not column name here, right?

Comment thread redash/query_runner/big_query.py Outdated
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move this line into an else clause below.

Comment thread redash/query_runner/big_query.py Outdated
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, not happy about this type check here since it'll be expensive. Could we do this differently?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, also isinstance like below.

Comment thread redash/query_runner/big_query.py Outdated
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a chance that the provided table name contains more than one dot? Providing a limiting second paramter table_name.split('.', 1) could prevent accidential ValueErrors during unpacking.

Comment thread redash/query_runner/big_query.py Outdated
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mind writing this not in shorthand, easier to debug that way.

Comment thread redash/query_runner/big_query.py Outdated
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to use the fields parameter for partial responses and pagination to be a functioning query call, like I did in https://github.com/getredash/redash/pull/3673/files#diff-79a49f870dc6fe9bd78c6c81e5d3b200R267.

The while loop is the pagination, you must request the nextPageToken to get it to work.

Docs for the "partial response": https://developers.google.com/api-client-library/python/guide/performance#partial-response-fields-parameter

API docs for tabledata listing: https://developers.google.com/apis-explorer/#p/bigquery/v2/bigquery.tabledata.list

API docs for table listing: https://developers.google.com/apis-explorer/#p/bigquery/v2/bigquery.tables.list

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I've looked into this a bit and I have a few comments on stuff that works and stuff that I think doesn't quite work:

  • Adding fields to both these API calls seems to work well and makes sense. Thanks for pointing this out!
  • I'm actually not using tables().list() but tables().get() (https://developers.google.com/apis-explorer/#p/bigquery/v2/bigquery.tables.get) which doesn't return or use a nextPageToken I think since it's fetching only 1 table it doesn't need to paginate
  • In terms of my use of tabledata().list(), I've also added a parameter, maxResults=1 since I'm only requesting 1 example for a given table. For this reason, I didn't paginate the result. We could potentially add the pagination as a precaution but as far as I can tell, it's not necessary.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, that makes sense, apologies for misreading the API that is used here, my fault.

About pagination, the only risk I see is if for some reason we'd remove the maxResult parameter in the future and would not see that pagination is missing, since the code is not expressive enough to show that. So either just add the while loop or add a comment maybe? Not sure what makes more sense 😬

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I'll leave a comment on this.

Comment thread redash/query_runner/big_query.py Outdated
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's be more specific what exceptions could happen here to not accidentally hiding things we aren't expecting. The logger.exception call right now doesn't do much right now since we don't look for it.

Comment thread tests/query_runner/test_bigquery.py Outdated
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use parentheses instead of backslashes.

@emtwo
Copy link
Copy Markdown
Author

emtwo commented May 10, 2019

I wanted to make a note here so I don't forget. When processing some queries in the bq samples locally, there was a lot of Access Denied errors that came up. Before merging, we should check in with jason if this was just because of my personal permissions and it should work fine otherwise.

@jezdez
Copy link
Copy Markdown

jezdez commented May 10, 2019

@emtwo Good idea, where those Access Denied error showing up in the place where exception handling could catch them?

@jezdez
Copy link
Copy Markdown

jezdez commented May 10, 2019

FWIW, this is good to go with the caveat around pagination.

@emtwo
Copy link
Copy Markdown
Author

emtwo commented May 10, 2019

@jezdez HttpError catches the access denied issue so it's not something the code can't handle but rather there are a bunch of samples that end up not getting populated. I suppose we can just see how this does in staging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants