Skip to content

Only load fields that are needed from Google API for Big Query schema.#787

Closed
jezdez wants to merge 32 commits intomasterfrom
bug785
Closed

Only load fields that are needed from Google API for Big Query schema.#787
jezdez wants to merge 32 commits intomasterfrom
bug785

Conversation

@jezdez
Copy link
Copy Markdown

@jezdez jezdez commented Jan 25, 2019

Refs #785.

I used the Google API Explorer to build the values for the fields parameters: e.g. https://developers.google.com/apis-explorer/#p/bigquery/v2/bigquery.tables.get

More info for the fields parameter: https://developers.google.com/api-client-library/python/guide/performance#partial-response-fields-parameter

Allen Short and others added 30 commits December 17, 2018 13:20
- Use new master / rc release release strategy (#440)
- Migrate Circle CI 2.0 (#488, #502)
I tried to make
`JSON.stringify(this.visualization.options.columnMapping)` a variable
to avoid repeating it, but if I make it a `let` the linter throws an
error and if I make it a `const` then it doesn’t change with the UI and
the logic doesn’t work. :(

updated based on PR comments
In the long run we'll be able to install additional dependencies by
having an own Dockerfile to build images based on the Redash image
but that installs additional Python dependencies. But until we have
a fork with lots of changes ourselves we need to do it this way.

Redash-stmo contains the ability to hook up our own Dockerflow
library.

Refs #13
Refs #37
Extend the Remote User Auth backend with the ability to pass remote user groups via a configurable request header similar to the REMOTE_USER header.

Refs #37.

If enabled the feature allows checks the header value against a configured list of group names, including the ability to use UNIX shell-style wildcards.
* Use --max-tasks-per-child as per celery documentation

* Set --max-memory-per-child to 1/4th of total system memory

* Split exec command over multiple lines

* Fix memory variable typo
Refs #537, #553.

Co-authored-by: Marina Samuel <msamuel@mozilla.com>
Co-authored-by: Allen Short <ashort@mozilla.com>
@jezdez jezdez requested review from emtwo and jasonthomas January 25, 2019 17:55
@jezdez
Copy link
Copy Markdown
Author

jezdez commented Jan 25, 2019

It'd be neat to try this out eventually, not sure how I can do this locally to be honest :(

@emtwo
Copy link
Copy Markdown

emtwo commented Jan 28, 2019

@jezdez I haven't looked into the cause of the failing tests here, but I've tried this out locally and it's working in that less data is passed along per request, which is good.

However, I'm not entirely sure that this will resolve the issue #785 for a couple of reasons.

  1. It did not provide a clear improvement in speed for fetching BigQuery schema since the main bottleneck seems to be making a request for every single table, and this PR does not change that. Unfortunately, it seems that this is just the nature of the BigQuery schema API.

  2. I could not reproduce the error @jasonthomas was seeing {"error": {"message": "Error retrieving schema.", "code": 2}} neither on master or in this PR when testing with a BigQuery data source that has ~270 tables. The only way I could get that error to show up is by using an invalid JSON key file when setting up the data source.

That said, I think once these tests are fixed, we can land this PR anyways (perhaps it should be done upstream?) It may or may not fix the issue but it's still useful to be sending around less data.

I also think if we could connect to the BigQuery data source locally that is failing on stage, it might be easier to reproduce and debug the issue. Is there any chance we could do that, @jasonthomas?

@emtwo
Copy link
Copy Markdown

emtwo commented Jan 28, 2019

For reference, here are some differences in data that is passed along with the change in this PR:

[Before] Fetching a dataset

[{
    u 'kind': u 'bigquery#dataset',
    u 'location': u 'US',
    u 'id': u 'cosmic-antenna-155218:babynames',
    u 'datasetReference': {
        u 'projectId': u 'cosmic-antenna-155218',
        u 'datasetId': u 'babynames'
    }
}]

[After] Fetching a dataset

[{
    u 'datasetReference': {
        u 'datasetId': u 'babynames'
    }
}]

[Before] Fetching table names

[{
    u 'creationTime': u '1541093458826',
    u 'kind': u 'bigquery#table',
    u 'type': u 'TABLE',
    u 'id': u 'cosmic-antenna-155218:babynames.copy',
    u 'tableReference': {
        u 'projectId': u 'cosmic-antenna-155218',
        u 'tableId': u 'copy',
        u 'datasetId': u 'babynames'
    }
}, ...]

[After] Fetching table names

[{
    u 'tableReference': {
        u 'tableId': u 'copy'
    }
}, ...]

[Before] Fetching a table schema:

{
    u 'kind': u 'bigquery#table',
    u 'creationTime': u '1541093458826',
    u 'location': u 'US',
    u 'tableReference': {
        u 'projectId': u 'cosmic-antenna-155218',
        u 'tableId': u 'copy',
        u 'datasetId': u 'babynames'
    },
    u 'numRows': u '33228',
    u 'numBytes': u '637183',
    u 'etag': u '3AWXArPupz0SpiQIXDh8RQ==',
    u 'numLongTermBytes': u '0',
    u 'lastModifiedTime': u '1541093458826',
    u 'type': u 'TABLE',
    u 'id': u 'cosmic-antenna-155218:babynames.copy',
    u 'selfLink': u 'https://www.googleapis.com/bigquery/v2/projects/cosmic-antenna-155218/datasets/babynames/tables/copy',
    u 'schema': {
        u 'fields': [{
            u 'type': u 'STRING',
            u 'name': u 'name',
            u 'mode': u 'NULLABLE'
        }, {
            u 'type': u 'STRING',
            u 'name': u 'gender',
            u 'mode': u 'NULLABLE'
        }, {
            u 'type': u 'INTEGER',
            u 'name': u 'count',
            u 'mode': u 'NULLABLE'
        }]
    }
}

[After] Fetching a table schema:

{
    u 'id': u 'cosmic-antenna-155218:babynames.copy',
    u 'schema': {
        u 'fields': [{
            u 'type': u 'STRING',
            u 'name': u 'name',
            u 'mode': u 'NULLABLE'
        }, {
            u 'type': u 'STRING',
            u 'name': u 'gender',
            u 'mode': u 'NULLABLE'
        }, {
            u 'type': u 'INTEGER',
            u 'name': u 'count',
            u 'mode': u 'NULLABLE'
        }]
    }
}

@jasonthomas
Copy link
Copy Markdown
Member

jasonthomas commented Jan 28, 2019

Is there any chance we could do that, @jasonthomas?

@emtwo I've sent you an email with the json key. The project id information is also listed in the json key file.

One thing to mention is that the redash documentation recommends either using the Big Query Admin role or creating a custom role for the service account with the permissions. I've created a custom role with the following permissions:

    "bigquery.datasets.get",
    "bigquery.jobs.create",
    "bigquery.jobs.get",
    "bigquery.jobs.list",
    "bigquery.jobs.listAll",
    "bigquery.jobs.update",
    "bigquery.tables.get",
    "bigquery.tables.list",

The documentation also includes 'bigquery.datasets.list' and 'bigquery.tables.getData' which is not included in the above permissions. 'bigquery.datasets.list' does not exist and ''bigquery.tables.getData' would give access to all table data [2]. Instead the service account user has access to specific datasets/tables.

[1] https://redash.io/help/data-sources/setup/bigquery
[2] https://cloud.google.com/bigquery/docs/access-control

@jasonthomas
Copy link
Copy Markdown
Member

As per today's STMO meeting we decided to test the BigQuery data source with a a gcp service account that contained role 'BigQuery Admin' to rule out this is a permission issue.

I created a new service account, created a new Big Query data source on stage stmo and was unable to view the schema. The error was the same as above {"error": {"message": "Error retrieving schema.", "code": 2}}. I also ran a query against the datasource and that was successful and returned data.

@emtwo
Copy link
Copy Markdown

emtwo commented Feb 1, 2019

Note: bug #785 has a fix that was merged upstream: getredash#3382

@jezdez
Copy link
Copy Markdown
Author

jezdez commented Feb 25, 2019

Hey @emtwo, @jasonthomas Do you think this would be a good idea to open a PR upstream?

@emtwo
Copy link
Copy Markdown

emtwo commented Mar 28, 2019

Do you think this would be a good idea to open a PR upstream?

Yes definitely.

@jezdez
Copy link
Copy Markdown
Author

jezdez commented Apr 3, 2019

Closing in favor of getredash#3673.

@jezdez jezdez closed this Apr 3, 2019
@jezdez jezdez deleted the bug785 branch April 3, 2019 07:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants