Skip to content

WIP: [AIRFLOW-1894] Google cloud bigquery#4607

Closed
jmcarp wants to merge 1 commit intoapache:masterfrom
jmcarp:google-cloud-bigquery
Closed

WIP: [AIRFLOW-1894] Google cloud bigquery#4607
jmcarp wants to merge 1 commit intoapache:masterfrom
jmcarp:google-cloud-bigquery

Conversation

@jmcarp
Copy link
Contributor

@jmcarp jmcarp commented Jan 28, 2019

Make sure you have checked all steps below.

Jira

  • My PR addresses the following Airflow Jira issues and references them in the PR title. For example, "[AIRFLOW-XXX] My Airflow PR"

Description

  • Here are some details about my PR, including screenshots of any UI changes:

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:

Commits

  • My commits all reference Jira issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • When adding new operators/hooks/sensors, the autoclass documentation generation needs to be added.
    • All the public functions and the classes in the PR contain docstrings that explain what it does

Code Quality

  • Passes flake8

@jmcarp jmcarp force-pushed the google-cloud-bigquery branch 4 times, most recently from 94a6905 to d131b00 Compare January 30, 2019 04:25
@jmcarp jmcarp changed the title [AIRFLOW-3776] Google cloud bigquery [AIRFLOW-1894] Google cloud bigquery Jan 30, 2019
@jmcarp jmcarp force-pushed the google-cloud-bigquery branch 2 times, most recently from 8bd9564 to dc502cb Compare January 30, 2019 05:03
@kaxil kaxil changed the title [AIRFLOW-1894] Google cloud bigquery WIP: [AIRFLOW-1894] Google cloud bigquery Jan 30, 2019
@jmcarp jmcarp force-pushed the google-cloud-bigquery branch 2 times, most recently from 89b048d to 292d467 Compare January 31, 2019 01:43
@codecov-io
Copy link

codecov-io commented Jan 31, 2019

Codecov Report

Merging #4607 into master will increase coverage by 0.23%.
The diff coverage is 47.05%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4607      +/-   ##
==========================================
+ Coverage    74.3%   74.53%   +0.23%     
==========================================
  Files         426      426              
  Lines       27867    27577     -290     
==========================================
- Hits        20706    20554     -152     
+ Misses       7161     7023     -138
Impacted Files Coverage Δ
...rflow/contrib/operators/bigquery_check_operator.py 0% <ø> (ø) ⬆️
airflow/contrib/operators/gcs_to_bq.py 0% <0%> (ø) ⬆️
airflow/contrib/operators/bigquery_get_data.py 0% <0%> (ø) ⬆️
airflow/contrib/operators/bigquery_to_bigquery.py 0% <0%> (ø) ⬆️
...ontrib/operators/bigquery_table_delete_operator.py 0% <0%> (ø) ⬆️
airflow/contrib/operators/bigquery_to_gcs.py 0% <0%> (ø) ⬆️
airflow/contrib/operators/bigquery_operator.py 95.93% <100%> (+2.36%) ⬆️
airflow/contrib/hooks/bigquery_hook.py 60.69% <47.29%> (+2.62%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f07f3a8...d903f24. Read the comment docs.

@jmcarp jmcarp force-pushed the google-cloud-bigquery branch 5 times, most recently from 7562a62 to 8698ae7 Compare February 2, 2019 02:13
@jmcarp jmcarp force-pushed the google-cloud-bigquery branch from 8698ae7 to d903f24 Compare February 2, 2019 02:18
@jmcarp
Copy link
Contributor Author

jmcarp commented Feb 4, 2019

@kaxil @potiuk: tests are passing, so this is ready for review when you have time. A few differences to point out:

  • The current python bigquery client library implements the dbapi connection and cursor interfaces, so we can drop our custom implementations
  • The current client also allows for polling a job by calling its result method, so we can also drop our custom polling logic
  • Because the current client includes classes for most job options that include basic validation, we're also able to drop some custom validation
  • Extra methods on the bigquery cursor have been moved to the hook

@jmcarp
Copy link
Contributor Author

jmcarp commented Feb 8, 2019

Ping @Fokko @feng-tao @mik-laj (not sure who knows this code best). I'm hoping to add some features after this is ready, so would be great to get feedback when you all have time.

sql,
bigquery_conn_id='bigquery_default',
use_legacy_sql=True,
use_legacy_sql=False,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't change this behaviour

allow_jagged_rows=self.allow_jagged_rows,
src_fmt_configs=self.src_fmt_configs,
labels=self.labels
external_config_options=self.external_config_options,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will have to be for 2.0 as it will break things. Also, it needs to be backward-compatible to make updation smooth from 1.X to 2.0

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't yet reviewed the entire PR but schemed through it. Will try to find some time to look at it more thoroughly

@Fokko
Copy link
Contributor

Fokko commented Feb 27, 2019

I'm seeing some good things in this PR, and it looks like it will simplify the logic. Any plans of moving this forward @jmcarp ?

:type project_id: str
"""
service = self.get_service()
project_id = project_id if project_id is not None else self.project_id
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could use fallback_to_default_project_id decorator. It has additional logic to raise the exception if none of the project_id s is specified. and you could remove this if altogether then. It forces to use keyword parameters though.

project_id=project,
use_legacy_sql=self.use_legacy_sql,
def get_client(self, project_id=None):
project_id = project_id if project_id is not None else self.project_id
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as below - fallback_to_default_project_id decorator is nicer way I think

}
})
if external_config_options is not None:
if not isinstance(external_config_options, type(external_config.options)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should use dict explicitly here. What if external_config.options are None (seem to be default).

https://cloud.google.com/bigquery/docs/locations#specifying_your_location
:type location: str
"""
project_id = project_id if project_id is not None else self.project_id
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again - I think using fallback decorator is nicer as it keeps the project_id logic in one place.

passed to BigQuery
:type labels: dict
"""
project_id = project_id if project_id is not None else self.project_id
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here with decorator.

time_partitioning. The order of columns given determines the sort order.
:type cluster_fields: list of str
"""
project_id = project_id if project_id is not None else self.project_id
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And here.

"""
# check to see if the table exists
table_id = table_resource['tableReference']['tableId']
project_id = project_id if project_id is not None else self.project_id
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And here :)

even if any insertion errors occur.
:type fail_on_error: bool
"""
project_id = project_id if project_id is not None else self.project_id
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here too :)

private_key=private_key)

def table_exists(self, project_id, dataset_id, table_id):
def table_exists(self, dataset_id, table_id, project_id=None):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, thanks for putting this together, this is great stuff!

Regarding these convenience methods, I'd argue that it would be better if the functionality would be implemented on the connection object returned by get_conn, and the here we just delegate, like this:

def table_exists(self, dataset_id, table_id, project_id=None):
    return self.get_conn().table_exists(dataset_id, table_id, project_id)

The reason is that sometimes it is very useful to have explicit control over when and where connections get created, for example when using multi-threading for I/O optimization. We've also had code that spuriously crashed because it was creating too many new connections (implicitly in such convenience methods), and at some point the authentication requests hit a rate-limit.

@potiuk
Copy link
Member

potiuk commented Apr 20, 2019

Hey @jmcarp -> are you still working on this? As I am now committer, I am happy to review that one as well as soon as it gets rebased (if it's still something you want to add).

@kaxil
Copy link
Member

kaxil commented Apr 30, 2019

@jmcarp Are you still working on this?

@mik-laj mik-laj added the provider:google Google (including GCP) related issues label Jul 22, 2019
@mik-laj
Copy link
Member

mik-laj commented Jul 27, 2019

@jmcarp Can I help you with this? This looks very interesting to me.

@mik-laj
Copy link
Member

mik-laj commented Sep 5, 2019

Hi.

I made a change in the base class - GoogleCloudBaseHook. Your PR may need to be changed. Could you do rebase?

Cheers

Refenence:
#5907

@potiuk
Copy link
Member

potiuk commented Sep 6, 2019

@jmcarp - is it possible to rebase/complete the work on this one ? It blocks us from moving operators/hook from contrib to core - after we move the operators/hooks it will be much more difficult to merge.

@potiuk
Copy link
Member

potiuk commented Sep 21, 2019

Hey @jmcarp - are you still doing it? Or should we close that one?

@stale
Copy link

stale bot commented Nov 20, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Nov 20, 2019
@stale stale bot closed this Nov 27, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

provider:google Google (including GCP) related issues stale Stale PRs per the .github/workflows/stale.yml policy file

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants