WIP: [AIRFLOW-1894] Google cloud bigquery by jmcarp · Pull Request #4607 · apache/airflow

jmcarp · 2019-01-28T04:08:01Z

Make sure you have checked all steps below.

Jira

My PR addresses the following Airflow Jira issues and references them in the PR title. For example, "[AIRFLOW-XXX] My Airflow PR"
- https://issues.apache.org/jira/browse/AIRFLOW-1894
- In case you are fixing a typo in the documentation you can prepend your commit with [AIRFLOW-XXX], code changes always need a Jira issue.

Description

Here are some details about my PR, including screenshots of any UI changes:

Tests

My PR adds the following unit tests OR does not need testing for this extremely good reason:

Commits

My commits all reference Jira issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters (not including Jira issue reference)
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not "adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"

Documentation

In case of new functionality, my PR adds documentation that describes how to use it.
- When adding new operators/hooks/sensors, the autoclass documentation generation needs to be added.
- All the public functions and the classes in the PR contain docstrings that explain what it does

Code Quality

Passes flake8

codecov-io · 2019-01-31T02:33:31Z

Codecov Report

Merging #4607 into master will increase coverage by 0.23%.
The diff coverage is 47.05%.

@@            Coverage Diff             @@
##           master    #4607      +/-   ##
==========================================
+ Coverage    74.3%   74.53%   +0.23%     
==========================================
  Files         426      426              
  Lines       27867    27577     -290     
==========================================
- Hits        20706    20554     -152     
+ Misses       7161     7023     -138

Impacted Files	Coverage Δ
...rflow/contrib/operators/bigquery_check_operator.py	`0% <ø> (ø)`	⬆️
airflow/contrib/operators/gcs_to_bq.py	`0% <0%> (ø)`	⬆️
airflow/contrib/operators/bigquery_get_data.py	`0% <0%> (ø)`	⬆️
airflow/contrib/operators/bigquery_to_bigquery.py	`0% <0%> (ø)`	⬆️
...ontrib/operators/bigquery_table_delete_operator.py	`0% <0%> (ø)`	⬆️
airflow/contrib/operators/bigquery_to_gcs.py	`0% <0%> (ø)`	⬆️
airflow/contrib/operators/bigquery_operator.py	`95.93% <100%> (+2.36%)`	⬆️
airflow/contrib/hooks/bigquery_hook.py	`60.69% <47.29%> (+2.62%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f07f3a8...d903f24. Read the comment docs.

jmcarp · 2019-02-04T17:07:00Z

@kaxil @potiuk: tests are passing, so this is ready for review when you have time. A few differences to point out:

The current python bigquery client library implements the dbapi connection and cursor interfaces, so we can drop our custom implementations
The current client also allows for polling a job by calling its result method, so we can also drop our custom polling logic
Because the current client includes classes for most job options that include basic validation, we're also able to drop some custom validation
Extra methods on the bigquery cursor have been moved to the hook

jmcarp · 2019-02-08T04:52:43Z

Ping @Fokko @feng-tao @mik-laj (not sure who knows this code best). I'm hoping to add some features after this is ready, so would be great to get feedback when you all have time.

kaxil · 2019-02-12T23:30:42Z

airflow/contrib/operators/bigquery_check_operator.py

                 sql,
                 bigquery_conn_id='bigquery_default',
-                 use_legacy_sql=True,
+                 use_legacy_sql=False,


We shouldn't change this behaviour

kaxil · 2019-02-12T23:32:18Z

airflow/contrib/operators/bigquery_operator.py

            allow_jagged_rows=self.allow_jagged_rows,
-            src_fmt_configs=self.src_fmt_configs,
-            labels=self.labels
+            external_config_options=self.external_config_options,


This will have to be for 2.0 as it will break things. Also, it needs to be backward-compatible to make updation smooth from 1.X to 2.0

Haven't yet reviewed the entire PR but schemed through it. Will try to find some time to look at it more thoroughly

Fokko · 2019-02-27T11:55:32Z

I'm seeing some good things in this PR, and it looks like it will simplify the logic. Any plans of moving this forward @jmcarp ?

potiuk · 2019-02-05T13:02:51Z