WIP: [AIRFLOW-1894] Google cloud bigquery#4607
Conversation
94a6905 to
d131b00
Compare
8bd9564 to
dc502cb
Compare
89b048d to
292d467
Compare
Codecov Report
@@ Coverage Diff @@
## master #4607 +/- ##
==========================================
+ Coverage 74.3% 74.53% +0.23%
==========================================
Files 426 426
Lines 27867 27577 -290
==========================================
- Hits 20706 20554 -152
+ Misses 7161 7023 -138
Continue to review full report at Codecov.
|
7562a62 to
8698ae7
Compare
8698ae7 to
d903f24
Compare
|
@kaxil @potiuk: tests are passing, so this is ready for review when you have time. A few differences to point out:
|
| sql, | ||
| bigquery_conn_id='bigquery_default', | ||
| use_legacy_sql=True, | ||
| use_legacy_sql=False, |
There was a problem hiding this comment.
We shouldn't change this behaviour
| allow_jagged_rows=self.allow_jagged_rows, | ||
| src_fmt_configs=self.src_fmt_configs, | ||
| labels=self.labels | ||
| external_config_options=self.external_config_options, |
There was a problem hiding this comment.
This will have to be for 2.0 as it will break things. Also, it needs to be backward-compatible to make updation smooth from 1.X to 2.0
There was a problem hiding this comment.
Haven't yet reviewed the entire PR but schemed through it. Will try to find some time to look at it more thoroughly
|
I'm seeing some good things in this PR, and it looks like it will simplify the logic. Any plans of moving this forward @jmcarp ? |
| :type project_id: str | ||
| """ | ||
| service = self.get_service() | ||
| project_id = project_id if project_id is not None else self.project_id |
There was a problem hiding this comment.
I think you could use fallback_to_default_project_id decorator. It has additional logic to raise the exception if none of the project_id s is specified. and you could remove this if altogether then. It forces to use keyword parameters though.
| project_id=project, | ||
| use_legacy_sql=self.use_legacy_sql, | ||
| def get_client(self, project_id=None): | ||
| project_id = project_id if project_id is not None else self.project_id |
There was a problem hiding this comment.
Same as below - fallback_to_default_project_id decorator is nicer way I think
| } | ||
| }) | ||
| if external_config_options is not None: | ||
| if not isinstance(external_config_options, type(external_config.options)): |
There was a problem hiding this comment.
I think you should use dict explicitly here. What if external_config.options are None (seem to be default).
| https://cloud.google.com/bigquery/docs/locations#specifying_your_location | ||
| :type location: str | ||
| """ | ||
| project_id = project_id if project_id is not None else self.project_id |
There was a problem hiding this comment.
Again - I think using fallback decorator is nicer as it keeps the project_id logic in one place.
| passed to BigQuery | ||
| :type labels: dict | ||
| """ | ||
| project_id = project_id if project_id is not None else self.project_id |
| time_partitioning. The order of columns given determines the sort order. | ||
| :type cluster_fields: list of str | ||
| """ | ||
| project_id = project_id if project_id is not None else self.project_id |
| """ | ||
| # check to see if the table exists | ||
| table_id = table_resource['tableReference']['tableId'] | ||
| project_id = project_id if project_id is not None else self.project_id |
| even if any insertion errors occur. | ||
| :type fail_on_error: bool | ||
| """ | ||
| project_id = project_id if project_id is not None else self.project_id |
| private_key=private_key) | ||
|
|
||
| def table_exists(self, project_id, dataset_id, table_id): | ||
| def table_exists(self, dataset_id, table_id, project_id=None): |
There was a problem hiding this comment.
Hi, thanks for putting this together, this is great stuff!
Regarding these convenience methods, I'd argue that it would be better if the functionality would be implemented on the connection object returned by get_conn, and the here we just delegate, like this:
def table_exists(self, dataset_id, table_id, project_id=None):
return self.get_conn().table_exists(dataset_id, table_id, project_id)The reason is that sometimes it is very useful to have explicit control over when and where connections get created, for example when using multi-threading for I/O optimization. We've also had code that spuriously crashed because it was creating too many new connections (implicitly in such convenience methods), and at some point the authentication requests hit a rate-limit.
|
Hey @jmcarp -> are you still working on this? As I am now committer, I am happy to review that one as well as soon as it gets rebased (if it's still something you want to add). |
|
@jmcarp Are you still working on this? |
|
@jmcarp Can I help you with this? This looks very interesting to me. |
|
Hi. I made a change in the base class - GoogleCloudBaseHook. Your PR may need to be changed. Could you do rebase? Cheers Refenence: |
|
@jmcarp - is it possible to rebase/complete the work on this one ? It blocks us from moving operators/hook from contrib to core - after we move the operators/hooks it will be much more difficult to merge. |
|
Hey @jmcarp - are you still doing it? Or should we close that one? |
|
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Make sure you have checked all steps below.
Jira
Description
Tests
Commits
Documentation
Code Quality
flake8