[AIRFLOW-4588] Add GoogleDiscoveryApiHook and GoogleApiToS3Transfer#5335
Conversation
|
I'm not that familiar with the GCP hooks etc, but is it worth using this in as a base class for any of the existing hooks? |
|
So far we used slightly different approach for GCP services:
The only part which is not in the gcp_api_base_hook.py (from this proposal) is some common way of handling pagination. @feluelle -> are you aware of the implementation ? Do you see any problem with following the approach of other GCP operators that you need to solve? Maybe the pagination could be implemented in the gcp_api_base_hook instead if you think it's worth to implement some common code? |
Yes, I am. The pagination implementation is only compatible to the google-api-python-client library I think. |
|
@feluelle - do you want to follow up on that ? My only concern with that one is that it introduces a new base GCP hook where we already have one. Maybe you could simply start with rewriting your code to use the GCP API base hook (including the credentials retrieval that is already there) and then you could move some of the methods to the base hook (like the pagination). Then you could also contribute the whole Google API to S3 operator. It seems useful to have such generic solution. |
|
I've made the changes you suggested, but I keep them for now on my fork https://github.com/louisguitton/airflow/pull/1 |
|
OK. I will be happy to collaborate on it (but starting next week likely :)) |
719524e to
4d96c6e
Compare
|
PTAL @potiuk @louisguitton :) |
|
I added @mik-laj and @kaxil as I am going for holidays soon, and it's a very interesting approach - different to what we've been using so far, but pretty interesting one as it provides one implementation to handle many google APIS - pretty useful in a number of cases (like the API to S3 transfer already implemented in this PR). I'd love your (@kaxil and @mik-laj ) opinion on that approach. Also @feluelle and @louisguitton -> for sure if we go that direction we should integration.rst update where we would describe this as an alternative method to access Google APIs in more "generic" way than the rest of the operators. But let's hold-off with adding the docs until @kaxil and @mik-laj have their say. |
|
Google provides two types of API libraries:
It is worth noting that the Discovery library supports all Google services also those less popular among our clients. The google-cloud-* libraries support only Google Cloud Platform services. I do not know other libraries that integrate with Google API. Both libraries have identical authorization mechanisms, but they differ in implementation e.g. The Google APIs Python Client use HTTPS only. In most cases the google-cloud-* libraries use protobuf. In rare cases, they also use HTTPs. In my opinion, it is worth introducing a separate hooks for both types of libraries to improve developer experience. Currently, it is necessary to create long descriptions to warn users that a given function can be used only with a given type of library. Despite warnings, even experienced developers face these problems and try to use functions designed for another type of library with the second type of library. CC: @nuclearpinguin |
|
As for the operator, I like the idea. The generic operator is well suited to the generic API Client. However, I have concerns about its performance. This can be used by many services, including those that generate a lot of data. I think that it is worth thinking about the memory limitation. What do you think about sending data in chunk in JSON lines format? This will provide much better performance. |
|
I like the approach and the idea of this kind of operator. We should, however, think about how we want to limit the response as it would probably be passed to xcom. |
|
@kaxil You are totally right. We should limit it. I only used it for some APIs where their response was quite small. I could also implement a way to pass a jsonpath as arg to only push specific response data to xcom for example like this |
|
@potiuk Agree note about it in the plus "side note" (what @mik-laj just wrote)
|
4d96c6e to
9ada8da
Compare
kaxil
left a comment
There was a problem hiding this comment.
For the memory limitation @mik-laj @kaxil :
- Should we just have a
memory_limitarg?- Should one be able to specify a
google_api_response_field_via_xcomto a json path and only if set push to xcom?I also updated it to include a documentation about the integration.
We can have both. We can re-use
airflow/airflow/models/xcom.py
Line 37 in 7fb729d
airflow/airflow/contrib/operators/gcs_download_operator.py
Lines 85 to 91 in ec7c67f
google_api_response_field_via_xcom sounds like a good idea too.
|
@kaxil That sounds good! 👍 |
9ada8da to
875375c
Compare
|
I think I would like to open another PR regarding:
Because I think it is not required. But I added the WDYT now @kaxil ? |
875375c to
7568957
Compare
|
@kaxil there are two pipelines running for this PR. Can you cancel the first one https://travis-ci.org/apache/airflow/builds/574902955 ? |
Done |
|
I like this approach because it's really generic. On the other hand, I also like service-specific hooks where user can see how the discovery request is build: self.get_conn().projects().locations().functions().get().execute()especially when I think about "Explicit is better than implicit.". I would opt for using this generic hook in operators based on discovery API. Meaning that we will not require operator + hook pair for every service. The main advantage of this is having code in one place. Moreover, it would be really easy to create new operators using this hook. However, I would suggest to incorporate proposed functionalities into already present |
I wouldn't mix the implementations which are using Discovery API and gcp libraries. But I could imagine putting both into one module / file - not into a single class. (But I think I would prefer leaving it like it is, in separate modules.) |
|
I think having it as a separate module is good like @feluelle mentioned. One main reason is for all the other GCP services we are moving towards a Client-based API instead of Discovery based. |
|
@kaxil @nuclearpinguin @mik-laj @potiuk What do you guys think of a deprecation warning if this hook is called with services where we already have a dedicated hook? So that users can then change to the dedicated hook if it has the functionality he needs. The second one is probably too complex. and in the end we are about to mix the client-based and discovery based api - what we probably don't want. So... |
|
Hi. I made a change in the base class - GoogleCloudBaseHook. Your PR may need to be changed. Could you do rebase? Cheers Refenence: |
- add documentation to integration.rst The hook provides: - a get_conn function to authenticate to the Google API via an airflow connection - a query function to dynamically query all data available for a specific endpoint and given parameters. (You are able to either retrieve one page of data or all data) The transfer operator provides: - basic transfer between google api and s3 - passing an xcom variable to dynamically set the endpoint params for a request - exposing the response data to xcom, but raises exception when it exceeds MAX_XCOM_SIZE Co-authored-by: louisguitton <louisguitton@users.noreply.github.com>
7568957 to
b4d75c9
Compare
Codecov Report
@@ Coverage Diff @@
## master #5335 +/- ##
==========================================
- Coverage 80.02% 9.56% -70.47%
==========================================
Files 594 596 +2
Lines 34769 34871 +102
==========================================
- Hits 27824 3334 -24490
- Misses 6945 31537 +24592
Continue to review full report at Codecov.
|
|
Thanks 😁 Thanks to @louisguitton for helping out on this :) |
|
congrats on the merge, and thanks for including me on this @feluelle ; I really did not mucho |
Make sure you have checked all steps below.
Jira
Description
This PR adds a hook based on Google's Discovery API to communicate to any Google Services.
And also a implementation of this hook in an operator that transfers data from Google to S3.
The hook provides:
The transfer operator provides:
Tests
Commits
Documentation
Code Quality
flake8