Skip to content

[AIRFLOW-4588] Add GoogleDiscoveryApiHook and GoogleApiToS3Transfer#5335

Merged
feluelle merged 1 commit intoapache:masterfrom
feluelle:feature/AIRFLOW-4588-google-api-client-hook
Sep 11, 2019
Merged

[AIRFLOW-4588] Add GoogleDiscoveryApiHook and GoogleApiToS3Transfer#5335
feluelle merged 1 commit intoapache:masterfrom
feluelle:feature/AIRFLOW-4588-google-api-client-hook

Conversation

@feluelle
Copy link
Member

@feluelle feluelle commented May 29, 2019

Make sure you have checked all steps below.

Jira

  • My PR addresses the following Airflow Jira issues and references them in the PR title. For example, "[AIRFLOW-XXX] My Airflow PR"
    • https://issues.apache.org/jira/browse/AIRFLOW-4588
    • In case you are fixing a typo in the documentation you can prepend your commit with [AIRFLOW-XXX], code changes always need a Jira issue.
    • In case you are proposing a fundamental code change, you need to create an Airflow Improvement Proposal (AIP).
    • In case you are adding a dependency, check if the license complies with the ASF 3rd Party License Policy.

Description

  • Here are some details about my PR, including screenshots of any UI changes:

This PR adds a hook based on Google's Discovery API to communicate to any Google Services.
And also a implementation of this hook in an operator that transfers data from Google to S3.

  • add documentation to integration.rst
    The hook provides:
  • a get_conn function to authenticate to the Google API via an airflow connection
  • a query function to dynamically query all data available for a specific endpoint and given parameters. (You are able to either retrieve one page of data or all data)
    The transfer operator provides:
  • basic transfer between google api and s3
  • passing an xcom variable to dynamically set the endpoint params for a request
  • exposing the response data to xcom, but raises exception when it exceeds MAX_XCOM_SIZE

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:
  • tests/contrib/hooks/test_google_discovery_api_hook.py
  • tests/contrib/operators/test_google_api_to_s3_transfer.py

Commits

  • My commits all reference Jira issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain docstrings that explain what it does
    • If you implement backwards incompatible changes, please leave a note in the Updating.md so we can assign it to a appropriate release

Code Quality

  • Passes flake8

@ashb
Copy link
Member

ashb commented May 29, 2019

I'm not that familiar with the GCP hooks etc, but is it worth using this in as a base class for any of the existing hooks?

@feluelle
Copy link
Member Author

Me neither. @potiuk @feng-tao ?

@potiuk
Copy link
Member

potiuk commented May 29, 2019

So far we used slightly different approach for GCP services:

  1. In most cases (and this is preferred per official google recommendations) instead of discovery API we used the new python-idiomatic APIs, specific for each service. See for example here: https://github.com/apache/airflow/blob/master/airflow/contrib/hooks/gcp_spanner_hook.py
    Discovery API is officially deprecated by Google (but still works so far)

  2. whenever there is no python-idiomatic API or it is not enough, discovery API is used (for example: https://github.com/apache/airflow/blob/master/airflow/contrib/hooks/gcp_compute_hook.py). Each such hook is implemented separately (and it's usually not a lot of code per hook - especially the query method is usually few lines of code)

  3. In both discovery and idiomatic APIs we use this base hook (it provides authentication that is usable by both types of APIs + exception handling + default project_id handling): https://github.com/apache/airflow/blob/master/airflow/contrib/hooks/gcp_api_base_hook.py

The only part which is not in the gcp_api_base_hook.py (from this proposal) is some common way of handling pagination.

@feluelle -> are you aware of the implementation ? Do you see any problem with following the approach of other GCP operators that you need to solve? Maybe the pagination could be implemented in the gcp_api_base_hook instead if you think it's worth to implement some common code?

@feluelle
Copy link
Member Author

feluelle commented Jun 3, 2019

@feluelle -> are you aware of the implementation ? Do you see any problem with following the approach of other GCP operators that you need to solve? Maybe the pagination could be implemented in the gcp_api_base_hook instead if you think it's worth to implement some common code?

Yes, I am.
So far, it worked great for YouTube Data, YouTube Analytics & Reporting, Google Analytics and Google Sheets - and thats how we call them.
We are not using the GCP at work so I never tried to access those services.

The pagination implementation is only compatible to the google-api-python-client library I think.
I haven't checked out the python-idiomatic API implementations, yet. So I don't know if there is a general approach to do that.

@potiuk
Copy link
Member

potiuk commented Jun 9, 2019

@feluelle - do you want to follow up on that ? My only concern with that one is that it introduces a new base GCP hook where we already have one.

Maybe you could simply start with rewriting your code to use the GCP API base hook (including the credentials retrieval that is already there) and then you could move some of the methods to the base hook (like the pagination).

Then you could also contribute the whole Google API to S3 operator. It seems useful to have such generic solution.

@louisguitton
Copy link
Contributor

I've made the changes you suggested, but I keep them for now on my fork https://github.com/louisguitton/airflow/pull/1

@potiuk
Copy link
Member

potiuk commented Jun 18, 2019

OK. I will be happy to collaborate on it (but starting next week likely :))

@mik-laj mik-laj added the provider:google Google (including GCP) related issues label Jul 22, 2019
@feluelle feluelle force-pushed the feature/AIRFLOW-4588-google-api-client-hook branch 3 times, most recently from 719524e to 4d96c6e Compare August 5, 2019 10:36
@feluelle feluelle changed the title [AIRFLOW-4588] Add GoogleApiClientHook [AIRFLOW-4588] Add GoogleDiscoveryApiHook and GoogleApiToS3Transfer Aug 5, 2019
@feluelle
Copy link
Member Author

feluelle commented Aug 5, 2019

PTAL @potiuk @louisguitton :)

@potiuk potiuk requested review from kaxil, mik-laj and potiuk August 6, 2019 06:27
@potiuk
Copy link
Member

potiuk commented Aug 6, 2019

I added @mik-laj and @kaxil as I am going for holidays soon, and it's a very interesting approach - different to what we've been using so far, but pretty interesting one as it provides one implementation to handle many google APIS - pretty useful in a number of cases (like the API to S3 transfer already implemented in this PR). I'd love your (@kaxil and @mik-laj ) opinion on that approach.

Also @feluelle and @louisguitton -> for sure if we go that direction we should integration.rst update where we would describe this as an alternative method to access Google APIs in more "generic" way than the rest of the operators. But let's hold-off with adding the docs until @kaxil and @mik-laj have their say.

@mik-laj
Copy link
Member

mik-laj commented Aug 6, 2019

Google provides two types of API libraries:

  • The Google APIs Python Client is a client library that are compatibilble with all Google APIs.
  • The google-cloud-* libraries are built specifically for the Google Cloud Platform and provide the recommended way to integrate Google Cloud APIs. This provides much better experience for programmers.

It is worth noting that the Discovery library supports all Google services also those less popular among our clients. The google-cloud-* libraries support only Google Cloud Platform services. I do not know other libraries that integrate with Google API. Both libraries have identical authorization mechanisms, but they differ in implementation e.g. The Google APIs Python Client use HTTPS only. In most cases the google-cloud-* libraries use protobuf. In rare cases, they also use HTTPs.

In my opinion, it is worth introducing a separate hooks for both types of libraries to improve developer experience. Currently, it is necessary to create long descriptions to warn users that a given function can be used only with a given type of library. Despite warnings, even experienced developers face these problems and try to use functions designed for another type of library with the second type of library.
An example from the last 6 hours (CC: @potiuk)
#5335 (comment)

CC: @nuclearpinguin

@mik-laj
Copy link
Member

mik-laj commented Aug 6, 2019

As for the operator, I like the idea. The generic operator is well suited to the generic API Client. However, I have concerns about its performance. This can be used by many services, including those that generate a lot of data. I think that it is worth thinking about the memory limitation. What do you think about sending data in chunk in JSON lines format? This will provide much better performance.
Currently, resource constraints are not often exists on production, but Apache Airflow is developing towards serverless and Kubernetes solutions. In this case memory limitations gain in importance.

@kaxil
Copy link
Member

kaxil commented Aug 6, 2019

I like the approach and the idea of this kind of operator. We should, however, think about how we want to limit the response as it would probably be passed to xcom.

@feluelle
Copy link
Member Author

feluelle commented Aug 7, 2019

@kaxil You are totally right. We should limit it. I only used it for some APIs where their response was quite small.

I could also implement a way to pass a jsonpath as arg to only push specific response data to xcom for example like this 'foo[*].baz' (see https://github.com/kennknowles/python-jsonpath-rw) What do you think of this? (besides that it would add another package. Maybe I can implement it without using another external package.)

@feluelle
Copy link
Member Author

feluelle commented Aug 7, 2019

@potiuk Agree note about it in the integration.rst would be nice.

When accessing Google's Services you can choose out of two integrated options for doing operations.

The first and recommended option is to use the dedicated Hooks and Operators that are only integrating one specific service - but do that very stable.
You can identify them on the **gcp** prefix of the file name.

The second option to use Google Services is to use the **GoogleDiscoveryApiHook** to request data from any Google Service registered in the [Google API Explorer](https://developers.google.com/apis-explorer/#p/) - deprecated way but more generic to not having to write custom service logic.

For more Information which libraries these two options use see below:

plus "side note" (what @mik-laj just wrote)

Google provides two types of API libraries:

  • The Google APIs Python Client is a client library that are compatibilble with all Google APIs.

  • The google-cloud-* libraries are built specifically for the Google Cloud Platform and provide the recommended way to integrate Google Cloud APIs. This provides much better experience for programmers.

It is worth noting that the Discovery library supports all Google services also those less popular among our clients. The google-cloud-* libraries support only Google Cloud Platform services. I do not know other libraries that integrate with Google API. Both libraries have identical authorization mechanisms, but they differ in implementation e.g. The Google APIs Python Client use HTTPS only. In most cases the google-cloud-* libraries use protobuf. In rare cases, they also use HTTPs.

@feluelle feluelle force-pushed the feature/AIRFLOW-4588-google-api-client-hook branch from 4d96c6e to 9ada8da Compare August 19, 2019 15:56
@feluelle
Copy link
Member Author

feluelle commented Aug 19, 2019

For the memory limitation @mik-laj @kaxil :

  • Should we just have a memory_limit arg?
  • Should one be able to specify a google_api_response_field_via_xcom to a json path and only if set push to xcom?

I also updated it to include a documentation about the integration.

Copy link
Member

@kaxil kaxil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the memory limitation @mik-laj @kaxil :

  • Should we just have a memory_limit arg?
  • Should one be able to specify a google_api_response_field_via_xcom to a json path and only if set push to xcom?

I also updated it to include a documentation about the integration.

We can have both. We can re-use

MAX_XCOM_SIZE = 49344
and have it as a limit. We use something similar is GCS Download Operator:

if self.store_to_xcom_key:
if sys.getsizeof(file_bytes) < MAX_XCOM_SIZE:
context['ti'].xcom_push(key=self.store_to_xcom_key, value=file_bytes)
else:
raise RuntimeError(
'The size of the downloaded file is too large to push to XCom!'
)

google_api_response_field_via_xcom sounds like a good idea too.

@feluelle
Copy link
Member Author

@kaxil That sounds good! 👍

@feluelle feluelle force-pushed the feature/AIRFLOW-4588-google-api-client-hook branch from 9ada8da to 875375c Compare August 21, 2019 14:56
@feluelle
Copy link
Member Author

feluelle commented Aug 21, 2019

I think I would like to open another PR regarding:

Should one be able to specify a google_api_response_field_via_xcom to a json path and only if set push to xcom?

Because I think it is not required.

But I added the MAX_XCOM_SIZE limitation when pushing the response to xcom.

WDYT now @kaxil ?

@feluelle feluelle force-pushed the feature/AIRFLOW-4588-google-api-client-hook branch from 875375c to 7568957 Compare August 21, 2019 15:41
@feluelle
Copy link
Member Author

@kaxil there are two pipelines running for this PR. Can you cancel the first one https://travis-ci.org/apache/airflow/builds/574902955 ?

@kaxil
Copy link
Member

kaxil commented Aug 21, 2019

@kaxil there are two pipelines running for this PR. Can you cancel the first one https://travis-ci.org/apache/airflow/builds/574902955 ?

Done

@turbaszek
Copy link
Member

I like this approach because it's really generic. On the other hand, I also like service-specific hooks where user can see how the discovery request is build:

self.get_conn().projects().locations().functions().get().execute()

especially when I think about "Explicit is better than implicit.".

I would opt for using this generic hook in operators based on discovery API. Meaning that we will not require operator + hook pair for every service. The main advantage of this is having code in one place. Moreover, it would be really easy to create new operators using this hook.

However, I would suggest to incorporate proposed functionalities into already present GoogleCloudBaseHook to keep all gcp logic in one class.

WDYT @potiuk @mik-laj @feluelle

@feluelle
Copy link
Member Author

feluelle commented Aug 29, 2019

However, I would suggest to incorporate proposed functionalities into already present GoogleCloudBaseHook to keep all gcp logic in one class.

I wouldn't mix the implementations which are using Discovery API and gcp libraries. But I could imagine putting both into one module / file - not into a single class.

(But I think I would prefer leaving it like it is, in separate modules.)

@kaxil
Copy link
Member

kaxil commented Aug 29, 2019

@nuclearpinguin

I think having it as a separate module is good like @feluelle mentioned. One main reason is for all the other GCP services we are moving towards a Client-based API instead of Discovery based.

@feluelle
Copy link
Member Author

feluelle commented Aug 30, 2019

@kaxil @nuclearpinguin @mik-laj @potiuk

What do you guys think of a deprecation warning if this hook is called with services where we already have a dedicated hook? So that users can then change to the dedicated hook if it has the functionality he needs.
OR
What if it dynamically changes to the dedicated hook's function if exists.
So for example:
You call GoogleDiscoveryApiHook.query(endpoint='bigtableadmin.projects.instances.get') and it actually calls BigtableHook.get_instance. This would require a mapping between api endpoint and gcp hook function that we then probably need to manually update every time we add a new gcp (dedicated) hook.

The second one is probably too complex.

and in the end we are about to mix the client-based and discovery based api - what we probably don't want. So...

Copy link
Member

@kaxil kaxil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine with the current scope of this PR. I don't want to map or mix it :)

Good job @feluelle

@mik-laj
Copy link
Member

mik-laj commented Sep 5, 2019

Hi.

I made a change in the base class - GoogleCloudBaseHook. Your PR may need to be changed. Could you do rebase?

Cheers

Refenence:
#5907

- add documentation to integration.rst
The hook provides:
- a get_conn function to authenticate to the Google API via an airflow connection
- a query function to dynamically query all data available for a specific endpoint and given parameters. (You are able to either retrieve one page of data or all data)
The transfer operator provides:
- basic transfer between google api and s3
- passing an xcom variable to dynamically set the endpoint params for a request
- exposing the response data to xcom, but raises exception when it exceeds MAX_XCOM_SIZE

Co-authored-by: louisguitton <louisguitton@users.noreply.github.com>
@feluelle feluelle force-pushed the feature/AIRFLOW-4588-google-api-client-hook branch from 7568957 to b4d75c9 Compare September 5, 2019 15:09
@codecov-io
Copy link

Codecov Report

Merging #5335 into master will decrease coverage by 70.46%.
The diff coverage is 0%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #5335       +/-   ##
==========================================
- Coverage   80.02%   9.56%   -70.47%     
==========================================
  Files         594     596        +2     
  Lines       34769   34871      +102     
==========================================
- Hits        27824    3334    -24490     
- Misses       6945   31537    +24592
Impacted Files Coverage Δ
...low/contrib/operators/google_api_to_s3_transfer.py 0% <0%> (ø)
airflow/contrib/hooks/google_discovery_api_hook.py 0% <0%> (ø)
...low/contrib/operators/wasb_delete_blob_operator.py 0% <0%> (-100%) ⬇️
airflow/example_dags/subdags/subdag.py 0% <0%> (-100%) ⬇️
airflow/gcp/sensors/bigquery_dts.py 0% <0%> (-100%) ⬇️
airflow/gcp/operators/text_to_speech.py 0% <0%> (-100%) ⬇️
airflow/contrib/sensors/emr_base_sensor.py 0% <0%> (-100%) ⬇️
airflow/example_dags/example_subdag_operator.py 0% <0%> (-100%) ⬇️
airflow/contrib/operators/file_to_gcs.py 0% <0%> (-100%) ⬇️
airflow/gcp/hooks/kms.py 0% <0%> (-100%) ⬇️
... and 489 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0f548b8...b4d75c9. Read the comment docs.

Copy link
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job indeed. Let's merge it :). One of the first commits you can merge yourself @feluelle :)

@feluelle
Copy link
Member Author

Thanks 😁

Thanks to @louisguitton for helping out on this :)

@feluelle feluelle merged commit 25c53b0 into apache:master Sep 11, 2019
@louisguitton
Copy link
Contributor

congrats on the merge, and thanks for including me on this @feluelle ; I really did not mucho

@feluelle feluelle deleted the feature/AIRFLOW-4588-google-api-client-hook branch April 30, 2020 07:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

provider:google Google (including GCP) related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants