Add DataflowStartSQLQuery operator #8553

mik-laj · 2020-04-25T02:38:12Z

Make sure to mark the boxes below before creating PR: [x]

Description above provides context of the change
Unit tests coverage for changes (not needed for documentation changes)
Target Github ISSUE in description if exists
Commits follow "How to write a good git commit message"
Relevant documentation is updated including usage instructions.
I will engage committers as explained in Contribution Workflow Example.

In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.
Read the Pull Request Guidelines for more information.

mik-laj · 2020-04-26T04:07:26Z

@jaketf Can I ask for review? I know that you are also interested in integration with Dataflow.

aaltay · 2020-04-27T20:24:15Z

airflow/providers/google/cloud/example_dags/example_dataflow_sql.py

Who is going to create this dataset? Can we use a public dataset so that examples works for anyone?

This bucket is created in system tests. https://github.com/apache/airflow/pull/8553/files#
Unfortunately, Dataflow SQL is not compatible with the public datasets I know.

I got the following error when I referred to the public dataset.

Caused by: java.lang.UnsupportedOperationException: Field type 'NUMERIC' is not supported (field 'value')

Its error message from BigQuery console

@ibzib - is there a public dataset that could be used?

Unfortunately, Dataflow SQL is not compatible with the public datasets I know.

Yeah, the table will have to have a schema that is compatible with DF SQL. A canonical public BQ dataset seems like something we should definitely have in the docs, but I couldn't find one.

Yeah Dataflow SQL doesn't support GEOGRAPHY or NUMERIC, but I'm sure there are many public datasets that don't use those types. chicago_taxi_trips.taxi_trips looks like it will work.

aaltay · 2020-04-27T20:25:03Z

airflow/providers/google/cloud/example_dags/example_dataflow_sql.py

Just a note, I can review Dataflow aspects but not very familiar with Airflow. For example I am not sure what do_xcom_push is. It would be good to get an airflow review as well.

I still asked @jaketfI to review. Before this change is merged, it will also be reviewed by at least one Apache Airflow commiter. .

aaltay · 2020-04-27T20:28:06Z

airflow/providers/google/cloud/hooks/dataflow.py

JOB_STATE_STOPPED is not a failed state. (See: https://cloud.google.com/dataflow/docs/reference/rest/v1b3/projects.jobs#jobstate)

I agree with @aaltay . It looks like proper place for this status is in AWAITING_STATES.

aaltay · 2020-04-27T20:32:34Z

airflow/providers/google/cloud/hooks/dataflow.py

This is modifying user provided input. Is this your intention?

might be better to check that labels adhere to these regex (in API docs):

Keys must conform to regexp: [\p{Ll}\p{Lo}][\p{Ll}\p{Lo}\p{N}_-]{0,62} Values must conform to regexp: [\p{Ll}\p{Lo}\p{N}_-]{0,63}

and raise exception.

@jaketf do you know maybe are these regexes are used for labels by all google apis?
I see that google provides a comprehensive information when labels do not comply with these regex, with regex included, however we could make validation of labels for whole google provider and throw some warning to the user (before it fails during run).
WDYT @jaketf @mik-laj

On the other hand, I don't think it is good idea to add unnecessary complexity to the it and limit user. Google logs already provide extensive information about problems with labels when they occur

This is modifying user provided input. Is this your intention?

@aaltay the reason of this replace was to avoid spaces in json, change of the user input was side effect.
I changed it to variables['labels'] = json.dumps(variables['labels'], separators=(',', ':')) so the json is compact as well and labels provided by the user are not touched.

aaltay · 2020-04-27T20:33:30Z

airflow/providers/google/cloud/hooks/dataflow.py

Why is this formatter different than the one from L521?

Could we move all label fomating to the place where dataflow job is triggered?

It depends on the SDK that is used. These two SDK require different argument formats.

We have three related methods.

start_python_dataflow

start_java_dataflow

_start_dataflow

The first two methods are public and dependent on the SDK. This is responsible for actions regarding a specific SDK e.g. environment preparation.
_start_dataflow is an internal method. This starts the system process and supervises execution. I have the impression that this separation is helpful.

OK. Thank you for the explanation.

aaltay · 2020-04-27T20:34:21Z

airflow/providers/google/cloud/hooks/dataflow.py

Is the beta still required, do you know?

It is not required.

Thanks. I deleted it.

aaltay · 2020-04-27T20:35:28Z

airflow/providers/google/cloud/hooks/dataflow.py

What does shlex.quote() do?

Adds escape characters if needed.
Example:
If you want to display the contents of the /tmp/ directory then you can use the command ls /tmp/
If you want to display the contents of the /tmp/i love pizza directory then you can use the command ls '/tmp/ i love pizza'. ls /tmp/i love pizza is incorrect command. The decision about quotation characeters was made by shlex.quote. This also supports other cases required by sh e.g. quote character in an argument

but this is only for logging? Do users normally copy paste these commands out of the logs?

This is only for logs. I used it to test this operator. A normal user will not copy it, but it may be helpful to him for debugging only.

These logs are available in Airflow Web UI, so a normal user can easily access them.

Is this really required if it is only for logs? subprocess.run does not need to escape them anyway.

IMHO it is not particularly required but nice to have it :)

aaltay · 2020-04-27T20:35:52Z

airflow/providers/google/cloud/hooks/dataflow.py

log.error for stderr maybe?

stderr often contains developer information. There are not only errors. I will change it to log.warning

I changed it :)

aaltay · 2020-04-27T20:37:06Z

airflow/providers/google/cloud/operators/dataflow.py

What is gcp_conn_id?

Airflow saves all credentials(MySQL, GCP, AWS, and other) in one table in the database. It's called connection. This is the entry ID in this table.

aaltay · 2020-04-27T20:38:37Z

airflow/providers/google/cloud/operators/dataflow.py

Do you want to call this even if job is cancelled/stopped/finished?

Good point. I will skip jobs in the terminal state.

I fixed it in hook.

aaltay · 2020-04-27T20:40:32Z

/cc @kennknowles

kennknowles · 2020-04-27T20:47:34Z

airflow/providers/google/cloud/hooks/dataflow.py

Are these lists used anywhere else? I think that the success/fail distinction is artificial. You cannot really say if CANCELED is a failure or not. Probably the same with DRAINED and UPDATED. Whatever is looking at the job status probably wants the full details.

Airflow does not have the ability to display full information about the status of the job in an external system. We only have two states - SUCCESS/FAILED. What are you proposing then? Can the user specify expected end-states?

I do not know airflow that well. I just tried to see how these variables were used. I missed the place where they actually affect the Airflow result. It is a good idea to let the user say what they expect, and then a failure can be anything else.

Example: we have had real use cases where we deliberately cancel jobs we do not need anymore, and that can be success for streaming jobs.

I think the only certain failed state is JOB_STATE_FAILED.

I think these are reasonable defaults, but I agree it would be nice to let the users set as parameters.
I could even see DRAINING as a failed state (if airflow never expects a human to make manual intervention)
I could see wanting to fail earlier on CANCELLING (rather than waiting til CANCELLED)

I agree with you. I created issue for it: #11721 and I will work on in separate PR.

kennknowles · 2020-04-27T20:48:20Z

airflow/providers/google/cloud/hooks/dataflow.py

It is not required.

kennknowles · 2020-04-27T20:50:38Z

airflow/providers/google/cloud/operators/dataflow.py

@ibzib would be a good reviewer here

jaketf · 2020-05-02T19:01:22Z

airflow/providers/google/cloud/hooks/dataflow.py

nit: for var names is JOB_STATE_ prefix really necessary on all these?
IMO slightly more readable to drop it and this causes "stutter" DataflowJobStatus.JOB_STATE_xxx.

A forward looking thought (though not backwards compatible so not immediate suggestion) in python 3.8+ this set could be more concise with walrus operator e.g.

FAILED_END_STATES = { (FAILED := "JOB_STATE_FAILED"), (CANCELLED := "JOB_STATE_CANCELLED"), (STOPPED := "JOB_STATE_STOPPED") }

created separate issue for stutter: #11205

jaketf · 2020-05-02T19:06:41Z

airflow/providers/google/cloud/hooks/dataflow.py

I think these are reasonable defaults, but I agree it would be nice to let the users set as parameters.
I could even see DRAINING as a failed state (if airflow never expects a human to make manual intervention)
I could see wanting to fail earlier on CANCELLING (rather than waiting til CANCELLED)

jaketf · 2020-05-02T19:22:45Z

airflow/providers/google/cloud/hooks/dataflow.py

This makes me think (larger scope than just SQL operator) of should we have Beam Operators that support other runners?

For example some users it does not make sense Dataflow for smaller/shorter batch jobs say (because you have the overhead of waiting for workers to come up) For a job < 30 mins worker spin up time can be 10% performance hit. But they may still want to use Apache Beam (on say spark runner) that submits to non-ephemeral cluster (dataproc, EMR, spark on k8s, on prem infra, etc).

Would this be easy enough to achieve on Dataproc / EMR / Spark Operators ?

It will be quite huge task to write Apache Beam Hooks and Operators but worth to keep it in mind to do this in future.

In current Q, we want to start working on operators for Apache Beam.

jaketf · 2020-05-02T19:30:12Z

airflow/providers/google/cloud/hooks/dataflow.py

might be better to check that labels adhere to these regex (in API docs):

Keys must conform to regexp: [\p{Ll}\p{Lo}][\p{Ll}\p{Lo}\p{N}_-]{0,62} Values must conform to regexp: [\p{Ll}\p{Lo}\p{N}_-]{0,63}

and raise exception.

jaketf · 2020-05-02T19:38:25Z

airflow/providers/google/cloud/hooks/dataflow.py

nit: this if and the next elif take the same action and could be combined

Suggested change

if value is None:

if value is None or (isinstance(value, bool) and value):

ibzib · 2020-05-05T19:16:53Z

airflow/providers/google/cloud/example_dags/example_dataflow_sql.py

Unfortunately, Dataflow SQL is not compatible with the public datasets I know.

Yeah, the table will have to have a schema that is compatible with DF SQL. A canonical public BQ dataset seems like something we should definitely have in the docs, but I couldn't find one.

ibzib · 2020-05-05T19:19:32Z

airflow/providers/google/cloud/hooks/dataflow.py

Unless you have a good reason to rename this location, I would use region because it is more specific and consistent with Beam/Dataflow usage.

The main reason behind it is to keep consistency across all google provider which uses location parameter.

I'm not sure which other GCP products you are referring to, but in Dataflow it's usually --region.
https://cloud.google.com/dataflow/docs/concepts/regional-endpoints

This is an essential feature of Airflow. In Airflow, you can define default arguments that will be all operators, but the parameter name must be consistent across all operators.

default_args = { 'dataflow_default_options': { 'tempLocation': GCS_TMP, 'stagingLocation': GCS_STAGING, }, 'location': 'europe-west3' } with models.DAG( "example_gcp_dataflow_native_java", schedule_interval=None, # Override to match your needs start_date=days_ago(1), tags=['example'], ) as dag_native_java: start_java_job = DataflowCreateJavaJobOperator( task_id="start-java-job", jar=GCS_JAR, job_name='{{task.task_id}}', options={ 'output': GCS_OUTPUT, }, poll_sleep=10, job_class='org.apache.beam.examples.WordCount', check_if_running=CheckJobRunning.IgnoreJob, location='europe-west3', ) # [START howto_operator_bigquery_create_table] create_table = BigQueryCreateEmptyTableOperator( task_id="create_table", dataset_id=DATASET_NAME, table_id="test_table", schema_fields=[ {"name": "emp_name", "type": "STRING", "mode": "REQUIRED"}, {"name": "salary", "type": "INTEGER", "mode": "NULLABLE"}, ], ) # [END howto_operator_bigquery_create_table]

In the above example, task create_table and start-java-job is executed in one location - europe-west3.

Dataflow also uses the word "location" in its API to denote this field.

https://cloud.google.com/dataflow/docs/reference/rest/v1b3/projects.locations.jobs/get

In Airflow, you can define default arguments that will be all operators, but the parameter name must be consistent across all operators.

Makes sense, thanks for the explanation.

ibzib · 2020-05-05T19:20:43Z

airflow/providers/google/cloud/operators/dataflow.py

Dataflow has deliberately been trying to move away from using a default location, because many users may not realize that their job is running in us-central1 even if that is not intended.

@ibzib I think we have to change it to all operators in the future. To keep consistency across all dataflow operators I would like to keep it for now.

ibzib · 2020-05-05T19:21:31Z

airflow/providers/google/cloud/hooks/dataflow.py

Nit: parameters itself is one of the arguments that can be passed here (see https://cloud.google.com/dataflow/docs/guides/sql/parameterized-queries). Maybe use "arguments" instead.

This is related to what @mik-laj said here: #8553 (comment)
I added parametrization to example dag so it would be nice hint for the users in case of doubts how to use it.

Sounds good.

stale · 2020-06-20T03:29:20Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

TobKed · 2020-10-07T09:08:51Z

I rebased on the latest master

github-actions · 2020-10-07T11:28:57Z

The Build Workflow run is cancelling this PR. It has some failed jobs matching ^Pylint$,^Static checks$,^Build docs$,^Spell check docs$,^Backport packages$,^Checks: Helm tests$,^Test OpenAPI*.

github-actions · 2020-10-08T15:35:43Z

The Workflow run is cancelling this PR. It has some failed jobs matching ^Pylint$,^Static checks$,^Build docs$,^Spell check docs$,^Backport packages$,^Checks: Helm tests$,^Test OpenAPI*.

github-actions · 2020-10-08T16:15:06Z

The Workflow run is cancelling this PR. It has some failed jobs matching ^Pylint$,^Static checks$,^Build docs$,^Spell check docs$,^Backport packages$,^Checks: Helm tests$,^Test OpenAPI*.

github-actions · 2020-10-09T13:13:35Z

The Workflow run is cancelling this PR. It has some failed jobs matching ^Pylint$,^Static checks$,^Build docs$,^Spell check docs$,^Backport packages$,^Checks: Helm tests$,^Test OpenAPI*.

github-actions · 2020-10-09T13:18:33Z

The Workflow run is cancelling this PR. It has some failed jobs matching ^Pylint$,^Static checks$,^Build docs$,^Spell check docs$,^Backport packages$,^Checks: Helm tests$,^Test OpenAPI*.

github-actions · 2020-10-19T20:39:16Z

The Workflow run is cancelling this PR. It has some failed jobs matching ^Pylint$,^Static checks$,^Build docs$,^Spell check docs$,^Backport packages$,^Checks: Helm tests$,^Test OpenAPI*.

github-actions · 2020-10-19T20:39:30Z

The Workflow run is cancelling this PR. It has some failed jobs matching ^Pylint$,^Static checks$,^Build docs$,^Spell check docs$,^Backport packages$,^Checks: Helm tests$,^Test OpenAPI*.

github-actions · 2020-10-19T20:39:51Z

The Workflow run is cancelling this PR. It has some failed jobs matching ^Pylint$,^Static checks$,^Build docs$,^Spell check docs$,^Backport packages$,^Checks: Helm tests$,^Test OpenAPI*.

github-actions · 2020-10-19T20:39:51Z

The Workflow run is cancelling this PR. It has some failed jobs matching ^Pylint$,^Static checks$,^Build docs$,^Spell check docs$,^Backport packages$,^Checks: Helm tests$,^Test OpenAPI*.

github-actions · 2020-10-19T20:39:52Z

The Workflow run is cancelling this PR. It has some failed jobs matching ^Pylint$,^Static checks$,^Build docs$,^Spell check docs$,^Backport packages$,^Checks: Helm tests$,^Test OpenAPI*.

github-actions · 2020-10-21T18:17:30Z

The Workflow run is cancelling this PR. It has some failed jobs matching ^Pylint$,^Static checks$,^Build docs$,^Spell check docs$,^Backport packages$,^Checks: Helm tests$,^Test OpenAPI*.

github-actions · 2020-10-22T09:48:08Z

The Workflow run is cancelling this PR. It has some failed jobs matching ^Pylint$,^Static checks$,^Build docs$,^Spell check docs$,^Backport packages$,^Checks: Helm tests$,^Test OpenAPI*.

github-actions · 2020-10-22T09:48:49Z

The Workflow run is cancelling this PR. It has some failed jobs matching ^Pylint$,^Static checks$,^Build docs$,^Spell check docs$,^Backport packages$,^Checks: Helm tests$,^Test OpenAPI*.

github-actions · 2020-10-22T12:45:42Z

The Workflow run is cancelling this PR. It has some failed jobs matching ^Pylint$,^Static checks$,^Build docs$,^Spell check docs$,^Backport packages$,^Checks: Helm tests$,^Test OpenAPI*.

TobKed · 2020-10-26T11:12:31Z

I replied to all comments, made requested changes, rebased on the master, made CI happy. IMHO PR is ready for final review and hopefully to be merged :)
@aaltay @jaketf @ibzib @TheNeuralBit @kennknowles PTAL.

github-actions · 2020-11-03T11:21:09Z

The PR should be OK to be merged with just subset of tests as it does not modify Core of Airflow. The committers might merge it or can add a label 'full tests needed' and re-run it to run all tests if they see it is needed!

TobKed · 2020-11-03T16:22:07Z

Rebased on the latest master

github-actions · 2020-11-03T16:40:18Z

The Workflow run is cancelling this PR. It has some failed jobs matching ^Pylint$,^Static checks,^Build docs$,^Spell check docs$,^Backport packages$,^Provider packages,^Checks: Helm tests$,^Test OpenAPI*.

github-actions · 2020-11-04T10:15:54Z

The Workflow run is cancelling this PR. It has some failed jobs matching ^Pylint$,^Static checks,^Build docs$,^Spell check docs$,^Backport packages$,^Provider packages,^Checks: Helm tests$,^Test OpenAPI*.

boring-cyborg bot added the provider:google Google (including GCP) related issues label Apr 25, 2020

mik-laj force-pushed the daataflow-sql branch from 2cf905a to aab4d5a Compare April 25, 2020 21:19

mik-laj added the dont-merge label Apr 27, 2020

aaltay reviewed Apr 27, 2020

View reviewed changes

aaltay mentioned this pull request Apr 27, 2020

Add DataflowStartFlexTemplateOperator #8550

Merged

6 tasks

kennknowles reviewed Apr 27, 2020

View reviewed changes

jaketf reviewed May 2, 2020

View reviewed changes

ibzib reviewed May 5, 2020

View reviewed changes

stale bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Jun 20, 2020

mik-laj added the pinned Protect from Stalebot auto closing label Jun 21, 2020

stale bot removed the stale Stale PRs per the .github/workflows/stale.yml policy file label Jun 21, 2020

TobKed mentioned this pull request Sep 30, 2020

Simplify DataflowJobStatus states #11205

Closed

TobKed force-pushed the daataflow-sql branch from aab4d5a to ae46efc Compare October 7, 2020 09:07

TobKed force-pushed the daataflow-sql branch from f056315 to 3b88c6e Compare October 9, 2020 12:50

TobKed force-pushed the daataflow-sql branch from 3b88c6e to 552c5bb Compare October 19, 2020 20:01

TobKed force-pushed the daataflow-sql branch from 552c5bb to 96ce313 Compare October 20, 2020 07:45

TobKed mentioned this pull request Oct 21, 2020

Dataflow operators - add user possibility to define expected terminal state #11721

Closed

TobKed force-pushed the daataflow-sql branch from 21d8554 to 686dac4 Compare October 21, 2020 14:41

TobKed force-pushed the daataflow-sql branch from 686dac4 to 46e5813 Compare October 22, 2020 07:41

TobKed force-pushed the daataflow-sql branch from 46e5813 to 95b90ca Compare October 22, 2020 11:58

mik-laj removed the dont-merge label Oct 23, 2020

TobKed force-pushed the daataflow-sql branch 3 times, most recently from 770b296 to 399a1d2 Compare November 2, 2020 13:15

turbaszek approved these changes Nov 3, 2020

View reviewed changes

github-actions bot added the okay to merge It's ok to merge this PR as it does not require more tests label Nov 3, 2020

TobKed force-pushed the daataflow-sql branch from 399a1d2 to 5d2ad0f Compare November 3, 2020 16:21

TobKed force-pushed the daataflow-sql branch 2 times, most recently from 7f44c15 to a063523 Compare November 4, 2020 09:48

Add DataflowStartSQLQuery operator

92d18b0

TobKed force-pushed the daataflow-sql branch from a063523 to 92d18b0 Compare November 4, 2020 10:40

mik-laj merged commit f1f1940 into apache:master Nov 4, 2020

mik-laj deleted the daataflow-sql branch November 4, 2020 17:35

potiuk mentioned this pull request Nov 10, 2020

GCSToBigQueryOperator - Not generating the Unique BQ Job Name #11660

Closed

champon1020 mentioned this pull request Oct 5, 2023

Fix the logic of checking dataflow job state #34785

Merged

	if value is None:
	if value is None or (isinstance(value, bool) and value):

Add DataflowStartSQLQuery operator #8553

Add DataflowStartSQLQuery operator #8553

Uh oh!

Conversation

mik-laj commented Apr 25, 2020

Uh oh!

mik-laj commented Apr 26, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TheNeuralBit May 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mik-laj Apr 27, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TheNeuralBit May 5, 2020 •

edited

Loading

mik-laj Apr 27, 2020 •

edited

Loading