feat: automatically inject OL info into spark job in DataprocSubmitJobOperator #44477

kacpermuda · 2024-11-29T14:04:36Z

This PR introduces a completely new feature to OpenLineage integration. It will NOT impact users that are not using OpenLineage or have not explicitly enabled this feature (False by default).

TLDR;

When explicitly enabled by the user for supported operators, we will automatically inject parent job information into the Spark job properties. For example, when submitting a Spark job using the DataprocSubmitJobOperator, we will include details about the Airflow task that triggered it so that the OpenLineage Spark integration can include them in parentRunFacet.

Why ?

To enable full pipeline visibility and track dependencies between jobs in OpenLineage, we utilize the parentRunFacet. This facet stores the identifier of the parent job that triggered the current job. This approach works across various integrations, f.e. you can pass Airflow’s job identifier to a Spark application if it was triggered by an Airflow operator. Currently, this process requires manual configuration by the user, such as leveraging macros:

DataprocSubmitJobOperator(
    task_id="my_task", 
    # ... 
    job={ 
        # ...
        "spark.openlineage.parentJobNamespace": "{{ macros.OpenLineageProviderPlugin.lineage_job_namespace() }}",
        "spark.openlineage.parentJobName": "{{ macros.OpenLineageProviderPlugin.lineage_job_name(task_instance) }}", 
        "spark.openlineage.parentRunId": "{{ macros.OpenLineageProviderPlugin.lineage_run_id(task_instance) }}"
    } 
)

Understanding how various Airflow operators configure Spark allows us to automatically inject parent job information.

Controlling the Behavior

We provide users with a flexible control mechanism to manage this injection, combining per-operator enablement with a global fallback configuration. This design is inspired by the deferrable argument in Airflow.

ol_inject_parent_job_info: bool = conf.getboolean(
    "openlineage", "spark_inject_parent_job_info", fallback=False
)

Each supported operator will include an argument like ol_inject_parent_job_info, which defaults to the global configuration value of openlineage.spark_inject_parent_job_info. This approach allows users to:

Control behavior on a per-job basis by explicitly setting the argument.
Rely on a consistent default configuration for all jobs if the argument is not set.

This design ensures both flexibility and ease of use, enabling users to fine-tune their workflows while minimizing repetitive configuration. I am aware that adding an OpenLineage-related argument to the operator will affect all users, even those not using OpenLineage, but since it defaults to False and can be ignored, I hope this will not pose any issues.

How?

The implementation is divided into three parts for better organization and clarity:

Operator's Code (including the execute method):
Contains minimal logic to avoid overwhelming users who are not actively working with OpenLineage.
Google's Provider OpenLineage Utils File:
Handles the logic for accessing Spark properties specific to a given operator or job.
OpenLineage Provider's Utils:
Responsible for creating / extracting all necessary information in a format compatible with the OpenLineage Spark integration. We are also performing modifications to the Spark properties here.

For some operators parts 1 and 2 may be in the operator's code. In general, the specific operator / provider will know how to get the spark properties and the OL will know what to inject and do the injection itself.

Next steps

Expand Operator Coverage:
Increase support for additional operators by extending the parent job information injection to cover more cases.
Automate Transport Configuration:
Implement similar automation for transport configurations, starting with HTTP, to streamline the integration process.

^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

providers/src/airflow/providers/google/cloud/operators/dataproc.py

providers/src/airflow/providers/common/compat/openlineage/utils/spark.py

providers/src/airflow/providers/openlineage/utils/spark.py

MaksYermak

LGTM

ahidalgob · 2024-12-04T12:53:47Z

Hi, are we planning on also emitting the lineage events from Airflow itself? I think we have other services that emit lineage (for example, BigQuery) where we also still emit this lineage from Airflow. For example, in Composer, we generate the events based on the SQL query of Hive, SparkSQL, Presto and Trino jobs.

kacpermuda · 2024-12-04T13:37:54Z

Hey @ahidalgob, just to confirm I understand you correctly: are you asking if we plan to emit the lineage from the child job (in this case, Spark) directly from Airflow? As of now, there aren’t any plans for that that I'm aware of. In my opinion, it’s a bit more complex to implement compared to a SQL-based approach, where we can parse the SQL on the Airflow side and occasionally patch it with API calls to BigQuery or similar solutions. Extracting lineage from a Spark jar, which can do virtually anything, is more challenging. For now, I’m focusing on making it easier for users to configure Spark integration, without changing the entity responsible for emitting the events.

ahidalgob · 2024-12-04T14:26:05Z

Hi @kacpermuda , what I meant was exactly this you describe: parsing the SQL query on Airflow side and generating the inputs/outputs. Right now this PR only as you confirmed only configures how Spark generates the lineage events but doesn't generate from Airflow side, right?

kacpermuda · 2024-12-04T14:43:57Z

Correct, this feature is only about automatically passing some OpenLineage information from Airflow to Spark to automate the process of configuring the OpenLineage/Spark integration.

ahidalgob · 2024-12-05T10:05:58Z

Thanks @kacpermuda, we would like to contribute the logic we used in Composer to generate the events from the SQL queries in other DataprocSubmitJob types. I think this PR and what we want to contribute are not incompatible, does it sound good to you? (also @mobuchowski )

mobuchowski · 2024-12-05T10:33:33Z

@ahidalgob I don't think that's right, since you can submit JAR with arbitrary code rather than just SQL. Also, even for SQL jobs, rather than using parser (which is a best effort solution) we can use Spark integration that actually understands the uploaded jobs. Airflow events here can contribute proper hierarchy.

kacpermuda · 2024-12-10T16:34:10Z

The failing trino test comes from changes made in #44717. Waiting for the fixing PR as it's on the way by the author.

michalmodras · 2024-12-13T15:47:02Z

To make sure I understand the target state:

With Kacper's changes, some additional metadata about parent job (Airflow DAG / task in this case) will be passed to the Spark job, and emitted in an OpenLineage event to OpenLineage-supporting-catalog by the Spark job itself.
Regardless of that, OpenLineage event at the Airflow level can/will be emitted.

I think it's fair for each layer of orchestration to emit metadata that it has access to, for example depending on the Spark job type/implementation, the low level information about Spark execution, or, in case of Airflow, information about DAG/task/Airflow deployment.

For Airflow itself, to construct such lineage event, Airflow needs to be aware of the input/output assets (as long as we cannot link lineage events only by process identifier). SQL parsing can be a way to get this information for SQL-like jobs, in case of other types of jobs (not necessarily Spark jobs) we can for example query the service the operator is integrated with (e.g. with BigQuery jobs - we could query BigQuery API to get that information and emit event linking input/output assets with BigQuery job id, and DAG/Task/Airflow deployment id).

kacpermuda · 2024-12-16T07:40:53Z

To make sure I understand the target state:

With Kacper's changes, some additional metadata about parent job (Airflow DAG / task in this case) will be passed to the Spark job, and emitted in an OpenLineage event to OpenLineage-supporting-catalog by the Spark job itself.
Regardless of that, OpenLineage event at the Airflow level can/will be emitted.

Correct, that is the target state in my opinion, Airflow events will still be emitted without any changes. For now we are simply automating the transfer of some additional information to Spark integration (but people have been doing this manually until now with OL provided macros).

…bOperator Signed-off-by: Kacper Muda <mudakacper@gmail.com>

…bOperator (apache#44477) Signed-off-by: Kacper Muda <mudakacper@gmail.com>

boring-cyborg bot added area:providers provider:common-compat provider:google Google (including GCP) related issues provider:openlineage AIP-53 labels Nov 29, 2024

kacpermuda force-pushed the feat-ol-inject-parent-info-dataproc branch from ee38e20 to 7ed0872 Compare November 29, 2024 14:14

mobuchowski reviewed Nov 29, 2024

View reviewed changes

providers/src/airflow/providers/google/cloud/operators/dataproc.py Outdated Show resolved Hide resolved

mobuchowski reviewed Nov 29, 2024

View reviewed changes

providers/src/airflow/providers/google/cloud/operators/dataproc.py Outdated Show resolved Hide resolved

kacpermuda force-pushed the feat-ol-inject-parent-info-dataproc branch 2 times, most recently from e96ac0e to ac60867 Compare December 2, 2024 13:30

mobuchowski added the full tests needed We need to run full set of tests for this PR to merge label Dec 2, 2024

kacpermuda force-pushed the feat-ol-inject-parent-info-dataproc branch from ac60867 to 6b1a2a0 Compare December 2, 2024 16:49

github-advanced-security bot found potential problems Dec 2, 2024

View reviewed changes

providers/src/airflow/providers/common/compat/openlineage/utils/spark.py Fixed Show fixed Hide fixed

providers/src/airflow/providers/openlineage/utils/spark.py Fixed Show fixed Hide fixed

kacpermuda force-pushed the feat-ol-inject-parent-info-dataproc branch from 6b1a2a0 to 6eed40c Compare December 3, 2024 10:54

kacpermuda marked this pull request as ready for review December 3, 2024 10:55

kacpermuda force-pushed the feat-ol-inject-parent-info-dataproc branch from 6eed40c to c63e132 Compare December 3, 2024 11:10

kacpermuda mentioned this pull request Dec 3, 2024

feat: automatically inject OL info into spark properties in DataprocCreateBatchOperator #44612

Merged

kacpermuda force-pushed the feat-ol-inject-parent-info-dataproc branch from c63e132 to 82abaee Compare December 3, 2024 15:11

MaksYermak approved these changes Dec 4, 2024

View reviewed changes

kacpermuda force-pushed the feat-ol-inject-parent-info-dataproc branch from 82abaee to 1a99aed Compare December 5, 2024 14:35

kacpermuda mentioned this pull request Dec 5, 2024

feat: automatically inject OL info into spark properties in DataprocInstantiateInlineWorkflowTemplateOperator #44697

Merged

kacpermuda force-pushed the feat-ol-inject-parent-info-dataproc branch 2 times, most recently from 2ac8894 to 18bc511 Compare December 10, 2024 10:49

kacpermuda force-pushed the feat-ol-inject-parent-info-dataproc branch from 18bc511 to 9585316 Compare December 11, 2024 09:19

kacpermuda force-pushed the feat-ol-inject-parent-info-dataproc branch 3 times, most recently from a6433b8 to bfb553b Compare December 18, 2024 15:05

feat: automatically inject OL info into spark job in DataprocSubmitJo…

d525026

…bOperator Signed-off-by: Kacper Muda <mudakacper@gmail.com>

kacpermuda force-pushed the feat-ol-inject-parent-info-dataproc branch from bfb553b to d525026 Compare December 18, 2024 17:15

mobuchowski approved these changes Dec 19, 2024

View reviewed changes

mobuchowski merged commit 04ccef9 into apache:main Dec 19, 2024
137 checks passed

kacpermuda deleted the feat-ol-inject-parent-info-dataproc branch December 19, 2024 13:01

eladkal mentioned this pull request Dec 22, 2024

Status of testing Providers that were prepared on December 22, 2024 #45148

Closed

kacpermuda mentioned this pull request Jan 1, 2025

feat: automatically inject OL transport info into spark jobs #45326

Merged

got686-yandex pushed a commit to got686-yandex/airflow that referenced this pull request Jan 30, 2025

feat: automatically inject OL info into spark job in DataprocSubmitJo…

9df4174

…bOperator (apache#44477) Signed-off-by: Kacper Muda <mudakacper@gmail.com>

eladkal mentioned this pull request Feb 21, 2025

Status of testing Providers that were prepared on February 21, 2025 #46973

Closed

This was referenced Mar 7, 2025

feat: inject OpenLineage configuration to SparkSubmitOperator #47508

Merged

feat: inject OpenLineage configuration to LivyOperator #47564

Merged

mobuchowski mentioned this pull request Dec 5, 2025

automatically inject OL info into Glue job in GlueJobOperator #59094

Closed

feat: automatically inject OL info into spark job in DataprocSubmitJobOperator #44477

feat: automatically inject OL info into spark job in DataprocSubmitJobOperator #44477

Uh oh!

Conversation

kacpermuda commented Nov 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TLDR;

Why ?

Controlling the Behavior

How?

Next steps

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MaksYermak left a comment

Choose a reason for hiding this comment

Uh oh!

ahidalgob commented Dec 4, 2024

Uh oh!

kacpermuda commented Dec 4, 2024

Uh oh!

ahidalgob commented Dec 4, 2024

Uh oh!

kacpermuda commented Dec 4, 2024

Uh oh!

ahidalgob commented Dec 5, 2024

Uh oh!

mobuchowski commented Dec 5, 2024

Uh oh!

kacpermuda commented Dec 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

michalmodras commented Dec 13, 2024

Uh oh!

kacpermuda commented Dec 16, 2024

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

kacpermuda commented Nov 29, 2024 •

edited

Loading

kacpermuda commented Dec 10, 2024 •

edited

Loading