Skip to content

Conversation

@dolfinus
Copy link
Contributor

@dolfinus dolfinus commented Jul 18, 2024

closes: #40798

  • Add AirflowDagRunFacet which copies AirflowRunFacet, but without fields task, taskInstance, taskUuid.
  • Add the same facet to DAG run events with eventType=START.

This allows to collect data like dag tags, dag schedule interval, dagRun id and type on OpenLineage consumer side. Before that, this information was available only for task events.


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@dolfinus dolfinus force-pushed the feature/openlineage-dag-run-facet branch 2 times, most recently from b07fccd to 5f30877 Compare July 18, 2024 12:21
@dolfinus dolfinus marked this pull request as ready for review July 18, 2024 12:24
@dolfinus dolfinus requested a review from mobuchowski as a code owner July 18, 2024 12:24
@mobuchowski
Copy link
Contributor

@dolfinus I think this should be a different facet or added to AirflowJobFacet that we attach to DagRun events.

@dolfinus
Copy link
Contributor Author

dolfinus commented Jul 18, 2024

Added to AirflowJobFacet that we attach to DagRun events.

DagRun information is a part of RunFacet, as it has fields like start_date, run_id, which are not JobFacet.

I think this should be a different facet

Honestly, I'd rather split these fields to different facets, like:

job:
  facets:
    airflow_dag: AirflowDagJobFacet(...)  # previously airflow:dag
    airflow_task: AirflowTaskJobFacet(...)  # previously airflow:task
run:
  facets:
    airflow_dagRun: AirflowDagRunFacet(...)  # previously airflow:dagRun
    airflow_taskInstance: AirflowTaskInstanceFacet(...)  # previously airflow:taskInstance
    # throw away taskUuuid, it just ruplicates runId

But this is not backwards compatible.

@mobuchowski
Copy link
Contributor

Why not attach it to some new facet as AirflowDagRunFacet? Basically, AirflowRunFacet is tightly coupled to what we have during task instance execution, and don't think making it just a bag of possible things makes sense.

I'd rather have the same DagRun information attached to different facets in TaskInstance level events and DagRun events.

@dolfinus
Copy link
Contributor Author

dolfinus commented Jul 18, 2024

Why not attach it to some new facet as AirflowDagRunFacet?

What should be the key of this facet? "airflow" or something else? For OL consumer, having runEvents with facets stored with the same key, but having different fields for Task and DAG runs, is not very comfortable.

Should AirflowRunFacet be renamed to AirflowTaskRunFacet, for consistency?

@kacpermuda
Copy link
Contributor

Why not attach it to some new facet as AirflowDagRunFacet?

What should be the key of this facet? "airflow" or something else? For OL consumer, having runEvents with facets stored with the same key, but having different fields for Task and DAG runs, is not very comfortable.

Should AirflowRunFacet be renamed to AirflowTaskRunFacet, for consistency?

We can think of a new name, that would not be confusing. I don't think we even could go with another run facet with airflow as key, as from spec perspective it would be a duplicate, there is no distinction between DAG and Task run facet from the spec perspective.

I think the renaming you propose would not be a big issue, as it would not change the spec, only the import, so we would have to keep old name for compatibility. Personally, I would avoid doing any changes to AirflowRunFacet if possible. I think renaming will not be necessary if we come up with a better name for the new one 😄

@dolfinus
Copy link
Contributor Author

dolfinus commented Jul 18, 2024

My current plan is:

  • Mark AirflowRunFacet as deprecated, but keep it for backward compatibility (al least for several next releases).
  • Add new facet airflow_task: AirflowTaskRunFacet which would just copy AirflowRunFacet, but with a new name and key.
  • Add new facet airflow_dag: AirflowDagRunFacet which would implement changes for OpenLineage: Add AirflowRunFacet to DAG run events #40798.

What do you think?

@kacpermuda
Copy link
Contributor

My current plan is:

  • Mark AirflowRunFacet as deprecated, but keep it for backward compatibility (al least for several next releases).
  • Add new facet airflow_task: AirflowTaskRunFacet which would just copy AirflowRunFacet, but with a new name and key.
  • Add new facet airflow_dag: AirflowDagRunFacet which would implement changes for OpenLineage: Add AirflowRunFacet to DAG run events #40798.

What do you think?

I think i would proceed with the last point for now in this PR, and create a discussion for the first two so that we can decide if / how we want to make it less confusing for users. The AirflowRunFacet is the primary facet that users rely on, so we need to handle it with care. Before proceeding with any actions, we should have an in-depth discussion, as deprecating it without such a discussion would not be advisable. 😄

We can also discuss here what would be the best key for this new facet: airflowDag, airflowRun, airflowDagRun?

@dolfinus dolfinus marked this pull request as draft July 18, 2024 15:48
@dolfinus dolfinus force-pushed the feature/openlineage-dag-run-facet branch 3 times, most recently from b705273 to 2b5fc66 Compare July 19, 2024 09:18
@dolfinus dolfinus marked this pull request as ready for review July 19, 2024 09:59
@dolfinus
Copy link
Contributor Author

dolfinus commented Jul 19, 2024

Updated implementation - reverted all the changes to AirflowRunFacet, added new facet airflow_dagRun: AirflowDagRunFacet. Using key airflow_dagRun instead of airflowDagRun as recommended by specification.

Regarding changing AirflowRunFacet name, key and probably content I've created discussion #40886.

Copy link
Contributor

@kacpermuda kacpermuda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, left some comments

@dolfinus dolfinus force-pushed the feature/openlineage-dag-run-facet branch 3 times, most recently from 39bfdcf to d1afb37 Compare July 19, 2024 12:55
@dolfinus dolfinus force-pushed the feature/openlineage-dag-run-facet branch 2 times, most recently from e753f78 to 18d9815 Compare July 19, 2024 13:16
@dolfinus dolfinus changed the title openlineage: Add AirflowRunFacet for dag runEvents openlineage: Add AirflowDagRunFacet to dag runEvents Jul 19, 2024
Copy link
Contributor

@kacpermuda kacpermuda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, left some comments

@dolfinus dolfinus force-pushed the feature/openlineage-dag-run-facet branch from 18d9815 to e942dad Compare July 22, 2024 09:17
@dolfinus dolfinus force-pushed the feature/openlineage-dag-run-facet branch from e942dad to 6529dbf Compare July 22, 2024 09:18
@mobuchowski mobuchowski merged commit 9ec9eb7 into apache:main Jul 23, 2024
@dolfinus dolfinus deleted the feature/openlineage-dag-run-facet branch July 23, 2024 12:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

OpenLineage: Add AirflowRunFacet to DAG run events

3 participants