-
Notifications
You must be signed in to change notification settings - Fork 16.4k
AIP-72: Port Registering of Asset Changes to Task SDK on task completion #45924
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Working on fixing the tests |
|
With the new changes, tested both for legacy and task sdk DAGs (cc @ashb) Task SDK Results: |
|
Looks like the changes in this PR break some test cases for inlet_events. Looking into it |
|
Legacy without my changes: Legacy with my changes: Which is the same, so likely its a change needed in the test cases. |
|
Ok I think I figured out the reason for failure, working on a fix |
ashb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking better. A few tidy ups to do but def on the right track.
|
I will just resolve the conversations with relevant replies, rebase & merge this |


closes: #45752
Why?
When a task completed (moves into success) state, asset related events for that particular task should be stores into the metadata DB. Currently it is done like this: airflow/airflow/models/taskinstance.py.
This does a few things:
assetasset_aliasasset_alias_assetasset_alias_asset_eventasset_eventThis behaviour needs to be ported into the task sdk.
Approach
The idea is to implement this logic in the
patchti state endpoint in the execution API server. The reasoning is so that we needn't make an additional API call but when "finishing" a task from the task runner, we can send in the relevant details like the task outlets, the outlet events and take care of the rest inti_update_stateendpoint.Interface changes
TISuccessStatePayloadto mark a state as success from the task runner in the execution API -> reason being, we do not want to coagulate theTITerminalStatePayloadwith additional information slowing down the API request for no need.task_outlets: translates toti.task.outletsat execution time sent from the task runneroutlet_events: these are the events for aoutletobject. For example, forAssetsit translates tocontext["outlet_events"][Asset Object]asset_type: we send the type of object (class name) the exeuction API has to deal with. This is to avoid any additional mental gymnastics on the server side to find the kind of object we are dealing with as it will be serialised.Server Side
ti_update_stateadded an additional branch for success state where we callregister_asset_changeswhich is a similar function toairflow/airflow/models/taskinstance.py
Line 359 in 051e617
task_outlet, it registers the events for different types:tuple[asset uri, extra], and generates events for those by handling some cases. Docs: https://airflow.apache.org/docs/apache-airflow/stable/authoring-and-scheduling/datasets.html#how-to-use-datasetaliasExecution side
Task Runner
The task runner runs the task as usual and before finishing, checks for task outlets defined. If task outlets are defined, it populates the
task_outlets,outlet_eventsandasset_typefor the outlets present.For
Asset:task_outletsandoutlet_eventsasevents[obj]whereevents = context["outlet_events"]For
AssetNameRefandAssetUriRef:task_outletsas needed and populates all the events possible as we cannot access the DB to get the model being referencedFor
AssetAlias:We dont care about the
task_outlets, we only care about theasset_alias_events, so those are populated inoutlet_eventsOnce this is done, a
SucceedTaskis sent to supervisor.Supervisor
Supervisor starts treating the success state as
STATES_SENT_DIRECTLYfrom now. Once it receives aSuceedTaskmessage from the task runner, it callsHTTP client in task sdk
Introduced a new method called:
succeedwhich will call theti_patch_stateapi withTISuccessStatePayloadTesting
Using the DAG: https://github.com/apache/airflow/blob/main/airflow/example_dags/example_asset_alias.py
DAGS:

Non aliases assets
Event:
Consumer DAG also gets triggered:

Aliases assets
Event:
This triggers two DAGs now: asset_alias_example_alias_consumer and asset_s3_bucket_consumer since the aliased asset was updated
Fails due to

inlet_eventsstill not being ported:DB level checks
Asset created:

Asset Alias:

Asset Alias Asset mapping:

Asset event: (first one triggered by asset, second by alias)

Alias Event mapping:

^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named
{pr_number}.significant.rstor{issue_number}.significant.rst, in newsfragments.