Skip to content

Conversation

@luoyuliuyin
Copy link
Contributor

@luoyuliuyin luoyuliuyin commented Sep 30, 2024

closes: #42581

Problem Description

image
The trigger_rule of task_one_success is one_success. When the upstream node of task_one_success has not yet run, task_one_success is skipped. According to the semantics of one_success, task_one_success should be able to run.

In this scenario, Airflow turns on the schedule_after_task_execution parameter, which means that after the upstream node finishes running, it will try to schedule the downstream node in the current worker.

This problem may occur when task_1 runs faster than task_run. More specifically, it occurs when task_1 finishes running and successfully schedules downstream tasks in the current worker.

Related Code

Below is the code in question
image
image
When task_1 is finished, it will try to schedule downstream tasks. First, a partial dag will be generated.

partial_dag = task.dag.partial_subset(
                task.downstream_task_ids,
                include_downstream=True,
                include_upstream=False,
                include_direct_upstream=True,
            )

task => "task_1"
task.downstream_task_ids => "task_2"

include_downstream=True => ["task_2"]
include_upstream=False => ["task_2"]
include_direct_upstream=True => ["task_2", "task_skip", "task_one_success", "task_1"]

So the final partial_dag is ["task_2", "task_skip", "task_one_success", "task_1"]
image
image

This partial_dag is incomplete because task_one_success's other upstream node task_run is not in it.Specifically, the include_upstream parameter should not be false

Solution

The correct subgraph division should be as follows, include_upstream=True:

partial_dag = task.dag.partial_subset(
                task.downstream_task_ids,
                include_downstream=True,
                include_upstream=True,
                include_direct_upstream=True,
            )

task => "task_1"
task.downstream_task_ids => "task_2"

include_downstream=True => ["task_2"]
include_upstream=True =>["task_2", "task_skip", "task_one_success", "task_1", "task_run", "branch"]
include_direct_upstream=True => ["task_2", "task_skip", "task_one_success", "task_1", "task_run", "branch"]

So the final partial_dag is ["task_2", "task_skip", "task_one_success", "task_1", "task_run", "branch"]

The final partial_dag should be as follows:
image
image

Subgraph pruning will only be performed when the schedule_after_task_execution parameter is turned on. Normal scheduler scheduling will not have this problem.

@romsharon98
Copy link
Contributor

Can you add test that prevent regression?

@luoyuliuyin
Copy link
Contributor Author

luoyuliuyin commented Sep 30, 2024

Can you add test that prevent regression?

Thank you for the suggestion. I've added test to prevent regression. Please check the latest commit.

@jscheffl jscheffl added this to the Airflow 2.10.3 milestone Sep 30, 2024
@jscheffl jscheffl added area:Scheduler including HA (high availability) scheduler type:bug-fix Changelog: Bug Fixes area:core labels Sep 30, 2024
@luoyuliuyin
Copy link
Contributor Author

Adding include_upstream=True is still unsafe. Here is a bad case where task_one_success is ignored again. Given that partial_subset has little impact on performance, I recommend removing partial_subset completely.

image
task => "task_0"
task.downstream_task_ids => "task_1"

include_downstream=True => ["task_1", "task_2"]
include_upstream=True =>["task_0", "task_1", "task_2"]
include_direct_upstream=True => ["task_0", "task_1", "task_2", "task_one_success", "task_skip"]

So the final partial_dag is ["task_0", "task_1", "task_2", "task_one_success", "task_skip"]

@shahar1 shahar1 self-requested a review October 11, 2024 06:17
@shahar1
Copy link
Contributor

shahar1 commented Oct 11, 2024

Also, DB tests currently fail

Copy link
Contributor

@shahar1 shahar1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - I'm ok with merging it after resolving my last nitpick.
@potiuk / @uranusjr / @ephraimbuddy - any objections?

@potiuk
Copy link
Member

potiuk commented Oct 14, 2024

I have completely no problem. The use of partial_subset caused problems with serialization already for a number of users so removing it seems like a good idaa, and performance-wise it should not be problematic.

@ashb @uranusjr ? Do You have any objections?

@potiuk potiuk merged commit 3fceaa6 into apache:main Oct 23, 2024
potiuk pushed a commit to potiuk/airflow that referenced this pull request Oct 23, 2024
* fix schedule_downstream_tasks bug

* remove partial_subset

* Update comment

---------

Co-authored-by: 维湘 <jiazhao.ljz@alibaba-inc.com>
(cherry picked from commit 3fceaa6)
potiuk added a commit that referenced this pull request Oct 23, 2024
* fix schedule_downstream_tasks bug

* remove partial_subset

* Update comment

---------

Co-authored-by: 维湘 <jiazhao.ljz@alibaba-inc.com>
(cherry picked from commit 3fceaa6)

Co-authored-by: luoyuliuyin <luoyuliuyin@gmail.com>
harjeevanmaan pushed a commit to harjeevanmaan/airflow that referenced this pull request Oct 23, 2024
* fix schedule_downstream_tasks bug

* remove partial_subset

* Update comment

---------

Co-authored-by: 维湘 <jiazhao.ljz@alibaba-inc.com>
PaulKobow7536 pushed a commit to PaulKobow7536/airflow that referenced this pull request Oct 24, 2024
* fix schedule_downstream_tasks bug

* remove partial_subset

* Update comment

---------

Co-authored-by: 维湘 <jiazhao.ljz@alibaba-inc.com>
utkarsharma2 pushed a commit that referenced this pull request Oct 24, 2024
* fix schedule_downstream_tasks bug

* remove partial_subset

* Update comment

---------

Co-authored-by: 维湘 <jiazhao.ljz@alibaba-inc.com>
(cherry picked from commit 3fceaa6)

Co-authored-by: luoyuliuyin <luoyuliuyin@gmail.com>
ellisms pushed a commit to ellisms/airflow that referenced this pull request Nov 13, 2024
* fix schedule_downstream_tasks bug

* remove partial_subset

* Update comment

---------

Co-authored-by: 维湘 <jiazhao.ljz@alibaba-inc.com>
@coding-dragon520
Copy link

coding-dragon520 commented Jan 10, 2025

0B813AE6-4AFB-4072-BDB4-27FEB0ECCE06

EB2103E2-832C-487f-A2AD-39298FC716B9

0 downstream tasks scheduled from follow-on schedule check, Actually, my downstream tasks can be scheduled, Production problems have labor to help see, This problem occurs occasionally, the upstream node succeeds, but the downstream node is not scheduled

@shahar1
Copy link
Contributor

shahar1 commented Jan 10, 2025

0B813AE6-4AFB-4072-BDB4-27FEB0ECCE06

EB2103E2-832C-487f-A2AD-39298FC716B9

0 downstream tasks scheduled from follow-on schedule check, Actually, my downstream tasks can be scheduled, Production problems have labor to help see, This problem occurs occasionally, the upstream node succeeds, but the downstream node is not scheduled

Thanks for reporting! Could you please create a GitHub issue with a minimal example to reproduce it (considering the latest Airflow version)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:core area:Scheduler including HA (high availability) scheduler type:bug-fix Changelog: Bug Fixes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

one_success trigger_rule scheduling exception

6 participants