Skip to content

DagRuns are marked as failed as soon as one task fails #7939

@dimberman

Description

@dimberman

Apache Airflow version: 1.7.1.2

Kubernetes version (if you are using kubernetes) (use kubectl version):

Environment:

  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:
    What happened:

#1514 added a verify_integrity function that greedily creates TaskInstance objects for all tasks in a dag.

This does not interact well with the assumptions in the new update_state function. The guard for if len(tis) == len(dag.active_tasks) is no longer effective; in the old world of lazily-created tasks this code would only run once all the tasks in the dag had run. Now it runs all the time, and as soon as one task in a dag run fails the whole DagRun fails. This is bad since the scheduler stops processing the DagRun after that.

In retrospect, the old code was also buggy: if your dag ends with a bunch of Queued tasks the DagRun could be marked as failed prematurely.

I suspect the fix is to update the guard to look at tasks where the state is success or failed. Otherwise we're evaluating and failing the dag based on up_for_retry/queued/scheduled tasks.

What you expected to happen:

How to reproduce it:

Anything else we need to know:

Moved here from https://issues.apache.org/jira/browse/AIRFLOW-441

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:Schedulerincluding HA (high availability) schedulerkind:featureFeature Requests

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions