Skip to content

Conversation

@huozhanfeng
Copy link
Contributor

@huozhanfeng huozhanfeng commented Jul 29, 2021

Backfill job fails to run when there are tasks run into rescheduling state.
The error log as follows in issue #13322

Traceback (most recent call last):
  File "/opt/conda/bin/airflow", line 8, in 
    sys.exit(main())
  File "/opt/conda/lib/python3.8/site-packages/airflow/__main__.py", line 40, in main
    args.func(args)
  File "/opt/conda/lib/python3.8/site-packages/airflow/cli/cli_parser.py", line 48, in command
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/airflow/utils/cli.py", line 89, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/airflow/cli/commands/dag_command.py", line 103, in dag_backfill
    dag.run(
  File "/opt/conda/lib/python3.8/site-packages/airflow/models/dag.py", line 1701, in run
    job.run()
  File "/opt/conda/lib/python3.8/site-packages/airflow/jobs/base_job.py", line 237, in run
    self._execute()
  File "/opt/conda/lib/python3.8/site-packages/airflow/utils/session.py", line 65, in wrapper
    return func(*args, session=session, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/airflow/jobs/backfill_job.py", line 799, in _execute
    self._execute_for_run_dates(
  File "/opt/conda/lib/python3.8/site-packages/airflow/utils/session.py", line 62, in wrapper
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/airflow/jobs/backfill_job.py", line 722, in _execute_for_run_dates
    processed_dag_run_dates = self._process_backfill_task_instances(
  File "/opt/conda/lib/python3.8/site-packages/airflow/utils/session.py", line 62, in wrapper
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/airflow/jobs/backfill_job.py", line 620, in _process_backfill_task_instances
    self._update_counters(ti_status=ti_status)
  File "/opt/conda/lib/python3.8/site-packages/airflow/utils/session.py", line 65, in wrapper
    return func(*args, session=session, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/airflow/jobs/backfill_job.py", line 211, in _update_counters
    ti_status.running.pop(key)
KeyError: TaskInstanceKey(dag_id='dag_id', task_id='task_name', execution_date=datetime.datetime(2020, 12, 15, 0, 0, tzinfo=Timezone('UTC')), try_number=2)

The root cause is that the field try_number doesn't Increase when the task runs into rescheduling state, but there is a reduce operation on try_number.

Currently, I can't think out a good ut to test it, only post the code here to help the one who is affected by it to solve the problem.

@boring-cyborg boring-cyborg bot added the area:Scheduler including HA (high availability) scheduler label Jul 29, 2021
@huozhanfeng
Copy link
Contributor Author

huozhanfeng commented Jul 30, 2021

Welcome anyone who is familiar with this code block to help to improve the UT. The code logic has run for two weeks in our production env and the stability is ok.

@uranusjr
Copy link
Member

Could you rebase this to the latest main? We had some disruptions in CI that caused the tests to fail in this PR.

@huozhanfeng huozhanfeng force-pushed the backfill_rescheduler_bug branch from 2ad74ef to 1d7062a Compare July 31, 2021 08:24
@huozhanfeng
Copy link
Contributor Author

Could you rebase this to the latest main? We had some disruptions in CI that caused the tests to fail in this PR.

Sure, I will rebase it. Thanks for your kind reminder.

@uranusjr
Copy link
Member

uranusjr commented Aug 2, 2021

CI seems to pass (the failure is due to resource insufficiency and not related to this PR). I have no idea if this is a right change, so let’s see what people more familiar with this part of code have to say…

@huozhanfeng
Copy link
Contributor Author

CI seems to pass (the failure is due to resource insufficiency and not related to this PR). I have no idea if this is a right change, so let’s see what people more familiar with this part of code have to say…

Sure, thank you a lot.

@galuszkak
Copy link
Contributor

galuszkak commented Sep 2, 2021

Hey,

It looks like we hit same problem described in #13322 . We would be really grateful if anyone from reviewers could look into this patch? (cc @ashb @kaxil @XD-DENG )

Basically backfill feature is heavily used in our production release of Airflow, and it's blocking us with moving into Airflow 2.x.

Thanks for all contributors that are working really hard on this project.

@ashb
Copy link
Member

ashb commented Sep 2, 2021

Since this is a bug fix, please could you add a unit test to prevent future regressions of this?

@huozhanfeng
Copy link
Contributor Author

Since this is a bug fix, please could you add a unit test to prevent future regressions of this?

Ehh...I have tried, but it's difficult to make a UT for it.

@pgodek
Copy link

pgodek commented Sep 6, 2021

Hi,
we have tested this fix and it does not seems to solve issue.

@huozhanfeng
Copy link
Contributor Author

huozhanfeng commented Sep 6, 2021

Hi,
we have tested this fix and it does not seems to solve issue.

Maybe there are other cases to cause the issue... You can print the content of ti.key and ti_status.running in function _update_counters to get the details.

@pgodek
Copy link

pgodek commented Sep 15, 2021

Maybe there are other cases to cause the issue... You can print the content of ti.key and ti_status.running in function _update_counters to get the details.

Hi, so I was not clear. Fix from PR resolves issue with KeyError: TaskInstanceKey ...,
but task is looped until timeout.
Attempt counter is not updated, task logs contains sequence of info like below


--------------------------------------------------------------------------------
[2021-09-13 11:16:50,930] {taskinstance.py:1088} INFO - Starting attempt 1 of 4
[2021-09-13 11:16:50,930] {taskinstance.py:1089} INFO - 
--------------------------------------------------------------------------------
...
--------------------------------------------------------------------------------
[2021-09-15 10:33:37,692] {taskinstance.py:1088} INFO - Starting attempt 1 of 4
[2021-09-15 10:33:37,692] {taskinstance.py:1089} INFO - 
--------------------------------------------------------------------------------
...

until timeout is reached.
It is expected that after reaching attempts threshold it will fail.

@huozhanfeng
Copy link
Contributor Author

huozhanfeng commented Sep 15, 2021

Maybe there are other cases to cause the issue... You can print the content of ti.key and ti_status.running in function _update_counters to get the details.

Hi, so I was not clear. Fix from PR resolves issue with KeyError: TaskInstanceKey ...,
but task is looped until timeout.
Attempt counter is not updated, task logs contains sequence of info like below


--------------------------------------------------------------------------------
[2021-09-13 11:16:50,930] {taskinstance.py:1088} INFO - Starting attempt 1 of 4
[2021-09-13 11:16:50,930] {taskinstance.py:1089} INFO - 
--------------------------------------------------------------------------------
...
--------------------------------------------------------------------------------
[2021-09-15 10:33:37,692] {taskinstance.py:1088} INFO - Starting attempt 1 of 4
[2021-09-15 10:33:37,692] {taskinstance.py:1089} INFO - 
--------------------------------------------------------------------------------
...

until timeout is reached.
It is expected that after reaching attempts threshold it will fail.

Is it in a rescheduling state? if so, it's reasonable because the attempt sequence num only is updated when the task runs into failure. And In rescheduling state, it only means the upstream is not met and will try again until upstream is done.

@pgodek
Copy link

pgodek commented Sep 15, 2021

Is it in a rescheduling state? if so, it's reasonable because the attempt sequence num only is updated when the task runs into failure. And In rescheduling state, it only means the upstream is not met and will try again until upstream is done.

You are right, I messed up some concepts.
So I do confirm that this PR looks like solution.

@github-actions github-actions bot added the full tests needed We need to run full set of tests for this PR to merge label Oct 7, 2021
@github-actions
Copy link

github-actions bot commented Oct 7, 2021

The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest main at your convenience, or amend the last commit of the PR, and push it with --force-with-lease.

@kaxil kaxil merged commit 9a5ac14 into apache:main Oct 7, 2021
@kaxil
Copy link
Member

kaxil commented Oct 7, 2021

I am merging this to get it in 2.2.0rc1 end of today but we should write a test for it

ephraimbuddy added a commit to astronomer/airflow that referenced this pull request Oct 7, 2021
This PR adds a unittest for the PR apache#17305
uranusjr pushed a commit that referenced this pull request Oct 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:Scheduler including HA (high availability) scheduler full tests needed We need to run full set of tests for this PR to merge

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants