[AIRFLOW-3136] Add retry_number to TaskInstance Key property to avoid race condition #3994

vardancse · 2018-10-03T07:22:59Z

Make sure you have checked all steps below.

Jira

My PR addresses the following AIRFLOW-3136 issues and references them in the PR title.

Description

Here are some details about my PR.
We were seeing an intermittent issue where executor reports task instance finished while task says it's in queue state, it was due to a race condition between scheduler which was clearing event_buffer in _process_executor_events method in jobs.py executor was about to put next_retry task's status as running which was failed in previous try. So, we thought to add retry_number as the member of TaskInstance key property.

Tests

My PR adds the following unit tests OR does not need testing for this extremely good reason:

Commits

My commits all reference Jira issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters (not including Jira issue reference)
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not "adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"

Documentation

In case of new functionality, my PR adds documentation that describes how to use it.
- When adding new operators/hooks/sensors, the autoclass documentation generation needs to be added.

Code Quality

Passes git diff upstream/master -u -- "*.py" | flake8 --diff

ashb · 2018-10-03T08:51:56Z

airflow/models.py

Right now the PK columns of the TI table is (self.dag_id, self.task_id, self.execution_date) and try_number is adjusted in place.

This change might be a good idea but I think it is more complex than just this -- for instance it probably makes sense to store each individual try in the task instance table, but that is a much bigger change.

Thanks @ashb for your comment.

Storing individual try information in task_instance table will actually connect more dots, currently we get individual task try information by joining with task_fail table, though not explicitly mentioned for which try job was failed but playing with timestamps, one can infer retry_number.

On the other side, I guess it's okay to keep schema as is because it will not add additional load on task_instance table instead it will keep on delegating load to task_fail only which has comparatively lesser or almost no reads, however adding retry_number in task_fail table will add value.

IMHO Keeping key property only for TaskInstance class seems to be a special behaviour probably it was meant only for scheduler to convey eligible task information to executor queued_tasks, if that's not true, we can also introduce executor_key as another property and make existing references to this new property, thoughts?

@ashb Looking forward to your feedback!

This collision situation is arising because of the following steps happening periodically by scheduler.

In the _execute_helper method, heartbeat method is being called, which calls execute_async in further and putting key(dag_id, task_id, execution_date) and command to executor queue and then from there onwards based on executor type, task starts executing asynchronously.

Sync methods runs followed by execute_async method which push keys from result_queue to event_buffer

_process_executor_events method being called followed by hearbeat method which detects external killing of task, if TI.state is queued and event_buffer key status is success/failed

Another suggestion for change could be taking out execute_async method from heartbeat method and call it after _process_executor_events method call in _execute_helper method in jobs.py

…tting race condition Changes wrt key change Added fix in failed test case

…tting race condition Changes in Test case Made change to anticipate new member of key Renamed retry to try

codecov-io · 2018-10-04T14:32:45Z

Codecov Report

Merging #3994 into master will decrease coverage by 0.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #3994      +/-   ##
==========================================
- Coverage   75.49%   75.48%   -0.02%     
==========================================
  Files         199      199              
  Lines       15947    15947              
==========================================
- Hits        12039    12037       -2     
- Misses       3908     3910       +2

Impacted Files	Coverage Δ
airflow/models.py	`91.67% <100%> (ø)`	⬆️
airflow/executors/base_executor.py	`93.75% <100%> (ø)`	⬆️
airflow/jobs.py	`81.95% <100%> (-0.18%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fb5ffd1...4eebf95. Read the comment docs.

ashb · 2018-10-09T13:11:55Z

Is there an easy way I can reproduce this locally to see the bug for myself?

vardancse · 2018-10-10T12:47:15Z

@ashb For us, we can easily reproduce by running a DAG which has a task bound to fail with retries set >0(preferably 4 or 5) and retry_delay set to 1 second, let me know if you're able to repro else I'll share sample DAG file. Also, our email_on_retry and email_on_failure is marked true, so once we see this issue of race condition, we will get an email something like below

Try 2 out of 4
Exception:
Executor reports task instance %s finished (%s) although the task says its %s. Was the task killed externally?
Log: Link
Host: 72a8e9235db3
Log file: /usr/local/airflow/logs/1fac4a1d57e551320a49403c_ufr09_qwpbfy/ACPComputeGateway_0_UFR09Task/2018-10-03T05:45:05.000221.log
Mark success: Link

EDIT-1: We have also done 1 testing at our side, where we were not able to reproduce issue when dag_folder consist less number of files but as soon as it touched 40+ Dags in our dag_folder, we were able to reproduce, probably increasing dags in dag_folder increasing 1 cycle of dags processing.

EDIT-2: Dag which was used to reproduce this issue can be found here and this is actually the originally shipped example dag with few changes mentioned here.

ashb · 2018-10-12T07:43:12Z

Thankss @vardancse that is useful. I will find time to look at this in detail (hopefully on Monday?)

vardancse · 2018-10-24T18:36:26Z

@ashb let me know if by any chance you got time to look at the issue?

ashb · 2018-10-31T20:39:57Z

Thanks, your reproduction steps were enough for me to reproduce this. The fix seemed to make the error go away, and things still get scheduled :D

…dition (#3994) We were seeing an intermittent issue where executor reports task instance finished while task says it's in queue state, it was due to a race condition between scheduler which was clearing event_buffer in _process_executor_events method in jobs.py executor was about to put next_retry task's status as running which was failed in previous try. So, we thought to add retry_number as the member of TaskInstance key property.

…dition (apache#3994) We were seeing an intermittent issue where executor reports task instance finished while task says it's in queue state, it was due to a race condition between scheduler which was clearing event_buffer in _process_executor_events method in jobs.py executor was about to put next_retry task's status as running which was failed in previous try. So, we thought to add retry_number as the member of TaskInstance key property.

aliaksandr-d · 2018-11-28T09:45:20Z

airflow/contrib/executors/kubernetes_executor.py

                labels['dag_id'], labels['task_id'],
-                self._label_safe_datestring_to_datetime(labels['execution_date']))
+                self._label_safe_datestring_to_datetime(labels['execution_date']),
+                labels['try_number'])


Dict labels doesn't contain try_number key, as it's not set in WorkerConfiguration make_pod. That's the reason why pods are not deleted.
https://github.com/apache/incubator-airflow/blob/v1-10-stable/airflow/contrib/kubernetes/worker_configuration.py#L197

D'oh.

Would someone (@vardancse? @aliaksandr-d?) be able to fix this, or at least add a test case that covers this?

@ashb I'm looking into it

@aliaksandr-d @ashb Is there any bug logged for the same so that I can raise the PR for the same?

@aliaksandr-d can you please apply the patch and confirm the resolution?

@aliaksandr-d #4163 already created for the same by @wyndhblb

…dition (apache#3994) We were seeing an intermittent issue where executor reports task instance finished while task says it's in queue state, it was due to a race condition between scheduler which was clearing event_buffer in _process_executor_events method in jobs.py executor was about to put next_retry task's status as running which was failed in previous try. So, we thought to add retry_number as the member of TaskInstance key property.

…3994) We were seeing an intermittent issue where executor reports task instance finished while task says it's in queue state, it was due to a race condition between scheduler which was clearing event_buffer in _process_executor_events method in jobs.py executor was about to put next_retry task's status as running which was failed in previous try. So, we thought to add retry_number as the member of TaskInstance key property.

…dition (apache#3994) We were seeing an intermittent issue where executor reports task instance finished while task says it's in queue state, it was due to a race condition between scheduler which was clearing event_buffer in _process_executor_events method in jobs.py executor was about to put next_retry task's status as running which was failed in previous try. So, we thought to add retry_number as the member of TaskInstance key property.

vardancse and others added 2 commits August 7, 2018 09:09

[AIRFLOW-2861] Added index on log table

95e41ed

Merge branch 'master' of https://github.com/apache/incubator-airflow

0826323

ashb reviewed Oct 3, 2018

View reviewed changes

Vardan Gupta added 2 commits October 3, 2018 19:24

[AIRFLOW-3136] Added retry_number in task key composition to avoid hi…

e22a48e

…tting race condition Changes wrt key change Added fix in failed test case

[AIRFLOW-3136] Added retry_number in task key composition to avoid hi…

4eebf95

…tting race condition Changes in Test case Made change to anticipate new member of key Renamed retry to try

ashb merged commit 8f15661 into apache:master Oct 31, 2018

johnhofman mentioned this pull request Nov 18, 2018

[AIRFLOW-2966] Catch ApiException in the Kubernetes Executor #4209

Merged

aliaksandr-d reviewed Nov 28, 2018

View reviewed changes

[AIRFLOW-3136] Add retry_number to TaskInstance Key property to avoid race condition #3994

[AIRFLOW-3136] Add retry_number to TaskInstance Key property to avoid race condition #3994

Uh oh!

Conversation

vardancse commented Oct 3, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Jira

Description

Tests

Commits

Documentation

Code Quality

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov-io commented Oct 4, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ashb commented Oct 9, 2018

Uh oh!

vardancse commented Oct 10, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ashb commented Oct 12, 2018

Uh oh!

vardancse commented Oct 24, 2018

Uh oh!

ashb commented Oct 31, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

vardancse commented Oct 3, 2018 •

edited

Loading

codecov-io commented Oct 4, 2018 •

edited

Loading

vardancse commented Oct 10, 2018 •

edited

Loading