Skip to content

Conversation

@vardancse
Copy link
Contributor

@vardancse vardancse commented Oct 3, 2018

Make sure you have checked all steps below.

Jira

  • My PR addresses the following AIRFLOW-3136 issues and references them in the PR title.

Description

  • Here are some details about my PR.
    We were seeing an intermittent issue where executor reports task instance finished while task says it's in queue state, it was due to a race condition between scheduler which was clearing event_buffer in _process_executor_events method in jobs.py executor was about to put next_retry task's status as running which was failed in previous try. So, we thought to add retry_number as the member of TaskInstance key property.

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:

Commits

  • My commits all reference Jira issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • When adding new operators/hooks/sensors, the autoclass documentation generation needs to be added.

Code Quality

  • Passes git diff upstream/master -u -- "*.py" | flake8 --diff

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now the PK columns of the TI table is (self.dag_id, self.task_id, self.execution_date) and try_number is adjusted in place.

This change might be a good idea but I think it is more complex than just this -- for instance it probably makes sense to store each individual try in the task instance table, but that is a much bigger change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ashb for your comment.

Storing individual try information in task_instance table will actually connect more dots, currently we get individual task try information by joining with task_fail table, though not explicitly mentioned for which try job was failed but playing with timestamps, one can infer retry_number.

On the other side, I guess it's okay to keep schema as is because it will not add additional load on task_instance table instead it will keep on delegating load to task_fail only which has comparatively lesser or almost no reads, however adding retry_number in task_fail table will add value.

IMHO Keeping key property only for TaskInstance class seems to be a special behaviour probably it was meant only for scheduler to convey eligible task information to executor queued_tasks, if that's not true, we can also introduce executor_key as another property and make existing references to this new property, thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ashb Looking forward to your feedback!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This collision situation is arising because of the following steps happening periodically by scheduler.

  1. In the _execute_helper method, heartbeat method is being called, which calls execute_async in further and putting key(dag_id, task_id, execution_date) and command to executor queue and then from there onwards based on executor type, task starts executing asynchronously.
  2. Sync methods runs followed by execute_async method which push keys from result_queue to event_buffer
  3. _process_executor_events method being called followed by hearbeat method which detects external killing of task, if TI.state is queued and event_buffer key status is success/failed

Another suggestion for change could be taking out execute_async method from heartbeat method and call it after _process_executor_events method call in _execute_helper method in jobs.py

Vardan Gupta added 2 commits October 3, 2018 19:24
…tting race condition

Changes wrt key change

Added fix in failed test case
…tting race condition

Changes in Test case
Made change to anticipate new member of key

Renamed retry to try
@codecov-io
Copy link

codecov-io commented Oct 4, 2018

Codecov Report

Merging #3994 into master will decrease coverage by 0.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3994      +/-   ##
==========================================
- Coverage   75.49%   75.48%   -0.02%     
==========================================
  Files         199      199              
  Lines       15947    15947              
==========================================
- Hits        12039    12037       -2     
- Misses       3908     3910       +2
Impacted Files Coverage Δ
airflow/models.py 91.67% <100%> (ø) ⬆️
airflow/executors/base_executor.py 93.75% <100%> (ø) ⬆️
airflow/jobs.py 81.95% <100%> (-0.18%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fb5ffd1...4eebf95. Read the comment docs.

@ashb
Copy link
Member

ashb commented Oct 9, 2018

Is there an easy way I can reproduce this locally to see the bug for myself?

@vardancse
Copy link
Contributor Author

vardancse commented Oct 10, 2018

@ashb For us, we can easily reproduce by running a DAG which has a task bound to fail with retries set >0(preferably 4 or 5) and retry_delay set to 1 second, let me know if you're able to repro else I'll share sample DAG file. Also, our email_on_retry and email_on_failure is marked true, so once we see this issue of race condition, we will get an email something like below

Try 2 out of 4
Exception:
Executor reports task instance %s finished (%s) although the task says its %s. Was the task killed externally?
Log: Link
Host: 72a8e9235db3
Log file: /usr/local/airflow/logs/1fac4a1d57e551320a49403c_ufr09_qwpbfy/ACPComputeGateway_0_UFR09Task/2018-10-03T05:45:05.000221.log
Mark success: Link

EDIT-1: We have also done 1 testing at our side, where we were not able to reproduce issue when dag_folder consist less number of files but as soon as it touched 40+ Dags in our dag_folder, we were able to reproduce, probably increasing dags in dag_folder increasing 1 cycle of dags processing.

EDIT-2: Dag which was used to reproduce this issue can be found here and this is actually the originally shipped example dag with few changes mentioned here.

@ashb
Copy link
Member

ashb commented Oct 12, 2018

Thankss @vardancse that is useful. I will find time to look at this in detail (hopefully on Monday?)

@vardancse
Copy link
Contributor Author

@ashb let me know if by any chance you got time to look at the issue?

@ashb
Copy link
Member

ashb commented Oct 31, 2018

Thanks, your reproduction steps were enough for me to reproduce this. The fix seemed to make the error go away, and things still get scheduled :D

@ashb ashb merged commit 8f15661 into apache:master Oct 31, 2018
ashb pushed a commit that referenced this pull request Nov 6, 2018
…dition (#3994)

 We were seeing an intermittent issue where executor reports task instance finished while task says it's in queue state, it was due to a race condition between scheduler which was clearing event_buffer in _process_executor_events method in jobs.py executor was about to put next_retry task's status as running which was failed in previous try. So, we thought to add retry_number as the member of TaskInstance key property.
wyndhblb pushed a commit to asappinc/incubator-airflow that referenced this pull request Nov 9, 2018
…dition (apache#3994)

 We were seeing an intermittent issue where executor reports task instance finished while task says it's in queue state, it was due to a race condition between scheduler which was clearing event_buffer in _process_executor_events method in jobs.py executor was about to put next_retry task's status as running which was failed in previous try. So, we thought to add retry_number as the member of TaskInstance key property.
galak75 pushed a commit to VilledeMontreal/incubator-airflow that referenced this pull request Nov 23, 2018
…dition (apache#3994)

 We were seeing an intermittent issue where executor reports task instance finished while task says it's in queue state, it was due to a race condition between scheduler which was clearing event_buffer in _process_executor_events method in jobs.py executor was about to put next_retry task's status as running which was failed in previous try. So, we thought to add retry_number as the member of TaskInstance key property.
tmiller-msft pushed a commit to cse-airflow/incubator-airflow that referenced this pull request Nov 27, 2018
…dition (apache#3994)

 We were seeing an intermittent issue where executor reports task instance finished while task says it's in queue state, it was due to a race condition between scheduler which was clearing event_buffer in _process_executor_events method in jobs.py executor was about to put next_retry task's status as running which was failed in previous try. So, we thought to add retry_number as the member of TaskInstance key property.
labels['dag_id'], labels['task_id'],
self._label_safe_datestring_to_datetime(labels['execution_date']))
self._label_safe_datestring_to_datetime(labels['execution_date']),
labels['try_number'])

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dict labels doesn't contain try_number key, as it's not set in WorkerConfiguration make_pod. That's the reason why pods are not deleted.
https://github.com/apache/incubator-airflow/blob/v1-10-stable/airflow/contrib/kubernetes/worker_configuration.py#L197

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

D'oh.

Would someone (@vardancse? @aliaksandr-d?) be able to fix this, or at least add a test case that covers this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ashb I'm looking into it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aliaksandr-d @ashb Is there any bug logged for the same so that I can raise the PR for the same?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aliaksandr-d can you please apply the patch and confirm the resolution?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aliaksandr-d #4163 already created for the same by @wyndhblb

elizabethhalper pushed a commit to cse-airflow/incubator-airflow that referenced this pull request Dec 7, 2018
…dition (apache#3994)

 We were seeing an intermittent issue where executor reports task instance finished while task says it's in queue state, it was due to a race condition between scheduler which was clearing event_buffer in _process_executor_events method in jobs.py executor was about to put next_retry task's status as running which was failed in previous try. So, we thought to add retry_number as the member of TaskInstance key property.
aliceabe pushed a commit to aliceabe/incubator-airflow that referenced this pull request Jan 3, 2019
…dition (apache#3994)

 We were seeing an intermittent issue where executor reports task instance finished while task says it's in queue state, it was due to a race condition between scheduler which was clearing event_buffer in _process_executor_events method in jobs.py executor was about to put next_retry task's status as running which was failed in previous try. So, we thought to add retry_number as the member of TaskInstance key property.
cfei18 pushed a commit to cfei18/incubator-airflow that referenced this pull request Jan 23, 2019
…3994)

 We were seeing an intermittent issue where executor reports task instance finished while task says it's in queue state, it was due to a race condition between scheduler which was clearing event_buffer in _process_executor_events method in jobs.py executor was about to put next_retry task's status as running which was failed in previous try. So, we thought to add retry_number as the member of TaskInstance key property.
wmorris75 pushed a commit to modmed-external/incubator-airflow that referenced this pull request Jul 29, 2019
…dition (apache#3994)

 We were seeing an intermittent issue where executor reports task instance finished while task says it's in queue state, it was due to a race condition between scheduler which was clearing event_buffer in _process_executor_events method in jobs.py executor was about to put next_retry task's status as running which was failed in previous try. So, we thought to add retry_number as the member of TaskInstance key property.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants