Fix triggerer deadlocks #51279

gopidesupavan · 2025-06-02T00:15:50Z

The triggers getting deadlock when using sync functions with sync_to_async. To avoid that we have couple of solutions discussed in here #50185.

Use the ThreadPoolExecutor to read trigger workloads and the future object will be used to wait in get_message, this will we can avoid collisions as described here #50185 (comment)

^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in airflow-core/newsfragments.

task-sdk/src/airflow/sdk/execution_time/task_runner.py

gopidesupavan · 2025-06-02T00:23:15Z

~~Need to add some tests.~~ added

gopidesupavan · 2025-06-02T00:31:49Z

x42005e1f · 2025-06-02T00:46:39Z

The future pass itself looks right, however, here you still need to use the mentioned lock type, which I described in the linked comment. The approach without synchronization is appropriate only in case of full use of futures - when each send_request() + get_message() are executed in a worker thread.

I can write a separate lock implementation later if you would like.

gopidesupavan · 2025-06-02T01:02:18Z

The future pass itself looks right, however, here you still need to use the mentioned lock type, which I described in the linked comment. The approach without synchronization is appropriate only in case of full use of futures - when each send_request() + get_message() are executed in a worker thread.

I can write a separate lock implementation later if you would like.

ah you mean lock here before threadpool https://github.com/apache/airflow/pull/51279/files#diff-e4cc497f1c786d142ce4c930f43e33b0bb4b53d375d274278fa82f4d5567608aR963 ?

  async with SUPERVISOR_COMMS.lock:
      self.requests_sock.write(msg.model_dump_json(exclude_none=True).encode() + b"\n")

      TRIGGERER_SUPERVISOR_COMMS_FUTURE = self._stdin_threadpool_executor.submit(
          SUPERVISOR_COMMS._read_stdin_line
      )

      line = await asyncio.wrap_future(TRIGGERER_SUPERVISOR_COMMS_FUTURE)
      TRIGGERER_SUPERVISOR_COMMS_FUTURE = None  # type: ignore[assignment]

?

x42005e1f · 2025-06-02T01:11:00Z

ah you mean lock here before threadpool https://github.com/apache/airflow/pull/51279/files#diff-e4cc497f1c786d142ce4c930f43e33b0bb4b53d375d274278fa82f4d5567608aR963 ?

Yes, and in get_ti_count() too, since it can be used in separate threads.

The thread-level lock approach is special in that all uses of the lock remain, but the lock itself changes, special handling for async -> sync is added. Synchronization is still needed to eliminate collisions.

gopidesupavan · 2025-06-02T01:15:01Z

ah you mean lock here before threadpool https://github.com/apache/airflow/pull/51279/files#diff-e4cc497f1c786d142ce4c930f43e33b0bb4b53d375d274278fa82f4d5567608aR963 ?

Yes, and in get_ti_count() too, since it can be used in separate threads.

The thread-level lock approach is special in that all uses of the lock remain, but the lock itself changes, special handling for async -> sync is added. Synchronization is still needed to eliminate collisions.

Yeah you correct, i just ran with some multiple dags i could see difference, without lock :)

x42005e1f · 2025-06-02T01:18:14Z

Yeah you correct, i just ran with some multiple dags i could see difference, without lock :)

Multithreaded issues are usually hard to reproduce - it is often much easier to take a formal approach to them. This is why I would advise not to trust tests, at least not specialized ones - they can lie.

gopidesupavan · 2025-06-02T11:09:54Z

Yeah you correct, i just ran with some multiple dags i could see difference, without lock :)

Multithreaded issues are usually hard to reproduce - it is often much easier to take a formal approach to them. This is why I would advise not to trust tests, at least not specialized ones - they can lie.

Yeah agree :)

gopidesupavan · 2025-06-02T12:30:52Z

airflow-core/tests/unit/jobs/test_triggerer_job.py

        yield TriggerEvent({"count": dag_run_states_count, "dag_run_state": dag_run_state})


-@pytest.mark.xfail(


Tests are passing now, xfail not required .

ashb · 2025-06-03T09:35:23Z

airflow-core/src/airflow/jobs/triggerer_job_runner.py

-        async def connect_stdin() -> asyncio.StreamReader:
-            reader = asyncio.StreamReader()
-            protocol = asyncio.StreamReaderProtocol(reader)
-            await loop.connect_read_pipe(lambda: protocol, sys.stdin)
-            return reader
-
-        self.response_sock = await connect_stdin()
-
-        line = await self.response_sock.readline()
+        msg = comms_decoder.get_message()


Why was this changed?

sys.stdin is already configured to comms here https://github.com/apache/airflow/pull/51279/files#diff-e4cc497f1c786d142ce4c930f43e33b0bb4b53d375d274278fa82f4d5567608aR805.

I think its fine to read from get_message?

ashb · 2025-06-03T09:36:33Z

task-sdk/src/airflow/sdk/execution_time/task_runner.py

+        global TRIGGERER_SUPERVISOR_COMMS_FUTURE
        line = None

+        if TRIGGERER_SUPERVISOR_COMMS_FUTURE is not None:


I really don't love this being in here. It feels like a massive abstraction leak. I think we should instead subclass CommsDecoder into a new class defined/living somewhere with the triggerer code.

Agree, happy to do with subclass..

ashb · 2025-06-03T09:40:17Z

@gopidesupavan Can you explain your reason/thinking for switching to a thread pool? Generally I don't love the use of a threadpool in an async context, especially when we are just making requests (i.e. something asyncio should be really good at), so I'd really like to understand more about why this change was needed.

x42005e1f · 2025-06-03T11:06:21Z

@gopidesupavan Can you explain your reason/thinking for switching to a thread pool? Generally I don't love the use of a threadpool in an async context, especially when we are just making requests (i.e. something asyncio should be really good at), so I'd really like to understand more about why this change was needed.

Let me try to explain, since I was the initiator of this change.

The problem is that synchronous and asynchronous lock calls can coexist in an asynchronous context. When an asynchronous task, holding the lock asynchronously, switches contexts, another task may try to acquire the lock synchronously (for some other request). The result is a deadlock - the attempt to acquire the lock synchronously cannot complete until the asynchronous task completes, and the asynchronous task cannot complete because the event loop is blocked by the synchronous call. ThreadPoolExecutor allows to delegate the first (asynchronous) call to a worker thread, and as a result it will be able to complete without switching to the asynchronous task, which will allow to bypass deadlock. Calling future.result() for a future object created by an asynchronous task in the same thread is necessary to ensure no collisions.

There are two cleaner solutions. The first one is to use ThreadPoolExecutor for each send_request() + get_message() - in this case we can get rid of the lock altogether. The second one is to make an asynchronous version of each method that calls send_request() + get_message(), but this is harder to implement and may not always be possible.

I will also clarify that this PR is incomplete without using a more specific type of lock that allows the async -> sync case (synchronous acquiring after asynchronous one in the same thread).

x42005e1f · 2025-06-03T11:37:38Z

Also note that it is possible to use sync_to_async() (where the lock will be acquired in the worker thread) instead of an explicit ThreadPoolExecutor. This method has less flexibility, because for communication it will be necessary to use only synchronous send_request() + get_message(), but it does not require storing a future object and using a special type of lock (moreover, it can be downgraded to threading.Lock). The method used in this PR can be used with asyncio tools, but to do so you need to access what's under their hood.

x42005e1f · 2025-06-03T12:05:19Z

In general, it is impossible to solve the synchronization problem between synchronous and asynchronous code in the same thread when synchronous code refers to threading, due to the specifics of cooperative multitasking implementation. Synchronous calls will always block the event loop, and the blocked event loop will prevent asynchronous tasks from executing that could have completed these synchronous calls. Turning synchronous calls into implicitly asynchronous ones (eventlet and gevent approach) leads to coroutine-safety violation. So the solutions are either to reduce this type of synchronization or to delegate execution to a worker thread.

gopidesupavan · 2025-06-03T17:32:39Z

@x42005e1f Thanks for the response :).

@ashb is that looks fine ? you have any suggestions are alternatives for this please?

gopidesupavan requested review from amoghrajesh, ashb, dstandish, hussein-awala and kaxil as code owners June 2, 2025 00:15

boring-cyborg bot added area:task-sdk area:Triggerer labels Jun 2, 2025

gopidesupavan commented Jun 2, 2025

View reviewed changes

task-sdk/src/airflow/sdk/execution_time/task_runner.py Outdated Show resolved Hide resolved

gopidesupavan mentioned this pull request Jun 2, 2025

Trigger runner process locked with multiple Workflow triggers #50185

Closed

2 tasks

gopidesupavan requested review from potiuk and vatsrahul1001 June 2, 2025 00:40

gopidesupavan mentioned this pull request Jun 2, 2025

Bugfix: Create async version of get_ti_count, get_dr_count, get_dagrun_state, get_task_states to avoid Triggerer Lock #51085

Closed

gopidesupavan force-pushed the fix-triggerer-comms-deadlock branch 2 times, most recently from 7e108a1 to 8a9433c Compare June 2, 2025 12:30

gopidesupavan commented Jun 2, 2025

View reviewed changes

gopidesupavan added the backport-to-v3-1-test Mark PR with this label to backport to v3-1-test branch label Jun 2, 2025

gopidesupavan force-pushed the fix-triggerer-comms-deadlock branch from 471110a to 34d1ba3 Compare June 2, 2025 23:45

gopidesupavan added 5 commits June 3, 2025 10:16

Fix triggerer deadlocks

49ebc30

Remove commented code

76d6f64

Add lock back

0e43022

Fixup tests

cf4d77e

Add test

3701d04

gopidesupavan force-pushed the fix-triggerer-comms-deadlock branch from 34d1ba3 to 3701d04 Compare June 3, 2025 09:16

ashb reviewed Jun 3, 2025

View reviewed changes

amoghrajesh linked an issue Jun 4, 2025 that may be closed by this pull request

Trigger runner process locked with multiple Workflow triggers #50185

Closed

2 tasks

ashb mentioned this pull request Jun 13, 2025

Switch the Supervisor/task process from line-based to length-prefixed #51699

Merged

ashb closed this in #51699 Jun 17, 2025

tirkarthi mentioned this pull request Jun 26, 2025

Deferred trigger tasks seem to get stuck in Airflow 3.0.2 #52247

Closed

2 tasks

x42005e1f mentioned this pull request Jul 26, 2025

RuntimeError in trigger creating hooks, Airflow 3.0.3 #53447

Closed

2 tasks

gopidesupavan mentioned this pull request Aug 12, 2025

MSGraphAsyncOperator fails to call API #54350

Closed

2 tasks

		yield TriggerEvent({"count": dag_run_states_count, "dag_run_state": dag_run_state})


		@pytest.mark.xfail(

Fix triggerer deadlocks #51279

Fix triggerer deadlocks #51279

Uh oh!

Conversation

gopidesupavan commented Jun 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

gopidesupavan commented Jun 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gopidesupavan commented Jun 2, 2025

Uh oh!

x42005e1f commented Jun 2, 2025

Uh oh!

gopidesupavan commented Jun 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

x42005e1f commented Jun 2, 2025

Uh oh!

gopidesupavan commented Jun 2, 2025

Uh oh!

x42005e1f commented Jun 2, 2025

Uh oh!

gopidesupavan commented Jun 2, 2025

Uh oh!

gopidesupavan Jun 2, 2025

Choose a reason for hiding this comment

Uh oh!

ashb Jun 3, 2025

Choose a reason for hiding this comment

Uh oh!

gopidesupavan Jun 3, 2025

Choose a reason for hiding this comment

Uh oh!

ashb Jun 3, 2025

Choose a reason for hiding this comment

Uh oh!

gopidesupavan Jun 3, 2025

Choose a reason for hiding this comment

Uh oh!

ashb commented Jun 3, 2025

Uh oh!

x42005e1f commented Jun 3, 2025

Uh oh!

x42005e1f commented Jun 3, 2025

Uh oh!

x42005e1f commented Jun 3, 2025

Uh oh!

gopidesupavan commented Jun 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

gopidesupavan commented Jun 2, 2025 •

edited

Loading

gopidesupavan commented Jun 2, 2025 •

edited

Loading

gopidesupavan commented Jun 2, 2025 •

edited

Loading