Simplify the handle stuck in queued interface #43647

dstandish · 2024-11-04T18:30:21Z

Proposed changes to #43520

This changes the signature of cleanup_stuck_queued_tasks such that it returns Iterable[TaskInstance] instead of List[str] (where the str is a repr).

What are the implications....

Old provider with new executor:

we check what kind of object the executor is returning. if it is a string, then we know it's old executor version. in that case we just log to stdout of scheduler and move on.

New provider with old executor:

New provider would be yielding TIs instead of strings. The executor would be expecting list of repr strings. The repr IN check would not find matches. Thus nothing would be logged. Nothing breaks but you don't have the log messages about stuck TIs. So this combination should be avoided and we might want to bump min version of provider. Or we could add backcompat code in the providers to yield the reprs in that case. That should work ok too.

dstandish · 2024-11-04T18:31:33Z

airflow/jobs/scheduler_job_runner.py

 DM = DagModel

-RESCHEDULE_STUCK_IN_QUEUED_EVENT = "rescheduling stuck in queued"
+STUCK_IN_QUEUED_EVENT = "stuck in queued"


one reason i removed the "rescheduling" part is because at the point where you log this, you don't know that it's reschedulable -- you only know that further down.

dstandish · 2024-11-04T18:31:54Z

airflow/jobs/scheduler_job_runner.py

        As a compromise between always failing a stuck task and always rescheduling a stuck task (which could
        lead to tasks being stuck in queued forever without informing the user), we have creating the config
-        `[core] num_stuck_reschedules`. With this new configuration, an airflow admin can decide how
+        ``[scheduler] num_stuck_in_queued_retries``. With this new configuration, an airflow admin can decide how


it's a scheduler setting not core, and more a retry than a reschedule

dstandish · 2024-11-04T18:36:49Z

airflow/jobs/scheduler_job_runner.py

-                            executor.fail(ti.key)
+            if not hasattr(executor, "cleanup_stuck_queued_tasks"):
+                continue
+            for ti in executor.cleanup_stuck_queued_tasks(tis=stuck_tis):


the thing that still bothers me @dimberman is, it doesn't feel right that we defer to the executor and only conditionally log if it "cleans up" the ti. we have already observed that it is stuck in queued so why not log that?

i guess the problem is we are logging the wrong event. the event is not that it is "stuck in queued" (which is an unconditional observation) but rather that it was requeued. that's the thing that conditionally happens.

dstandish requested review from XD-DENG, ashb, hussein-awala, jedcunningham, o-nikolas and pierrejeambrun as code owners November 4, 2024 18:30

boring-cyborg bot added area:Executors-core LocalExecutor & SequentialExecutor area:providers area:Scheduler including HA (high availability) scheduler provider:celery provider:cncf-kubernetes Kubernetes (k8s) provider related issues labels Nov 4, 2024

dstandish commented Nov 4, 2024

View reviewed changes

dstandish requested a review from dimberman November 4, 2024 19:07

dstandish force-pushed the handle-stuck-in-queued-proposed-revisions branch from 8c71f3f to d6d1caa Compare November 5, 2024 20:54

Simplify the handle stuck in queued interface

4021186

dstandish force-pushed the handle-stuck-in-queued-proposed-revisions branch from d6d1caa to 4021186 Compare November 6, 2024 22:47

dimberman merged commit 116d130 into apache:handle-stuck-in-queue Nov 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify the handle stuck in queued interface #43647

Simplify the handle stuck in queued interface #43647

Uh oh!

dstandish commented Nov 4, 2024 •

edited

Loading

Uh oh!

dstandish Nov 4, 2024

Uh oh!

dstandish Nov 4, 2024 •

edited

Loading

Uh oh!

dstandish Nov 4, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Simplify the handle stuck in queued interface #43647

Simplify the handle stuck in queued interface #43647

Uh oh!

Conversation

dstandish commented Nov 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dstandish Nov 4, 2024

Choose a reason for hiding this comment

Uh oh!

dstandish Nov 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dstandish Nov 4, 2024

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dstandish commented Nov 4, 2024 •

edited

Loading

dstandish Nov 4, 2024 •

edited

Loading