Skip to content

Conversation

@dstandish
Copy link
Contributor

@dstandish dstandish commented Nov 4, 2024

Proposed changes to #43520

This changes the signature of cleanup_stuck_queued_tasks such that it returns Iterable[TaskInstance] instead of List[str] (where the str is a repr).

What are the implications....

Old provider with new executor:

  • we check what kind of object the executor is returning. if it is a string, then we know it's old executor version. in that case we just log to stdout of scheduler and move on.

New provider with old executor:

  • New provider would be yielding TIs instead of strings. The executor would be expecting list of repr strings. The repr IN check would not find matches. Thus nothing would be logged. Nothing breaks but you don't have the log messages about stuck TIs. So this combination should be avoided and we might want to bump min version of provider. Or we could add backcompat code in the providers to yield the reprs in that case. That should work ok too.

@boring-cyborg boring-cyborg bot added area:Executors-core LocalExecutor & SequentialExecutor area:providers area:Scheduler including HA (high availability) scheduler provider:celery provider:cncf-kubernetes Kubernetes (k8s) provider related issues labels Nov 4, 2024
DM = DagModel

RESCHEDULE_STUCK_IN_QUEUED_EVENT = "rescheduling stuck in queued"
STUCK_IN_QUEUED_EVENT = "stuck in queued"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one reason i removed the "rescheduling" part is because at the point where you log this, you don't know that it's reschedulable -- you only know that further down.

As a compromise between always failing a stuck task and always rescheduling a stuck task (which could
lead to tasks being stuck in queued forever without informing the user), we have creating the config
`[core] num_stuck_reschedules`. With this new configuration, an airflow admin can decide how
``[scheduler] num_stuck_in_queued_retries``. With this new configuration, an airflow admin can decide how
Copy link
Contributor Author

@dstandish dstandish Nov 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a scheduler setting not core, and more a retry than a reschedule

executor.fail(ti.key)
if not hasattr(executor, "cleanup_stuck_queued_tasks"):
continue
for ti in executor.cleanup_stuck_queued_tasks(tis=stuck_tis):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the thing that still bothers me @dimberman is, it doesn't feel right that we defer to the executor and only conditionally log if it "cleans up" the ti. we have already observed that it is stuck in queued so why not log that?

i guess the problem is we are logging the wrong event. the event is not that it is "stuck in queued" (which is an unconditional observation) but rather that it was requeued. that's the thing that conditionally happens.

@dstandish dstandish requested a review from dimberman November 4, 2024 19:07
@dstandish dstandish force-pushed the handle-stuck-in-queued-proposed-revisions branch from 8c71f3f to d6d1caa Compare November 5, 2024 20:54
@dstandish dstandish force-pushed the handle-stuck-in-queued-proposed-revisions branch from d6d1caa to 4021186 Compare November 6, 2024 22:47
@dimberman dimberman merged commit 116d130 into apache:handle-stuck-in-queue Nov 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:Executors-core LocalExecutor & SequentialExecutor area:providers area:Scheduler including HA (high availability) scheduler provider:celery provider:cncf-kubernetes Kubernetes (k8s) provider related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants