Skip to content

Kubernetes logging errors - attempting to adopt taskinstance which was not specified by database #28071

@hterik

Description

@hterik

Apache Airflow version

2.4.3

What happened

Using following config

executor = CeleryKubernetesExecutor
delete_worker_pods = False
  1. Start a few dags running in kubernetes, wait for them to complete.
  2. Restart Scheduler.
  3. Logs are flooded with hundreds of errors like ERROR - attempting to adopt taskinstance which was not specified by database: TaskInstanceKey(dag_id='xxx', task_id='yyy', run_id='zzz', try_number=1, map_index=-1)

This is problematic because:

  • Our installation has thousands of dags and pods so this becomes very noisy and the adoption-process adds excessive startup-time to the scheduler, up to a minute some times.
  • It's hiding actual errors with resetting orphaned tasks, something that also happens for inexplicable reasons on scheduler restart with following log: Reset the following 6 orphaned TaskInstances. Making such much harder to debug. The cause of them can not be easily correlated with those that were not specified by database.

The cause of these logs are the Kubernetes executor on startup loads all pods (try_adopt_task_instances), it then cross references them with all RUNNING TaskInstances loaded via scheduler_job.adopt_or_reset_orphaned_tasks.
For all pods where a running TI can not be found, it logs the error above - But for TIs that were already completed this is not an error, and the pods should not have to be loaded at all.

I have an idea of adding some code in the kubernetes_executor that patches in something like a completion-acknowleged-label whenever a pod is completed (unless delete_worker_pods is set). Then on startup, all pods having this label can be excluded. Is this a good idea or do you see other potential solutions?
Another potential solution is to inside try_adopt_task_instances only fetch the exact pod-id specified in each task-instance, instead of listing all to later cross-reference them.

What you think should happen instead

No response

How to reproduce

No response

Operating System

Ubuntu 22.04

Versions of Apache Airflow Providers

No response

Deployment

Other Docker-based deployment

Deployment details

No response

Anything else

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions