-
Notifications
You must be signed in to change notification settings - Fork 16.4k
KubernetesExecutor observability Improvements #35579
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
potiuk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Goo idea.
|
can you also - however add the metrics to the documentation https://airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/logging-monitoring/metrics.html#metric-descriptions |
potiuk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change is good but needs documentation update.
Metrics documentation updated |
Description
We have a scheduler house keeping work (adopt_or_reset_orphaned_tasks, check_trigger_timeouts, _emit_pool_metrics, _find_zombies, clear_not_launched_queued_tasks and _check_worker_pods_pending_timeout) runs on certain frequency. Right now, we don't have any latency metrics on these house keeping work. These will impact the scheduler heartbeat. Its good idea to capture these latency metrics to identify and tune the airflow configuration
Use case/motivation
As we run the airflow at a large scale, we have found that the adopt_or_reset_orphaned_tasks and clear_not_launched_queued_tasks functions might take time in a few minutes prior to bug fix (#34877). These will delay the heartbeat of the scheduler and leads to the scheduler instance restarting/killed. In order to detect these latency issues, we need metrics to capture these latencies.
closes: #31957