[AIRFLOW-6040] Fix KubernetesJobWatcher Read time out error#6643
[AIRFLOW-6040] Fix KubernetesJobWatcher Read time out error#6643maxirus wants to merge 13 commits intoapache:masterfrom
Conversation
dimberman
left a comment
There was a problem hiding this comment.
This LGTM thank you for catching this! Please fix the flake8 issues and once tests pass I'll gladly merge :)
Codecov Report
@@ Coverage Diff @@
## master #6643 +/- ##
=========================================
Coverage ? 84.31%
=========================================
Files ? 676
Lines ? 38353
Branches ? 0
=========================================
Hits ? 32338
Misses ? 6015
Partials ? 0
Continue to review full report at Codecov.
|
|
Perhaps settings this as a configuration value, and falling back to a constant in case it doesn't exist? |
|
|
||
| kwargs = {'label_selector': 'airflow-worker={}'.format(worker_uuid)} | ||
| kwargs = {'label_selector': 'airflow-worker={}'.format(worker_uuid), | ||
| 'timeout_seconds': 50} |
There was a problem hiding this comment.
Yeah, this should be a config variable at the least.
Also it would be good if
airflow/airflow/config_templates/default_airflow.cfg
Lines 782 to 787 in df35957
There was a problem hiding this comment.
Wait, we already use the config option two lines down.
Someone on slack mentioned that this won't work but I don't see why from the code.
There was a problem hiding this comment.
If kube_client_request_args is used the Kubernetes executor fails to kick off tasks and the scheduler throws this exception:
[2019-11-25 18:02:53,397] {scheduler_job.py:1352} ERROR - Exception when executing execute_helper
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/airflow/jobs/scheduler_job.py", line 1350, in _execute
self._execute_helper()
File "/usr/local/lib/python3.7/site-packages/airflow/jobs/scheduler_job.py", line 1439, in _execute_helper
self.executor.heartbeat()
File "/usr/local/lib/python3.7/site-packages/airflow/executors/base_executor.py", line 136, in heartbeat
self.sync()
File "/usr/local/lib/python3.7/site-packages/airflow/contrib/executors/kubernetes_executor.py", line 801, in sync
self.kube_scheduler.run_next(task)
File "/usr/local/lib/python3.7/site-packages/airflow/contrib/executors/kubernetes_executor.py", line 456, in run_next
self.launcher.run_pod_async(pod, **self.kube_config.kube_client_request_args)
File "/usr/local/lib/python3.7/site-packages/airflow/contrib/kubernetes/pod_launcher.py", line 62, in run_pod_async
resp = self._client.create_namespaced_pod(body=req, namespace=pod.namespace, **kwargs)
File "/usr/local/lib/python3.7/site-packages/kubernetes/client/apis/core_v1_api.py", line 6115, in create_namespaced_pod
(data) = self.create_namespaced_pod_with_http_info(namespace, body, **kwargs)
File "/usr/local/lib/python3.7/site-packages/kubernetes/client/apis/core_v1_api.py", line 6148, in create_namespaced_pod_with_http_info
" to method create_namespaced_pod" % key
TypeError: Got an unexpected keyword argument 'timeout_seconds' to method create_namespaced_pod
There was a problem hiding this comment.
I believe the correct argument name here is _request_timeout. I can't link the generated python API file as it is too large for github, but it's on line 6141 of https://github.com/kubernetes-client/python/blob/master/kubernetes/client/api/core_v1_api.py.
This doc link for the kube_client_request_args is dead. It also states;
List of supported params in **kwargs are similar for all core_v1_apis, hence a single config variable for all apis
I feel like this is the wrong approach, as these setting should be configurable on a per request basis, but that's another matter and much more complex. For example this label_selector argument would fail if passed to the create_namespaced_pod function.
|
The My approach assumes that If we didn't want to "hard-code" this value here, the other approaches I see are:
P.S.: The Worker UUID seems to not persist and is created at runtime. If I follow correctly, this get generated each time the scheduler runs. How is this tracked across restarts? |
There's airflow.models.kubernetes.KubeWorkerIdentifier table with a singleton row where it should be stored. |
|
@maxirus I would favour either 1. or 4. for simplicity. |
|
Leaving it hard-coded is going to break it if someone changes the default _request_timeout from 60s. |
|
@ashb is this |
|
Same! I'm using this environment variable as a work-around for now: |
|
I can implement it like this but do we really want the timeout to always be |
🤦♂ No not at all. Broken logic. |
ashb
left a comment
There was a problem hiding this comment.
If possible it would be nice to add some unit tests too.
| for key, value in kube_config.kube_client_request_args.iteritems(): | ||
| kwargs[key] = value | ||
| conn_timeout = kube_config.kube_client_request_args.get('_request_timeout', [60, 60])[0] | ||
| kwargs['timeout_seconds'] = conn_timeout - 1 if conn_timeout - 1 > 0 else 1 |
There was a problem hiding this comment.
| kwargs['timeout_seconds'] = conn_timeout - 1 if conn_timeout - 1 > 0 else 1 | |
| kwargs['timeout_seconds'] = max(conn_timeout - 1, 1) |
(Assuming they are integers.)
Will try to get to this later this week. |
|
@maxirus Any progress? |
|
@mbelang No. Between the holidays & work I have not had time. Hoping to have some time this weekend. |
|
this mitigated the problem at least :)
What is the default timeout currently? |
Syncing upstream
|
Sooo the test framework has a high barrier to entry and there doesn't seem to be any existing tests for the |
If `_request_timeout` is neither an int, nor a 2-tuple, it is swallowed without further notice which is a rather unfortunate because the level developers would have to look for this issue is pretty deep. This actually leads to confusion already, see apache/airflow#6643 (comment) While it would break backwards compatibility to raise an exception, we should at least warn the developer.
|
@mbelang Unfortunately, your mitigation disables the timeout completely, see kubernetes-client/python#1069. Passing a string there makes the timeout disappear and lets the scheduler process wait forever. I have not checked deep enough to find out if that is a real problem or not though. |
|
The kube tests in particular are the hardest to test, yes :( |
|
Was this resolved? Setting AIRFLOW__KUBERNETES__KUBE_CLIENT_REQUEST_ARGS: '{ "_request_timeout": "50" }' did not resolve our issue with KubernetesJobWatcher |
same here. |
|
Hi, has there been any follow-up on this? This is really a blocker, as it seems like the KubernetesExecutor is completely broken, with no workaround. |
|
I just tested the fix presented here and I confirm it does work. Was this PR closed only because of the lack of a unit test? |
|
I found this guide very useful for those setting up Airflow on Kubernetes executor for the first time https://github.com/stwind/airflow-on-kubernetes |
|
@pvcnt Please check out my last comment: The fix does not work as intended. The Kubernetes job watcher is a process that is solely looping over the Kubernetes API, waiting for changes of Pods. That's all it does. If it crashes with a timeout or gets restarted virtually does not matter. What you are seeing is false alarm. The fix should rather be that the timeout is gracefully handled. I will provide a patch when I find some time for doing it. Again: The issue here does not impair the operations of the KubernetesExecutor. |
|
@sbrandtb From what I understood it is exactly the purpose to set a server-side timeout and handle it gracefully, instead of relying on a client-side timeout that triggers an exception, isn't it? What other approach do you propose? I had several issues going on at the same time in my cluster, so maybe what I was observing what caused by another issue (solved since). But still, in the current state there is a logs pollution that makes much more difficult to identify a real problem. |
|
@sbrandtb You shouldn't catch (and subsequently ignore) a connection/response timeout error. Setting @ashb @dimberman I would suggest taking another look at PR #7616 |
|
@maxirus Sorry, my bad. I did not see in fact that you were setting However, I still disagree with you setting the Either:
Because, if the request timeout from settings is something else than But in general I agree that setting the |
|
@sbrandtb I think you should take another look at the PR and read the comments in this thread again. My PR doesn't change the
Where am I setting this?
Nope... It's been configurable for a number of releases now and I didn't set this default value.
Yep.
Again, read the comments please. That is not how the maintainers wanted to handle it (see here)
Where's the double default? |
|
I wanna know how to fix it right now, from all above viewpoints, we know that we need to pass a def _run(self, kube_client, resource_version, worker_uuid, kube_config):
self.log.info(
'Event: and now my watch begins starting at resource_version: %s',
resource_version
)
watcher = watch.Watch()
kwargs = {'label_selector': 'airflow-worker={}'.format(worker_uuid)}
if resource_version:
kwargs['resource_version'] = resource_version
if kube_config.kube_client_request_args:
for key, value in kube_config.kube_client_request_args.items():
kwargs[key] = value
last_resource_version = None
for event in watcher.stream(kube_client.list_namespaced_pod, self.namespace,
**kwargs):
task = event['object']
self.log.info(
'Event: %s had an event of type %s',
task.metadata.name, event['type']
)
if event['type'] == 'ERROR':
return self.process_error(event)
self.process_status(
task.metadata.name, task.status.phase, task.metadata.labels,
task.metadata.resource_version
)
last_resource_version = task.metadata.resource_version
return last_resource_versionguess change However, what i wonder is this problem arise from apache-airflow |
|
Would it be possible to re-open this PR and consider applying this fix (or a similar one)? This issue is still present in the latest release of Airflow (scheduler logs are polluted with ReadTimeoutError), and setting timeout_seconds is a fix that works. |
Jira
Description
will cause a warning instead of an exception when a worker_uuid does not exist. timeout_seconds targets the list_namespaced_pod method as opposed to the underlying urllib3 library which throws an exception.
Tests
Commits
Documentation