Create Kubernetes peon lifecycle task log persist timeout#18444
Create Kubernetes peon lifecycle task log persist timeout#18444capistrant merged 4 commits intoapache:masterfrom
Conversation
|
|
||
| protected void saveLogs() | ||
| { | ||
| ExecutorService executor = Executors.newSingleThreadExecutor(); |
There was a problem hiding this comment.
Does it make sense to expose the timeouts here
https://github.com/fabric8io/kubernetes-client?tab=readme-ov-file#configuring-the-client
rather than doing a timeout via an exec service ?
There was a problem hiding this comment.
Does it make sense to expose the timeouts here
https://github.com/fabric8io/kubernetes-client?tab=readme-ov-file#configuring-the-client
rather than doing a timeout via an exec service ?
fabric8io/kubernetes-client#7163
From talking with @kgyrtkirk i was under the impression that the reason this log read can hang is because of the above bug. I don't think any existing timeouts can fix that. But we could also expose all the timeouts that fabric8 supplies as part of this PR?
There was a problem hiding this comment.
I see thanks for clarifying.
There was a problem hiding this comment.
this seems to create a new executor for every call?
that looks a bit odd
I don't know how many additional patches will be needed to make this fabric8+vertx stuff work correctly.
Would it be hard to add a knob which could restore the okhttp usage?
if we are about to handle the timeouts ourselves - is fabric8 the right choice of library?
I was taking a quick glance at https://github.com/kubernetes-client/java and it seemed more straightforward...
There was a problem hiding this comment.
@kgyrtkirk you raise good points. The only reason I'm proposing this is because of the fabric8 bug not being available in any release. I'd like to remove it ASAP once fabric8 does a release or patch release with this fix. maybe we need to take a step back and consider making the client pluggable allowing for operators to migrate between clients. vertx has solved a lot of problems for some use cases but caused other problems.
There was a problem hiding this comment.
this seems to create a new executor for every call? that looks a bit odd
new executor per task when this saveLogs is called once when task completes
cryptoe
left a comment
There was a problem hiding this comment.
Minor comments. LGTM otherwise.
|
|
||
| protected void saveLogs() | ||
| { | ||
| ExecutorService executor = Executors.newSingleThreadExecutor(); |
| log.warn("saveLogs() timed out after %d ms for task [%s]", logSaveTimeoutMs, taskId.getOriginalTaskId()); | ||
| } | ||
| catch (Exception e) { | ||
| log.error(e, "saveLogs() failed for task [%s]", taskId.getOriginalTaskId()); |
There was a problem hiding this comment.
Since tasks logs would not visible this has to be phrased in a manner that the cluster admin can understand.
Unable to save logs from the for the task[s] on location[details]. This does not have any impact on the work done by of the task. If this continues to happen, check Kubernetes server logs for potential errors .
…ss of if it completes or not
…ound logwatch getting closed
…ache#18444)" This reverts commit 90be682.
Description
Prevents
saveLogs()from hanging indefinitely when there are fabric8 issues. Read fabric8io/kubernetes-client#7163 for some extra context on how LogWatch processing can be indefinitely blocking situation. Situations like this are suspects in issues around overlord graceful shutdowns being able to complete successfully. The upstream kubernetes-client fix associated with the linked issue is preferable to my implementation, but with that being unreleased, it is not clear when we can integrate the fix into Druid.Release note
Introduce task log saving timeouts for the
kubernetes-overlord-extensionsmm-lessingestion framework. Persisting task logs will no longer have the potential to block indefinitely. Instead there is a time limit for persisting logs, that if breached, results in giving up on the persist. The default timeout is 5 minutes, but can be configured by overridingdruid.indexer.runner.logSaveTimeoutwith a valid Duration (e.gPT60S)Key changed/added classes in this PR
KubernetesPeonLifecycle.javaThis PR has: