Expand use of Druid side timeouts for fabric8 kubernetesclient timeouts#18587
Expand use of Druid side timeouts for fabric8 kubernetesclient timeouts#18587capistrant merged 17 commits intoapache:masterfrom
Conversation
| this.logSaveTimeoutMs = logSaveTimeoutMs; | ||
| this.logWatchInitializationTimeoutMs = logWatchInitializationTimeoutMs; | ||
| this.logWatchCopyLogTimeoutMs = logWatchCopyLogTimeoutMs; | ||
| this.logWatchExecutor = Executors.newSingleThreadExecutor( |
There was a problem hiding this comment.
I feel like the original approach of creating an ExecutorService on the fly when needed made more sense.
We don't really need this executor until we need to watch/copy over the logs, so there is no reason to have it take up unnecessary memory upfront.
There was a problem hiding this comment.
ack. I don't like either approach 😨 but it makes sense to not eagerly consume memory if the log watch stuff is involved in a small chunk of the overall peon lifecycle
| | `druid.indexer.runner.capacity` | `Integer` | Number of concurrent jobs that can be sent to Kubernetes. | `2147483647` | No | | ||
| | `druid.indexer.runner.cpuCoreInMicro` | `Integer` | Number of CPU micro core for the task. | `1000` | No | | ||
| | `druid.indexer.runner.logSaveTimeout` | `Duration` | How long to wait for task logs to be saved before giving up. | `PT300S` | NO | | ||
| | `druid.indexer.runner.logWatchInitializationTimeout` | `Duration` | How long to wait when initializing a log watch for a peon pod before giving up. | `PT30S` | NO | |
There was a problem hiding this comment.
Let's avoid this new config and just use the original one for both the purposes.
A default timeout of 5 minutes is good enough for both initializing the logWatch field as well as
downloading the logs.
There was a problem hiding this comment.
ok, I generalized the single config name a bit then
…hem on demand The idea is that these will only exist for short time per task, so creating them eagerly can be wasteful and cause memory pressure
kfaraz
left a comment
There was a problem hiding this comment.
Thanks for addressing the feedback, @capistrant ! Left some final suggestions.
| | `druid.indexer.runner.capacity` | `Integer` | Number of concurrent jobs that can be sent to Kubernetes. | `2147483647` | No | | ||
| | `druid.indexer.runner.cpuCoreInMicro` | `Integer` | Number of CPU micro core for the task. | `1000` | No | | ||
| | `druid.indexer.runner.logSaveTimeout` | `Duration` | How long to wait for task logs to be saved before giving up. | `PT300S` | NO | | ||
| | `druid.indexer.runner.podLogOperationTimeout` | `Duration` | Timeout for async operations that interact with `k8s` pod logs | `PT300S` | NO | |
There was a problem hiding this comment.
logOperationTimeout seems more ambiguous than logSaveTimeout.
Do you prefer this name since the same timeout is used for multiple operations like copy and log watch initialization?
I think it is fine to continue calling the config logSaveTimeout since both are related to the saving of logs.
Although, please update the description to call out exactly which operations use this timeout and what is the result of that timeout (i.e. will logs for that task ever be accessible).
There was a problem hiding this comment.
maybe I was over thinking it. I was concerned about doing the init under the same config as the save. But I suppose you are right that it is probably ok to just stick with it. I reverted and updated the doc with more info on what it is for.
| @NotNull | ||
| // how long to wait for log saving operations to complete | ||
| private Period logSaveTimeout = new Period("PT300S"); | ||
| private Period podLogOperationTimeout = new Period("PT300S"); |
There was a problem hiding this comment.
If we continue with the same config name, we can just revert the changes made to this and some other files.
| logWatchOperationTimeoutMs, taskId.getOriginalTaskId()); | ||
| } | ||
| catch (InterruptedException e) { | ||
| Thread.currentThread().interrupt(); |
There was a problem hiding this comment.
Do we need to mark the thread as interrupted?
Is this status read by the KubernetesTaskRunner?
There was a problem hiding this comment.
hmmm. I guess this may just be force of habit to be defensive and do this whenever I catch an interrupted exception. Is there negatives to it even if the status isn't read?
| } | ||
|
|
||
| private void doSaveLogs() | ||
| protected void saveLogs() |
There was a problem hiding this comment.
I feel we might benefit from adding a common method which can be used both for logWatch init as well as saveLogs:
private <T> T executeWithTimeout(Callable<T> runnable, long timeoutMillis, String operationName)
{
// Create the executor
// Start it
// Handle the exceptions
// Finally shutdown the executor
}There was a problem hiding this comment.
good point, I took a shot at this
kfaraz
left a comment
There was a problem hiding this comment.
Minor suggestions on the log lines etc.
Approach makes sense. 👍🏻
| Map<String, String> annotations, | ||
| Integer capacity, | ||
| Period taskJoinTimeout | ||
| Period taskJoinTimeout, |
There was a problem hiding this comment.
Please revert the changes to this file since they are not needed anymore.
We can stick to the original order of the constructor args.
| private final TaskStateListener stateListener; | ||
| private final SettableFuture<Boolean> taskStartedSuccessfullyFuture; | ||
| private final long logSaveTimeoutMs; | ||
| private final long logSaveTimeout; |
There was a problem hiding this comment.
Nit: Please continue using logSaveTimeoutMs or logSaveTimeoutMillis as it removes any ambiguity regarding the time unit.
Alternatively, you may pass in a Duration object and use the name logSaveTimeout,
but I don't think that is needed here.
| log.warn("Operation [%s] timed out after %d ms for task [%s]. %s", operationName, timeoutMillis, taskId.getOriginalTaskId(), errorMessage); | ||
| } | ||
| catch (InterruptedException e) { | ||
| Thread.currentThread().interrupt(); |
There was a problem hiding this comment.
@capistrant , regarding the previous discussion, I think we should avoid marking this thread as interrupted here since we are not reading/clearing this interrupted status anywhere which might have unintended side effects.
We can add it in the future if we need it.
…Adds back explicit context
Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>
…stion without realizing it wouldn't actually refactor the code magically
Description
Expand on #18444 to cover more surface area. While putting saveLogs on a strict timer, helped prevent some issues with LogWatch interaction hanging, I have still seen issues in the path where
KubernetesWorkItemcalls shutdown which starts a LogWatch. Even the initialization of a LogWatch object can hang indefinitely.Release note
new config for users of the
kubernetes-overlord-extensionsfor K8s TaskRunner.Key changed/added classes in this PR
KubernetesPeonLifecycleThis PR has: