Add limit to task payload size#16512
Conversation
|
@georgew5656 , thanks for the changes! It makes sense to fail early on large payloads with a clear error message, but I have some concerns.
One compromise could be to keep the check and just warn about large task payloads for the time being, with a log message and maybe also an alert event. Let me know what you think! |
the direct problem seems to be the error message that gets thrown when the write to db fails is a little TOO descriptive in this case (it tries to log the whole thing and this can easily cause mem issues). i don't think the config totally loses its purpose with a higher default, since you can still choose to lower it to limit the blast radius of large payloads. maybe it would make sense to just truncate that error message and throw a warn in OverlordResource when a task payload is very large? I personally don't think a task payload above a certain size makes sense. for msq, would it really try to generate that large of a payload? 60 MB is huge for metadata |
Yes, we should definitely do these two items even if we decide not to do the rest.
True, but I don't think most cluster operators would think of updating this config when they encounter a problem. They would be more likely to increase the
I agree that 60MiB is large enough, but I do recall some cases where users had to resort to increasing the In conclusion, we could do the following:
|
6c2211b to
1a3f95c
Compare
| log.warn("Received a task payload > [%d] with id [%s]. and datasource [%s]" + | ||
| " There may be downstream issues caused by managing this large payload." + | ||
| "Increase druid.indexer.queue.maxTaskPayloadSize to ignore this warning.", | ||
| config.getMaxTaskPayloadSize(), | ||
| task.getId(), | ||
| task.getDataSource() | ||
| ); |
There was a problem hiding this comment.
| log.warn("Received a task payload > [%d] with id [%s]. and datasource [%s]" + | |
| " There may be downstream issues caused by managing this large payload." + | |
| "Increase druid.indexer.queue.maxTaskPayloadSize to ignore this warning.", | |
| config.getMaxTaskPayloadSize(), | |
| task.getId(), | |
| task.getDataSource() | |
| ); | |
| log.warn( | |
| "Task[%s] of datasource[%s] has payload size[%d] larger than the recommended maximum[%d]." + | |
| + " Large task payloads may cause stability issues in the Overlord and may fail while persisting to the metadata store." | |
| + " Use smaller task payloads or increase 'druid.indexer.queue.maxTaskPayloadSize' to suppress this warning.", | |
| task.getId(), task.getDataSource(), config.getMaxTaskPayloadSize() | |
| ); |
…ord/TaskQueue.java Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>
| } else if (payload.length() > TASK_SIZE_WARNING_THRESHOLD) { | ||
| log.warn( | ||
| "Task[%s] of datasource[%s] has payload size[%d] larger than the recommended maximum[%d]. " + | ||
| "Large task payloads may cause stability issues in the Overlord and may fail while persisting to the metadata store." + |
There was a problem hiding this comment.
Typo: missing space
| "Large task payloads may cause stability issues in the Overlord and may fail while persisting to the metadata store." + | |
| "Large task payloads may cause stability issues in the Overlord and may fail while persisting to the metadata store. " + |
Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>
…ord/TaskQueue.java Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>
|
the intellij inspection failures look unrelated to me |
Currently it's possible to submit a really big task payload and have it OOM the overlord if the request fails and logs the whole thing in SQLMetadataStorageActionHandler.insertEntryWithHandle.
If the request happens to be larger than max_allowed_packet for a mysql metadata store it will always fail. Since really large task payloads seem to cause overlord instability in general I think it makes sense to limit the size of task payloads at the task queue level.
Description
Add a new config that sets a limit for task payload sizes. Throw a exception if the limit is exceeded.
The default limit of 60 MB is based on the 64 MB default value of max_allowed_packet in MySQL 8+.
I would ideally like to use the http request content-length header to calculate the size of the task payload rather than re-serializing it in memory but we also call taskQueue.add directly from supervisors so that would bypass the check. If others think this is acceptable and it would be better to check Content-Length in OverlordResource I am fine with changing this logic.
Release note
Key changed/added classes in this PR
TaskQueueThis PR has: