Do not retry SQL operation in case of max_allowed_packet exception#14271
Do not retry SQL operation in case of max_allowed_packet exception#14271kfaraz merged 14 commits intoapache:masterfrom
max_allowed_packet exception#14271Conversation
gianm
left a comment
There was a problem hiding this comment.
👍 on the general idea, question about detection.
| if (e == null) { | ||
| return false; | ||
| } | ||
| if (e.getMessage() != null && e.getMessage().contains(MAX_ALLOWED_PACKET_ERROR)) { |
There was a problem hiding this comment.
Is there a more robust way to detect this? Perhaps a particular exception type, or SQLState value, or vendorCode?
There was a problem hiding this comment.
For MySQL, the SQLState seems to be S1000, error code is 0, and the root cause is PacketTooBigException.
org.skife.jdbi.v2.exceptions.CallbackFailedException:
org.skife.jdbi.v2.exceptions.UnableToExecuteStatementException:
com.mysql.jdbc.PacketTooBigException:
Packet for query is too large (9952 > 1024).
You can change this value on the server by setting the max_allowed_packet' variable.
The solution for this case is simply to check if the cause of UnableToExecuteStatementException is transient, which we should always do anyway because UnableToExecuteStatementException is not implicitly transient by itself. Since PacketTooBigException is not transient by any of our other rules, it would be categorized correctly.
There was a problem hiding this comment.
Unfortunately, this is not the exception we had seen in prod which was something like:
org.skife.jdbi.v2.exceptions.CallbackFailedException:
org.skife.jdbi.v2.exceptions.UnableToExecuteStatementException:
java.sql.SQLTransientConnectionException:
(conn=10271) Could not send query: query size is >= to max_allowed_packet
I tried reproducing this by keeping a high value of max_allowed_packet on the mysql server and a low value in the connection url, but that that again resulted in the same PacketTooLargeException.
There was a problem hiding this comment.
I tried using a MariaDB driver instead and this is what I got:
org.skife.jdbi.v2.exceptions.CallbackFailedException:
org.skife.jdbi.v2.exceptions.UnableToExecuteStatementException:
java.sql.SQLTransientConnectionException:
org.mariadb.jdbc.internal.util.exceptions.MariaDbSqlException
java.sql.SQLNonTransientConnectionException
org.mariadb.jdbc.internal.util.exceptions.MaxAllowedPacketException
The weird thing is that the SQLTransientConnectionException is eventually caused by a SQLNonTransientConnectionException.
The SQLState in this case is HY which is pretty generic (MariaDB docs, and error code (vendor code) is again 0.
There was a problem hiding this comment.
So there are two things we can do here:
- Always look at the cause of
UnableToExecuteStatementExceptionto decide if it is transient. (this takes care of the mysql driver case) - For
SQLTransientConnectionException, check the exception chain. If any of the causes is aSQLNonTransientExceptionexception, the overall exception is also non-transient.
Ideally, the MariaDB driver should not have not qualified this exception as transient in the first place.
@gianm , let me know what you think.
There was a problem hiding this comment.
Always look at the cause of UnableToExecuteStatementException to decide if it is transient. (this takes care of the mysql driver case)
Sounds good to me.
For SQLTransientConnectionException, check the exception chain. If any of the causes is a SQLNonTransientException exception, the overall exception is also non-transient.
I'm a little worried about collateral damage from the strategy of looking for SQLNonTransientException under SQLTransientConnectionException. How about a more targeted change where we look for org.mariadb.jdbc.internal.util.exceptions.MaxAllowedPacketException specifically? I'd be OK with that.
There was a problem hiding this comment.
Done.
Added a new method connectorIsNonTransientException to allow connector implementations to classify an exception as definitely non-transient.
Added a new method isRootCausePacketTooBigException which checks for the specific MariaDB and MySQL exception classes in the MysqlStorageActionHandler.
|
The current fix in this PR is only half the solution as when this packet error is encountered, the HTTP API returns a 500 response which causes the client to retry indefinitely. I am looking at cleaning the exception and returning a more meaningful response so that the client knows not to retry. I will try to include the changes in this same PR. |
…x_packet_exception
|
@gianm , I have modified this PR to ensure the following:
I have also updated the description accordingly. Please let me know what you think of the approach. |
Description
This PR intends to correctly classify packet too big exceptions, improve the error messages and prevent indefinite retries.
Please refer to the section labeled "Motivation" for details on the requirement of this change.
Changes
DruidExceptionwhich contains a user-facing error message, HTTP response code and a causeEntryExistsExceptionextendsDruidExceptionand is now an unchecked exceptionDruidExceptionwith response code 400 (bad request) if metadata storemax_allowed_packetlimit is violated. This is accomplished by addingSQLMetadataConnector.isRootCausePacketTooBigExceptionwhich can be implemented by specific connectors (e.g. MySQL, Postgres, etc.)This work is more of a temporary band-aid and it should later tie in to #14004 which is a much more formalized treatment of Druid errors in general.
Motivation
If the packet size of a given statement exceeds the
max_allowed_packet, an exception of the following form is encountered:The root cause here is a
SQLTransientConnectionExceptionwhich gets (correctly) categorized as a transient exception bySQLMetadataConnector.isTransientException()and is thus retried.In a typical situation where an
index_paralleltask or a Coordinator issuedcompacttask tries to create sub-tasks with a very large payload, the following tends to happen:max_allowed_packetReadTimeoutExceptionresponseServiceClientImplon the supervisor task side interprets this as an network/IO exception and retries it an unlimited number of times. (SeeStandardRetryPolicy.unlimited()below)index_parallelorcompacttask makes no progress and doesn't fail eitherdruid/indexing-service/src/main/java/org/apache/druid/indexing/common/task/batch/parallel/TaskMonitor.java
Lines 109 to 118 in 9eebeea
Solution
In truth, the violation of
max_allowed_packetis not really a transient exception as it doesn't go away until the DB admin increases the configured limit.Testing
After the changes, an
index_paralleltask with an oversized merge task payload fails as follows:Release note
If the Overlord fails to insert a task into the metadata because of a payload that exceeds the
max_allowed_packetlimit, the HTTP code in the response is now 400 (bad request) instead of 500 (internal server error). This prevents anindex_paralleltask from retrying the insertion of a bad sub-task indefinitely and causes it to fail immediately.This PR has: