Skip to content

Do not retry SQL operation in case of max_allowed_packet exception#14271

Merged
kfaraz merged 14 commits intoapache:masterfrom
kfaraz:do_not_retry_max_packet_exception
Jun 10, 2023
Merged

Do not retry SQL operation in case of max_allowed_packet exception#14271
kfaraz merged 14 commits intoapache:masterfrom
kfaraz:do_not_retry_max_packet_exception

Conversation

@kfaraz
Copy link
Copy Markdown
Contributor

@kfaraz kfaraz commented May 12, 2023

Description

This PR intends to correctly classify packet too big exceptions, improve the error messages and prevent indefinite retries.

Please refer to the section labeled "Motivation" for details on the requirement of this change.

Changes

  • Add a DruidException which contains a user-facing error message, HTTP response code and a cause
  • EntryExistsException extends DruidException and is now an unchecked exception
  • Throw DruidException with response code 400 (bad request) if metadata store max_allowed_packet limit is violated. This is accomplished by adding SQLMetadataConnector.isRootCausePacketTooBigException which can be implemented by specific connectors (e.g. MySQL, Postgres, etc.)
  • A response code of 400 thrown above instead of 500 ensures that the request is not retried by the client.

This work is more of a temporary band-aid and it should later tie in to #14004 which is a much more formalized treatment of Druid errors in general.

Motivation

If the packet size of a given statement exceeds the max_allowed_packet, an exception of the following form is encountered:

org.skife.jdbi.v2.exceptions.CallbackFailedException:
  org.skife.jdbi.v2.exceptions.UnableToExecuteStatementException:
  java.sql.SQLTransientConnectionException:
    (conn=10271) Could not send query: query size is >= to max_allowed_packet

The root cause here is a SQLTransientConnectionException which gets (correctly) categorized as a transient exception by SQLMetadataConnector.isTransientException() and is thus retried.

In a typical situation where an index_parallel task or a Coordinator issued compact task tries to create sub-tasks with a very large payload, the following tends to happen:

  • Supervisor task requests the Overlord to insert the new sub-task in the DB
  • Insertion fails with violation of max_allowed_packet
  • Being a transient exception, the insertion is retried (typically upto 9 times with exponential backoff, taking something like 3 mins in total)
  • While the Overlord is busy retrying insertion in the DB, the supervisor gets a ReadTimeoutException response
  • ServiceClientImpl on the supervisor task side interprets this as an network/IO exception and retries it an unlimited number of times. (See StandardRetryPolicy.unlimited() below)
  • Parent index_parallel or compact task makes no progress and doesn't fail either

TaskMonitor(OverlordClient overlordClient, int maxRetry, int estimatedNumSucceededTasks)
{
// Unlimited retries for Overlord APIs: if it goes away, we'll wait indefinitely for it to come back.
this.overlordClient = Preconditions.checkNotNull(overlordClient, "overlordClient")
.withRetryPolicy(StandardRetryPolicy.unlimited());
this.maxRetry = maxRetry;
this.estimatedNumSucceededTasks = estimatedNumSucceededTasks;
log.info("TaskMonitor is initialized with estimatedNumSucceededTasks[%d]", estimatedNumSucceededTasks);
}

Solution

In truth, the violation of max_allowed_packet is not really a transient exception as it doesn't go away until the DB admin increases the configured limit.

Testing

After the changes, an index_parallel task with an oversized merge task payload fails as follows:

Caused by: org.apache.druid.rpc.HttpResponseException: Server error [400 Bad Request]; body: {"error":"Payload for task [partial_index_generic_merge_wikipedia_acflahlo_2023-06-02T13:43:42.233Z] exceeds the packet limit. Update the max_allowed_packet on your metadata store server or in the connection properties."}
	at org.apache.druid.rpc.ServiceClientImpl$1.onSuccess(ServiceClientImpl.java:201) ~[druid-server-27.0.0-SNAPSHOT.jar:27.0.0-SNAPSHOT]
	at org.apache.druid.rpc.ServiceClientImpl$1.onSuccess(ServiceClientImpl.java:183) ~[druid-server-27.0.0-SNAPSHOT.jar:27.0.0-SNAPSHOT]
	at com.google.common.util.concurrent.Futures$4.run(Futures.java:1181) ~[guava-16.0.1.jar:?]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[?:1.8.0_282]
	at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_282]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) ~[?:1.8.0_282]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) ~[?:1.8.0_282]
	... 3 more
2023-06-02T13:43:42,265 INFO [task-runner-0-priority-0] org.apache.druid.indexing.worker.executor.ExecutorLifecycle - Task completed with status: {
  "id" : "index_parallel_wikipedia_mcokcggf_2023-06-02T13:42:22.105Z",
  "status" : "FAILED",
  "duration" : 56130,
  "errorMsg" : "java.lang.RuntimeException: org.apache.druid.rpc.HttpResponseException: Server error [400 Bad Reques...",
  "location" : {
    "host" : null,
    "port" : -1,
    "tlsPort" : -1
  }
}
Screenshot 2023-06-02 at 7 18 09 PM Screenshot 2023-06-02 at 7 20 01 PM

Release note

If the Overlord fails to insert a task into the metadata because of a payload that exceeds the max_allowed_packet limit, the HTTP code in the response is now 400 (bad request) instead of 500 (internal server error). This prevents an index_parallel task from retrying the insertion of a bad sub-task indefinitely and causes it to fail immediately.


This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

Copy link
Copy Markdown
Contributor

@gianm gianm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 on the general idea, question about detection.

if (e == null) {
return false;
}
if (e.getMessage() != null && e.getMessage().contains(MAX_ALLOWED_PACKET_ERROR)) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a more robust way to detect this? Perhaps a particular exception type, or SQLState value, or vendorCode?

Copy link
Copy Markdown
Contributor Author

@kfaraz kfaraz May 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For MySQL, the SQLState seems to be S1000, error code is 0, and the root cause is PacketTooBigException.

org.skife.jdbi.v2.exceptions.CallbackFailedException:
  org.skife.jdbi.v2.exceptions.UnableToExecuteStatementException:
    com.mysql.jdbc.PacketTooBigException:
      Packet for query is too large (9952 > 1024).
      You can change this value on the server by setting the max_allowed_packet' variable.

The solution for this case is simply to check if the cause of UnableToExecuteStatementException is transient, which we should always do anyway because UnableToExecuteStatementException is not implicitly transient by itself. Since PacketTooBigException is not transient by any of our other rules, it would be categorized correctly.

Copy link
Copy Markdown
Contributor Author

@kfaraz kfaraz May 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, this is not the exception we had seen in prod which was something like:

org.skife.jdbi.v2.exceptions.CallbackFailedException:
  org.skife.jdbi.v2.exceptions.UnableToExecuteStatementException:
    java.sql.SQLTransientConnectionException:
      (conn=10271) Could not send query: query size is >= to max_allowed_packet

I tried reproducing this by keeping a high value of max_allowed_packet on the mysql server and a low value in the connection url, but that that again resulted in the same PacketTooLargeException.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried using a MariaDB driver instead and this is what I got:

org.skife.jdbi.v2.exceptions.CallbackFailedException: 
  org.skife.jdbi.v2.exceptions.UnableToExecuteStatementException:
    java.sql.SQLTransientConnectionException:
      org.mariadb.jdbc.internal.util.exceptions.MariaDbSqlException
        java.sql.SQLNonTransientConnectionException
          org.mariadb.jdbc.internal.util.exceptions.MaxAllowedPacketException

The weird thing is that the SQLTransientConnectionException is eventually caused by a SQLNonTransientConnectionException.

The SQLState in this case is HY which is pretty generic (MariaDB docs, and error code (vendor code) is again 0.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So there are two things we can do here:

  • Always look at the cause of UnableToExecuteStatementException to decide if it is transient. (this takes care of the mysql driver case)
  • For SQLTransientConnectionException, check the exception chain. If any of the causes is a SQLNonTransientException exception, the overall exception is also non-transient.

Ideally, the MariaDB driver should not have not qualified this exception as transient in the first place.
@gianm , let me know what you think.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Always look at the cause of UnableToExecuteStatementException to decide if it is transient. (this takes care of the mysql driver case)

Sounds good to me.

For SQLTransientConnectionException, check the exception chain. If any of the causes is a SQLNonTransientException exception, the overall exception is also non-transient.

I'm a little worried about collateral damage from the strategy of looking for SQLNonTransientException under SQLTransientConnectionException. How about a more targeted change where we look for org.mariadb.jdbc.internal.util.exceptions.MaxAllowedPacketException specifically? I'd be OK with that.

Copy link
Copy Markdown
Contributor Author

@kfaraz kfaraz May 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.
Added a new method connectorIsNonTransientException to allow connector implementations to classify an exception as definitely non-transient.

Added a new method isRootCausePacketTooBigException which checks for the specific MariaDB and MySQL exception classes in the MysqlStorageActionHandler.

@kfaraz
Copy link
Copy Markdown
Contributor Author

kfaraz commented May 23, 2023

The current fix in this PR is only half the solution as when this packet error is encountered, the HTTP API returns a 500 response which causes the client to retry indefinitely.

I am looking at cleaning the exception and returning a more meaningful response so that the client knows not to retry. I will try to include the changes in this same PR.

@kfaraz
Copy link
Copy Markdown
Contributor Author

kfaraz commented May 30, 2023

@gianm , I have modified this PR to ensure the following:

  • the packet exception is not categorized as transient by the Overlord so that it does not retry the metadata operation
  • the client gets back a 400 response code instead of 500 so that it knows not to retry

I have also updated the description accordingly. Please let me know what you think of the approach.

@kfaraz kfaraz merged commit 6e15870 into apache:master Jun 10, 2023
@kfaraz kfaraz deleted the do_not_retry_max_packet_exception branch June 10, 2023 06:45
@abhishekagarwal87 abhishekagarwal87 added this to the 27.0 milestone Jul 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants