Do not retry SQL operation in case of `max_allowed_packet` exception by kfaraz · Pull Request #14271 · apache/druid

kfaraz · 2023-05-12T14:17:15Z

Description

This PR intends to correctly classify packet too big exceptions, improve the error messages and prevent indefinite retries.

Please refer to the section labeled "Motivation" for details on the requirement of this change.

Changes

Add a DruidException which contains a user-facing error message, HTTP response code and a cause
EntryExistsException extends DruidException and is now an unchecked exception
Throw DruidException with response code 400 (bad request) if metadata store max_allowed_packet limit is violated. This is accomplished by adding SQLMetadataConnector.isRootCausePacketTooBigException which can be implemented by specific connectors (e.g. MySQL, Postgres, etc.)
A response code of 400 thrown above instead of 500 ensures that the request is not retried by the client.

This work is more of a temporary band-aid and it should later tie in to #14004 which is a much more formalized treatment of Druid errors in general.

Motivation

If the packet size of a given statement exceeds the max_allowed_packet, an exception of the following form is encountered:

org.skife.jdbi.v2.exceptions.CallbackFailedException:
  org.skife.jdbi.v2.exceptions.UnableToExecuteStatementException:
  java.sql.SQLTransientConnectionException:
    (conn=10271) Could not send query: query size is >= to max_allowed_packet

The root cause here is a SQLTransientConnectionException which gets (correctly) categorized as a transient exception by SQLMetadataConnector.isTransientException() and is thus retried.

In a typical situation where an index_parallel task or a Coordinator issued compact task tries to create sub-tasks with a very large payload, the following tends to happen:

Supervisor task requests the Overlord to insert the new sub-task in the DB
Insertion fails with violation of max_allowed_packet
Being a transient exception, the insertion is retried (typically upto 9 times with exponential backoff, taking something like 3 mins in total)
While the Overlord is busy retrying insertion in the DB, the supervisor gets a ReadTimeoutException response
ServiceClientImpl on the supervisor task side interprets this as an network/IO exception and retries it an unlimited number of times. (See StandardRetryPolicy.unlimited() below)
Parent index_parallel or compact task makes no progress and doesn't fail either

druid/indexing-service/src/main/java/org/apache/druid/indexing/common/task/batch/parallel/TaskMonitor.java

Lines 109 to 118 in 9eebeea

    
           TaskMonitor(OverlordClient overlordClient, int maxRetry, int estimatedNumSucceededTasks) 
        
           { 
        
             // Unlimited retries for Overlord APIs: if it goes away, we'll wait indefinitely for it to come back. 
        
             this.overlordClient = Preconditions.checkNotNull(overlordClient, "overlordClient") 
        
                                                .withRetryPolicy(StandardRetryPolicy.unlimited()); 
        
             this.maxRetry = maxRetry; 
        
             this.estimatedNumSucceededTasks = estimatedNumSucceededTasks; 
        
             log.info("TaskMonitor is initialized with estimatedNumSucceededTasks[%d]", estimatedNumSucceededTasks); 
        
           }

Solution

In truth, the violation of max_allowed_packet is not really a transient exception as it doesn't go away until the DB admin increases the configured limit.

Testing

After the changes, an index_parallel task with an oversized merge task payload fails as follows:

Caused by: org.apache.druid.rpc.HttpResponseException: Server error [400 Bad Request]; body: {"error":"Payload for task [partial_index_generic_merge_wikipedia_acflahlo_2023-06-02T13:43:42.233Z] exceeds the packet limit. Update the max_allowed_packet on your metadata store server or in the connection properties."}
	at org.apache.druid.rpc.ServiceClientImpl$1.onSuccess(ServiceClientImpl.java:201) ~[druid-server-27.0.0-SNAPSHOT.jar:27.0.0-SNAPSHOT]
	at org.apache.druid.rpc.ServiceClientImpl$1.onSuccess(ServiceClientImpl.java:183) ~[druid-server-27.0.0-SNAPSHOT.jar:27.0.0-SNAPSHOT]
	at com.google.common.util.concurrent.Futures$4.run(Futures.java:1181) ~[guava-16.0.1.jar:?]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[?:1.8.0_282]
	at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_282]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) ~[?:1.8.0_282]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) ~[?:1.8.0_282]
	... 3 more
2023-06-02T13:43:42,265 INFO [task-runner-0-priority-0] org.apache.druid.indexing.worker.executor.ExecutorLifecycle - Task completed with status: {
  "id" : "index_parallel_wikipedia_mcokcggf_2023-06-02T13:42:22.105Z",
  "status" : "FAILED",
  "duration" : 56130,
  "errorMsg" : "java.lang.RuntimeException: org.apache.druid.rpc.HttpResponseException: Server error [400 Bad Reques...",
  "location" : {
    "host" : null,
    "port" : -1,
    "tlsPort" : -1
  }
}

Release note

If the Overlord fails to insert a task into the metadata because of a payload that exceeds the max_allowed_packet limit, the HTTP code in the response is now 400 (bad request) instead of 500 (internal server error). This prevents an index_parallel task from retrying the insertion of a bad sub-task indefinitely and causes it to fail immediately.

This PR has:

gianm

👍 on the general idea, question about detection.

gianm · 2023-05-16T16:09:43Z

+    if (e == null) {
+      return false;
+    }
+    if (e.getMessage() != null && e.getMessage().contains(MAX_ALLOWED_PACKET_ERROR)) {


Is there a more robust way to detect this? Perhaps a particular exception type, or SQLState value, or vendorCode?

For MySQL, the SQLState seems to be S1000, error code is 0, and the root cause is PacketTooBigException.

org.skife.jdbi.v2.exceptions.CallbackFailedException: org.skife.jdbi.v2.exceptions.UnableToExecuteStatementException: com.mysql.jdbc.PacketTooBigException: Packet for query is too large (9952 > 1024). You can change this value on the server by setting the max_allowed_packet' variable.

The solution for this case is simply to check if the cause of UnableToExecuteStatementException is transient, which we should always do anyway because UnableToExecuteStatementException is not implicitly transient by itself. Since PacketTooBigException is not transient by any of our other rules, it would be categorized correctly.

Unfortunately, this is not the exception we had seen in prod which was something like:

org.skife.jdbi.v2.exceptions.CallbackFailedException: org.skife.jdbi.v2.exceptions.UnableToExecuteStatementException: java.sql.SQLTransientConnectionException: (conn=10271) Could not send query: query size is >= to max_allowed_packet

I tried reproducing this by keeping a high value of max_allowed_packet on the mysql server and a low value in the connection url, but that that again resulted in the same PacketTooLargeException.

I tried using a MariaDB driver instead and this is what I got:

org.skife.jdbi.v2.exceptions.CallbackFailedException: org.skife.jdbi.v2.exceptions.UnableToExecuteStatementException: java.sql.SQLTransientConnectionException: org.mariadb.jdbc.internal.util.exceptions.MariaDbSqlException java.sql.SQLNonTransientConnectionException org.mariadb.jdbc.internal.util.exceptions.MaxAllowedPacketException

The weird thing is that the SQLTransientConnectionException is eventually caused by a SQLNonTransientConnectionException.

The SQLState in this case is HY which is pretty generic (MariaDB docs, and error code (vendor code) is again 0.

So there are two things we can do here:

Always look at the cause of UnableToExecuteStatementException to decide if it is transient. (this takes care of the mysql driver case)

For SQLTransientConnectionException, check the exception chain. If any of the causes is a SQLNonTransientException exception, the overall exception is also non-transient.

Ideally, the MariaDB driver should not have not qualified this exception as transient in the first place.
@gianm , let me know what you think.

Always look at the cause of UnableToExecuteStatementException to decide if it is transient. (this takes care of the mysql driver case)

Sounds good to me.

For SQLTransientConnectionException, check the exception chain. If any of the causes is a SQLNonTransientException exception, the overall exception is also non-transient.

I'm a little worried about collateral damage from the strategy of looking for SQLNonTransientException under SQLTransientConnectionException. How about a more targeted change where we look for org.mariadb.jdbc.internal.util.exceptions.MaxAllowedPacketException specifically? I'd be OK with that.

Done.
~~Added a new method connectorIsNonTransientException to allow connector implementations to classify an exception as definitely non-transient.~~

Added a new method isRootCausePacketTooBigException which checks for the specific MariaDB and MySQL exception classes in the MysqlStorageActionHandler.

kfaraz · 2023-05-23T15:56:03Z

The current fix in this PR is only half the solution as when this packet error is encountered, the HTTP API returns a 500 response which causes the client to retry indefinitely.

I am looking at cleaning the exception and returning a more meaningful response so that the client knows not to retry. I will try to include the changes in this same PR.

…x_packet_exception

kfaraz · 2023-05-30T15:15:49Z

@gianm , I have modified this PR to ensure the following:

the packet exception is not categorized as transient by the Overlord so that it does not retry the metadata operation
the client gets back a 400 response code instead of 500 so that it knows not to retry

I have also updated the description accordingly. Please let me know what you think of the approach.

…x_packet_exception

Do not retry SQL operation in case of max_allowed_packet exception

829c3c3

kfaraz requested a review from gianm May 12, 2023 14:18

kfaraz added Area - Metadata Area - Ingestion labels May 15, 2023

kfaraz mentioned this pull request May 16, 2023

Remove giant lock from Overlord TaskQueue #14293

Closed

10 tasks

gianm reviewed May 16, 2023

View reviewed changes

Add method SQLMetadataConnector.connectorIsNonTransientException

05e9118

kfaraz added 8 commits May 29, 2023 17:12

Add DruidException with response codes

03aac04

Merge branch 'master' of github.com:apache/druid into do_not_retry_ma…

e1ce7eb

…x_packet_exception

Remove test changes

a4f754b

Fix usage of EntryExistsException

49eaee1

Fix unit tests

c6253b6

Add tests for new exceptions

5207984

Address inspections

983c5a2

More inspection fixes

5f8f205

AmatyaAvadhanula reviewed Jun 2, 2023

View reviewed changes

Fixes from feedback

26aed0c

AmatyaAvadhanula approved these changes Jun 2, 2023

View reviewed changes

kfaraz added 3 commits June 7, 2023 16:05

Fix method isTransientDruidException

069f4e4

Merge branch 'master' of github.com:apache/druid into do_not_retry_ma…

7fc5006

…x_packet_exception

Remove unused DruidException constructor

5bfce05

kfaraz merged commit 6e15870 into apache:master Jun 10, 2023

kfaraz deleted the do_not_retry_max_packet_exception branch June 10, 2023 06:45

abhishekagarwal87 added this to the 27.0 milestone Jul 19, 2023

AmatyaAvadhanula mentioned this pull request Aug 6, 2023

[DRAFT] 27.0.0 release notes #14761

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not retry SQL operation in case of `max_allowed_packet` exception#14271

Do not retry SQL operation in case of `max_allowed_packet` exception#14271
kfaraz merged 14 commits intoapache:masterfrom
kfaraz:do_not_retry_max_packet_exception

kfaraz commented May 12, 2023 •

edited

Loading

Uh oh!

gianm left a comment

Uh oh!

gianm May 16, 2023

Uh oh!

kfaraz May 17, 2023 •

edited

Loading

Uh oh!

kfaraz May 17, 2023 •

edited

Loading

Uh oh!

kfaraz May 22, 2023

Uh oh!

kfaraz May 22, 2023

Uh oh!

gianm May 22, 2023

Uh oh!

kfaraz May 23, 2023 •

edited

Loading

Uh oh!

kfaraz commented May 23, 2023 •

edited

Loading

Uh oh!

kfaraz commented May 30, 2023 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	TaskMonitor(OverlordClient overlordClient, int maxRetry, int estimatedNumSucceededTasks)
	{
	// Unlimited retries for Overlord APIs: if it goes away, we'll wait indefinitely for it to come back.
	this.overlordClient = Preconditions.checkNotNull(overlordClient, "overlordClient")
	.withRetryPolicy(StandardRetryPolicy.unlimited());
	this.maxRetry = maxRetry;
	this.estimatedNumSucceededTasks = estimatedNumSucceededTasks;

	log.info("TaskMonitor is initialized with estimatedNumSucceededTasks[%d]", estimatedNumSucceededTasks);
	}

Conversation

kfaraz commented May 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes

Motivation

Solution

Testing

Release note

Uh oh!

gianm left a comment

Choose a reason for hiding this comment

Uh oh!

gianm May 16, 2023

Choose a reason for hiding this comment

Uh oh!

kfaraz May 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kfaraz May 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kfaraz May 22, 2023

Choose a reason for hiding this comment

Uh oh!

kfaraz May 22, 2023

Choose a reason for hiding this comment

Uh oh!

gianm May 22, 2023

Choose a reason for hiding this comment

Uh oh!

kfaraz May 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kfaraz commented May 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kfaraz commented May 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kfaraz commented May 12, 2023 •

edited

Loading

kfaraz May 17, 2023 •

edited

Loading

kfaraz May 17, 2023 •

edited

Loading

kfaraz May 23, 2023 •

edited

Loading

kfaraz commented May 23, 2023 •

edited

Loading

kfaraz commented May 30, 2023 •

edited

Loading