Tidy up HTTP status codes for query errors by jihoonson · Pull Request #10746 · apache/druid

jihoonson · 2021-01-12T08:41:02Z

Description

Druid querying system returns an HTTP status code on query failures. Currently, most of query failures return an internal error (500) which could be misleading in some cases. This PR is to tidy up this and return a proper status code. Here is a list of query errors, their current status codes, and proposed changes.

Query planning errors

Exception	Description	Current code	Proposed new code
SqlParseException and ValidationException from Calcite	Query planning failed	500	400
CannotPlanException from Calcite	Failed to find the best plan. Likely an internal error	500
RelConversionException from Calcite	Conversion from SQL to Druid failed	500
CannotBuildQueryException	Failed to create a Druid native query from DruidRel. Likely an internal error	500
Others	Unknown errors	500

Query execution errors

Exception	Description	Current code	Proposed new code
QueryTimeoutException	Query execution didn't finish in timeout	504
Unauthorized exception	Authentication failed to execute the query	401
ForbiddenException	Authorization failed to execute the query	403
QueryCapacityExceededException	Query failed to schedule because of the lack of resources available at the time when the query was submitted	429
ResourceLimitExceededException	Query asked more resources than configured threshold	500	400
InsufficientResourceException	Query failed to schedule because of lack of merge buffers available at the time when it was submitted	500	429, merged to QueryCapacityExceededException
QueryUnsupportedException	Unsupported functionality	400	501
Others	Unknown exceptions	500

BadQueryException is added to represent all kinds of query exceptions that should return 400.

This PR has:

been self-reviewed.
- using the concurrency checklist (Remove this item if the PR doesn't have any relation to concurrency.)
added documentation for new or modified features or behaviors.
added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
added or updated version, license, or notice information in licenses.yaml
added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
added integration tests.
been tested in a test Druid cluster.

Key changed/added classes in this PR

BadQueryException
ResourceLimitExceededException

jihoonson · 2021-01-12T08:41:58Z

Will update docs as a follow-up.

a2l007

Thanks for the PR! This will help improve debugging and categorizing failed queries.

a2l007 · 2021-01-12T19:54:58Z


 /**
- * This exception is thrown when the requested operation cannot be completed due to a lack of available resources.
+ * An abstract class for all query exceptions that should return a bad request status code (404).


Shouldn't it be 400 here?

Oops, yes it should be.

a2l007 · 2021-01-12T20:26:35Z

- * error code.
+ * This is a {@link BadQueryException} because it will likely a user's misbehavior when this exception is thrown.
+ * Druid cluster operators set some set of resource limitations for some good reason. When a user query requires
+ * too many resources, it will likely mean that the user query should be modified to not use excessive resources.


Based on the description, shouldn't this exception also be categorized as ResourceLimitExceeded ?

Good catch 🙂 It should be, but I wasn't 100% sure whether it could cause any side effects if I did it. I left a comment about it here. I will also make an issue to track it.

If we change this, the exception type and code changes from RE to ResourceLimitExceeded for queries exceeding maxScatterGatherBytes. What are the side effects that could happen? Sorry if I'm missing something :)

Suggest rewording to something like:

"This is a {@link BadQueryException} because it likely indicates a user's misbehavior when this exception is thrown. The resource limitations set by Druid cluster operators are typically less flexible than the parameters of a user query, so when a user query requires too many resources, the likely remedy is that the user query should be modified to use fewer resources, or to reduce query volume."

@a2l007 oh, you are correct about the exception type in checkTotalBytesLimit(). My point was that this exception doesn't seem to be propagated to JsonParserIterator anyway because NettyHttpClient doesn't do it. I think NettyHttpClient should call retVal.setException() instead of retVal.set(null), but I haven't studied the code enough around it to figure out why we are doing this. Anyway, it doesn't look harm to change the exception type in this PR, so I changed 🙂

@jon-wei 👍 Updated the javadoc.

@jihoonson Makes sense. Thanks for clarifying!

a2l007 · 2021-01-12T20:36:57Z

+
+    Response gotBadQuery(BadQueryException e) throws IOException
+    {
+      return buildNonOkResponse(BadQueryException.STATUS_CODE, e);


I was curious whether this would catch ResourceLimitExceeded exceptions wrapped under QueryInterruptedException and so I ran the PR on my test cluster. It looks like such tests are still giving us 500 instead of 400. Related stack trace:

HTTP/1.1 500 Internal Server Error < Content-Type: application/json < Content-Length: 372 { "error" : "Resource limit exceeded", "trace": org.apache.druid.query.QueryInterruptedException: Not enough aggregation buffer space to execute this query. Try increasing druid.processing.buffer.sizeBytes or enable disk spilling by setting druid.query.groupBy.maxOnDiskStorage to a positive number. at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) ~[?:1.8.0_231] at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) ~[?:1.8.0_231] at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) ~[?:1.8.0_231] at java.lang.reflect.Constructor.newInstance(Constructor.java:423) ~[?:1.8.0_231] at com.fasterxml.jackson.databind.introspect.AnnotatedConstructor.call(AnnotatedConstructor.java:124) ~[jackson-databind-2.10.5.1.jar:2.10.5.1] at com.fasterxml.jackson.databind.deser.std.StdValueInstantiator.createFromObjectWith(StdValueInstantiator.java:283) ~[jackson-databind-2.10.5.1.jar:2.10.5.1] at com.fasterxml.jackson.databind.deser.ValueInstantiator.createFromObjectWith(ValueInstantiator.java:229) ~[jackson-databind-2.10.5.1.jar:2.10.5.1] at com.fasterxml.jackson.databind.deser.impl.PropertyBasedCreator.build(PropertyBasedCreator.java:198) ~[jackson-databind-2.10.5.1.jar:2.10.5.1] at com.fasterxml.jackson.databind.deser.BeanDeserializer._deserializeUsingPropertyBased(BeanDeserializer.java:422) ~[jackson-databind-2.10.5.1.jar:2.10.5.1] at com.fasterxml.jackson.databind.deser.std.ThrowableDeserializer.deserializeFromObject(ThrowableDeserializer.java:65) ~[jackson-databind-2.10.5.1.jar:2.10.5.1] at com.fasterxml.jackson.databind.deser.BeanDeserializer.deserialize(BeanDeserializer.java:159) ~[jackson-databind-2.10.5.1.jar:2.10.5.1] at com.fasterxml.jackson.databind.ObjectMapper._readValue(ObjectMapper.java:4189) ~[jackson-databind-2.10.5.1.jar:2.10.5.1] at com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:2476) ~[jackson-databind-2.10.5.1.jar:2.10.5.1] at org.apache.druid.client.JsonParserIterator.init(JsonParserIterator.java:182) [classes/:?] at org.apache.druid.client.JsonParserIterator.hasNext(JsonParserIterator.java:90) [classes/:?]

Maybe as a follow-up we could check the wrapped error code while handling QueryInterruptedException to determine the correct exception type.

Thanks for testing! I thought I fixed all cases where exceptions are wrapped in QueryInterruptedException, but missed JsonParserIterator. My intention was to fix it in this PR, so I went ahead and fixed it. It turns out there is an issue of that the type of query exception is lost after JSON deserialization. I ended up with this ugly switch clause to restore the exception type. A better approach is modifying QueryException to store the type using Jackson, but I wasn't 100% sure what is the side effect there too. The ugly switch clause seems at least safer than modifying QueryException. I think we can rework the serde system of QueryException after release. What do you think?

Thanks works fine now. I agree that we can avoid deeper tinkering of QueryException before the release and it can be picked up post release.

jon-wei · 2021-01-13T03:51:13Z

 *
- * As a {@link QueryException} it is expected to be serialied to a json response, but will be mapped to
- * {@link #STATUS_CODE} instead of the default HTTP 500 status.
+ * As a {@link QueryException} it is expected to be serialied to a json response with a proper HTTP error code


serialied -> serialized

Oops, thanks.

jon-wei · 2021-01-13T03:55:07Z

- * error code.
+ * This is a {@link BadQueryException} because it will likely a user's misbehavior when this exception is thrown.
+ * Druid cluster operators set some set of resource limitations for some good reason. When a user query requires
+ * too many resources, it will likely mean that the user query should be modified to not use excessive resources.


Suggest rewording to something like:

"This is a {@link BadQueryException} because it likely indicates a user's misbehavior when this exception is thrown. The resource limitations set by Druid cluster operators are typically less flexible than the parameters of a user query, so when a user query requires too many resources, the likely remedy is that the user query should be modified to use fewer resources, or to reduce query volume."

jon-wei

Had some comments on the javadocs but otherwise LGTM

a2l007 · 2021-01-13T14:26:15Z

LGTM after CI.

suneet-s · 2021-01-13T15:34:07Z

Will update docs as a follow-up.

As part of this follow-up PR can you also add test cases to the query IT. I think the scenarios you outlined in the PR description are great and should all have an IT associated with it.

jihoonson · 2021-01-13T20:15:58Z

Will update docs as a follow-up.

As part of this follow-up PR can you also add test cases to the query IT. I think the scenarios you outlined in the PR description are great and should all have an IT associated with it.

Good idea. We need integration tests for this.

jihoonson · 2021-01-14T00:26:32Z

------------------------------------------------------------------------------
|     lines      |    branches    |   functions    |   path
------------------------------------------------------------------------------
| 100% (17/17)   | 100% (0/0)     |  92% (12/13)   | org/apache/druid/server/QueryResource.java
|  32% (11/34)   |  44% (4/9)     |  71% (27/38)   | org/apache/druid/client/JsonParserIterator.java
|   0% (0/1)     | 100% (0/0)     | 100% (1/1)     | org/apache/druid/client/DirectDruidClient.java
| 100% (3/3)     | 100% (0/0)     |  50% (2/4)     | org/apache/druid/server/security/SecuritySanityCheckFilter.java
| 100% (1/1)     | 100% (0/0)     |  66% (6/9)     | org/apache/druid/server/security/ForbiddenException.java

------------------------------------------------------------------------------
Total diff coverage:
 - lines: 57% (32/56)
 - branches: 44% (4/9)
 - functions: 73% (48/65)
ERROR: Insufficient branch coverage of 44% (4/9). Required 50%.

The test coverage check failed. I agree those branches in JsonParserIterator should be covered, but since this is blocking a branch cut, I will add them as a follow-up. Other integration test failures seem flaky. I just restarted them.

jihoonson · 2021-01-14T01:14:29Z

All tests passed except for the test coverage check mentioned above. I'm going to go ahead and merge this to cut the branch. @jon-wei @a2l007 any thoughts?

a2l007 · 2021-01-14T01:18:18Z

SGTM. I've gone ahead and created a issue #10756 for this so that we can keep track.

jihoonson · 2021-01-14T01:19:56Z

@a2l007 thanks!

* Tidy up query error codes * fix tests * Restore query exception type in JsonParserIterator * address review comments; add a comment explaining the ugly switch * fix test

Tidy up query error codes

1c54f15

jihoonson added Bug Area - Querying Release Notes Design Review labels Jan 12, 2021

jihoonson added this to the 0.21.0 milestone Jan 12, 2021

fix tests

8a82858

a2l007 reviewed Jan 12, 2021

View reviewed changes

Restore query exception type in JsonParserIterator

78d69ad

jon-wei reviewed Jan 13, 2021

View reviewed changes

address review comments; add a comment explaining the ugly switch

aea6c59

jon-wei approved these changes Jan 13, 2021

View reviewed changes

a2l007 approved these changes Jan 13, 2021

View reviewed changes

fix test

abc29ea

jihoonson mentioned this pull request Jan 13, 2021

[Draft] 0.21.0 Release Notes #10752

Closed

a2l007 mentioned this pull request Jan 14, 2021

Add missing tests and docs for modified query error codes #10756

Closed

jihoonson merged commit 149306c into apache:master Jan 14, 2021

jihoonson mentioned this pull request Feb 2, 2021

Update doc for query errors and add unit tests for JsonParserIterator #10833

Merged

9 tasks

jihoonson mentioned this pull request Apr 10, 2021

More unit tests for JsonParserIterator; Integration tests for query errors #11091

Merged

9 tasks

Conversation

jihoonson commented Jan 12, 2021

Description

Query planning errors

Query execution errors

Key changed/added classes in this PR

Uh oh!

jihoonson commented Jan 12, 2021

Uh oh!

a2l007 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jon-wei left a comment

Choose a reason for hiding this comment

Uh oh!

a2l007 commented Jan 13, 2021

Uh oh!

suneet-s commented Jan 13, 2021

Uh oh!

jihoonson commented Jan 13, 2021

Uh oh!

jihoonson commented Jan 14, 2021

Uh oh!

jihoonson commented Jan 14, 2021

Uh oh!

a2l007 commented Jan 14, 2021

Uh oh!

jihoonson commented Jan 14, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants