Tidy up HTTP status codes for query errors#10746
Conversation
|
Will update docs as a follow-up. |
a2l007
left a comment
There was a problem hiding this comment.
Thanks for the PR! This will help improve debugging and categorizing failed queries.
|
|
||
| /** | ||
| * This exception is thrown when the requested operation cannot be completed due to a lack of available resources. | ||
| * An abstract class for all query exceptions that should return a bad request status code (404). |
There was a problem hiding this comment.
Oops, yes it should be.
| * error code. | ||
| * This is a {@link BadQueryException} because it will likely a user's misbehavior when this exception is thrown. | ||
| * Druid cluster operators set some set of resource limitations for some good reason. When a user query requires | ||
| * too many resources, it will likely mean that the user query should be modified to not use excessive resources. |
There was a problem hiding this comment.
Based on the description, shouldn't this exception also be categorized as ResourceLimitExceeded ?
There was a problem hiding this comment.
Good catch 🙂 It should be, but I wasn't 100% sure whether it could cause any side effects if I did it. I left a comment about it here. I will also make an issue to track it.
There was a problem hiding this comment.
If we change this, the exception type and code changes from RE to ResourceLimitExceeded for queries exceeding maxScatterGatherBytes. What are the side effects that could happen? Sorry if I'm missing something :)
There was a problem hiding this comment.
Suggest rewording to something like:
"This is a {@link BadQueryException} because it likely indicates a user's misbehavior when this exception is thrown. The resource limitations set by Druid cluster operators are typically less flexible than the parameters of a user query, so when a user query requires too many resources, the likely remedy is that the user query should be modified to use fewer resources, or to reduce query volume."
There was a problem hiding this comment.
@a2l007 oh, you are correct about the exception type in checkTotalBytesLimit(). My point was that this exception doesn't seem to be propagated to JsonParserIterator anyway because NettyHttpClient doesn't do it. I think NettyHttpClient should call retVal.setException() instead of retVal.set(null), but I haven't studied the code enough around it to figure out why we are doing this. Anyway, it doesn't look harm to change the exception type in this PR, so I changed 🙂
@jon-wei 👍 Updated the javadoc.
|
|
||
| Response gotBadQuery(BadQueryException e) throws IOException | ||
| { | ||
| return buildNonOkResponse(BadQueryException.STATUS_CODE, e); |
There was a problem hiding this comment.
I was curious whether this would catch ResourceLimitExceeded exceptions wrapped under QueryInterruptedException and so I ran the PR on my test cluster. It looks like such tests are still giving us 500 instead of 400. Related stack trace:
HTTP/1.1 500 Internal Server Error
< Content-Type: application/json
< Content-Length: 372
{
"error" : "Resource limit exceeded",
"trace":
org.apache.druid.query.QueryInterruptedException: Not enough aggregation buffer space to execute this query. Try increasing druid.processing.buffer.sizeBytes or enable disk spilling by setting druid.query.groupBy.maxOnDiskStorage to a positive number.
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) ~[?:1.8.0_231]
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) ~[?:1.8.0_231]
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) ~[?:1.8.0_231]
at java.lang.reflect.Constructor.newInstance(Constructor.java:423) ~[?:1.8.0_231]
at com.fasterxml.jackson.databind.introspect.AnnotatedConstructor.call(AnnotatedConstructor.java:124) ~[jackson-databind-2.10.5.1.jar:2.10.5.1]
at com.fasterxml.jackson.databind.deser.std.StdValueInstantiator.createFromObjectWith(StdValueInstantiator.java:283) ~[jackson-databind-2.10.5.1.jar:2.10.5.1]
at com.fasterxml.jackson.databind.deser.ValueInstantiator.createFromObjectWith(ValueInstantiator.java:229) ~[jackson-databind-2.10.5.1.jar:2.10.5.1]
at com.fasterxml.jackson.databind.deser.impl.PropertyBasedCreator.build(PropertyBasedCreator.java:198) ~[jackson-databind-2.10.5.1.jar:2.10.5.1]
at com.fasterxml.jackson.databind.deser.BeanDeserializer._deserializeUsingPropertyBased(BeanDeserializer.java:422) ~[jackson-databind-2.10.5.1.jar:2.10.5.1]
at com.fasterxml.jackson.databind.deser.std.ThrowableDeserializer.deserializeFromObject(ThrowableDeserializer.java:65) ~[jackson-databind-2.10.5.1.jar:2.10.5.1]
at com.fasterxml.jackson.databind.deser.BeanDeserializer.deserialize(BeanDeserializer.java:159) ~[jackson-databind-2.10.5.1.jar:2.10.5.1]
at com.fasterxml.jackson.databind.ObjectMapper._readValue(ObjectMapper.java:4189) ~[jackson-databind-2.10.5.1.jar:2.10.5.1]
at com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:2476) ~[jackson-databind-2.10.5.1.jar:2.10.5.1]
at org.apache.druid.client.JsonParserIterator.init(JsonParserIterator.java:182) [classes/:?]
at org.apache.druid.client.JsonParserIterator.hasNext(JsonParserIterator.java:90) [classes/:?]
Maybe as a follow-up we could check the wrapped error code while handling QueryInterruptedException to determine the correct exception type.
There was a problem hiding this comment.
Thanks for testing! I thought I fixed all cases where exceptions are wrapped in QueryInterruptedException, but missed JsonParserIterator. My intention was to fix it in this PR, so I went ahead and fixed it. It turns out there is an issue of that the type of query exception is lost after JSON deserialization. I ended up with this ugly switch clause to restore the exception type. A better approach is modifying QueryException to store the type using Jackson, but I wasn't 100% sure what is the side effect there too. The ugly switch clause seems at least safer than modifying QueryException. I think we can rework the serde system of QueryException after release. What do you think?
There was a problem hiding this comment.
Thanks works fine now. I agree that we can avoid deeper tinkering of QueryException before the release and it can be picked up post release.
| * | ||
| * As a {@link QueryException} it is expected to be serialied to a json response, but will be mapped to | ||
| * {@link #STATUS_CODE} instead of the default HTTP 500 status. | ||
| * As a {@link QueryException} it is expected to be serialied to a json response with a proper HTTP error code |
| * error code. | ||
| * This is a {@link BadQueryException} because it will likely a user's misbehavior when this exception is thrown. | ||
| * Druid cluster operators set some set of resource limitations for some good reason. When a user query requires | ||
| * too many resources, it will likely mean that the user query should be modified to not use excessive resources. |
There was a problem hiding this comment.
Suggest rewording to something like:
"This is a {@link BadQueryException} because it likely indicates a user's misbehavior when this exception is thrown. The resource limitations set by Druid cluster operators are typically less flexible than the parameters of a user query, so when a user query requires too many resources, the likely remedy is that the user query should be modified to use fewer resources, or to reduce query volume."
jon-wei
left a comment
There was a problem hiding this comment.
Had some comments on the javadocs but otherwise LGTM
|
LGTM after CI. |
As part of this follow-up PR can you also add test cases to the query IT. I think the scenarios you outlined in the PR description are great and should all have an IT associated with it. |
Good idea. We need integration tests for this. |
------------------------------------------------------------------------------
| lines | branches | functions | path
------------------------------------------------------------------------------
| 100% (17/17) | 100% (0/0) | 92% (12/13) | org/apache/druid/server/QueryResource.java
| 32% (11/34) | 44% (4/9) | 71% (27/38) | org/apache/druid/client/JsonParserIterator.java
| 0% (0/1) | 100% (0/0) | 100% (1/1) | org/apache/druid/client/DirectDruidClient.java
| 100% (3/3) | 100% (0/0) | 50% (2/4) | org/apache/druid/server/security/SecuritySanityCheckFilter.java
| 100% (1/1) | 100% (0/0) | 66% (6/9) | org/apache/druid/server/security/ForbiddenException.java
------------------------------------------------------------------------------
Total diff coverage:
- lines: 57% (32/56)
- branches: 44% (4/9)
- functions: 73% (48/65)
ERROR: Insufficient branch coverage of 44% (4/9). Required 50%.The test coverage check failed. I agree those branches in |
|
SGTM. I've gone ahead and created a issue #10756 for this so that we can keep track. |
|
@a2l007 thanks! |
* Tidy up query error codes * fix tests * Restore query exception type in JsonParserIterator * address review comments; add a comment explaining the ugly switch * fix test
Description
Druid querying system returns an HTTP status code on query failures. Currently, most of query failures return an internal error (500) which could be misleading in some cases. This PR is to tidy up this and return a proper status code. Here is a list of query errors, their current status codes, and proposed changes.
Query planning errors
Query execution errors
BadQueryExceptionis added to represent all kinds of query exceptions that should return 400.This PR has:
Key changed/added classes in this PR
BadQueryExceptionResourceLimitExceededException