Skip to content

Tidy up HTTP status codes for query errors#10746

Merged
jihoonson merged 5 commits intoapache:masterfrom
jihoonson:query-errors
Jan 14, 2021
Merged

Tidy up HTTP status codes for query errors#10746
jihoonson merged 5 commits intoapache:masterfrom
jihoonson:query-errors

Conversation

@jihoonson
Copy link
Copy Markdown
Contributor

Description

Druid querying system returns an HTTP status code on query failures. Currently, most of query failures return an internal error (500) which could be misleading in some cases. This PR is to tidy up this and return a proper status code. Here is a list of query errors, their current status codes, and proposed changes.

Query planning errors

Exception Description Current code Proposed new code
SqlParseException and ValidationException from Calcite Query planning failed 500 400
CannotPlanException from Calcite Failed to find the best plan. Likely an internal error 500
RelConversionException from Calcite Conversion from SQL to Druid failed 500
CannotBuildQueryException Failed to create a Druid native query from DruidRel. Likely an internal error 500
Others Unknown errors 500

Query execution errors

Exception Description Current code Proposed new code
QueryTimeoutException Query execution didn't finish in timeout 504
Unauthorized exception Authentication failed to execute the query 401
ForbiddenException Authorization failed to execute the query 403
QueryCapacityExceededException Query failed to schedule because of the lack of resources available at the time when the query was submitted 429
ResourceLimitExceededException Query asked more resources than configured threshold 500 400
InsufficientResourceException Query failed to schedule because of lack of merge buffers available at the time when it was submitted 500 429, merged to QueryCapacityExceededException
QueryUnsupportedException Unsupported functionality 400 501
Others Unknown exceptions 500

BadQueryException is added to represent all kinds of query exceptions that should return 400.


This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

Key changed/added classes in this PR
  • BadQueryException
  • ResourceLimitExceededException

@jihoonson
Copy link
Copy Markdown
Contributor Author

Will update docs as a follow-up.

Copy link
Copy Markdown
Contributor

@a2l007 a2l007 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! This will help improve debugging and categorizing failed queries.


/**
* This exception is thrown when the requested operation cannot be completed due to a lack of available resources.
* An abstract class for all query exceptions that should return a bad request status code (404).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't it be 400 here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, yes it should be.

* error code.
* This is a {@link BadQueryException} because it will likely a user's misbehavior when this exception is thrown.
* Druid cluster operators set some set of resource limitations for some good reason. When a user query requires
* too many resources, it will likely mean that the user query should be modified to not use excessive resources.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the description, shouldn't this exception also be categorized as ResourceLimitExceeded ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch 🙂 It should be, but I wasn't 100% sure whether it could cause any side effects if I did it. I left a comment about it here. I will also make an issue to track it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we change this, the exception type and code changes from RE to ResourceLimitExceeded for queries exceeding maxScatterGatherBytes. What are the side effects that could happen? Sorry if I'm missing something :)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest rewording to something like:

"This is a {@link BadQueryException} because it likely indicates a user's misbehavior when this exception is thrown. The resource limitations set by Druid cluster operators are typically less flexible than the parameters of a user query, so when a user query requires too many resources, the likely remedy is that the user query should be modified to use fewer resources, or to reduce query volume."

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@a2l007 oh, you are correct about the exception type in checkTotalBytesLimit(). My point was that this exception doesn't seem to be propagated to JsonParserIterator anyway because NettyHttpClient doesn't do it. I think NettyHttpClient should call retVal.setException() instead of retVal.set(null), but I haven't studied the code enough around it to figure out why we are doing this. Anyway, it doesn't look harm to change the exception type in this PR, so I changed 🙂

@jon-wei 👍 Updated the javadoc.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jihoonson Makes sense. Thanks for clarifying!


Response gotBadQuery(BadQueryException e) throws IOException
{
return buildNonOkResponse(BadQueryException.STATUS_CODE, e);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was curious whether this would catch ResourceLimitExceeded exceptions wrapped under QueryInterruptedException and so I ran the PR on my test cluster. It looks like such tests are still giving us 500 instead of 400. Related stack trace:

HTTP/1.1 500 Internal Server Error
< Content-Type: application/json
< Content-Length: 372
{
  "error" : "Resource limit exceeded",
 "trace":
org.apache.druid.query.QueryInterruptedException: Not enough aggregation buffer space to execute this query. Try increasing druid.processing.buffer.sizeBytes or enable disk spilling by setting druid.query.groupBy.maxOnDiskStorage to a positive number.
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) ~[?:1.8.0_231]
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) ~[?:1.8.0_231]
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) ~[?:1.8.0_231]
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423) ~[?:1.8.0_231]
	at com.fasterxml.jackson.databind.introspect.AnnotatedConstructor.call(AnnotatedConstructor.java:124) ~[jackson-databind-2.10.5.1.jar:2.10.5.1]
	at com.fasterxml.jackson.databind.deser.std.StdValueInstantiator.createFromObjectWith(StdValueInstantiator.java:283) ~[jackson-databind-2.10.5.1.jar:2.10.5.1]
	at com.fasterxml.jackson.databind.deser.ValueInstantiator.createFromObjectWith(ValueInstantiator.java:229) ~[jackson-databind-2.10.5.1.jar:2.10.5.1]
	at com.fasterxml.jackson.databind.deser.impl.PropertyBasedCreator.build(PropertyBasedCreator.java:198) ~[jackson-databind-2.10.5.1.jar:2.10.5.1]
	at com.fasterxml.jackson.databind.deser.BeanDeserializer._deserializeUsingPropertyBased(BeanDeserializer.java:422) ~[jackson-databind-2.10.5.1.jar:2.10.5.1]
	at com.fasterxml.jackson.databind.deser.std.ThrowableDeserializer.deserializeFromObject(ThrowableDeserializer.java:65) ~[jackson-databind-2.10.5.1.jar:2.10.5.1]
	at com.fasterxml.jackson.databind.deser.BeanDeserializer.deserialize(BeanDeserializer.java:159) ~[jackson-databind-2.10.5.1.jar:2.10.5.1]
	at com.fasterxml.jackson.databind.ObjectMapper._readValue(ObjectMapper.java:4189) ~[jackson-databind-2.10.5.1.jar:2.10.5.1]
	at com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:2476) ~[jackson-databind-2.10.5.1.jar:2.10.5.1]
	at org.apache.druid.client.JsonParserIterator.init(JsonParserIterator.java:182) [classes/:?]
	at org.apache.druid.client.JsonParserIterator.hasNext(JsonParserIterator.java:90) [classes/:?]

Maybe as a follow-up we could check the wrapped error code while handling QueryInterruptedException to determine the correct exception type.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for testing! I thought I fixed all cases where exceptions are wrapped in QueryInterruptedException, but missed JsonParserIterator. My intention was to fix it in this PR, so I went ahead and fixed it. It turns out there is an issue of that the type of query exception is lost after JSON deserialization. I ended up with this ugly switch clause to restore the exception type. A better approach is modifying QueryException to store the type using Jackson, but I wasn't 100% sure what is the side effect there too. The ugly switch clause seems at least safer than modifying QueryException. I think we can rework the serde system of QueryException after release. What do you think?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks works fine now. I agree that we can avoid deeper tinkering of QueryException before the release and it can be picked up post release.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

*
* As a {@link QueryException} it is expected to be serialied to a json response, but will be mapped to
* {@link #STATUS_CODE} instead of the default HTTP 500 status.
* As a {@link QueryException} it is expected to be serialied to a json response with a proper HTTP error code
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

serialied -> serialized

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, thanks.

* error code.
* This is a {@link BadQueryException} because it will likely a user's misbehavior when this exception is thrown.
* Druid cluster operators set some set of resource limitations for some good reason. When a user query requires
* too many resources, it will likely mean that the user query should be modified to not use excessive resources.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest rewording to something like:

"This is a {@link BadQueryException} because it likely indicates a user's misbehavior when this exception is thrown. The resource limitations set by Druid cluster operators are typically less flexible than the parameters of a user query, so when a user query requires too many resources, the likely remedy is that the user query should be modified to use fewer resources, or to reduce query volume."

Copy link
Copy Markdown
Contributor

@jon-wei jon-wei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had some comments on the javadocs but otherwise LGTM

@a2l007
Copy link
Copy Markdown
Contributor

a2l007 commented Jan 13, 2021

LGTM after CI.

@suneet-s
Copy link
Copy Markdown
Contributor

Will update docs as a follow-up.

As part of this follow-up PR can you also add test cases to the query IT. I think the scenarios you outlined in the PR description are great and should all have an IT associated with it.

@jihoonson
Copy link
Copy Markdown
Contributor Author

Will update docs as a follow-up.

As part of this follow-up PR can you also add test cases to the query IT. I think the scenarios you outlined in the PR description are great and should all have an IT associated with it.

Good idea. We need integration tests for this.

@jihoonson
Copy link
Copy Markdown
Contributor Author

------------------------------------------------------------------------------
|     lines      |    branches    |   functions    |   path
------------------------------------------------------------------------------
| 100% (17/17)   | 100% (0/0)     |  92% (12/13)   | org/apache/druid/server/QueryResource.java
|  32% (11/34)   |  44% (4/9)     |  71% (27/38)   | org/apache/druid/client/JsonParserIterator.java
|   0% (0/1)     | 100% (0/0)     | 100% (1/1)     | org/apache/druid/client/DirectDruidClient.java
| 100% (3/3)     | 100% (0/0)     |  50% (2/4)     | org/apache/druid/server/security/SecuritySanityCheckFilter.java
| 100% (1/1)     | 100% (0/0)     |  66% (6/9)     | org/apache/druid/server/security/ForbiddenException.java

------------------------------------------------------------------------------
Total diff coverage:
 - lines: 57% (32/56)
 - branches: 44% (4/9)
 - functions: 73% (48/65)
ERROR: Insufficient branch coverage of 44% (4/9). Required 50%.

The test coverage check failed. I agree those branches in JsonParserIterator should be covered, but since this is blocking a branch cut, I will add them as a follow-up. Other integration test failures seem flaky. I just restarted them.

@jihoonson
Copy link
Copy Markdown
Contributor Author

All tests passed except for the test coverage check mentioned above. I'm going to go ahead and merge this to cut the branch. @jon-wei @a2l007 any thoughts?

@a2l007
Copy link
Copy Markdown
Contributor

a2l007 commented Jan 14, 2021

SGTM. I've gone ahead and created a issue #10756 for this so that we can keep track.

@jihoonson
Copy link
Copy Markdown
Contributor Author

@a2l007 thanks!

@jihoonson jihoonson merged commit 149306c into apache:master Jan 14, 2021
JulianJaffePinterest pushed a commit to JulianJaffePinterest/druid that referenced this pull request Jan 22, 2021
* Tidy up query error codes

* fix tests

* Restore query exception type in JsonParserIterator

* address review comments; add a comment explaining the ugly switch

* fix test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants