Skip to content

500s while querying longish ranges #266

@jml

Description

@jml

I was just trying to get a graph of our frontend's QPS over the last week using the data we have in Cortex. I could get as far as 2 days, but even then, Grafana would sometimes render an error page. On 7 days, it consistently errors.

Looking at the Cortex logs for the querier service, I see:

time="2017-02-06T14:09:14Z" level=warning msg="Error fetching from cache: read tcp 10.244.254.146:52292->10.244.229.94:11211: i/o timeout" source="chunk_store.go:469"
time="2017-02-06T14:09:23Z" level=error msg="Error in MergeQuerier.QueryRange: InternalError: We encountered an internal error. Please try again.\n\tstatus code: 500, request id: 8CED988DD7ED850A" source="querier.go:130"

and

time="2017-02-06T14:07:08Z" level=warning msg="Error fetching from cache: read tcp 10.244.228.139:42174->10.244.253.92:11211: i/o timeout" source="chunk_store.go:469"
time="2017-02-06T14:07:28Z" level=error msg="Error in MergeQuerier.QueryRange: RequestError: send request failed\ncaused by: Get https://weaveworks-prod-chunks.s3.amazonaws.com/2/15428021661599280118%3A1486162592195%3A1486170002195: http: server closed idle connection" source="querier.go:130"
time="2017-02-06T14:07:28Z" level=error msg="Error in MergeQuerier.QueryRange: RequestError: send request failed\ncaused by: Get https://weaveworks-prod-chunks.s3.amazonaws.com/2/12619149369118128877%3A1486247414537%3A1486262234537: EOF" source="querier.go:130"

The error grafana sees looks like:

{
  "status": "error",
  "errorType": "execution",
  "error": "InternalError: We encountered an internal error. Please try again.\n\tstatus code: 500, request id: 0BA1F8D00B3A2DDC",
  "message": "InternalError: We encountered an internal error. Please try again.\n\tstatus code: 500, request id: 0BA1F8D00B3A2DDC"
}

I guess there are two possible implications of this behaviour:

  • maybe we should do some sort of more sophisticated error handling on reads (e.g. retries)
  • maybe we need to provide special logic for longer range queries with more data in order to make them perform reasonably

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions