Skip to content

Optimize activation list query handling for CosmosDB #4684

@chetanmeh

Description

@chetanmeh

In our setup we are seeing high RU consumption for list activation query. This query is created as a result of wsk activation poll command which translates to

/api/v1/namespaces/_/activations?docs=true&limit=0&since=1542422386538&skip=0

We mitigated the cost to some extent by reducing the fetched document count (#4157). However still the poll query consumes quite a bit of the provisioned RU. One reason for this high usage is due to this query being cross partition.

CosmosDB has a limit of 10GB per partition. To avoid hitting that limit we store the activations using the id as the partition key. Hence a query listing the "recent" activations has to be executed across all partitions (fan out) and then result merged by SDK on client side. Checking the list query usage following aspects stand out

  1. All query are performed in descending order. From client side currently user cannot change the sort order
  2. Most list calls specify since but not the upto (upto can only be specified via list command which is not used much). So in that case the top results are only the most recent results
  3. Skip is mostly 0
  4. Its mostly used by developers actively working where they want to see "recent" activations. So result should mostly have activations from recent past and not very old

Materialized View

Given above aspects one way we can optimize the query handling is by using Materialized View pattern. In this we would have a new collection activations_query

  1. It would use the namespace as partition key
  2. Have a much shorter TTL say 1 hr (see also below)

TTL for Materialized View

To avoid hitting the partition limits we would need to keep a very low TTL for activations_query. To determine what TTL need to be used we would need to collect metrics on poll flow to see how much old activation we return compared to current time (#4688).

In general its seen that due to descending sort poll query fetches very latest data. Once we have metrics around this we can confirm this hypothesis and use that value as TTL

Write Flow

To add activations to it we would be running 2 copies of Activation Persister Service (#4632).

  1. First service would write to existing activations collection which has a longer TTL say 7 days
  2. Second service would write to new activations_query

Query Flow

For the query part we would need to have a MultiplexingArtifactStore impl where the query call would result in

  1. Query from activations_query first and check if its able to fetch result upto the limit
  2. If result count is upto limit then return that
  3. ELSE query activations and fetch remaining result

For most cases I expect that code path would only make first call and make use of fast path and thus resulting in lesser over all RU usage

Metadata

Metadata

Assignees

Labels

cosmosdbIssues related to CosmosDB support

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions