-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
In our setup we are seeing high RU consumption for list activation query. This query is created as a result of wsk activation poll command which translates to
/api/v1/namespaces/_/activations?docs=true&limit=0&since=1542422386538&skip=0
We mitigated the cost to some extent by reducing the fetched document count (#4157). However still the poll query consumes quite a bit of the provisioned RU. One reason for this high usage is due to this query being cross partition.
CosmosDB has a limit of 10GB per partition. To avoid hitting that limit we store the activations using the id as the partition key. Hence a query listing the "recent" activations has to be executed across all partitions (fan out) and then result merged by SDK on client side. Checking the list query usage following aspects stand out
- All query are performed in descending order. From client side currently user cannot change the sort order
- Most list calls specify
sincebut not theupto(upto can only be specified vialistcommand which is not used much). So in that case the top results are only the most recent results - Skip is mostly 0
- Its mostly used by developers actively working where they want to see "recent" activations. So result should mostly have activations from recent past and not very old
Materialized View
Given above aspects one way we can optimize the query handling is by using Materialized View pattern. In this we would have a new collection activations_query
- It would use the
namespaceas partition key - Have a much shorter TTL say 1 hr (see also below)
TTL for Materialized View
To avoid hitting the partition limits we would need to keep a very low TTL for activations_query. To determine what TTL need to be used we would need to collect metrics on poll flow to see how much old activation we return compared to current time (#4688).
In general its seen that due to descending sort poll query fetches very latest data. Once we have metrics around this we can confirm this hypothesis and use that value as TTL
Write Flow
To add activations to it we would be running 2 copies of Activation Persister Service (#4632).
- First service would write to existing
activationscollection which has a longer TTL say 7 days - Second service would write to new
activations_query
Query Flow
For the query part we would need to have a MultiplexingArtifactStore impl where the query call would result in
- Query from
activations_queryfirst and check if its able to fetch result upto thelimit - If result count is upto limit then return that
- ELSE query
activationsand fetch remaining result
For most cases I expect that code path would only make first call and make use of fast path and thus resulting in lesser over all RU usage