SQL result cache and partition cache #3994

marising · 2020-07-01T07:59:33Z

Solutions

This cache give priority to ensuring data consistency. On this basis, it refines the cache granularity and improves the hit rate. Therefore, it has the following characteristics:

User don't need to worry about data consistency. Cache invalidation is controlled by version. The cached data is consistent with the data queried from be
Without additional components and costs, the cache results are stored in the memory of be, and user can adjust the cache memory size according to their needs
Two caching strategies are implemented, sql and partition cache, which are more granular
The cache algorithm in be is an improved LRU

Two cache mode

SQLCache

Sql cache stores and fetches the cache according to the SQL signature, partition ID of the query table, and the latest version of the partition.

The combination of the three determines a cache dataset. If any one of them changes, such as SQL changes, query fields or conditions are not the same, or the version after data update changes, the cache will not be hit.

If multiple tables are joined, the latest partition ID and the latest version number are used. If one of the tables is updated, the partition ID or version number will be different, and the cache will not be hit.

Sql cache is more suitable for the scenario of T + 1 update. When the data is updated in the morning, the results of the first query are obtained from be and put into the cache, and the subsequent same query is obtained from the cache. Real time update data can also be used, but there may be a low hit rate. Please refer to the following partitioncache.

PartitionCache

Query the number of users per day in the last 7 days, such as partitioning by date, data is only written to the current partition, and the data of other partitions other than that day are fixed. Under the same query SQL, query a partition that does not update The indicators are fixed. As follows, the number of users in the 7 days before the query on 2020-03-09, the data from 2020-03-03 to 2020-03-07 comes from the cache, the first query from 2020-03-08 comes from the partition, and the subsequent queries come from the cache , 2020-03-09 because of the non-stop writing that day, so from the partition.

Therefore, querying the data of N days, the latest D days of the data update, each day is only a query with a similar date range, only need to query D partitions, the other parts are all from the cache, which can effectively reduce the cluster load and reduce the query time.

MySQL [(none)]> SELECT eventdate,count(userid) FROM testdb.appevent WHERE eventdate>="2020-03-03" AND eventdate<="2020-03-09" GROUP BY eventdate ORDER BY eventdate;
+------------+-----------------+
| eventdate  | count(`userid`) |
+------------+-----------------+
| 2020-03-03 |              15 | //From cache
| 2020-03-04 |              20 | ...
| 2020-03-05 |              25 |
| 2020-03-06 |              30 |
| 2020-03-07 |              35 |
| 2020-03-08 |              40 | //From cache
| 2020-03-09 |              25 | //From disk
+------------+-----------------+
7 rows in set (0.02 sec)

Reference

For more information, please read partition_cache.md

BabySid · 2020-07-01T10:48:34Z

be/src/common/config.h

+    // Set max cache's size of query results, the unit is M byte
+    CONF_Int32(cache_max_size, "256"); 
+
+    //Cache memory is pruened when reach cache_max_size + cache_elasticity_size


cache memory will be shrinked?
i do not understand what does this mean....

In order to avoid frequent cache cleaning and keep high hit rate, set two config item, cache_max_size and cache_elasticity_size，such as default config value, when reach 256M+128M，cache memory is pruned to 256M

wutiangan · 2020-07-01T11:21:44Z

docs/zh-CN/administrator-guide/partition_cache.md

+```
+MySQL [(none)]> set [global] enable_sql_cache=true;
+```
+注：globa是全局变量，不加指当前会话变量


Suggested change

注：globa是全局变量，不加指当前会话变量

注：global是全局变量，不加指当前会话变量

be/src/common/config.h

BabySid · 2020-07-01T11:00:07Z

be/src/runtime/cache/result_cache.h

+
+typedef std::unordered_map<UniqueId, ResultNode*> ResultNodeMap;
+
+// a doubly linked list class


why not use std::list?

BabySid · 2020-07-01T11:55:36Z

fe/src/main/java/org/apache/doris/qe/cache/CacheAnalyzer.java

+     * SELECT xxx FROM app_event INNER JOIN user_Profile ON app_event.user_id = user_profile.user_id xxx
+     * SELECT xxx FROM app_event INNER JOIN user_profile ON xxx INNER JOIN site_channel ON xxx
+     */
+    public void checkCacheMode(long now) {


Check whether a SQL hit the cache in explain?
otherwise we will check the log, it's inconvenient

marising · 2020-07-10T01:26:53Z

I split the PR and submit the be part first
#4005

1. Cache SQL result for T+1 table 2. Cache Partition result for partition table of realtime updated 3. Config and session variables for cache

BabySid reviewed Jul 1, 2020

View reviewed changes

wutiangan reviewed Jul 1, 2020

View reviewed changes

BabySid reviewed Jul 1, 2020

View reviewed changes

morningman added area/sql/execution Issues or PRs related to the execution engine kind/improvement labels Jul 1, 2020

marising force-pushed the partition_cache_0.3 branch 2 times, most recently from 47c06ff to c28775a Compare July 31, 2020 11:01

marising force-pushed the partition_cache_0.3 branch 4 times, most recently from 4a05946 to 885d9ce Compare August 10, 2020 10:09

[Feture][Cache] SQL result cache and partition cache

bc75361

1. Cache SQL result for T+1 table 2. Cache Partition result for partition table of realtime updated 3. Config and session variables for cache

marising force-pushed the partition_cache_0.3 branch from 885d9ce to bc75361 Compare August 10, 2020 11:29

HangyuanLiu closed this Oct 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SQL result cache and partition cache #3994

SQL result cache and partition cache #3994

Uh oh!

marising commented Jul 1, 2020

Uh oh!

BabySid Jul 1, 2020

Uh oh!

marising Jul 1, 2020

Uh oh!

wutiangan Jul 1, 2020

Uh oh!

Uh oh!

BabySid Jul 1, 2020

Uh oh!

BabySid Jul 1, 2020

Uh oh!

marising commented Jul 10, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

	注：globa是全局变量，不加指当前会话变量
	注：global是全局变量，不加指当前会话变量


		typedef std::unordered_map<UniqueId, ResultNode*> ResultNodeMap;

		// a doubly linked list class

SQL result cache and partition cache #3994

SQL result cache and partition cache #3994

Uh oh!

Conversation

marising commented Jul 1, 2020

Solutions

Two cache mode

SQLCache

PartitionCache

Reference

Uh oh!

BabySid Jul 1, 2020

Choose a reason for hiding this comment

Uh oh!

marising Jul 1, 2020

Choose a reason for hiding this comment

Uh oh!

wutiangan Jul 1, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

BabySid Jul 1, 2020

Choose a reason for hiding this comment

Uh oh!

BabySid Jul 1, 2020

Choose a reason for hiding this comment

Uh oh!

marising commented Jul 10, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants