Skip to content

Conversation

@marising
Copy link
Contributor

@marising marising commented Jul 1, 2020

#2581

Solutions

This cache give priority to ensuring data consistency. On this basis, it refines the cache granularity and improves the hit rate. Therefore, it has the following characteristics:

  • User don't need to worry about data consistency. Cache invalidation is controlled by version. The cached data is consistent with the data queried from be
  • Without additional components and costs, the cache results are stored in the memory of be, and user can adjust the cache memory size according to their needs
  • Two caching strategies are implemented, sql and partition cache, which are more granular
  • The cache algorithm in be is an improved LRU

Two cache mode

SQLCache

Sql cache stores and fetches the cache according to the SQL signature, partition ID of the query table, and the latest version of the partition.

The combination of the three determines a cache dataset. If any one of them changes, such as SQL changes, query fields or conditions are not the same, or the version after data update changes, the cache will not be hit.

If multiple tables are joined, the latest partition ID and the latest version number are used. If one of the tables is updated, the partition ID or version number will be different, and the cache will not be hit.

Sql cache is more suitable for the scenario of T + 1 update. When the data is updated in the morning, the results of the first query are obtained from be and put into the cache, and the subsequent same query is obtained from the cache. Real time update data can also be used, but there may be a low hit rate. Please refer to the following partitioncache.

PartitionCache

Query the number of users per day in the last 7 days, such as partitioning by date, data is only written to the current partition, and the data of other partitions other than that day are fixed. Under the same query SQL, query a partition that does not update The indicators are fixed. As follows, the number of users in the 7 days before the query on 2020-03-09, the data from 2020-03-03 to 2020-03-07 comes from the cache, the first query from 2020-03-08 comes from the partition, and the subsequent queries come from the cache , 2020-03-09 because of the non-stop writing that day, so from the partition.

Therefore, querying the data of N days, the latest D days of the data update, each day is only a query with a similar date range, only need to query D partitions, the other parts are all from the cache, which can effectively reduce the cluster load and reduce the query time.

MySQL [(none)]> SELECT eventdate,count(userid) FROM testdb.appevent WHERE eventdate>="2020-03-03" AND eventdate<="2020-03-09" GROUP BY eventdate ORDER BY eventdate;
+------------+-----------------+
| eventdate  | count(`userid`) |
+------------+-----------------+
| 2020-03-03 |              15 | //From cache
| 2020-03-04 |              20 | ...
| 2020-03-05 |              25 |
| 2020-03-06 |              30 |
| 2020-03-07 |              35 |
| 2020-03-08 |              40 | //From cache
| 2020-03-09 |              25 | //From disk
+------------+-----------------+
7 rows in set (0.02 sec)

Reference

For more information, please read partition_cache.md

// Set max cache's size of query results, the unit is M byte
CONF_Int32(cache_max_size, "256");

//Cache memory is pruened when reach cache_max_size + cache_elasticity_size
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cache memory will be shrinked?
i do not understand what does this mean....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to avoid frequent cache cleaning and keep high hit rate, set two config item, cache_max_size and cache_elasticity_size,such as default config value, when reach 256M+128M,cache memory is pruned to 256M

```
MySQL [(none)]> set [global] enable_sql_cache=true;
```
注:globa是全局变量,不加指当前会话变量
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
注:globa是全局变量,不加指当前会话变量
注:global是全局变量,不加指当前会话变量


typedef std::unordered_map<UniqueId, ResultNode*> ResultNodeMap;

// a doubly linked list class
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not use std::list?

* SELECT xxx FROM app_event INNER JOIN user_Profile ON app_event.user_id = user_profile.user_id xxx
* SELECT xxx FROM app_event INNER JOIN user_profile ON xxx INNER JOIN site_channel ON xxx
*/
public void checkCacheMode(long now) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check whether a SQL hit the cache in explain?
otherwise we will check the log, it's inconvenient

@morningman morningman added area/sql/execution Issues or PRs related to the execution engine kind/improvement labels Jul 1, 2020
@marising
Copy link
Contributor Author

I split the PR and submit the be part first
#4005

@marising marising force-pushed the partition_cache_0.3 branch 2 times, most recently from 47c06ff to c28775a Compare July 31, 2020 11:01
@marising marising force-pushed the partition_cache_0.3 branch 4 times, most recently from 4a05946 to 885d9ce Compare August 10, 2020 10:09
1. Cache SQL result for T+1 table
2. Cache Partition result for partition table of realtime updated
3. Config and session variables for cache
@marising marising force-pushed the partition_cache_0.3 branch from 885d9ce to bc75361 Compare August 10, 2020 11:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/sql/execution Issues or PRs related to the execution engine kind/improvement

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants