-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Description
BackGround
Doris currently use REPLACE to update data, but the replacement order cannot be guaranteed for the data import of the same batch. The user needs to guarantee that there is no same key column in the imported data of the same batch to guarantee the replacement order, which is very inconvenient for the user. To solve this problem, we can use a version column to specify the replacement order.
Goal
The user specifies a version column when creating the table. Doris relies on this column to update the data of REPLACE type. The larger version column data can REPLACE the data of the smaller version column, while the data of the smaller version column cannot REPLACE the larger version column data.
Create Table Interface
CREATE TABLE `test` (
`id` bigint(20) NOT NULL,
`date` date NOT NULL,
`group_id` bigint(20) NOT NULL,
`version` int MAX NOT NULL,
`keyword` varchar(128) REPLACE NOT NULL,
`clicks` bigint(20) SUM NULL DEFAULT "0" ,
`cost` bigint(20) SUM NULL DEFAULT "0"
) ENGINE=OLAP
AGGREGATE KEY(`id`, `date`, `group_id`)
DISTRIBUTED BY HASH(`id`) BUCKETS 16
PROPERTIES (
"replace_version_column" = "version"
);
When creating a table, the user simply adds the replace_version_column attribute in PROPERTIES to identify the version column, which requires a MAX aggregation type to ensure that only the largest version column is retained for the same key column.
Query
When a user's query does not contain the REPLACE column, the original logic follows. When a user's query contains REPLACE columns, BE needs to extend the Version column on which the REPLACE column depends, and compare the value column when it is aggregated. These operations can be done by extending Reader return columns, and in FE,the isPreAggregation is OFF because of the REPLACE column is value column in StorageEngine
,which means the storage engine needs to aggregate the data before returning to scan node,so we can guarantee that the same key columns will be aggregated in Reader.
Compaction
Base and Cumulative Compaction use Reader to aggregate data, and it use all tablet columns as return columns, so similar to the query processing, we can use Reader for replace based on version columns.
Load
With the same batch of data load, Doris uses one or more MemTable. We need to ensure that the same key column in one MemTable, columns of REPLACE type are replaced with version column, while the data in different MemTable is not guaranteed in LOAD because Query and Compaction guarantee the order of replacement.
RollUp
If rollup contains a column of REPLACE type, we need the user to add the Replace version column or extend the column automatically.