Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion be/src/olap/olap_server.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -174,7 +174,7 @@ void StorageEngine::_garbage_sweeper_thread_callback() {
curr_interval = std::min(curr_interval, max_interval);

// start clean trash and update usage.
OLAPStatus res = _start_trash_sweep(&usage);
OLAPStatus res = start_trash_sweep(&usage);
if (res != OLAP_SUCCESS) {
OLAP_LOG_WARNING(
"one or more errors occur when sweep trash."
Expand Down
15 changes: 12 additions & 3 deletions be/src/olap/storage_engine.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -635,15 +635,24 @@ void StorageEngine::_start_clean_fd_cache() {
VLOG_TRACE << "end clean file descritpor cache";
}

OLAPStatus StorageEngine::_start_trash_sweep(double* usage) {
OLAPStatus StorageEngine::start_trash_sweep(double* usage, bool ignore_guard) {
OLAPStatus res = OLAP_SUCCESS;

std::unique_lock<std::mutex> l(_trash_sweep_lock,std::defer_lock);
if(!l.try_lock()) {
LOG(INFO) << "trash and snapshot sweep is running.";
return res;
}

LOG(INFO) << "start trash and snapshot sweep.";

const int32_t snapshot_expire = config::snapshot_expire_time_sec;
const int32_t trash_expire = config::trash_file_expire_time_sec;
// the guard space should be lower than storage_flood_stage_usage_percent,
// so here we multiply 0.9
const double guard_space = config::storage_flood_stage_usage_percent / 100.0 * 0.9;
// if ignore_guard is true, set guard_space to 0.
const double guard_space =
ignore_guard ? 0 : config::storage_flood_stage_usage_percent / 100.0 * 0.9;
std::vector<DataDirInfo> data_dir_infos;
RETURN_NOT_OK_LOG(get_all_data_dir_info(&data_dir_infos, false),
"failed to get root path stat info when sweep trash.")
Expand Down Expand Up @@ -687,7 +696,7 @@ OLAPStatus StorageEngine::_start_trash_sweep(double* usage) {
}

if (usage != nullptr) {
*usage = tmp_usage;
*usage = tmp_usage; // update usage
}

// clear expire incremental rowset, move deleted tablet to trash
Expand Down
7 changes: 5 additions & 2 deletions be/src/olap/storage_engine.h
Original file line number Diff line number Diff line change
Expand Up @@ -170,6 +170,10 @@ class StorageEngine {
// start all background threads. This should be call after env is ready.
Status start_bg_threads();

// clear trash and snapshot file
// option: update disk usage after sweep
OLAPStatus start_trash_sweep(double* usage, bool ignore_guard = false);

void stop();

void create_cumulative_compaction(TabletSharedPtr best_tablet,
Expand Down Expand Up @@ -238,8 +242,6 @@ class StorageEngine {

void _start_clean_fd_cache();

// 清理trash和snapshot文件,返回清理后的磁盘使用量
OLAPStatus _start_trash_sweep(double* usage);
// 磁盘状态监测。监测unused_flag路劲新的对应root_path unused标识位,
// 当检测到有unused标识时,从内存中删除对应表信息,磁盘数据不动。
// 当磁盘状态为不可用,但未检测到unused标识时,需要从root_path上
Expand Down Expand Up @@ -291,6 +293,7 @@ class StorageEngine {

EngineOptions _options;
std::mutex _store_lock;
std::mutex _trash_sweep_lock;
std::map<std::string, DataDir*> _store_map;
uint32_t _available_storage_medium_type_count;

Expand Down
3 changes: 3 additions & 0 deletions be/src/service/backend_service.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -373,4 +373,7 @@ void BackendService::get_stream_load_record(TStreamLoadRecordResult& result,
}
}

void BackendService::clean_trash() {
StorageEngine::instance()->start_trash_sweep(nullptr, true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may takes a very long time to clean the trash. So I suggest to use a async call.

Copy link
Contributor Author

@BiteTheDDDDt BiteTheDDDDt Aug 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may takes a very long time to clean the trash. So I suggest to use a async call.

I think this is already async, because of I use oneway to define the function at thrift file.
gensrc/thrift/BackendService.thrift
oneway void clean_trash();

}
} // namespace doris
2 changes: 2 additions & 0 deletions be/src/service/backend_service.h
Original file line number Diff line number Diff line change
Expand Up @@ -148,6 +148,8 @@ class BackendService : public BackendServiceIf {

virtual void get_stream_load_record(TStreamLoadRecordResult& result,
const int64_t last_stream_record_time) override;

virtual void clean_trash() override;

private:
Status start_plan_fragment_execution(const TExecPlanFragmentParams& exec_params);
Expand Down
2 changes: 2 additions & 0 deletions docs/.vuepress/sidebar/en.js
Original file line number Diff line number Diff line change
Expand Up @@ -154,6 +154,7 @@ module.exports = [
directoryPath: "operation/",
children: [
"doris-error-code",
"disk-capacity",
"metadata-operation",
"monitor-alert",
"multi-tenant",
Expand Down Expand Up @@ -436,6 +437,7 @@ module.exports = [
directoryPath: "Administration/",
children: [
"ADMIN CANCEL REPAIR",
"ADMIN CLEAN TRASH",
"ADMIN CHECK TABLET",
"ADMIN REPAIR",
"ADMIN SET CONFIG",
Expand Down
1 change: 1 addition & 0 deletions docs/.vuepress/sidebar/zh-CN.js
Original file line number Diff line number Diff line change
Expand Up @@ -441,6 +441,7 @@ module.exports = [
directoryPath: "Administration/",
children: [
"ADMIN CANCEL REPAIR",
"ADMIN CLEAN TRASH",
"ADMIN CHECK TABLET",
"ADMIN REPAIR",
"ADMIN SET CONFIG",
Expand Down
169 changes: 169 additions & 0 deletions docs/en/administrator-guide/operation/disk-capacity.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,169 @@
---
{
"title": "Disk Capacity Management",
"language": "en"
}
---

<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->

# Disk Capacity Management

This document mainly introduces system parameters and processing strategies related to disk storage capacity.

If Doris' data disk capacity is not controlled, the process will hang because the disk is full. Therefore, we monitor the disk usage and remaining capacity, and control various operations in the Doris system by setting different warning levels, and try to avoid the situation where the disk is full.

## Glossary

* FE:Doris Frontend Node. Responsible for metadata management and request access.
* BE:Doris Backend Node. Responsible for query execution and data storage.
* Data Dir:Data directory, each data directory specified in the `storage_root_path` of the BE configuration file `be.conf`. Usually a data directory corresponds to a disk, so the following **disk** also refers to a data directory.

## Basic Principles

BE will report disk usage to FE on a regular basis (every minute). FE records these statistical values and restricts various operation requests based on these statistical values.

Two thresholds, **High Watermark** and **Flood Stage**, are set in FE. Flood Stage is higher than High Watermark. When the disk usage is higher than High Watermark, Doris will restrict the execution of certain operations (such as replica balancing, etc.). If it is higher than Flood Stage, certain operations (such as load data) will be prohibited.

At the same time, a **Flood Stage** is also set on the BE. Taking into account that FE cannot fully detect the disk usage on BE in a timely manner, and cannot control certain BE operations (such as Compaction). Therefore, Flood Stage on the BE is used for the BE to actively refuse and stop certain operations to achieve the purpose of self-protection.

## FE Parameter

**High Watermark:**

```
storage_high_watermark_usage_percent: default value is 85 (85%).
storage_min_left_capacity_bytes: default value is 2GB.
```

When disk capacity **more than** `storage_high_watermark_usage_percent`, **or** disk free capacity **less than** `storage_min_left_capacity_bytes`, the disk will no longer be used as the destination path for the following operations:

* Tablet Balance
* Colocation Relocation
* Decommission

**Flood Stage:**

```
storage_flood_stage_usage_percent: default value is 95 (95%).
storage_flood_stage_left_capacity_bytes: default value is 1GB.
```

When disk capacity **more than** `storage_flood_stage_usage_percent`, **or** disk free capacity **less than** `storage_flood_stage_left_capacity_bytes`, the disk will no longer be used as the destination path for the following operations:

* Tablet Balance
* Colocation Relocation
* Replica make up
* Restore
* Load/Insert

## BE Parameter

**Flood Stage:**

```
capacity_used_percent_flood_stage: default value is 95 (95%).
capacity_min_left_bytes_flood_stage: default value is 1GB.
```

When disk capacity **more than** `storage_flood_stage_usage_percent`, **and** disk free capacity **less than** `storage_flood_stage_left_capacity_bytes`, the following operations on this disk will be prohibited:

* Base/Cumulative Compaction
* Data load
* Clone Task (Usually occurs when the replica is repaired or balanced.)
* Push Task (Occurs during the Loading phase of Hadoop import, and the file is downloaded. )
* Alter Task (Schema Change or Rollup Task.)
* Download Task (The Downloading phase of the recovery operation.)

## Disk Capacity Release

When the disk capacity is higher than High Watermark or even Flood Stage, many operations will be prohibited. At this time, you can try to reduce the disk usage and restore the system in the following ways.

* Delete table or partition

By deleting tables or partitions, you can quickly reduce the disk space usage and restore the cluster.
**Note: Only the `DROP` operation can achieve the purpose of quickly reducing the disk space usage, the `DELETE` operation cannot.**

```
DROP TABLE tbl;
ALTER TABLE tbl DROP PARTITION p1;
```

* BE expansion

After backend expansion, data tablets will be automatically balanced to BE nodes with lower disk usage. The expansion operation will make the cluster reach a balanced state in a few hours or days depending on the amount of data and the number of nodes.

* Modify replica of a table or partition

You can reduce the number of replica of a table or partition. For example, the default 3 replica can be reduced to 2 replica. Although this method reduces the reliability of the data, it can quickly reduce the disk usage rate and restore the cluster to normal.
This method is usually used in emergency recovery systems. Please restore the number of copies to 3 after reducing the disk usage rate by expanding or deleting data after recovery.
Modifying the replica operation takes effect instantly, and the backends will automatically and asynchronously delete the redundant replica.

```
ALTER TABLE tbl MODIFY PARTITION p1 SET("replication_num" = "2");
```

* Delete unnecessary files

When the BE has crashed because the disk is full and cannot be started (this phenomenon may occur due to untimely detection of FE or BE), you need to delete some temporary files in the data directory to ensure that the BE process can start.
Files in the following directories can be deleted directly:

* log/:Log files in the log directory.
* snapshot/: Snapshot files in the snapshot directory.
* trash/ Trash files in the trash directory.

**This operation will affect [Restore data from BE Recycle Bin](./tablet-restore-tool.md).**

If the BE can still be started, you can use `ADMIN CLEAN TRASH ON(BackendHost:BackendHeartBeatPort);` to actively clean up temporary files. **all trash files** and expired snapshot files will be cleaned up, **This will affect the operation of restoring data from the trash bin**.


If you do not manually execute `ADMIN CLEAN TRASH`, the system will still automatically execute the cleanup within a few minutes to tens of minutes.There are two situations as follows:
* If the disk usage does not reach 90% of the **Flood Stage**, expired trash files and expired snapshot files will be cleaned up. At this time, some recent files will be retained without affecting the recovery of data.
* If the disk usage has reached 90% of the **Flood Stage**, **all trash files** and expired snapshot files will be cleaned up, **This will affect the operation of restoring data from the trash bin**.

The time interval for automatic execution can be changed by `max_garbage_sweep_interval` and `max_garbage_sweep_interval` in the configuration items.

When the recovery fails due to lack of trash files, the following results may be returned:

```
{"status": "Fail","msg": "can find tablet path in trash"}
```

* Delete data file (dangerous!!!)

When none of the above operations can free up capacity, you need to delete data files to free up space. The data file is in the `data/` directory of the specified data directory. To delete a tablet, you must first ensure that at least one replica of the tablet is normal, otherwise **deleting the only replica will result in data loss**.

Suppose we want to delete the tablet with id 12345:

* Find the directory corresponding to Tablet, usually under `data/shard_id/tablet_id/`. like:

```data/0/12345/```

* Record the tablet id and schema hash. The schema hash is the name of the next-level directory of the previous step. The following is 352781111:

```data/0/12345/352781111```

* Delete the data directory:

```rm -rf data/0/12345/```

* Delete tablet metadata (refer to [Tablet metadata management tool](./tablet-meta-tool.md))

```./lib/meta_tool --operation=delete_header --root_path=/path/to/root_path --tablet_id=12345 --schema_hash= 352781111```
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
---
{
"title": "ADMIN CLEAN TRASH",
"language": "en"
}
---

<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->

# ADMIN CLEAN TRASH
## description
This statement is used to clean up the trash data in the backend.
Grammar:
ADMIN CLEAN TRASH [ON ("BackendHost1:BackendHeartBeatPort1", "BackendHost2:BackendHeartBeatPort2", ...)];

Explain:
Take BackendHost:BackendHeartBeatPort to indicate the backend that needs to be cleaned up, and clean up all backends without adding the on limit.

## example

1. Clean up the trash data of all be nodes.

ADMIN CLEAN TRASH;

2. Clean up the trash data of '192.168.0.1:9050' and '192.168.0.2:9050'.

ADMIN CLEAN TRASH ON ("192.168.0.1:9050","192.168.0.2:9050");

## keyword
ADMIN, CLEAN, TRASH
15 changes: 15 additions & 0 deletions docs/zh-CN/administrator-guide/operation/disk-capacity.md
Original file line number Diff line number Diff line change
Expand Up @@ -127,6 +127,21 @@ capacity_min_left_bytes_flood_stage 默认 1GB。
* snapshot/: 快照目录下的快照文件。
* trash/:回收站中的文件。

**这种操作会对 [从 BE 回收站中恢复数据](./tablet-restore-tool.md) 产生影响。**

如果BE还能够启动,则可以使用`ADMIN CLEAN TRASH ON(BackendHost:BackendHeartBeatPort);`来主动清理临时文件,会清理 **所有** trash文件和过期snapshot文件,**这将影响从回收站恢复数据的操作** 。

如果不手动执行`ADMIN CLEAN TRASH`,系统仍将会在几分钟至几十分钟内自动执行清理,这里分为两种情况:
* 如果磁盘占用未达到 **危险水位(Flood Stage)** 的90%,则会清理过期trash文件和过期snapshot文件,此时会保留一些近期文件而不影响恢复数据。
* 如果磁盘占用已达到 **危险水位(Flood Stage)** 的90%,则会清理 **所有** trash文件和过期snapshot文件, **此时会影响从回收站恢复数据的操作** 。
自动执行的时间间隔可以通过配置项中的`max_garbage_sweep_interval`和`max_garbage_sweep_interval`更改。

出现由于缺少trash文件而导致恢复失败的情况时,可能返回如下结果:

```
{"status": "Fail","msg": "can find tablet path in trash"}
```

* 删除数据文件(危险!!!)

当以上操作都无法释放空间时,需要通过删除数据文件来释放空间。数据文件在指定数据目录的 `data/` 目录下。删除数据分片(Tablet)必须先确保该 Tablet 至少有一个副本是正常的,否则**删除唯一副本会导致数据丢失**。假设我们要删除 id 为 12345 的 Tablet:
Expand Down
Loading