Skip to content

[Routine Load][BE]Too many VersionCount cause by wrong compaction failure time  #4551

@caoyang10

Description

@caoyang10

I start a routine load job and set

cumulative_compaction_num_threads_per_disk=10
base_compaction_num_threads_per_disk=5
push_write_mbytes_per_sec=100
cumulative_compaction_check_interval_seconds=1
cumulative_compaction_skip_window_seconds=5

in my BE config.

the job properties is set
(
"desired_concurrent_number"="25",
"max_batch_interval" = "60",
"max_batch_rows" = "1000000000",
"max_batch_size" = "1073741824",
"strict_mode" = "false",
"format" = "json",
"strip_outer_array" = "false"
)

partitions is divided into "HOUR" time unit

I have 25 BE nodes and throughput is almost 300K msg per second(Data Size: 700MB/s).
Then I find performance of the query which search recent time or now is terrible.
For example: now is 2020-09-07 14:30
some query like timestamp >= "2020-09-07 14:00" and timestamp < "2020-09-07 15:00" will cost 4 sec
while some query like timestamp >= "2020-09-07 13:00" and timestamp < "2020-09-07 14:00" will cost 0.4 sec

I notice that the VersionCount is almost 300+ which is 2-30 as usual. It means some recent data doesn't have been compacted and I increase cumulative_compaction_num_threads_per_disk, it doesn't work.

I review the compaction code(src/olap/tablet_manager.cpp 715):

            int64_t last_failure_ms = tablet_ptr->last_cumu_compaction_failure_time();
            if (compaction_type == CompactionType::BASE_COMPACTION) {
                last_failure_ms = tablet_ptr->last_base_compaction_failure_time();
            }
            if (now_ms - last_failure_ms <= config::min_compaction_failure_interval_sec * 1000) {
                VLOG(1) << "Too often to check compaction, skip it."
                        << "compaction_type=" << compaction_type_str
                        << ", last_failure_time_ms=" << last_failure_ms
                        << ", tablet_id=" << tablet_ptr->tablet_id();
                continue;
            } 

It means tablet doesn't need compact when last failure time is too closed to now.
the last failure time code(src/olap/storage_engine.cpp 557 _perform_cumulative_compaction 593 _perform_base_compaction)

OLAPStatus res = cumulative_compaction.compact();
if (res != OLAP_SUCCESS) {
    best_tablet->set_last_cumu_compaction_failure_time(UnixMillis());
    if (res != OLAP_ERR_CUMULATIVE_NO_SUITABLE_VERSIONS) {
        DorisMetrics::instance()->cumulative_compaction_request_failed.increment(1);
        LOG(WARNING) << "failed to do cumulative compaction. res=" << res
                    << ", table=" << best_tablet->full_name();
    }
    return;
}
best_tablet->set_last_cumu_compaction_failure_time(0);

It means when compaction is success, it will set last failure time (0), when compaction is failed, it will set now instead.
I don't know what error orrurs while compacting and it's truely failed. The last failure time is set now and the tablet will not be compacted during next min_compaction_failure_interval_sec second which default value is 600. So there are more and more routine load data in BE and query is slow.

A simple way to solve the problem is telling the difference between compact result code and set correct last failure time:
Just change the code as follow:

OLAPStatus res = cumulative_compaction.compact();
if (res != OLAP_SUCCESS) {
    if (res == OLAP_ERR_BE_TRY_BE_LOCK_ERROR) {
        best_tablet->set_last_cumu_compaction_failure_time(UnixMillis());
    } else {
        best_tablet->set_last_cumu_compaction_failure_time(0);
    }
    if (res != OLAP_ERR_CUMULATIVE_NO_SUITABLE_VERSIONS) {
        DorisMetrics::instance()->cumulative_compaction_request_failed.increment(1);
        LOG(WARNING) << "failed to do cumulative compaction. res=" << res
                    << ", table=" << best_tablet->full_name();
    }
    return;
}
best_tablet->set_last_cumu_compaction_failure_time(0);

I think whatever the status of the thread that owns the compaction lock is, the last compaction failure time need to be set as 0 so that the tablet can be scheduled next time.

Any suggestions ?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions