Optimize the load performance for large file #1798

morningman · 2019-09-12T08:38:09Z

morningman · 2019-09-12T08:39:26Z

Separate TabletsChannel from TabletWriterMgr to avoid conflict define between gutil and brpc

kangkaisen · 2019-09-12T08:52:09Z

be/src/common/config.h

+    // the size of queue for saving immutable memtables.
+    // set this size larger may reduce the time of waiting memtable flush,
+    // but will increase memory usage of loading.
+    // CONF_Int32(memtable_queue_size, "1");


unused config?

I will remove it

kangkaisen · 2019-09-12T08:56:52Z

be/src/olap/delta_writer.h

    bool _delta_written_success;
+    std::atomic<OLAPStatus> _flush_status;
+    int64_t _flush_cost_ns;
+    int64_t _flush_time;


rename _flush_cost_ns to _flush_time_ns?
rename _flush_time to _flush_count?

kangkaisen · 2019-09-12T09:02:43Z

be/src/runtime/tablets_channel.h

+    void _flush_memtable();
+
+private:
+    // id of this load channel, just for 


kangkaisen · 2019-09-12T09:18:22Z

be/src/runtime/tablets_channel.h

+    // the size of flush queue equals to the number of tablets.
+    // so that each tablet has at least one rotational memtable.
+    // and the over all mem usage is at most 2 times of total memtable's size
+    BlockingQueue<std::shared_ptr<MemTable>> _flush_queue;


All tablets data are not uniform. An extreme example: there 10 tablets, tablet A occupies ninety percent data, so tablet A data flush will very slowly. So I think there maybe two even more tablet A MemTable in flush_queue at the same time.

So "the over all mem usage is at most 2 times of total memtable's size" will not always right.

First of all, all memtables' max size are equal(default is 100MB).
Suppose there 5 tablets, and even if tablet A occupies all data, there are at most 5 memtables(all of them are tablet A's) in the queue. And plus the number of memtables for incoming data, there are 10 memtables at most.

OK. you are right. I forgot the " all memtables' max size are equal(default is 100MB)."

imay · 2019-09-12T15:43:20Z

be/src/olap/delta_writer.cpp

+      _delta_written_success(false), _flush_status(OLAP_SUCCESS),
+      _flush_queue(flush_queue) {
+
+    _mem_table.reset();


imay · 2019-09-12T15:44:55Z

be/src/olap/delta_writer.cpp

+    // and create a new memtable for incoming data
    if (_mem_table->memory_usage() >= config::write_buffer_size) {
-        RETURN_NOT_OK(_mem_table->flush(_rowset_writer.get()));
+        if (_flush_status.load() != OLAP_SUCCESS) {


RETURN_NOT_OK?

imay · 2019-09-12T15:59:16Z

be/src/olap/delta_writer.h

+    int64_t _flush_count;
+
+    // queue for saving immable mem tables
+    BlockingQueue<std::shared_ptr<MemTable>>* _flush_queue;


I think it's better to have an executor, which can be system wide or a load wide. When DeltaWriter submit a flush request to the executor, the executor will return a future to it. If you do that way, you need not to modify TabletChannel, and TabletChannel don't call MemTable flush. MemTable flush operation is StorageEngine implementation related, it is a little weird to put it out of StorageEngine.

Let me think about it

imay · 2019-09-18T01:55:00Z

be/src/runtime/exec_env_init.cpp

    _stream_load_executor = new StreamLoadExecutor(this);
    _routine_load_task_executor = new RoutineLoadTaskExecutor(this);
    _small_file_mgr = new SmallFileMgr(this, config::small_file_dir);
+    _memtable_flush_executor = new MemTableFlushExecutor(this);


This flush executor should be included in StorageEngine other than here.

imay · 2019-09-18T02:08:34Z

be/src/olap/delta_writer.cpp

-                              _req.tuple_desc, _tablet->keys_type());
+    _mem_table = std::make_shared<MemTable>(_tablet->tablet_id(), _schema, _tablet_schema, _req.slots, _req.tuple_desc, _tablet->keys_type());
+
+    _flush_queue_idx = _flush_executor->get_queue_idx(_tablet->data_dir()->path_hash());


Why are you exposing this queue to the outside?

Because for now, each DeltaWriter must only push memtables to the same queue.

imay · 2019-09-18T02:25:43Z

be/src/runtime/memtable_flush_executor.h

+struct MemTableFlushContext {
+    std::shared_ptr<MemTable> memtable;
+    DeltaWriter* delta_writer;
+    std::atomic<OLAPStatus>* flush_status;


DeltaWriter will call MemTableFlushExecutor, but you put DeltaWriter in context, which make a loop dependence

imay · 2019-09-18T02:32:01Z

be/src/runtime/tablet_writer_mgr.cpp

 namespace doris {

-// channel that process all data for this load
-class TabletsChannel {


If not necessary to modify, please don't change these files to avoid Introducing new questions

imay · 2019-09-19T09:41:47Z

be/src/common/config.h

    // The min bytes that should be left of a data dir
    CONF_Int64(storage_flood_stage_left_capacity_bytes, "1073741824")   // 1GB
+    // number of thread for flushing memtable per data dir
+    CONF_Int32(flush_thread_num_per_dir, "2");


Suggested change

CONF_Int32(flush_thread_num_per_dir, "2");

CONF_Int32(flush_thread_num_per_store, "2");

imay · 2019-09-19T09:42:16Z

be/src/olap/delta_writer.cpp

+      _delta_written_success(false), _flush_status(OLAP_SUCCESS),
+      _flush_queue(flush_queue) {
+
+    _mem_table.reset();


imay · 2019-09-19T09:47:04Z

be/src/olap/delta_writer.h

+    // if the last memtable is flushed, all previous memtables should already be flushed.
+    // so we only need to wait and block on the last memtable's flush future.
+    std::future<OLAPStatus> _flush_future;
+#endif


remove this?

imay · 2019-09-19T09:59:47Z

be/src/runtime/tablets_channel.h

+private:
+    // id of this load channel
+    TabletsChannelKey _key;
+    MemTableFlushExecutor* _flush_executor;


Better not to store this in TabletChannel, this is storage engine internal.

imay · 2019-09-19T10:30:05Z

be/src/runtime/memtable_flush_executor.h

+    int32_t get_queue_idx(size_t path_hash);
+
+    // push the memtable to specified flush queue, and return a future
+    std::future<OLAPStatus> push_memtable(int32_t queue_idx, const MemTableFlushContext& ctx);


Better to assign queue in this class, not outside this class.

imay · 2019-09-19T10:34:17Z

be/src/runtime/memtable_flush_executor.cpp

@@ -0,0 +1,128 @@
+// Licensed to the Apache Software Foundation (ASF) under one


we have two executor?

imay · 2019-09-21T05:39:37Z

be/src/olap/delta_writer.cpp

-        SAFE_DELETE(_mem_table);
-        _mem_table = new MemTable(_schema, _tablet_schema, _req.slots,
-                                  _req.tuple_desc, _tablet->keys_type());
+OLAPStatus DeltaWriter::_check_flush_futures(bool block) {


why not put this logic in _flush_handler. DeltaWriter can test status of _flush_handler.
Executor can update flush_handler's status through callback. In that way, there is no need to loop the status of submitted MemTable.

In your implementation code, you give a 1 microseconds interval to loop the status. When waiting last MemTable, block is true, you will loop 10^6 times one second, which is waste of CPU.

I think following interface is enough for current usage.

class FlusherClient { public: // submit a memtable to flush. return error if some previous submitted MemTable has failed Status submit(); // wait for all submitted memtable finished. Status wait(); // Get flush operations' statistics Statistics get_stats(); // called when a memtable is finished by executor. void on_flush_finished(); };

imay · 2019-09-23T11:20:40Z

be/src/olap/memtable_flush_executor.h

+//      ...
+//      FlushHandler* flush_handler;
+//      memTableFlushExecutor.create_flush_handler(path_hash, &flush_handler);
+//      std::shared_ptr<FlushHandler> shared_handler(flush_handler);


If client should wrap returned pointer with a shared_ptr, why not define the argument with std::shared_ptr<>*

imay · 2019-09-23T11:24:18Z

be/src/olap/memtable_flush_executor.h

+    // wait for all submitted memtable finished.
+    OLAPStatus wait();
+    // get flush operations' statistics
+    const FlushStatistic& get_stats();


Suggested change

const FlushStatistic& get_stats();

const FlushStatistic& get_stats() const;

be/src/olap/memtable_flush_executor.h

imay · 2019-09-23T11:42:35Z

be/src/olap/memtable_flush_executor.cpp

+    MemTableFlushContext ctx;
+    ctx.memtable = memtable;
+    ctx.flush_handler = this->shared_from_this();
+    _flush_futures.push(_flush_executor->_push_memtable(_flush_queue_idx, ctx));


One of future and callback is enough. No need to have both, which will make it more complex.
In this case, what you need is CountDownLatch, update in callback.

imay · 2019-09-23T11:52:11Z

be/src/olap/memtable_flush_executor.cpp

+        }
+
+        // if last flush of this tablet already failed, just skip
+        if (ctx.flush_handler->last_flush_status() != OLAP_SUCCESS) {


If you call last_flush_status() here, this means that Executor know details of flush handler.
there are other options to do this

Executor just do and callback, and Flusher will add more task in callback function

make a context an interface who has an abstract function get_state() to tell if it should be scheduled.

imay · 2019-09-23T11:52:47Z

be/src/olap/memtable_flush_executor.cpp

+            std::lock_guard<SpinLock> l(_lock);
+            _flush_promises[ctx.flush_id].set_value(res.flush_status);
+            _flush_promises.erase(ctx.flush_id);
+        }


I think you can remove this block

Just easy for reading

imay · 2019-09-24T11:19:00Z

be/src/util/counter_cond_variable.hpp

+// waiter:
+//      one or more waiter call xxx_wait() to wait until all or at least one tasks are finished.
+class CounterCondVariable {
+    public:


imay · 2019-09-24T11:19:29Z

be/src/util/semaphore.hpp

            std::unique_lock<std::mutex> lock(_mutex);
-            ++count_;
-            cv_.notify_one();
+            ++_count;


imay · 2019-09-24T11:20:11Z

be/src/util/counter_cond_variable.hpp

+// worker:
+//      one or more workers do the task and call dec_count() after finishing the task
+// waiter:
+//      one or more waiter call xxx_wait() to wait until all or at least one tasks are finished.


better to give a regular usage example

imay · 2019-09-24T11:21:22Z

be/src/util/counter_cond_variable.hpp

+        // wait until count down to zero
+        void block_wait() {
+            std::unique_lock<std::mutex> lock(_lock);
+            _cv.wait(lock, [=] { return _count <= 0; });


Suggested change

_cv.wait(lock, [=] { return _count <= 0; });

_cv.wait(lock, [this] { return _count <= 0; });

only capture what you need

imay · 2019-09-24T11:26:10Z

be/src/util/counter_cond_variable.hpp

+        // wait if count larger than 0
+        // and after being notified, return true if count down zo zero,
+        // or return false other wise.
+        bool check_wait() {


I think a dec_to_zero function is better than this interface in this case.
And dec_to_zero is more general.

imay · 2019-09-24T11:39:11Z

be/src/olap/delta_writer.cpp

    _delta_written_success = true;
+
+    const FlushStatistic& stat = _flush_handler->get_stats();
+    LOG(INFO) << "close delta writer for tablet: " << _tablet->tablet_id()


Do we need this INFO log? this would create many log in our info log files.

This log is already online and not write too much log.
It's better to use INFO in this version for checking. I will remove it at next version if it a problem.

imay · 2019-09-24T14:03:16Z

be/src/olap/memtable.h

 #ifndef DORIS_BE_SRC_OLAP_MEMTABLE_H
 #define DORIS_BE_SRC_OLAP_MEMTABLE_H

+#include <future>


imay · 2019-09-24T14:06:32Z

be/src/olap/memtable_flush_executor.cpp

+
+        // if last flush of this tablet already failed, just skip
+        if (ctx.flush_handler->is_cancelled()) {
+            continue;


imay · 2019-09-24T14:10:36Z

be/src/olap/memtable_flush_executor.cpp

+
+namespace doris {
+
+OLAPStatus FlushHandler::submit(std::shared_ptr<MemTable> memtable) {


do we need to limit the number of memtables submitted to executor?
I'm afraid that one big ingest will block other ingest.

Too many memtables in flush queue will take too much memory.
And the size of a memtable is no more than 100MB, so there will no be a big ingest

imay

LGTM

…che#1798)

kangkaisen reviewed Sep 12, 2019

View reviewed changes

imay requested changes Sep 12, 2019

View reviewed changes

morningman force-pushed the optimize_tablet_channel branch from bb0dbc7 to cd4ffd0 Compare September 16, 2019 00:56

imay reviewed Sep 18, 2019

View reviewed changes

morningman closed this Sep 18, 2019

morningman reopened this Sep 18, 2019

imay reviewed Sep 19, 2019

View reviewed changes

imay reviewed Sep 21, 2019

View reviewed changes

morningman force-pushed the optimize_tablet_channel branch from 4348e78 to f3c4967 Compare September 23, 2019 01:05

imay reviewed Sep 23, 2019

View reviewed changes

morningman and others added 5 commits September 23, 2019 22:54

Optimize the load performance for large file

513d3b4

fix review by kks

8d9ec65

Add memtable flush executor

0293162

fix by review

ffb3730

fix by zc review 2

0d97379

morningman force-pushed the optimize_tablet_channel branch from 25aff96 to 0d97379 Compare September 23, 2019 14:55

add counter cond variable

1235497

imay reviewed Sep 24, 2019

View reviewed changes

fix by review

a732ffd

imay reviewed Sep 24, 2019

View reviewed changes

add on_flush_cancelled callback

91b3ac7

imay approved these changes Sep 25, 2019

View reviewed changes

morningman merged commit c643cbd into apache:master Sep 25, 2019

imay mentioned this pull request Sep 26, 2019

Release Notes 0.11.0 #1891

Closed

swjtu-zhanglei pushed a commit to swjtu-zhanglei/incubator-doris that referenced this pull request Jul 25, 2023

(selectdb-cloud) Fix cluster and instance regression case failed (apa…

5fe7870

…che#1798)

	CONF_Int32(flush_thread_num_per_dir, "2");
	CONF_Int32(flush_thread_num_per_store, "2");

		@@ -0,0 +1,128 @@
		// Licensed to the Apache Software Foundation (ASF) under one

	const FlushStatistic& get_stats();
	const FlushStatistic& get_stats() const;

	_cv.wait(lock, [=] { return _count <= 0; });
	_cv.wait(lock, [this] { return _count <= 0; });


		namespace doris {

		OLAPStatus FlushHandler::submit(std::shared_ptr<MemTable> memtable) {

Optimize the load performance for large file #1798

Optimize the load performance for large file #1798

Uh oh!

Conversation

morningman commented Sep 12, 2019

Uh oh!

morningman commented Sep 12, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment