Fix bugs of Broker load #1546

morningman · 2019-07-25T03:35:33Z

Use same UUID as query ID and load ID of a load execution plan.
Each load execution plan has a load ID, and as a plan, there is also a query ID.
We can use same UUID as query ID and load ID, for tracing the load process more easily.
Change the load ID when retrying a load execution plan.
When a load execution plan retry, the load ID should be changed, otherwise BE can not
distinguish the old and new load requests.
Cancel the running loading task when cancelling the broker load.
When user cancel a broker load, the running loading task should also be cancelled, or
it may occupies the worker thread for a long time.
Remove the unnecessary query report when doing load execution plan.
Only the last query report is needed.
Add a new BE config tablet_writer_rpc_timeout_sec.
It is used for RPC of tablet sink. The default is 600 seconds. which is long enough for flushing
about 6GB data. The long timeout config will reduce the possibility of encountering fail to send batch error when loading.
Use streaming_load_max_mb instead of mini_load_max_mb in BE config.
Add more logs for tracing a broker load process easily.

When broker load task failed and try, the load id should be changed, or the retry task will use same tablets channel on BE, which may cause some problem. And furthermore, I use the same TUniqueId for the query id and load id of a load plan, for tracing the load process more easily. Also add a BE config 'tablet_writer_rpc_timeout_sec' to config the timeout of add_batch rpc in load process.

imay · 2019-07-25T04:34:06Z

be/src/exec/tablet_sink.h

    // BE id -> execution time of add batch in us
    std::unordered_map<int64_t, int64_t> _node_add_batch_time_map;
+    std::unordered_map<int64_t, int64_t> _node_add_batch_wait_time_map;
+    std::unordered_map<int64_t, int64_t> _node_add_batch_num_map;


use a struct to enclose these

be/src/runtime/tablet_writer_mgr.cpp

imay · 2019-07-25T04:40:12Z

fe/src/main/java/org/apache/doris/load/loadv2/LoadJob.java

            executeAfterAborted(txnState);
            // cancel load job
-            executeCancel(new FailMsg(FailMsg.CancelType.LOAD_RUN_FAIL, txnStatusChangeReason), false);
+            unprotectedExecuteCancel(new FailMsg(FailMsg.CancelType.LOAD_RUN_FAIL, txnStatusChangeReason), false);


clear idToTask

imay · 2019-07-25T04:42:39Z

fe/src/main/java/org/apache/doris/load/loadv2/LoadingTaskPlanner.java

+            olapTableSink.updateLoadId(loadId);
+        }
+
+        LOG.info("update olap table sink's load id to {}, job: {}", DebugUtil.printId(loadId), loadJobId);


Delete this debug log

Link loadId and jobId is necessary.

Delete this debug log

No, this is not a debug log. We can not find the new load id without this bug

EmmyMiao87 · 2019-07-25T06:27:18Z

fe/src/main/java/org/apache/doris/load/loadv2/BrokerLoadPendingTask.java

+                totalFileNum += fileStatuses.size();
+                LOG.info("get {} files to in file group {}. size: {}. job: {}",
+                        fileStatuses.size(), groupNum, groupFileSize, callback.getCallbackId());
+                groupNum++;


The number of group is not meaningful. Maybe id of table is better.

I will add a table id here

EmmyMiao87 · 2019-07-25T07:33:14Z

fe/src/main/java/org/apache/doris/load/routineload/KafkaTaskInfo.java

        return gson.toJson(partitionIdToOffset);
    }

    private TExecPlanFragmentParams updateTExecPlanFragmentParams(RoutineLoadJob routineLoadJob) throws UserException {


The better name of method is replan.

change it to rePlan()

EmmyMiao87 · 2019-07-25T07:41:16Z

fe/src/main/java/org/apache/doris/load/loadv2/LoadingTaskPlanner.java

+            olapTableSink.updateLoadId(loadId);
+        }
+
+        LOG.info("update olap table sink's load id to {}, job: {}", DebugUtil.printId(loadId), loadJobId);


Link loadId and jobId is necessary.

imay · 2019-07-26T01:28:52Z

be/test/exec/tablet_sink_test.cpp

+    if (!doris::config::init(conffile.c_str(), false)) {
+        fprintf(stderr, "error read config file. \n");
+        return -1;
+    }


you can set config value in SetUp(), otherwise you can only run this test in run-ut.sh

Use same UUID as query ID and load ID of a load execution plan. Each load execution plan has a load ID, and as a plan, there is also a query ID. We can use same UUID as query ID and load ID, for tracing the load process more easily. Change the load ID when retrying a load execution plan. When a load execution plan retry, the load ID should be changed, otherwise BE can not distinguish the old and new load requests. Cancel the running loading task when cancelling the broker load. When user cancel a broker load, the running loading task should also be cancelled, or it may occupies the worker thread for a long time. Remove the unnecessary query report when doing load execution plan. Only the last query report is needed. Add a new BE config tablet_writer_rpc_timeout_sec. It is used for RPC of tablet sink. The default is 600 seconds. which is long enough for flushing about 6GB data. The long timeout config will reduce the possibility of encountering fail to send batch error when loading. Use streaming_load_max_mb instead of mini_load_max_mb in BE config. Add more logs for tracing a broker load process easily.

Co-authored-by: morningman <morningman@163.com>

morningman added 7 commits July 24, 2019 17:13

cancel load first commit

ddbd1f0

remove useless report

29c0e3a

fix compile buig

6b3c8eb

add lock time

7e81ad4

fix bug

badaac2

remove mini load max bytes

dc523ee

imay requested changes Jul 25, 2019

View reviewed changes

EmmyMiao87 reviewed Jul 25, 2019

View reviewed changes

morningman added 3 commits July 25, 2019 16:33

fix by review 1

565141c

fix ut

c91ebae

fix compile bug

32cfe11

imay reviewed Jul 26, 2019

View reviewed changes

fix ut

0738b37

morningman closed this Jul 26, 2019

morningman reopened this Jul 26, 2019

fix ut 2

972bbd9

imay approved these changes Jul 27, 2019

View reviewed changes

imay merged commit 0694b6a into apache:master Jul 27, 2019

imay mentioned this pull request Sep 26, 2019

Release Notes 0.11.0 #1891

Closed

swjtu-zhanglei pushed a commit to swjtu-zhanglei/incubator-doris that referenced this pull request Jul 25, 2023

[fix](feut) Fix some FE ut (apache#1546)

f93c57b

Co-authored-by: morningman <morningman@163.com>

Fix bugs of Broker load #1546

Fix bugs of Broker load #1546

Uh oh!

Conversation

morningman commented Jul 25, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants