[Enhancement](Load)Express the parameters of Stream Load using SQL #16940

Cai-Yao · 2023-02-20T08:55:22Z

Proposed changes

Issue Number: close #xxx

Problem summary

In stream load, add a 'sql' parameter to the header to replace the 'column_separator', 'line_delimiter', 'where', 'columns' in the previous parameter, which is convenient to use.

curl --location-trusted -u user:passwd [-H "sql: ${load_sql}"...] -T data.file -XPUT http://fe_host:http_port/api/_stream_load_with_sql


# -- load_sql
# insert into db.table (col, ...) select stream_col, ... from stream("property1"="value1");

# stream
# (
#     "column_separator" = ",",
#     "format" = "CSV",
#     ...
# )

Examples:

curl  --location-trusted -u root: -T test.csv  -H "sql:insert into demo.example_tbl_1(user_id, age, cost) select c1, c4, c7 * 2 from stream("format" = "CSV", "column_separator" = "," ) where age >= 30"  http://127.0.0.1:28030/api/_stream_load_with_sql

Checklist(Required)

Does it affect the original behavior
Has unit tests been added
Has document been added or modified
Does it need to update dependencies
Is this PR support rollback (If NO, please explain WHY)

Further comments

If this is a relatively large or complex change, kick off the discussion at dev@doris.apache.org by explaining why you chose the solution you did and what alternatives you considered, etc...

github-actions

clang-tidy made some suggestions

be/src/http/action/stream_load_with_sql.cpp

github-actions · 2023-02-20T09:00:07Z

be/src/http/action/stream_load_with_sql.h

+    Status _process_put_with_load_sql(HttpRequest* http_req, StreamLoadContext* ctx);
+    void _save_stream_load_record(StreamLoadContext* ctx, const std::string& str);
+
+private:


warning: redundant access specifier has the same accessibility as the previous access specifier [readability-redundant-access-specifiers]

Suggested change

private:

be/src/http/action/stream_load_with_sql.h:46: previously declared here

private: ^

yiguolei · 2023-02-21T06:32:36Z

regression-test/suites/load_p0/stream_load_with_sql/test_stream_load_with_sql.groovy

+        streamLoad {
+            set 'version', '1'
+            set 'sql', """
+                    insert into ${db}.${tableName3} select c1, c2, year(c14), month(c14), day(c14) from stream("format"="csv")


增加一下行分隔符、列分隔符的测试case

yiguolei · 2023-02-21T06:50:38Z

be/src/http/action/stream_load_with_sql.cpp

+    ctx->load_type = TLoadType::MANUL_LOAD;
+    ctx->load_src_type = TLoadSourceType::RAW;
+
+    ctx->db = req->param(HTTP_DB_KEY);


not use db and table in url, should use db and table in sql to check privileges.

yiguolei · 2023-02-21T06:55:02Z

be/src/http/action/stream_load_with_sql.cpp

+    // default csv
+    ctx->format = TFileFormatType::FORMAT_CSV_PLAIN;
+
+    if (ctx->format == TFileFormatType::FORMAT_UNKNOWN) {


Is this line useful?

yiguolei · 2023-02-21T06:55:53Z

be/src/http/action/stream_load_with_sql.cpp

+
+Status StreamLoadWithSqlAction::_on_header(HttpRequest* http_req, StreamLoadContext* ctx) {
+    // auth information
+    if (!parse_basic_auth(*http_req, &ctx->auth)) {


useless. Should use the db and table in sql to check privileges.

yiguolei · 2023-02-21T06:58:51Z

be/src/http/action/stream_load_with_sql.cpp

+    int64_t begin_txn_start_time = MonotonicNanos();
+    // RETURN_IF_ERROR(_exec_env->stream_load_executor()->begin_txn(ctx));
+    ctx->begin_txn_cost_nanos = MonotonicNanos() - begin_txn_start_time;
+


lost some code:
// check content length
ctx->body_bytes = 0;
size_t csv_max_body_bytes = config::streaming_load_max_mb * 1024 * 1024;
size_t json_max_body_bytes = config::streaming_load_json_max_mb * 1024 * 1024;
bool read_json_by_line = false;
if (!http_req->header(HTTP_READ_JSON_BY_LINE).empty()) {
if (iequal(http_req->header(HTTP_READ_JSON_BY_LINE), "true")) {
read_json_by_line = true;
}
}
if (!http_req->header(HttpHeaders::CONTENT_LENGTH).empty()) {
ctx->body_bytes = std::stol(http_req->header(HttpHeaders::CONTENT_LENGTH));
// json max body size
if ((ctx->format == TFileFormatType::FORMAT_JSON) &&
(ctx->body_bytes > json_max_body_bytes) && !read_json_by_line) {
return Status::InternalError(
"The size of this batch exceed the max size [{}] of json type data "
" data [ {} ]. Split the file, or use 'read_json_by_line'",
json_max_body_bytes, ctx->body_bytes);
}
// csv max body size
else if (ctx->body_bytes > csv_max_body_bytes) {
LOG(WARNING) << "body exceed max size." << ctx->brief();
return Status::InternalError("body exceed max size: {}, data: {}", csv_max_body_bytes,
ctx->body_bytes);
}
} else {
#ifndef BE_TEST
evhttp_connection_set_max_body_size(
evhttp_request_get_connection(http_req->get_evhttp_request()), csv_max_body_bytes);
#endif
}

if (!http_req->header(HTTP_TIMEOUT).empty()) { try { ctx->timeout_second = std::stoi(http_req->header(HTTP_TIMEOUT)); } catch (const std::invalid_argument& e) { return Status::InvalidArgument("Invalid timeout format"); } }

yiguolei · 2023-02-21T07:01:13Z

be/src/http/action/stream_load_with_sql.cpp

+
+    // begin transaction
+    int64_t begin_txn_start_time = MonotonicNanos();
+    // RETURN_IF_ERROR(_exec_env->stream_load_executor()->begin_txn(ctx));


begin txn not called? how to set txn_id?

begin_txn on fe

yiguolei · 2023-02-21T07:05:17Z

be/src/http/action/stream_load_with_sql.cpp

+DEFINE_GAUGE_METRIC_PROTOTYPE_2ARG(streaming_load_with_sql_current_processing,
+                                   MetricUnit::REQUESTS);
+
+#ifdef BE_TEST


delete the code

yiguolei · 2023-02-21T07:07:24Z

be/src/http/action/stream_load_with_sql.cpp

+    int64_t start_read_data_time = MonotonicNanos();
+    const size_t buffer_max_size = 1 * 1024 * 1024;
+    size_t buffer_size = 0;
+    char* buffer = new char[buffer_max_size];


add comment for buffer variable.

yiguolei · 2023-02-21T07:08:27Z

be/src/http/action/stream_load_with_sql.cpp

+    } else {
+        LOG(WARNING) << "_exec_env->master_info not set backend_id";
+    }
+    request.__set_backend_id(10046);


yiguolei · 2023-02-21T07:13:23Z

be/src/http/action/stream_load_with_sql.cpp

+
+Status StreamLoadWithSqlAction::_process_put(HttpRequest* http_req, StreamLoadContext* ctx) {
+    // Now we use stream
+    ctx->use_streaming = is_format_support_streaming(ctx->format);


if not support streaming, any other codes?

yiguolei · 2023-02-22T02:54:54Z

be/src/http/action/stream_load_with_sql.cpp

+    // ctx->future.wait_for(std::chrono::seconds(config::max_fragment_start_wait_time_seconds));
+    // if (!ctx->future.valid()) {
+    //     return Status::TimedOut("data receive timeout");
+    // }


Should not wait all time. For example, if fe crashed, then fe will never set promise, then the http thread will hang. You should wait for 1 second, then call fe to check the load status using load id, if not find then just let load failed.

Should not wait all time. For example, if fe crashed, then fe will never set promise, then the http thread will hang. You should wait for 1 second, then call fe to check the txn status, if not find then just let load failed.

OK, I get.

yiguolei · 2023-02-22T02:55:24Z

be/src/http/action/stream_load_with_sql.cpp

+        // RETURN_IF_ERROR(_exec_env->stream_load_executor()->commit_txn(ctx));
+        ctx->commit_and_publish_txn_cost_nanos = MonotonicNanos() - commit_and_publish_start_time;
+    }
+    while (!ctx->is_stream_load_put_success) {


what's this code means?

what's this code means?

This code is maybe useless, I wanted to make sure that the put_process is executed before the handle.

yiguolei · 2023-02-22T02:57:52Z

be/src/http/action/stream_load_with_sql.cpp

+    }
+}
+
+static bool is_format_support_streaming(TFileFormatType::type format) {


Do not copy parse_format and parse_format, too many duplicate code is hard to maintain. could use StreamLoadAction::is_format_support_streaming or StreamLoadAction::parse_format?

yiguolei · 2023-02-22T03:00:26Z

fe/fe-core/src/main/java/org/apache/doris/backup/BlobStorage.java


    public static BlobStorage create(String name, StorageBackend.StorageType type, Map<String, String> properties) {
-        if (type == StorageBackend.StorageType.S3) {
+        if (type == StorageBackend.StorageType.S3 || type == StorageBackend.StorageType.STREAM) {


why add code here?

yiguolei · 2023-02-22T03:01:11Z

fe/fe-core/src/main/java/org/apache/doris/planner/external/QueryScanProvider.java

+            for (Backend be : Env.getCurrentSystemInfo().getIdToBackend().values()) {
+                long streamLoadBackendId = ctx.getBackendId();
+                if (be.getId() == streamLoadBackendId) {
+                    LOG.info("cwk newLocations");


bad log info.

yiguolei · 2023-02-22T03:02:34Z

fe/fe-core/src/main/java/org/apache/doris/qe/QeProcessorImpl.java

        }
        try {
            info.getCoord().updateFragmentExecStatus(params);
+            info.getCoord().setIsReportExecStatus(true);


why add this code?

why add this code?

why add this code?

The purpose is to determine whether FE has updated the status of BE execution so that streamload_action can query the execution status

yiguolei · 2023-02-22T03:05:08Z

fe/fe-core/src/main/java/org/apache/doris/service/FrontendServiceImpl.java

    }

+    private void streamLoadPutWithSqlImpl(TStreamLoadPutRequest request) throws UserException {
+        String loadSql = request.getLoadSql();


Add a log here to indicate that we receive a load request.

Add a log here to indicate that we receive a load request.

OK

yiguolei · 2023-02-22T03:06:29Z

fe/fe-core/src/main/java/org/apache/doris/service/FrontendServiceImpl.java

+        SqlScanner input = new SqlScanner(new StringReader(loadSql), ctx.getSessionVariable().getSqlMode());
+        SqlParser parser = new SqlParser(input);
+        try {
+            StatementBase parsedStmt = SqlParserUtils.getFirstStmt(parser);


should not call coord.exec() in rpc thread context. It will use up all rpc threads.

should add the coord in to a map<load id, coord> and use a thread pool to check coord status and send status to related be.

yiguolei · 2023-02-22T03:08:35Z

gensrc/thrift/FrontendService.thrift

    TWaitingTxnStatusResult waitingTxnStatus(1: TWaitingTxnStatusRequest request)

    TStreamLoadPutResult streamLoadPut(1: TStreamLoadPutRequest request)
+    TStreamLoadWithLoadStatusResult StreamLoadWithLoadStatus(1: TStreamLoadWithLoadStatusRequest request)


StreamLoadWithLoadStatus --> streamLoadWithLoadStatus

yiguolei · 2023-02-22T03:15:37Z

be/src/runtime/fragment_mgr.cpp

    DCHECK(req.status.ok() || req.done); // if !status.ok() => done
    Status exec_status = req.update_fn(req.status);
-
+    if (_exec_env->new_load_stream_mgr()->have_promise(req.query_id)) {


do not depend on this to set promise. coordinator_callback only means the fragment is finished on be, but there are some state on fe.
Add a rpc method in be, and if fe find the coord finished, then fe call this rpc service to indicate the load finished.

yiguolei · 2023-02-22T03:18:01Z

fe/fe-core/src/main/java/org/apache/doris/qe/Coordinator.java

    // Once this is set to true, errors from remote fragments are ignored.
    private boolean returnedAllResults;

+    private boolean isReportExecStatus;


why add this variable?

yiguolei · 2023-02-22T03:18:51Z

be/src/http/action/stream_load_with_sql.cpp

+    }
+    request.__set_execMemLimit(2 * 1024 * 1024 * 1024L);
+    request.fileType = TFileType::FILE_STREAM;
+    request.__set_thrift_rpc_timeout_ms(20000);


Is there any config for thrift rpc timeout?

yiguolei · 2023-02-22T03:19:11Z

be/src/http/action/stream_load_with_sql.cpp

+    } else {
+        LOG(WARNING) << "_exec_env->master_info not set backend_id";
+    }
+    request.__set_execMemLimit(2 * 1024 * 1024 * 1024L);


add a config in config.h for this variable.

weizuo93 · 2023-02-24T02:21:34Z

be/src/common/config.h

 // time interval to clean expired stream load records
 CONF_mInt64(clean_stream_load_record_interval_secs, "1800");
+// use memory in stream load default
+CONF_Int64(stream_load_exec_mem_limit, "214748364"); // 2G


CONF_mInt64(stream_load_exec_mem_limit, "214748364");
Maybe it's better to adjust this parameter dynamically？

yiguolei · 2023-03-13T08:41:14Z

fe/fe-core/src/main/java/org/apache/doris/service/FrontendServiceImpl.java

        result.setStatus(status);
        try {
-            result.setParams(streamLoadPutImpl(request));
+            if (request.getVersion() == 1) {


Use enum type to indicate stream load type.

yiguolei · 2023-03-13T08:42:07Z

fe/fe-core/src/main/java/org/apache/doris/service/FrontendServiceImpl.java

        result.setStatus(status);
        try {
-            result.setParams(streamLoadPutImpl(request));
+            if (request.getVersion() == 1) {


the version variable may not be set for non-sql-load. Should check if it is set.

yiguolei · 2023-03-13T08:43:54Z

fe/fe-core/src/main/java/org/apache/doris/service/FrontendServiceImpl.java

        return result;
    }

+    public class ReportStreamLoadWorker implements Runnable {


Add a seperate class file in load package, not in frontendservice.

yiguolei · 2023-03-13T08:44:49Z

fe/fe-core/src/main/java/org/apache/doris/service/FrontendServiceImpl.java

+        ctx.setBackendId(request.getBackendId());
+        StreamLoadTask streamLoadTask = StreamLoadTask.fromTStreamLoadPutRequest(request);
+        ctx.setStreamLoadInfo(streamLoadTask);
+        ctx.setLoadId(request.getLoadId());


on line 1277, you alread set queryid to load id. so that load id is useless now.

yiguolei · 2023-03-13T08:48:28Z

fe/fe-core/src/main/java/org/apache/doris/service/FrontendServiceImpl.java

+            QeProcessorImpl.INSTANCE.registerQuery(request.getLoadId(), coord);
+            coord.exec();
+        } catch (UserException e) {
+            LOG.warn("exec sql error {}", e.getMessage());


LOG.warn("exec sql error {}", e);

yiguolei · 2023-03-13T08:49:01Z

fe/fe-core/src/main/java/org/apache/doris/service/FrontendServiceImpl.java

+            coord.setQueryType(TQueryType.LOAD);
+            QeProcessorImpl.INSTANCE.registerQuery(request.getLoadId(), coord);
+            coord.exec();
+        } catch (UserException e) {


This line is useless, since you already catch throwable at line 1304

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

github-actions · 2023-07-29T15:35:35Z

clang-tidy review says "All clean, LGTM! 👍"

stream load rebase

github-actions · 2023-08-01T06:35:13Z

clang-tidy review says "All clean, LGTM! 👍"

@Cai-Yao

This PR was originally #16940 , but it has not been updated for a long time due to the original author @Cai-Yao . At present, we will merge some of the code into the master first. thanks @Cai-Yao @yiguolei

github-actions bot added area/planner Issues or PRs related to the query planner kind/docs Categorizes issue or PR as related to documentation. kind/test labels Feb 20, 2023

github-actions bot reviewed Feb 20, 2023

View reviewed changes

yiguolei reviewed Feb 21, 2023

View reviewed changes

Cai-Yao force-pushed the stream_load branch from 129b8d8 to babbaad Compare February 21, 2023 09:32

yiguolei reviewed Feb 22, 2023

View reviewed changes

github-actions bot added the area/vectorization label Feb 24, 2023

weizuo93 reviewed Feb 24, 2023

View reviewed changes

yiguolei reviewed Mar 13, 2023

View reviewed changes

yiguolei mentioned this pull request Jul 9, 2023

[Enhancement](load) http load using SQL #21621

Closed

Cai-Yao and others added 14 commits July 29, 2023 22:58

init

c2c3f14

fix context is null

9bab338

add setFileAttributes and clean unused code

c6f64e8

add stream load result and some regression test

b5d7f3f

fix some bugs

a47ab61

add stream load regression test

1bec888

add docs

65e6dec

fix bug and remove fe check privileges

1579dd1

add line delimiter and column separator test

773e04f

add reportStreamLoadStatus thread in FE and fix

bce76f4

Update be/src/http/action/stream_load_with_sql.cpp

a4586d7

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

add stream load with local file

83c8e79

rebase

0929bae

Update be/src/http/action/stream_load_with_sql.cpp

db9573e

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

Cai-Yao force-pushed the stream_load branch from f9f3054 to db9573e Compare July 29, 2023 15:27

zzzzzzzs and others added 3 commits July 31, 2023 10:41

Merge branch 'master' into stream_load

c083718

rebase

7c6043c

Merge pull request #1 from zzzzzzzs/zs_stream_load

69a24b5

stream load rebase

zzzzzzzs mentioned this pull request Aug 2, 2023

[Enhancement](Load) Stream Load using SQL #22509

Merged

dataroaring closed this Sep 2, 2023

[Enhancement](Load)Express the parameters of Stream Load using SQL #16940

[Enhancement](Load)Express the parameters of Stream Load using SQL #16940

Uh oh!

Conversation

Cai-Yao commented Feb 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed changes

Problem summary

Checklist(Required)

Further comments

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

github-actions bot Feb 20, 2023

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yiguolei Feb 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Cai-Yao commented Feb 20, 2023 •

edited

Loading

yiguolei Feb 22, 2023 •

edited

Loading