-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Closed
Labels
kind/fixCategorizes issue or PR as related to a bug.Categorizes issue or PR as related to a bug.
Milestone
Description
Describe the bug
Every time after restarting BE, the stream load job will fail:
Status: Fail
Message: failed to send batch
NumberLoadedRows: 0
NumberFilteredRows: 0
LoadBytes: 18708516
LoadTimeMs: 100144
ErrorURL: null
Status: Fail
Message: failed to open tablet writer
To Reproduce
Steps to reproduce the behavior:
- running many stream load job
- kill -9 restart all BE in one cluster
- you should see the above error
Additional context
Usually,this error will keep several minutes. So I have added a retry strategy and this issue not a big issue. But last night, this error kept more than an hour.
The following is some related log in BE:
W0429 21:47:00.231511 85103 olap_table_sink.cpp:203] failed to send batch, error=RPC call is timed out, error_text=[E1008]Reached timeout=50000ms @10.26.82.152:8060
W0429 21:47:00.272903 85103 olap_table_sink.cpp:355] close_wait node channel failed, load_id=604297B000A3DE93:3DF96F1E3A3EB389, node=10.26.82.152:8060, errmsg=failed to send batch
W0429 21:47:00.277072 85301 stream_load.cpp:296] commit transaction failed, id=604297B000A3DE93:3DF96F1E3A3EB389, errmsg=transaction commit failed
W0429 21:47:00.279239 85301 stream_load.cpp:225] handle streaming load failed, id=604297B000A3DE93:3DF96F1E3A3EB389, errmsg=transaction commit failed
W0429 21:58:34.429118 44529 olap_table_sink.cpp:116] failed to open tablet writer, error=RPC call is timed out, error_text=[E1008]Reached timeout=50000ms @10.26.82.149:8060
W0429 21:58:34.457186 44529 olap_table_sink.cpp:289] tablet open failed, load_id=FD435A82BACA5422:53D0158F40EFC49A, node=10.26.82.149:8060, errmsg=failed to open tablet writer
W0429 21:59:24.487653 44529 olap_table_sink.cpp:203] failed to send batch, error=RPC call is timed out, error_text=[E1008]Reached timeout=50000ms @10.26.82.145:8060
W0429 21:59:24.518293 44529 olap_table_sink.cpp:355] close_wait node channel failed, load_id=FD435A82BACA5422:53D0158F40EFC49A, node=10.26.82.145:8060, errmsg=failed to send batch
W0429 21:59:24.518378 44529 olap_table_sink.cpp:360] close_wait failed, load_id=FD435A82BACA5422:53D0158F40EFC49A
W0429 21:59:24.518388 44529 olap_table_sink.cpp:586] close channel failed, load_id=FD435A82BACA5422:53D0158F40EFC49A, txn_id=192482768
W0429 21:59:24.519050 44529 stream_load.cpp:543] fragment execute failed, load_id=FD435A82BACA5422:53D0158F40EFC49A, txn_id=192482768, query_id=E86270C7C4BC4259:A8E4D34ECEE0977D, errmsg=failed to send batch
W0429 21:59:24.519096 44720 stream_load.cpp:225] handle streaming load failed, id=FD435A82BACA5422:53D0158F40EFC49A, errmsg=failed to send batch
W0429 22:36:52.198593 85136 olap_table_sink.cpp:203] failed to send batch, error=RPC call is timed out, error_text=[E1008]Reached
timeout=50000ms @10.26.70.178:8060
I0429 22:36:52.198784 85136 status.cpp:54] failed to send batch
@ 0xb96ba9 doris::Status::Status()
@ 0x11dd838 doris::stream_load::NodeChannel::_wait_in_flight_packet()
@ 0x11dd903 doris::stream_load::NodeChannel::close_wait()
@ 0x11dee6f doris::stream_load::IndexChannel::close()
@ 0x11df2fe doris::stream_load::OlapTableSink::close()
@ 0xe405d5 doris::PlanFragmentExecutor::open_internal()
@ 0xe40af3 doris::PlanFragmentExecutor::open()
@ 0xde0077 doris::FragmentExecState::execute()
@ 0xde1a4c doris::FragmentMgr::exec_actual()
@ 0xde5034 boost::detail::function::void_function_obj_invoker0<>::invoke()
@ 0xda7ee8 doris::ThreadPool::work_thread()
@ 0x157ffbd thread_proxy
@ 0x7f4362940dd5 start_thread
@ 0x7f4362c52ead __clone
I0429 22:10:18.703799 81833 status.cpp:54] failed to open tablet writer
@ 0xb96ba9 doris::Status::Status()
@ 0x11dd105 doris::stream_load::NodeChannel::open_wait()
@ 0x11dd241 doris::stream_load::IndexChannel::open()
@ 0x11dd5ac doris::stream_load::OlapTableSink::open()
@ 0xe40127 doris::PlanFragmentExecutor::open_internal()
@ 0xe40af3 doris::PlanFragmentExecutor::open()
@ 0xde0077 doris::FragmentExecState::execute()
@ 0xde1a4c doris::FragmentMgr::exec_actual()
@ 0xde5034 boost::detail::function::void_function_obj_invoker0<>::invoke()
@ 0xda7ee8 doris::ThreadPool::work_thread()
@ 0x157ffbd thread_proxy
@ 0x7fc2bf49fdd5 start_thread
@ 0x7fc2bf7b1ead __clone
W0429 22:10:18.738099 81833 olap_table_sink.cpp:289] tablet open failed, load_id=5E4F37F7807C69E1:B9025008E1BE25A8, node=10.26.82.152:8060, errmsg=failed to open tablet writer
My kafka to doris program also get the following error:
rpc failed. backend id: 87716913
2019-04-29 21:58:34,167 ERROR [pool-2-thread-1] kafka.KafkaConsumerTask:248 : java.util.concurrent.TimeoutException: Read timeout
to gha-data-palo-query08.gh.sankuai.com/10.26.82.142:8040 after 60000 ms
java.util.concurrent.ExecutionException: java.util.concurrent.TimeoutException: Read timeout to xxxx.com/xxxx:8040 after 60000 ms
at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895)
at org.asynchttpclient.netty.NettyResponseFuture.get(NettyResponseFuture.java:202)
at com.meituan.doris.doris.DorisHttpClient.sendDoris(DorisHttpClient.java:37)
I found there are a lot of delete log and the disk.io.util is very high after restarting BE:
I0429 22:20:03.368141 81909 olap_table.cpp:716] delete pending data from tablet. [table=112515529.1907219336 transaction_id=192478116]
I0429 22:20:03.368486 81909 olap_table.cpp:716] delete pending data from tablet. [table=112515537.1907219336 transaction_id=192478116]
I0429 22:20:03.368813 81909 olap_table.cpp:716] delete pending data from tablet. [table=112515588.1907219336 transaction_id=192478116]
I0429 22:20:03.369206 81909 olap_table.cpp:716] delete pending data from tablet. [table=112515595.1907219336 transaction_id=192478116]
I0429 22:20:03.369566 81909 olap_table.cpp:716] delete pending data from tablet. [table=112515641.1907219336 transaction_id=192478116]
I0429 22:18:31.064805 81986 utils.cpp:1568] delete unused file. [file='/data2/olap/data/191/112665167/1509942097/112665167.hdr']
I0429 22:18:31.166507 81981 utils.cpp:1568] delete unused file. [file='/data8/olap/data/138/112651719/1509942097/112651719.hdr']
I0429 22:18:34.225603 81906 utils.cpp:1568] delete unused file. [file='/data11/olap/data/451/112729201/1509942097/112729201.hdr']
The following are some abnormal warn log:
I0429 22:58:02.467939 57825 olap_table.cpp:970] version_hash is same when publish version, publish success. [table=112752871.15099
42097 transaction_id=184325307]
W0429 22:58:02.467954 57825 olap_engine.cpp:961] can't publish version on tablet since data existed. [table=112752871.1509942097 t
ransaction_id=184325307 version=104]
I0429 22:58:02.469532 57825 olap_table.cpp:716] delete pending data from tablet. [table=112752871.1509942097 transaction_id=184325307]
I0429 22:58:02.469588 57825 olap_table.cpp:970] version_hash is same when publish version, publish success. [table=112752887.15099
42097 transaction_id=184325307]
W0429 22:58:02.469593 57825 olap_engine.cpp:961] can't publish version on tablet since data existed. [table=112752887.1509942097 t
ransaction_id=184325307 version=104]
I0429 22:58:02.470337 57825 olap_table.cpp:716] delete pending data from tablet. [table=112752887.1509942097 transaction_id=184325307]
Metadata
Metadata
Assignees
Labels
kind/fixCategorizes issue or PR as related to a bug.Categorizes issue or PR as related to a bug.