Skip to content

Restarting BE lead to All Stream load Job failed #1080

@kangkaisen

Description

@kangkaisen

Describe the bug

Every time after restarting BE, the stream load job will fail:

Status: Fail
Message: failed to send batch
NumberLoadedRows: 0
NumberFilteredRows: 0
LoadBytes: 18708516
LoadTimeMs: 100144
ErrorURL: null

Status: Fail
Message: failed to open tablet writer

To Reproduce

Steps to reproduce the behavior:

  1. running many stream load job
  2. kill -9 restart all BE in one cluster
  3. you should see the above error

Additional context

Usually,this error will keep several minutes. So I have added a retry strategy and this issue not a big issue. But last night, this error kept more than an hour.

The following is some related log in BE:

W0429 21:47:00.231511 85103 olap_table_sink.cpp:203] failed to send batch, error=RPC call is timed out, error_text=[E1008]Reached timeout=50000ms @10.26.82.152:8060
W0429 21:47:00.272903 85103 olap_table_sink.cpp:355] close_wait node channel failed, load_id=604297B000A3DE93:3DF96F1E3A3EB389, node=10.26.82.152:8060, errmsg=failed to send batch
W0429 21:47:00.277072 85301 stream_load.cpp:296] commit transaction failed, id=604297B000A3DE93:3DF96F1E3A3EB389, errmsg=transaction commit failed
W0429 21:47:00.279239 85301 stream_load.cpp:225] handle streaming load failed, id=604297B000A3DE93:3DF96F1E3A3EB389, errmsg=transaction commit failed


W0429 21:58:34.429118 44529 olap_table_sink.cpp:116] failed to open tablet writer, error=RPC call is timed out, error_text=[E1008]Reached timeout=50000ms @10.26.82.149:8060
W0429 21:58:34.457186 44529 olap_table_sink.cpp:289] tablet open failed, load_id=FD435A82BACA5422:53D0158F40EFC49A, node=10.26.82.149:8060, errmsg=failed to open tablet writer
W0429 21:59:24.487653 44529 olap_table_sink.cpp:203] failed to send batch, error=RPC call is timed out, error_text=[E1008]Reached timeout=50000ms @10.26.82.145:8060
W0429 21:59:24.518293 44529 olap_table_sink.cpp:355] close_wait node channel failed, load_id=FD435A82BACA5422:53D0158F40EFC49A, node=10.26.82.145:8060, errmsg=failed to send batch
W0429 21:59:24.518378 44529 olap_table_sink.cpp:360] close_wait failed, load_id=FD435A82BACA5422:53D0158F40EFC49A
W0429 21:59:24.518388 44529 olap_table_sink.cpp:586] close channel failed, load_id=FD435A82BACA5422:53D0158F40EFC49A, txn_id=192482768
W0429 21:59:24.519050 44529 stream_load.cpp:543] fragment execute failed, load_id=FD435A82BACA5422:53D0158F40EFC49A, txn_id=192482768, query_id=E86270C7C4BC4259:A8E4D34ECEE0977D, errmsg=failed to send batch
W0429 21:59:24.519096 44720 stream_load.cpp:225] handle streaming load failed, id=FD435A82BACA5422:53D0158F40EFC49A, errmsg=failed to send batch

W0429 22:36:52.198593 85136 olap_table_sink.cpp:203] failed to send batch, error=RPC call is timed out, error_text=[E1008]Reached
timeout=50000ms @10.26.70.178:8060
I0429 22:36:52.198784 85136 status.cpp:54] failed to send batch
    @           0xb96ba9  doris::Status::Status()
    @          0x11dd838  doris::stream_load::NodeChannel::_wait_in_flight_packet()
    @          0x11dd903  doris::stream_load::NodeChannel::close_wait()
    @          0x11dee6f  doris::stream_load::IndexChannel::close()
    @          0x11df2fe  doris::stream_load::OlapTableSink::close()
    @           0xe405d5  doris::PlanFragmentExecutor::open_internal()
    @           0xe40af3  doris::PlanFragmentExecutor::open()
    @           0xde0077  doris::FragmentExecState::execute()
    @           0xde1a4c  doris::FragmentMgr::exec_actual()
    @           0xde5034  boost::detail::function::void_function_obj_invoker0<>::invoke()
    @           0xda7ee8  doris::ThreadPool::work_thread()
    @          0x157ffbd  thread_proxy
    @     0x7f4362940dd5  start_thread
    @     0x7f4362c52ead  __clone


I0429 22:10:18.703799 81833 status.cpp:54] failed to open tablet writer
    @           0xb96ba9  doris::Status::Status()
    @          0x11dd105  doris::stream_load::NodeChannel::open_wait()
    @          0x11dd241  doris::stream_load::IndexChannel::open()
    @          0x11dd5ac  doris::stream_load::OlapTableSink::open()
    @           0xe40127  doris::PlanFragmentExecutor::open_internal()
    @           0xe40af3  doris::PlanFragmentExecutor::open()
    @           0xde0077  doris::FragmentExecState::execute()
    @           0xde1a4c  doris::FragmentMgr::exec_actual()
    @           0xde5034  boost::detail::function::void_function_obj_invoker0<>::invoke()
    @           0xda7ee8  doris::ThreadPool::work_thread()
    @          0x157ffbd  thread_proxy
    @     0x7fc2bf49fdd5  start_thread
    @     0x7fc2bf7b1ead  __clone
W0429 22:10:18.738099 81833 olap_table_sink.cpp:289] tablet open failed, load_id=5E4F37F7807C69E1:B9025008E1BE25A8, node=10.26.82.152:8060, errmsg=failed to open tablet writer

My kafka to doris program also get the following error:

rpc failed. backend id: 87716913


2019-04-29 21:58:34,167 ERROR [pool-2-thread-1] kafka.KafkaConsumerTask:248 : java.util.concurrent.TimeoutException: Read timeout
to gha-data-palo-query08.gh.sankuai.com/10.26.82.142:8040 after 60000 ms
java.util.concurrent.ExecutionException: java.util.concurrent.TimeoutException: Read timeout to xxxx.com/xxxx:8040 after 60000 ms
        at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
        at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895)
        at org.asynchttpclient.netty.NettyResponseFuture.get(NettyResponseFuture.java:202)
        at com.meituan.doris.doris.DorisHttpClient.sendDoris(DorisHttpClient.java:37)

I found there are a lot of delete log and the disk.io.util is very high after restarting BE:

I0429 22:20:03.368141 81909 olap_table.cpp:716] delete pending data from tablet. [table=112515529.1907219336 transaction_id=192478116]
I0429 22:20:03.368486 81909 olap_table.cpp:716] delete pending data from tablet. [table=112515537.1907219336 transaction_id=192478116]
I0429 22:20:03.368813 81909 olap_table.cpp:716] delete pending data from tablet. [table=112515588.1907219336 transaction_id=192478116]
I0429 22:20:03.369206 81909 olap_table.cpp:716] delete pending data from tablet. [table=112515595.1907219336 transaction_id=192478116]
I0429 22:20:03.369566 81909 olap_table.cpp:716] delete pending data from tablet. [table=112515641.1907219336 transaction_id=192478116]


I0429 22:18:31.064805 81986 utils.cpp:1568] delete unused file. [file='/data2/olap/data/191/112665167/1509942097/112665167.hdr']
I0429 22:18:31.166507 81981 utils.cpp:1568] delete unused file. [file='/data8/olap/data/138/112651719/1509942097/112651719.hdr']
I0429 22:18:34.225603 81906 utils.cpp:1568] delete unused file. [file='/data11/olap/data/451/112729201/1509942097/112729201.hdr']

The following are some abnormal warn log:

I0429 22:58:02.467939 57825 olap_table.cpp:970] version_hash is same when publish version, publish success. [table=112752871.15099
42097 transaction_id=184325307]
W0429 22:58:02.467954 57825 olap_engine.cpp:961] can't publish version on tablet since data existed. [table=112752871.1509942097 t
ransaction_id=184325307 version=104]
I0429 22:58:02.469532 57825 olap_table.cpp:716] delete pending data from tablet. [table=112752871.1509942097 transaction_id=184325307]
I0429 22:58:02.469588 57825 olap_table.cpp:970] version_hash is same when publish version, publish success. [table=112752887.15099
42097 transaction_id=184325307]
W0429 22:58:02.469593 57825 olap_engine.cpp:961] can't publish version on tablet since data existed. [table=112752887.1509942097 t
ransaction_id=184325307 version=104]
I0429 22:58:02.470337 57825 olap_table.cpp:716] delete pending data from tablet. [table=112752887.1509942097 transaction_id=184325307]

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/fixCategorizes issue or PR as related to a bug.

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions