-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Closed
Closed
Copy link
Description
Describe the bug
I found a coredump, back trace look like:
Program terminated with signal 6, Aborted.
#0 0x00007fca7abcb1d7 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install glibc-2.17-157.el7_3.1.x86_64 libgcc-4.8.5-28.el7_5.1.x86_64 zlib-1.2.7-17.el7.x86_64
(gdb) bt
#0 0x00007fca7abcb1d7 in raise () from /lib64/libc.so.6
#1 0x00007fca7abcc8c8 in abort () from /lib64/libc.so.6
#2 0x0000000001b13376 in google::DumpStackTraceAndExit () at src/utilities.cc:147
#3 0x0000000001b0a67d in google::LogMessage::Fail () at src/logging.cc:1599
#4 0x0000000001b0c504 in google::LogMessage::SendToLog (this=0x7fca74df2770) at src/logging.cc:1553
#5 0x0000000001b0a1a4 in google::LogMessage::Flush (this=0x7fca74df2770) at src/logging.cc:1422
#6 0x0000000001b0cf39 in google::LogMessageFatal::~LogMessageFatal (this=<optimized out>, __in_chrg=<optimized out>) at src/logging.cc:2125
#7 0x0000000000e26694 in doris::DataDir::load (this=0x4d74f00) at /builds/olap/doris/be/src/olap/data_dir.cpp:705
#8 0x0000000000e09dd9 in operator() (__closure=0x5349558) at /builds/olap/doris/be/src/olap/storage_engine.cpp:149
#9 __invoke_impl<void, doris::StorageEngine::load_data_dirs(const std::vector<doris::DataDir*>&)::<lambda()> > (__f=...) at /usr/include/c++/7.3.0/bits/invoke.h:60
#10 __invoke<doris::StorageEngine::load_data_dirs(const std::vector<doris::DataDir*>&)::<lambda()> > (__fn=...) at /usr/include/c++/7.3.0/bits/invoke.h:95
#11 _M_invoke<0> (this=0x5349558) at /usr/include/c++/7.3.0/thread:234
#12 operator() (this=0x5349558) at /usr/include/c++/7.3.0/thread:243
#13 std::thread::_State_impl<std::thread::_Invoker<std::tuple<doris::StorageEngine::load_data_dirs(const std::vector<doris::DataDir*>&)::<lambda()> > > >::_M_run(void) (this=0x5349550) at /usr/include/c++/7.3.0/thread:186
#14 0x00000000026b642f in std::execute_native_thread_routine (__p=0x5349550) at ../../../.././libstdc++-v3/src/c++11/thread.cc:83
#15 0x00007fca7a981dc5 in start_thread () from /lib64/libpthread.so.0
#16 0x00007fca7ac8d73d in clone () from /lib64/libc.so.6
I checked the related log:
W1201 14:52:13.408074 183882 tablet_manager.cpp:155] add duplicated tablet. force=0, res=-500, tablet_id=5164922, schema_hash=502924845, old_version=2, new_version=2, old_time=1606138765, new_time=1599296476, old_tablet_path=/home/work/app/doris/c3prc-hadoop-test/be/ssd1/data/325/5164922/502924845, new_tablet_path=/home/work/app/doris/c3prc-hadoop-test/be/ssd2/data/64/5164922/502924845
W1201 14:52:13.408120 183882 tablet_manager.cpp:843] fail to add tablet. tablet=5164922.502924845.1848811be2b4e08b-4abe001b0545fcb3[res=-500]
W1201 14:52:13.408583 183882 data_dir.cpp:690] load tablet from header failed. status:-500, tablet=5164922.502924845 // !!!critical log
W1201 14:52:13.409047 183882 alpha_rowset.cpp:327] tablet: 5164930 expect zone map size is 253, actual num is 4. If this is not the first start after upgrade, please pay attention!
W1201 14:52:13.409586 183882 alpha_rowset.cpp:327] tablet: 5164990 expect zone map size is 253, actual num is 4. If this is not the first start after upgrade, please pay attention!
W1201 14:52:13.410159 183882 alpha_rowset.cpp:327] tablet: 5165054 expect zone map size is 253, actual num is 4. If this is not the first start after upgrade, please pay attention!
W1201 14:52:13.410725 183882 alpha_rowset.cpp:327] tablet: 5165078 expect zone map size is 253, actual num is 4. If this is not the first start after upgrade, please pay attention!
I1201 14:52:13.410773 183882 tablet_manager.cpp:461] begin drop tablet. tablet_id=5165078, schema_hash=502924845
I1201 14:52:13.410786 183882 tablet_manager.cpp:1387] set tablet to shutdown state and remove it from memory. tablet_id=5165078, schema_hash=502924845, tablet_path=/home/work/app/doris/c3prc-hadoop-test/be/ssd1/data/162/5165078/502924845
I1201 14:52:13.411496 183882 tablet_meta_manager.cpp:115] save tablet meta , key:tabletmeta_5165078_502924845 meta_size=93382
W1201 14:52:13.411962 183882 tablet_manager.cpp:155] add duplicated tablet. force=0, res=0, tablet_id=5165078, schema_hash=502924845, old_version=2, new_version=2, old_time=1599296540, new_time=1606161418, old_tablet_path=/home/work/app/doris/c3prc-hadoop-test/be/ssd1/data/162/5165078/502924845, new_tablet_path=/home/work/app/doris/c3prc-hadoop-test/be/ssd2/data/506/5165078/502924845
W1201 14:52:13.412612 183882 alpha_rowset.cpp:327] tablet: 5165122 expect zone map size is 253, actual num is 4. If this is not the first start after upgrade, please pay attention!
W1201 14:52:13.413225 183882 alpha_rowset.cpp:327] tablet: 5165158 expect zone map size is 253, actual num is 4. If this is not the first start after upgrade, please pay attention!
W1201 14:52:13.413820 183882 alpha_rowset.cpp:327] tablet: 5165170 expect zone map size is 253, actual num is 4. If this is not the first start after upgrade, please pay attention!
W1201 14:52:15.418694 183882 data_dir.cpp:700] load tablets from header failed, loaded tablet: 45330, error tablet: 1, path: /home/work/app/doris/c3prc-hadoop-test/be/ssd2
F1201 14:52:15.418807 183882 data_dir.cpp:705] load tablets encounter failure. stop BE process. path: /home/work/app/doris/c3prc-hadoop-test/be/ssd2
It says that when load a new tablet in another data dir with the same tablet id, it may lead error, and the BE will exit.
After reading the code:
https://github.com/apache/incubator-doris/blob/df1f06e60b1339ef6e2756d0c4cb492cb64986c7/be/src/olap/tablet_manager.cpp#L130-L151
I doubt if there is a bug, data dirs are parallelly loaded by multi threads, a later loaded tablet may be older than the previously loaded tablet, we should not assume that a later loaded tablet must be newer (judged by version and create time).
Expected behavior
When found a older tablet loaded, just skip.
Metadata
Metadata
Assignees
Labels
No labels