Skip to content

All olap scanner hang at ShardedLRUCache's lock #1824

@imay

Description

@imay

Today, we found all backends have no response.
When I login and check its stack. we found that all scanner are wait lock in ShardedLRUCache. The stack looks like following

#0  0x00007f31aef69e24 in __lll_lock_wait () from /opt/compiler/gcc-4.8.2/lib64/libpthread.so.0
#1  0x00007f31aef656d9 in _L_lock_535 () from /opt/compiler/gcc-4.8.2/lib64/libpthread.so.0
#2  0x00007f31aef65500 in pthread_mutex_lock () from /opt/compiler/gcc-4.8.2/lib64/libpthread.so.0
#3  0x0000000002192a52 in pthread_mutex_lock ()
#4  0x0000000000d58630 in doris::Mutex::lock() ()
#5  0x0000000000d950e3 in doris::ShardedLRUCache::lookup(doris::CacheKey const&) ()
#6  0x0000000000d87d9c in doris::FileHandler::open_with_cache(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int) ()
#7  0x0000000000da1c76 in doris::SegmentReader::_load_segment_file() ()
#8  0x0000000000da926a in doris::SegmentReader::init(bool) ()
#9  0x0000000000d68cca in doris::ColumnData::_seek_to_block(doris::RowBlockPosition const&, bool) ()
#10 0x0000000000d69f42 in doris::ColumnData::_get_block ()
#11 0x0000000000d6c25b in doris::ColumnData::_seek_to_row(doris::RowCursor const&, bool, bool) ()
#12 0x0000000000d6c69c in doris::ColumnData::prepare_block_read(doris::RowCursor const*, bool, doris::RowCursor const*, bool, doris::RowBlock**) ()
#13 0x0000000000d20326 in doris::Reader::_attach_data_to_merge_set(bool, bool*) ()
#14 0x0000000000d23476 in doris::Reader::init(doris::ReaderParams const&) ()
#15 0x000000000127d54a in doris::OlapScanner::open() ()
#16 0x0000000001254f94 in doris::OlapScanNode::scanner_thread(doris::OlapScanner*) ()
#17 0x0000000000dfea96 in doris::PriorityThreadPool::work_thread(int) ()
#18 0x000000000171cb5d in thread_proxy ()
#19 0x00007f31aef631c3 in start_thread () from /opt/compiler/gcc-4.8.2/lib64/libpthread.so.0
#20 0x00007f31af26012d in clone () from /opt/compiler/gcc-4.8.2/lib64/libc.so.6

And I found that this lock is held to prune FdCache, the stack looks like

#0  0x00007f31af25264d in close () from /opt/compiler/gcc-4.8.2/lib64/libc.so.6
#1  0x0000000000d8a8b8 in doris::FileHandler::_delete_cache_file_descriptor(doris::CacheKey const&, void*) ()
#2  0x0000000000d95b2a in doris::LRUCache::_unref(doris::LRUHandle*) ()
#3  0x0000000000d95d3f in doris::ShardedLRUCache::prune() ()
#4  0x0000000000cba0f8 in doris::OLAPEngine::start_clean_fd_cache() ()
#5  0x0000000000cf125f in doris::OLAPEngine::_fd_cache_clean_callback(void*) ()
#6  0x0000000000cf130f in _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN5doris10OLAPEngine16_start_bg_workerEvEUlvE4_EEEEE6_M_runEv ()
#7  0x00000000026f72ef in execute_native_thread_routine ()
#8  0x00007f31aef631c3 in start_thread () from /opt/compiler/gcc-4.8.2/lib64/libpthread.so.0
#9  0x00007f31af26012d in clone () from /opt/compiler/gcc-4.8.2/lib64/libc.so.6

So we should improve our fd cache to avoid such things.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions