-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Closed
Description
Describe the bug
This picture is a mem.memfree.percent monitor for one Doris BE cluster having stream load
This week, I found a few BE down without stack trace or core dump. I used dmesg command found the reason:
[26421178.578265] Out of memory: Kill process 11543 (palo_be) score 941 or sacrifice child
[26421178.578842] Killed process 11543 (palo_be) total-vm:129318072kB, anon-rss:123827392kB, file-rss:0kB
[26421178.604663] palo_be: page allocation failure: order:0, mode:0x201da
[26421178.604666] CPU: 35 PID: 11543 Comm: palo_be Tainted: GF U O-------------- 3.10.0-229.el7.x86_64 #1
So, I think Doris has memory leak issue.
This picture is a mem.memfree.percent monitor for two Doris BE cluster without stream load
So, I guess the memory leak issue is related with stream load
Additional context
Today I try debugging this issue with valgrind:
I add the following command to start_be.sh script and restart BE
nohup $LIMIT /usr/local/bin/valgrind --tool=memcheck --leak-check=full --log-file=mem-log --error-limit=no ${DORIS_HOME}/lib/palo_be
but only run several minutes, there is a core dump:
==24637== Process terminating with default action of signal 11 (SIGSEGV)
==24637== at 0x242E191: tcmalloc::CentralFreeList::FetchFromOneSpans(int, void**, void**) (central_freelist.cc:298)
==24637== by 0x242E47B: tcmalloc::CentralFreeList::FetchFromOneSpansSafe(int, void**, void**) (central_freelist.cc:282)
==24637== by 0x242E576: tcmalloc::CentralFreeList::RemoveRange(void**, void**, int) (central_freelist.cc:264)
==24637== by 0x243AA42: tcmalloc::ThreadCache::FetchFromCentralCache(unsigned int, int, void* (*)(unsigned long)) (thread_cache.cc:126)
==24637== by 0x258C104: tcmalloc::allocate_full_cpp_throw_oom(unsigned long) (in /opt/meituan/palo/be/lib/palo_be)
==24637== by 0xE6EF5B: doris::BRpcService::BRpcService(doris::ExecEnv*) (brpc_service.cpp:30)
==24637== by 0xA45B7D: main (doris_main.cpp:162)
==24637== Invalid free() / delete / delete[] / realloc()
==24637== at 0x4E3EF7D: free (vg_replace_malloc.c:530)
==24637== by 0x5ADEB7B: __libc_freeres (in /usr/lib64/libc-2.17.so)
==24637== by 0x4C38749: _vgnU_freeres (vg_preloaded.c:77)
==24637== by 0x2363: ???
==24637== by 0x242E47B: tcmalloc::CentralFreeList::FetchFromOneSpansSafe(int, void**, void**) (central_freelist.cc:282)
==24637== by 0x242E576: tcmalloc::CentralFreeList::RemoveRange(void**, void**, int) (central_freelist.cc:264)
==24637== by 0x243AA42: tcmalloc::ThreadCache::FetchFromCentralCache(unsigned int, int, void* (*)(unsigned long)) (thread_cache.cc:126)
==24637== by 0x258C104: tcmalloc::allocate_full_cpp_throw_oom(unsigned long) (in /opt/meituan/palo/be/lib/palo_be)
==24637== by 0xE6EF5B: doris::BRpcService::BRpcService(doris::ExecEnv*) (brpc_service.cpp:30)
The following is partial output for valgrind:
==5145== 164,080 bytes in 5 blocks are definitely lost in loss record 4,148 of 4,200
==5145== at 0x4E3DE83: malloc (vg_replace_malloc.c:299)
==5145== by 0x5A2F280: __alloc_dir (in /usr/lib64/libc-2.17.so)
==5145== by 0xC87171: doris::OlapStore::_check_path_exist() (store.cpp:104)
==5145== by 0xC8A273: doris::OlapStore::load() (store.cpp:86)
==5145== by 0xC62021: doris::OLAPEngine::open() (olap_engine.cpp:290)
==5145== by 0xC635CC: doris::OLAPEngine::open(doris::EngineOptions const&, doris::OLAPEngine**) (olap_engine.cpp:95)
==5145== by 0xA45ADA: main (doris_main.cpp:138)
==5145== 128 bytes in 16 blocks are definitely lost in loss record 2,604 of 4,200
==5145== at 0x4E3EB48: operator new[](unsigned long) (vg_replace_malloc.c:423)
==5145== by 0x137AC7A: rocksdb::VersionStorageInfo::VersionStorageInfo(rocksdb::InternalKeyComparator const*, rocksdb::Comparat
or const*, int, rocksdb::CompactionStyle, rocksdb::VersionStorageInfo*, bool) (version_set.cc:1091)
==5145== by 0x137B378: rocksdb::Version::Version(rocksdb::ColumnFamilyData*, rocksdb::VersionSet*, rocksdb::EnvOptions const&,
unsigned long) (version_set.cc:1149)
==5145== by 0x137B470: rocksdb::VersionSet::CreateColumnFamily(rocksdb::ColumnFamilyOptions const&, rocksdb::VersionEdit*) (ver
sion_set.cc:4152)
==5145== by 0x137EC64: rocksdb::VersionSet::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::Column
FamilyDescriptor> > const&, bool) (version_set.cc:3175)
==5145== by 0x132AF2C: rocksdb::DBImpl::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFami
lyDescriptor> > const&, bool, bool, bool) (db_impl_open.cc:379)
==5145== by 0x132BC9B: rocksdb::DBImpl::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>
, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > co
nst&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**, bool) (db_impl_ope
n.cc:1067)
==5145== by 0x132D552: rocksdb::DB::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, st
d::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&
, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**) (db_impl_open.cc:1008)
==5145== by 0xCFC9FB: doris::OlapMeta::init() (olap_meta.cpp:81)
==5145== by 0xC89D3C: doris::OlapStore::_init_meta() (store.cpp:284)
==5145== by 0xC8A470: doris::OlapStore::load() (store.cpp:97)
==5145== by 0xC62021: doris::OLAPEngine::open() (olap_engine.cpp:290)
==5145== 63,521,561 bytes in 13,309 blocks are possibly lost in loss record 4,191 of 4,200
==5145== at 0x4E3EB48: operator new[](unsigned long) (vg_replace_malloc.c:423)
==5145== by 0x13E4ADB: rocksdb::UncompressBlockContentsForCompressionType(char const*, unsigned long, rocksdb::BlockContents*,
unsigned int, rocksdb::Slice const&, rocksdb::CompressionType, rocksdb::ImmutableCFOptions const&) (format.cc:286)
==5145== by 0x13E4EA8: rocksdb::UncompressBlockContents(char const*, unsigned long, rocksdb::BlockContents*, unsigned int, rock
sdb::Slice const&, rocksdb::ImmutableCFOptions const&) (format.cc:393)
==5145== by 0x13E0FD8: rocksdb::BlockFetcher::ReadBlockContents() (block_fetcher.cc:228)
==5145== by 0x13CF46B: rocksdb::(anonymous namespace)::ReadBlockFromFile(rocksdb::RandomAccessFileReader*, rocksdb::FilePrefetc
hBuffer*, rocksdb::Footer const&, rocksdb::ReadOptions const&, rocksdb::BlockHandle const&, std::unique_ptr<rocksdb::Block, std::d
efault_delete<rocksdb::Block> >*, rocksdb::ImmutableCFOptions const&, bool, rocksdb::Slice const&, rocksdb::PersistentCacheOptions const&, unsigned long, unsigned long) (block_based_table_reader.cc:87)
==5145== by 0x13D1BD0: rocksdb::BlockBasedTable::MaybeLoadDataBlockToCache(rocksdb::FilePrefetchBuffer*, rocksdb::BlockBasedTable::Rep*, rocksdb::ReadOptions const&, rocksdb::BlockHandle const&, rocksdb::Slice, rocksdb::BlockBasedTable::CachableEntry<rocksdb::Block>*, bool, rocksdb::GetContext*) (block_based_table_reader.cc:1614)
==5145== by 0x13D1EAB: rocksdb::BlockBasedTable::NewDataBlockIterator(rocksdb::BlockBasedTable::Rep*, rocksdb::ReadOptions const&, rocksdb::BlockHandle const&, rocksdb::BlockIter*, bool, rocksdb::GetContext*, rocksdb::Status) (block_based_table_reader.cc:1494)
==5145== by 0x13D29BD: rocksdb::BlockBasedTableIterator::InitDataBlock() (block_based_table_reader.cc:1910)
==5145== by 0x13D2C79: rocksdb::BlockBasedTableIterator::FindKeyForward() (block_based_table_reader.cc:1933)
==5145== by 0x136F778: Next (iterator_wrapper.h:61)
==5145== by 0x136F778: rocksdb::(anonymous namespace)::LevelIterator::Next() (version_set.cc:613)
==5145== by 0x13F3F34: Next (iterator_wrapper.h:61)
==5145== by 0x13F3F34: rocksdb::MergingIterator::Next() (merging_iterator.cc:202)
==5145== by 0x14B68CE: rocksdb::DBIter::Next() (db_iter.cc:375)
==5145== LEAK SUMMARY:
==5145== definitely lost: 164,518 bytes in 30 blocks
==5145== indirectly lost: 0 bytes in 0 blocks
==5145== possibly lost: 69,780,519 bytes in 41,474 blocks
==5145== still reachable: 2,611,746,489 bytes in 32,280,429 blocks
==5145== of which reachable via heuristic:
==5145== newarray : 7,304 bytes in 25 blocks
==5145== multipleinheritance: 20,328 bytes in 3 blocks
==5145== suppressed: 0 bytes in 0 blocks
==5145== Reachable blocks (those to which a pointer was found) are not shown.
==5145== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==5145==
==5145== For counts of detected and suppressed errors, rerun with: -v
==5145== Use --track-origins=yes to see where uninitialised values come from
==5145== ERROR SUMMARY: 6163542 errors from 1204 contexts (suppressed: 0 from 0)
I am trying the HEAPCHECK tool.
Metadata
Metadata
Assignees
Labels
No labels

