Skip to content

There should be memory leak in Doris #690

@kangkaisen

Description

@kangkaisen

Describe the bug
This picture is a mem.memfree.percent monitor for one Doris BE cluster having stream load
doris-memory-leak

This week, I found a few BE down without stack trace or core dump. I used dmesg command found the reason:

[26421178.578265] Out of memory: Kill process 11543 (palo_be) score 941 or sacrifice child
[26421178.578842] Killed process 11543 (palo_be) total-vm:129318072kB, anon-rss:123827392kB, file-rss:0kB
[26421178.604663] palo_be: page allocation failure: order:0, mode:0x201da
[26421178.604666] CPU: 35 PID: 11543 Comm: palo_be Tainted: GF    U     O--------------   3.10.0-229.el7.x86_64 #1

So, I think Doris has memory leak issue.

This picture is a mem.memfree.percent monitor for two Doris BE cluster without stream load
doris-memory

So, I guess the memory leak issue is related with stream load

Additional context

Today I try debugging this issue with valgrind:

I add the following command to start_be.sh script and restart BE

nohup $LIMIT /usr/local/bin/valgrind --tool=memcheck --leak-check=full --log-file=mem-log --error-limit=no ${DORIS_HOME}/lib/palo_be

but only run several minutes, there is a core dump:

    ==24637== Process terminating with default action of signal 11 (SIGSEGV)
==24637==    at 0x242E191: tcmalloc::CentralFreeList::FetchFromOneSpans(int, void**, void**) (central_freelist.cc:298)
==24637==    by 0x242E47B: tcmalloc::CentralFreeList::FetchFromOneSpansSafe(int, void**, void**) (central_freelist.cc:282)
==24637==    by 0x242E576: tcmalloc::CentralFreeList::RemoveRange(void**, void**, int) (central_freelist.cc:264)
==24637==    by 0x243AA42: tcmalloc::ThreadCache::FetchFromCentralCache(unsigned int, int, void* (*)(unsigned long)) (thread_cache.cc:126)
==24637==    by 0x258C104: tcmalloc::allocate_full_cpp_throw_oom(unsigned long) (in /opt/meituan/palo/be/lib/palo_be)
==24637==    by 0xE6EF5B: doris::BRpcService::BRpcService(doris::ExecEnv*) (brpc_service.cpp:30)
==24637==    by 0xA45B7D: main (doris_main.cpp:162)

==24637== Invalid free() / delete / delete[] / realloc()
==24637==    at 0x4E3EF7D: free (vg_replace_malloc.c:530)
==24637==    by 0x5ADEB7B: __libc_freeres (in /usr/lib64/libc-2.17.so)
==24637==    by 0x4C38749: _vgnU_freeres (vg_preloaded.c:77)
==24637==    by 0x2363: ???
==24637==    by 0x242E47B: tcmalloc::CentralFreeList::FetchFromOneSpansSafe(int, void**, void**) (central_freelist.cc:282)
==24637==    by 0x242E576: tcmalloc::CentralFreeList::RemoveRange(void**, void**, int) (central_freelist.cc:264)
==24637==    by 0x243AA42: tcmalloc::ThreadCache::FetchFromCentralCache(unsigned int, int, void* (*)(unsigned long)) (thread_cache.cc:126)
==24637==    by 0x258C104: tcmalloc::allocate_full_cpp_throw_oom(unsigned long) (in /opt/meituan/palo/be/lib/palo_be)
==24637==    by 0xE6EF5B: doris::BRpcService::BRpcService(doris::ExecEnv*) (brpc_service.cpp:30)

The following is partial output for valgrind:

==5145== 164,080 bytes in 5 blocks are definitely lost in loss record 4,148 of 4,200
==5145==    at 0x4E3DE83: malloc (vg_replace_malloc.c:299)
==5145==    by 0x5A2F280: __alloc_dir (in /usr/lib64/libc-2.17.so)
==5145==    by 0xC87171: doris::OlapStore::_check_path_exist() (store.cpp:104)
==5145==    by 0xC8A273: doris::OlapStore::load() (store.cpp:86)
==5145==    by 0xC62021: doris::OLAPEngine::open() (olap_engine.cpp:290)
==5145==    by 0xC635CC: doris::OLAPEngine::open(doris::EngineOptions const&, doris::OLAPEngine**) (olap_engine.cpp:95)
==5145==    by 0xA45ADA: main (doris_main.cpp:138)



==5145== 128 bytes in 16 blocks are definitely lost in loss record 2,604 of 4,200
==5145==    at 0x4E3EB48: operator new[](unsigned long) (vg_replace_malloc.c:423)
==5145==    by 0x137AC7A: rocksdb::VersionStorageInfo::VersionStorageInfo(rocksdb::InternalKeyComparator const*, rocksdb::Comparat
or const*, int, rocksdb::CompactionStyle, rocksdb::VersionStorageInfo*, bool) (version_set.cc:1091)
==5145==    by 0x137B378: rocksdb::Version::Version(rocksdb::ColumnFamilyData*, rocksdb::VersionSet*, rocksdb::EnvOptions const&,
unsigned long) (version_set.cc:1149)
==5145==    by 0x137B470: rocksdb::VersionSet::CreateColumnFamily(rocksdb::ColumnFamilyOptions const&, rocksdb::VersionEdit*) (ver
sion_set.cc:4152)
==5145==    by 0x137EC64: rocksdb::VersionSet::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::Column
FamilyDescriptor> > const&, bool) (version_set.cc:3175)
==5145==    by 0x132AF2C: rocksdb::DBImpl::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFami
lyDescriptor> > const&, bool, bool, bool) (db_impl_open.cc:379)
==5145==    by 0x132BC9B: rocksdb::DBImpl::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>
, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > co
nst&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**, bool) (db_impl_ope
n.cc:1067)
==5145==    by 0x132D552: rocksdb::DB::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, st
d::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&
, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**) (db_impl_open.cc:1008)
==5145==    by 0xCFC9FB: doris::OlapMeta::init() (olap_meta.cpp:81)
==5145==    by 0xC89D3C: doris::OlapStore::_init_meta() (store.cpp:284)
==5145==    by 0xC8A470: doris::OlapStore::load() (store.cpp:97)
==5145==    by 0xC62021: doris::OLAPEngine::open() (olap_engine.cpp:290)



==5145== 63,521,561 bytes in 13,309 blocks are possibly lost in loss record 4,191 of 4,200
==5145==    at 0x4E3EB48: operator new[](unsigned long) (vg_replace_malloc.c:423)
==5145==    by 0x13E4ADB: rocksdb::UncompressBlockContentsForCompressionType(char const*, unsigned long, rocksdb::BlockContents*,
unsigned int, rocksdb::Slice const&, rocksdb::CompressionType, rocksdb::ImmutableCFOptions const&) (format.cc:286)
==5145==    by 0x13E4EA8: rocksdb::UncompressBlockContents(char const*, unsigned long, rocksdb::BlockContents*, unsigned int, rock
sdb::Slice const&, rocksdb::ImmutableCFOptions const&) (format.cc:393)
==5145==    by 0x13E0FD8: rocksdb::BlockFetcher::ReadBlockContents() (block_fetcher.cc:228)
==5145==    by 0x13CF46B: rocksdb::(anonymous namespace)::ReadBlockFromFile(rocksdb::RandomAccessFileReader*, rocksdb::FilePrefetc
hBuffer*, rocksdb::Footer const&, rocksdb::ReadOptions const&, rocksdb::BlockHandle const&, std::unique_ptr<rocksdb::Block, std::d
efault_delete<rocksdb::Block> >*, rocksdb::ImmutableCFOptions const&, bool, rocksdb::Slice const&, rocksdb::PersistentCacheOptions const&, unsigned long, unsigned long) (block_based_table_reader.cc:87)
==5145==    by 0x13D1BD0: rocksdb::BlockBasedTable::MaybeLoadDataBlockToCache(rocksdb::FilePrefetchBuffer*, rocksdb::BlockBasedTable::Rep*, rocksdb::ReadOptions const&, rocksdb::BlockHandle const&, rocksdb::Slice, rocksdb::BlockBasedTable::CachableEntry<rocksdb::Block>*, bool, rocksdb::GetContext*) (block_based_table_reader.cc:1614)
==5145==    by 0x13D1EAB: rocksdb::BlockBasedTable::NewDataBlockIterator(rocksdb::BlockBasedTable::Rep*, rocksdb::ReadOptions const&, rocksdb::BlockHandle const&, rocksdb::BlockIter*, bool, rocksdb::GetContext*, rocksdb::Status) (block_based_table_reader.cc:1494)
==5145==    by 0x13D29BD: rocksdb::BlockBasedTableIterator::InitDataBlock() (block_based_table_reader.cc:1910)
==5145==    by 0x13D2C79: rocksdb::BlockBasedTableIterator::FindKeyForward() (block_based_table_reader.cc:1933)
==5145==    by 0x136F778: Next (iterator_wrapper.h:61)
==5145==    by 0x136F778: rocksdb::(anonymous namespace)::LevelIterator::Next() (version_set.cc:613)
==5145==    by 0x13F3F34: Next (iterator_wrapper.h:61)
==5145==    by 0x13F3F34: rocksdb::MergingIterator::Next() (merging_iterator.cc:202)
==5145==    by 0x14B68CE: rocksdb::DBIter::Next() (db_iter.cc:375)


==5145== LEAK SUMMARY:
==5145==    definitely lost: 164,518 bytes in 30 blocks
==5145==    indirectly lost: 0 bytes in 0 blocks
==5145==      possibly lost: 69,780,519 bytes in 41,474 blocks
==5145==    still reachable: 2,611,746,489 bytes in 32,280,429 blocks
==5145==                       of which reachable via heuristic:
==5145==                         newarray           : 7,304 bytes in 25 blocks
==5145==                         multipleinheritance: 20,328 bytes in 3 blocks
==5145==         suppressed: 0 bytes in 0 blocks
==5145== Reachable blocks (those to which a pointer was found) are not shown.
==5145== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==5145==
==5145== For counts of detected and suppressed errors, rerun with: -v
==5145== Use --track-origins=yes to see where uninitialised values come from
==5145== ERROR SUMMARY: 6163542 errors from 1204 contexts (suppressed: 0 from 0)

I am trying the HEAPCHECK tool.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions