Skip to content

Conversation

@rdhabalia
Copy link
Contributor

@rdhabalia rdhabalia commented Feb 6, 2019

Motivation

Due to #1937, bookies may not run GC that can lead to have large number of entry-logger files (10K) on the disk. After restarting bookie, bookie can trigger and GC tries to extract and loads all entrylog-metadata into memory by adding into entryLogMetaMap which eventually causes OOM in bookie and bookies keep restarting. this issue can be easily reproduced by adding large number of entrylogmetadata objects into the map and jvm will go OOM.

Changes

Bookie should extract entry-loggers in chunk to avoid memory pressure and OOM issues.

@merlimat
Copy link
Contributor

merlimat commented Feb 6, 2019

I think a better long-term fix here is to remove the entyLogMetadata maps from the bookie memory and instead store them on disk (eg: in rocksdb).

That will ensure that the amount of memory required by a bookie is completely independent from the amount of data stored in it.

@rdhabalia
Copy link
Contributor Author

I think a better long-term fix here is to remove the entyLogMetadata maps from the bookie memory and instead store them on disk (eg: in rocksdb).

could be good idea. I will make the change.

@reddycharan
Copy link
Contributor

@rdhabalia to begin with, this is a major change touching one of the core component - gc. Functional tests validating the correctness and tests proving there wont be regression would be helpful.

@reddycharan
Copy link
Contributor

@rdhabalia correct me if I'm wrong with the following math

first of all I'm surprised that you have 10K entrylog files in a Bookie. What is your 'logSizeLimit'? the default value is 1 GB. If it is the default value then 10K entrylog file would take 10 TB.

lets assume you have 10 K entrylog files, for each entrylog file there will be EntryLogMetadata. Each EntryLogMetadata mainly contains Map of <long ledgerid, long ledgersize>. Lets assume on average in each entrylog file there will be entries of 100 ledgers. So it would be 100 * (8+8) = 1.6 KB. Including other overhead lets make it 4 KB. So 10 K * 4 KB = 40 MB. Even if we assume that there are entries of 1000 ledgers in an entrylog, including overhead memory size of EntryLogMetaData should be 40 KB, so 10K * 40KB = 400 MB.

So I'm not sure how memory needed for entryLogMetaMap could become bottleneck.

entryLogMetaMap = extractMetaFromEntryLogs(entryLogMetaMap, isIteration);

// gc inactive/deleted ledgers
doGcLedgers();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is very expensive operation. It iterates over all the ledger ranges in the metadata server/all the ledgers in the bookie, this shouldn't be called multiple times

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that's correct. therefore, have provided option maxEntryLoggersPerScan=2000. In normal case, bookie will have < 500 entryLog files so, it will not require multiple iteration but it will target issue where bookie is not able to recover with large number of entrylog files. and therefore, @merlimat suggested to persist entryLogmetadataMap into rocksDB so, I have created PR: #1949

doGcLedgers();

// gc entry logs
doGcEntryLogs();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this method iterates over all the entries in entryLogMetaMap, again calling this multiple times is not sensible.

@rdhabalia
Copy link
Contributor Author

@reddycharan
Bookie reached to 10K entryLog files due to this #1937 bug and based on bookie stats, bookie had > 20M ledgers across all entryLog files. So, as soon as GC triggers, bookie was dying due to OOM and dump gave the root cause where entryLogMetadata-map data structure causing the issues which holds a map of entryLogId->ledgerId->ledgerIdSize. and It ends up taking up over 14GB total to the bookie node. I ran this patch on the node with maxEntryLoggersPerScan=2K which prevented bookie going OOM and GC/Compaction successfully cleaned up all 10K entryLogger files.

@reddycharan
Copy link
Contributor

if you are planning to do - #1949 , then is this change needed?

@eolivelli
Copy link
Contributor

@merlimat @reddycharan what's the status of this patch ?

zymap pushed a commit that referenced this pull request Dec 22, 2021
…1949)

### Motivation

It addresses #1938, where it provides interface for `entyLogMetadataMap` so, `entrylogMetadata` can be cached into RocksDB instead main-memory.

### Modification
- introduced RocksDB entryLogMetadataMap to cache entry-log metadata.
- it can fallback to in-memory implementation using  `gcPersistentEntrylogMetadataMapEnabled` flag and that can be removed in future release and keep only rocksDB implementation in future.

### Result
- It will help to avoid storing entryLogMetadata into main-memory.
Ghatage pushed a commit to sijie/bookkeeper that referenced this pull request Jul 12, 2024
…pache#1949)

### Motivation

It addresses apache#1938, where it provides interface for `entyLogMetadataMap` so, `entrylogMetadata` can be cached into RocksDB instead main-memory.

### Modification
- introduced RocksDB entryLogMetadataMap to cache entry-log metadata.
- it can fallback to in-memory implementation using  `gcPersistentEntrylogMetadataMapEnabled` flag and that can be removed in future release and keep only rocksDB implementation in future.

### Result
- It will help to avoid storing entryLogMetadata into main-memory.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants