Flink 2.0: Replace Caffeine maxSize cache with LRUCache by aiborodin · Pull Request #13382 · apache/iceberg

aiborodin · 2025-06-25T08:45:27Z

We recently discovered that LRUCache, based on LinkedHashMap, has a throughput almost two times as high as the Caffeine Cache with the maximum size configured. Please, see JMH benchmark results here.

Let's use LRUCache in TableMetadataCache to improve the cache performance of the DynamicIcebergSink.

ben-manes · 2025-06-26T03:13:30Z

Is TableMetadataCache single threaded like your benchmark? If so then this is what LinkedHashMap is optimal for since the LRU updates are simple pointer swaps. If not, then you will have to synchronize access because it is not thread-safe, where concurrent usage will cause corruption and instability. This is required for every read (per the javadoc) since every access mutates the LRU order. You also have to be thoughtful about the cache hit rate because a cache miss will be far more expensive than the in-memory operations, e.g. it will have to perform I/O. If the workload is recency-biased then LRU is perfect, but if it has frequency or scans then it can be quite poor.

Caffeine is a multi-threaded cache with an adaptive eviction policy that maximizes the hit rates based on the observed workload. This does incur additional overhead but can greatly improve the overall system performance.

I adjusted Caffeine's benchmark to run as a single threaded and with 16 threads on my 14-core M3 MAX laptop using OpenJDK 24. This uses a Zipfian distribution to simulate hot/cold items with a 100% hit rate. This was not a clean system as I was on a conference call while writing code, which only hurt Caffeine since it will utilize all of the cores. In a cloud environment you will likely observe worse throughput due to virtualization, numa effects, noisy neighbors, older hardware, etc. In general you want to drive application performance decision by profiling to resolve hotspots, as small optimizations can backfire when your benchmark does not fit your real workload.

aiborodin · 2025-06-26T07:34:47Z

Thank you for a detailed reply.

TableMetadataCache is single-threaded, so it's safe to use LRUCache, based on LinkedHashMap.

It was unclear from the caffeine project description that the cache is specifically optimised for high concurrency. I also didn't find any single-threaded benchmarks online, so I wrote my own here which gave these results:

Benchmark                                     Mode  Cnt     Score     Error  Units
CacheBenchmark.testCaffeineCacheMaxSize_Get  thrpt    5   929.031 ±  84.510  ops/s
CacheBenchmark.testCaffeineCacheMaxSize_Put  thrpt    5   548.677 ±  12.191  ops/s
CacheBenchmark.testLRUCache_Get              thrpt    5  1657.313 ±  71.981  ops/s
CacheBenchmark.testLRUCache_Put              thrpt    5  1206.151 ± 112.609  ops/s

I guess it makes sense that caffeine performs worse in a single-threaded scenario due to thread synchronisation. It might be worth clarifying this on the project page so it's more visible for users.

ben-manes · 2025-06-26T07:47:41Z

Yep, yours are reasonable but you don’t need to have the loop and can let jmh handle it for better clarity of the results. Those are 100k calls per unit.

It’s not that much worse as 44M vs 76M reads/s is far faster than typically needed. Usually those who care need primitive collections and are very specialized. The much better hit rate than LRU more than compensates because milliseconds for an extra miss outweighs saving a few nanoseconds per hit. LHM as an LRU is really great, but LRU isn’t as good as people assume.

pvary · 2025-06-26T19:44:29Z

Thanks for the PR @aiborodin, and @ben-manes for the nice detailed test and explanation. We are debating sharing the TableCache on JVM level. Your highlights about the concurrent cache access will be a very useful data point in that discussion.

Currently the cache access is single threaded, so we can get away with the suggested LHM solution.

aiborodin · 2025-06-27T07:25:59Z

Thanks for your comment @pvary. I rebased this change on top of the merged #13340, so this should be ready for review.

mxm

That's a clever find! As Peter said, we want to eventually share this cache across all components of DynamicSink. We might want to re-evaluate then, but this is good for now.

What about the other instances of Caffeine in the Flink module? E.g. TableSerializerCache, HashKeyGenerator, DynamicWriteResultAggregator, DynamicWriter (they all use Caffeine).

pvary · 2025-06-27T15:42:29Z

@aiborodin: Do you want to check the other instances of the cache mentioned by @mxm in this PR, or you want to do them in another PR?

aiborodin · 2025-06-30T06:43:00Z

I replaced all Caffeine maximumSize caches with LRUCache in this PR. The only one left is DynamicWriteResultAggregator, which uses expireAfterWrite semantics:

this.specs =
    Caffeine.newBuilder().expireAfterWrite(CACHE_EXPIRATION_DURATION).softValues().build();
this.outputFileFactories =
    Caffeine.newBuilder().expireAfterWrite(CACHE_EXPIRATION_DURATION).softValues().build();

Should we replace that with LRUCache as well?

mxm · 2025-06-30T07:50:12Z

We can probably replace that one as well. There is nothing inherently different about that cache.

pvary · 2025-06-30T08:35:51Z

I replaced all Caffeine maximumSize caches with LRUCache in this PR. The only one left is DynamicWriteResultAggregator, which uses expireAfterWrite semantics:
this.specs =
    Caffeine.newBuilder().expireAfterWrite(CACHE_EXPIRATION_DURATION).softValues().build();
this.outputFileFactories =
    Caffeine.newBuilder().expireAfterWrite(CACHE_EXPIRATION_DURATION).softValues().build();
Should we replace that with LRUCache as well?

Before we move forward, let's take a step back and consider the size of the cache, and the cost of a cache miss. How many items do we need in the cache for the optimal operation:

Is this something which is static and we can configure?
Is this something which should be configured by the user and relatively stable for a single job?
Is this something which is dependent on the workload, and increases as the number of tables accessed by the job increases?

This one looks like number 3 to me. Also the cache access is only several times during a checkpoint, so the performance is less important. So for me this seems like better handled by caffeine, but I could be convinced.

Also please revisit the pervious decisions on the cache sizes, and reconsider if appropriate.

Thanks,
Peter

mxm · 2025-06-30T10:44:54Z

You are right Peter, the cache size should probably be proportional to the "active" tables, but using a time-based eviction policy, we have seen worse performance. That was the reason we switched to max size / LRU. So there should probably be an LRU eviction policy not across ALL tables, but per-table. We could extend LRUCache to use a LinkedList per table. We probably have to then compare it again to something like Caffeine to decide if the custom cache implementation is still worth it.

pvary · 2025-06-30T18:05:29Z

using a time-based eviction policy, we have seen worse performance

IIUC, we access this cache only a few times every checkpoint. Sum of (table x parallelism for the table). Not very few, but not like for every record. Probably doesn't worth the extra complexity.

We recently discovered that LRUCache, based on LinkedHashMap, performs almost twice as fast as the Caffeine max size cache. Let's replace the caffeine cache to optimise the performance.

aiborodin · 2025-07-01T07:21:55Z

I agree that the caching/eviction policy makes more sense on a per-table basis. We can keep this PR scoped to replacing Caffeine.maximumSize implementations with LRUCache for simplicity and revisit the expireAfterWrite cache in the future.

pvary

+1 pending tests

@mxm: Any other comments?

mxm · 2025-07-01T10:09:07Z

LGTM

pvary · 2025-07-01T14:19:48Z

Merged to main.
Thanks for the PR @aiborodin, @ben-manes for the detailed help and @mxm for the review!

aiborodin · 2025-07-02T01:59:05Z

Thank you for merging @pvary! Thank you, @ben-manes, for the benchmarking analysis and @mxm, for the review.
I raised a PR to backport changes to Flink 1.19/1.20: #13441.

backports #13382

github-actions Bot added the flink label Jun 25, 2025

aiborodin mentioned this pull request Jun 25, 2025

Flink: Dynamic Iceberg Sink: Optimise RowData evolution #13340

Merged

manuzhang changed the title ~~Use LRUCache for Iceberg table metadata~~ Flink 2.0: Use LRUCache for Iceberg table metadata Jun 25, 2025

aiborodin force-pushed the use-lru-cache-for-table-metadata branch from 715be14 to 0689560 Compare June 25, 2025 09:54

aiborodin force-pushed the use-lru-cache-for-table-metadata branch from 0689560 to d430d54 Compare June 27, 2025 07:22

mxm reviewed Jun 27, 2025

View reviewed changes

pvary reviewed Jun 27, 2025

View reviewed changes

Comment thread flink/v2.0/flink/src/main/java/org/apache/iceberg/flink/sink/dynamic/TableMetadataCache.java Outdated

aiborodin force-pushed the use-lru-cache-for-table-metadata branch from d430d54 to c7afaa2 Compare June 30, 2025 06:35

aiborodin changed the title ~~Flink 2.0: Use LRUCache for Iceberg table metadata~~ Flink 2.0: Replace Caffeine maxSize cache with LRUCache Jun 30, 2025

pvary reviewed Jun 30, 2025

View reviewed changes

Comment thread ...k/v2.0/flink/src/test/java/org/apache/iceberg/flink/sink/dynamic/TestTableMetadataCache.java Outdated

pvary reviewed Jun 30, 2025

View reviewed changes

Comment thread flink/v2.0/flink/src/test/java/org/apache/iceberg/flink/sink/dynamic/TestHashKeyGenerator.java Outdated

pvary reviewed Jun 30, 2025

View reviewed changes

Comment thread flink/v2.0/flink/src/test/java/org/apache/iceberg/flink/sink/dynamic/TestTableUpdater.java Outdated

pvary reviewed Jun 30, 2025

View reviewed changes

Comment thread flink/v2.0/flink/src/test/java/org/apache/iceberg/flink/sink/dynamic/TestTableUpdater.java Outdated

aiborodin force-pushed the use-lru-cache-for-table-metadata branch from c7afaa2 to ac76a15 Compare July 1, 2025 07:08

Replace Caffeine maxSize cache with LRUCache

7ba9960

We recently discovered that LRUCache, based on LinkedHashMap, performs almost twice as fast as the Caffeine max size cache. Let's replace the caffeine cache to optimise the performance.

aiborodin force-pushed the use-lru-cache-for-table-metadata branch from ac76a15 to 7ba9960 Compare July 1, 2025 07:11

pvary approved these changes Jul 1, 2025

View reviewed changes

mxm approved these changes Jul 1, 2025

View reviewed changes

pvary merged commit e733443 into apache:main Jul 1, 2025
23 of 27 checks passed

aiborodin deleted the use-lru-cache-for-table-metadata branch July 2, 2025 01:47

aiborodin mentioned this pull request Jul 2, 2025

Flink: Backport LRUCache usage to Flink 1.19/1.20 #13441

Merged

pvary pushed a commit that referenced this pull request Jul 2, 2025

Flink: Backport replace Caffeine maxSize cache with LRUCache (#13441)

97a1f3b

backports #13382

Conversation

aiborodin commented Jun 25, 2025

Uh oh!

ben-manes commented Jun 26, 2025

Uh oh!

aiborodin commented Jun 26, 2025

Uh oh!

ben-manes commented Jun 26, 2025

Uh oh!

pvary commented Jun 26, 2025

Uh oh!

aiborodin commented Jun 27, 2025

Uh oh!

mxm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pvary commented Jun 27, 2025

Uh oh!

aiborodin commented Jun 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mxm commented Jun 30, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pvary commented Jun 30, 2025

Uh oh!

mxm commented Jun 30, 2025

Uh oh!

pvary commented Jun 30, 2025

Uh oh!

aiborodin commented Jul 1, 2025

Uh oh!

pvary left a comment

Choose a reason for hiding this comment

Uh oh!

mxm commented Jul 1, 2025

Uh oh!

Uh oh!

pvary commented Jul 1, 2025

Uh oh!

aiborodin commented Jul 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

aiborodin commented Jun 30, 2025 •

edited

Loading