Skip to content

Add a memory bound FileStatisticsCache for the Listing Table#20047

Open
mkleen wants to merge 66 commits intoapache:mainfrom
mkleen:file-stats-cache
Open

Add a memory bound FileStatisticsCache for the Listing Table#20047
mkleen wants to merge 66 commits intoapache:mainfrom
mkleen:file-stats-cache

Conversation

@mkleen
Copy link
Copy Markdown
Contributor

@mkleen mkleen commented Jan 28, 2026

Which issue does this PR close?

This change introduces a default FileStatisticsCache implementation for the Listing-Table with a size limit, implementing the following steps following #19052 (comment) :

Rationale for this change

See above.

What changes are included in this PR?

See above.

Are these changes tested?

Yes.

Are there any user-facing changes?

A new runtime setting datafusion.runtime.file_statistics.cache_limit

@github-actions github-actions Bot added documentation Improvements or additions to documentation core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) catalog Related to the catalog crate common Related to common crate execution Related to the execution crate labels Jan 28, 2026
@github-actions github-actions Bot removed the documentation Improvements or additions to documentation label Jan 28, 2026
@mkleen mkleen force-pushed the file-stats-cache branch 2 times, most recently from e273afc to b297378 Compare January 28, 2026 14:40
@github-actions github-actions Bot added the documentation Improvements or additions to documentation label Jan 28, 2026
@mkleen mkleen marked this pull request as ready for review January 28, 2026 16:23
@mkleen mkleen changed the title Add a default FileStatisticsCache implementation for the ListingTable Add a default FileStatisticsCache with a size limit Jan 28, 2026
@mkleen mkleen changed the title Add a default FileStatisticsCache with a size limit Add a FileStatisticsCache with a size limit Jan 28, 2026
@mkleen mkleen changed the title Add a FileStatisticsCache with a size limit Add FileStatisticsCache with a size limit Jan 28, 2026
@mkleen mkleen changed the title Add FileStatisticsCache with a size limit Add a memory bound FileStatisticsCache with a size limit Jan 29, 2026
@mkleen mkleen changed the title Add a memory bound FileStatisticsCache with a size limit Add a memory bound FileStatisticsCache for the Listing Table Jan 31, 2026
Copy link
Copy Markdown
Contributor

@kosiew kosiew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mkleen

Thanks for working on this.

Comment thread datafusion/execution/src/cache/cache_unit.rs Outdated
Comment thread datafusion/common/src/heap_size.rs
@mkleen
Copy link
Copy Markdown
Contributor Author

mkleen commented Feb 4, 2026

@kosiew Thank you for the feedback!

@mkleen mkleen requested a review from kosiew February 4, 2026 12:10
Copy link
Copy Markdown
Contributor

@kosiew kosiew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mkleen
Copy link
Copy Markdown
Contributor Author

mkleen commented Feb 10, 2026

@kosiew Anything else needed to get this merged? Another approval maybe?

Comment thread datafusion/common/src/heap_size.rs Outdated
impl<T: DFHeapSize> DFHeapSize for Arc<T> {
fn heap_size(&self) -> usize {
// Arc stores weak and strong counts on the heap alongside an instance of T
2 * size_of::<usize>() + size_of::<T>() + self.as_ref().heap_size()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't be accurate.

let a1 = Arc::new(vec![1, 2, 3]);
let a2 = a1.clone();
let a3 = a1.clone();
let a4 = a3.clone();

// this should be true because all `a`s point to the same object in memory 
// but the current implementation does not detect this and counts them separately
assert_eq!(a4.heap_size(), a1.heap_size() + a2.heap_size() + a3.heap_size() + a4.heap_size());

The only solution I imagine is the caller to keep track of the pointer addresses which have been "sized" and ignore any Arc's which point to an address which has been "sized" earlier.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! I took this implementation from https://github.com/apache/arrow-rs/blob/main/parquet/src/file/metadata/memory.rs#L97-L102 . I would suggest to also do a follow-up here. We are planing anyway to restructure the whole heap size estimation.

Comment thread datafusion/execution/src/cache/cache_unit.rs Outdated
Comment thread datafusion/execution/src/cache/cache_manager.rs
Comment thread datafusion/core/src/execution/context/mod.rs Outdated
Comment thread datafusion/execution/src/cache/cache_unit.rs Outdated
Comment thread datafusion/execution/src/cache/file_statistics_cache.rs
Comment thread datafusion/execution/src/cache/cache_unit.rs Outdated
Comment thread datafusion/execution/src/cache/file_statistics_cache.rs
Comment thread datafusion/sqllogictest/test_files/set_variable.slt Outdated
Comment thread datafusion/execution/src/cache/cache_manager.rs Outdated
@mkleen
Copy link
Copy Markdown
Contributor Author

mkleen commented Feb 10, 2026

@martin-g Thanks for this great review! I am on it.

@mkleen mkleen force-pushed the file-stats-cache branch from dee9bad to f98889f Compare April 23, 2026 10:33
Copy link
Copy Markdown
Contributor

@kosiew kosiew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mkleen

Thanks for addressing the earlier feedback.

Most of the issues look resolved and the changes are in a much better place now. There are just a couple of remaining concerns that would be good to fix before merging.

}
}

impl<T: DFHeapSize> DFHeapSize for Arc<T> {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like DFHeapSize for Arc<T> is still double counting shared allocations. Each clone of the same Arc ends up reporting the full size of the underlying data.

This was mentioned earlier, but it is still an issue. Any structure holding multiple clones of the same Arc will overestimate memory usage.

Since this trait is now used for cache accounting, this could lead to premature evictions and confusing metrics. It would be great to adjust this so shared allocations are only counted once.

fsc.update_cache_limit(config.file_statistics_cache_limit);
Some(Arc::clone(fsc))
}
None if config.file_statistics_cache_limit > 0 => {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change subtly alters the behavior of with_file_statistics_cache(None). It no longer disables the cache if file_statistics_cache_limit > 0, because try_new will create a default cache in that case.

Previously, passing None was enough to disable the cache. Now callers also need to know to set the limit to 0, which is not obvious.

This feels like a silent behavior change and it also contradicts the current builder documentation. It would be good to either restore the original behavior or make the contract explicit in both code and docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

catalog Related to the catalog crate common Related to common crate core Core DataFusion crate datasource Changes to the datasource crate documentation Improvements or additions to documentation execution Related to the execution crate sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add a default FileStatisticsCache implementation for the ListingTable Add limit to DefaultFileStatisticsCache

8 participants