Skip to content

Conversation

@Ted-Jiang
Copy link
Member

Which issue does this PR close?

Closes #7556 .

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added the core Core DataFusion crate label Sep 15, 2023

#[tokio::test]
async fn load_table_stats_with_session_level_cache() -> Result<()> {
let testdata = crate::test_util::parquet_test_data();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add test to check cache share in session level

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice -- thank you. Since this is an end to end test, I recommend moving it to somewhere in core_integration: datafusion/core/tests/core_integration.rs perhaps

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Ted-Jiang -- this is looking really nice. I left some various clean up comments but I think this PR looks very nice and is well commented and structured 🏆

// The cache accessor, users usually working on this interface while manipulating caches
pub trait CacheAccessor<K, V>: Send + Sync {
// Extra info but not part of the cache key or cache value.
type Extra: Clone;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain what the usecase for Extra is? Specifically I wonder why such information could not be added as a field to the Value

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like in default FileStatisticsCache, get func need check last_modified from ObjectMeta which not impl Hash so can not be part of the key, we need put this info in Extra.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see -- that makes sense -- it might help to document the rationale in statistics

}

/// Get `Statistics` for file location. Returns None if file has changed or not found.
fn get_with_extra(&self, k: &Path, e: &Self::Extra) -> Option<Statistics> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as written this is going to copy the statistics (though I realize that is what this PR did previously) -- maybe we could use something like Arc<Statistics> to store the statistics.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes , thanks for point this out


#[tokio::test]
async fn load_table_stats_with_session_level_cache() -> Result<()> {
let testdata = crate::test_util::parquet_test_data();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice -- thank you. Since this is an end to end test, I recommend moving it to somewhere in core_integration: datafusion/core/tests/core_integration.rs perhaps

}

impl CacheManagerConfig {
pub fn enable_table_files_statistics_cache(mut self, cache: FileStaticCache) -> Self {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we possibly have the field names match -- here it is called table_files_statistics_cache but on the CacheManager it is called file_statistics_cache -- I think they should be the same in both places (I like file_statistics_cache best as it matches the type name)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ops, this make sense

pub mod cache_unit;

// The cache accessor, users usually working on this interface while manipulating caches
pub trait CacheAccessor<K, V>: Send + Sync {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another classic API to add here would be "clear()" to clear all the values

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need this

.with_schema(resolved_schema);
let table =
ListingTable::try_new(config)?.with_definition(cmd.definition.clone());
let provider;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you implement the with_cache API, this can look like

let provider = ListingTable::try_new(config)?
  .with_cache(state.runtime_env().cache_manager.get_file_statistic_cache())

Copy link
Member Author

@Ted-Jiang Ted-Jiang Sep 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More clear way ! 👍

Ted-Jiang and others added 4 commits September 16, 2023 09:08
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
@Ted-Jiang
Copy link
Member Author

@alamb PTAL

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Ted-Jiang -- I think this looks great -- I had a suggestion to improve the comments but I think we can do that as a follow on PR as well. Nice work!

use tempfile::NamedTempFile;

mod custom_reader;
mod file_statistics;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@alamb
Copy link
Contributor

alamb commented Sep 17, 2023

cc @Dandandan / @thinkharderdev @liukun4515 and @mateuszkj who appears to have added this feature originally in 85c11c1 / #3649

Copy link
Contributor

@liukun4515 liukun4515 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for file statistics trait api

Ted-Jiang and others added 2 commits September 18, 2023 10:33
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
@Ted-Jiang Ted-Jiang merged commit 678d27a into apache:main Sep 18, 2023
@Ted-Jiang
Copy link
Member Author

@alamb @liukun4515 Thanks for the review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Make StatisticsCache share in session level

3 participants