Extract common parquet testing code to parquet-test-util crate#4042
Extract common parquet testing code to parquet-test-util crate#4042alamb merged 2 commits intoapache:masterfrom
parquet-test-util crate#4042Conversation
ce9d131 to
b159106
Compare
| scan_options: ParquetScanOptions, | ||
| debug: bool, | ||
| ) -> Result<usize> { | ||
| let ParquetScanOptions { |
There was a problem hiding this comment.
All this code is simply moved into parquet-test-utils and the TestParquetFile struct
| use parquet::file::properties::WriterProperties; | ||
|
|
||
| /// a ParquetFile that has been created for testing. | ||
| pub struct TestParquetFile { |
There was a problem hiding this comment.
The point of this PR is to refactor common code from the benchmark into TestParquetFile for use in testing
| host_idx: usize, | ||
|
|
||
| /// optional number of rows produced | ||
| row_limit: Option<usize>, |
There was a problem hiding this comment.
Limits to control the maximum parquet size
| [workspace] | ||
| exclude = ["datafusion-cli"] | ||
| members = ["datafusion/common", "datafusion/core", "datafusion/expr", "datafusion/jit", "datafusion/optimizer", "datafusion/physical-expr", "datafusion/proto", "datafusion/row", "datafusion/sql", "datafusion-examples", "benchmarks", | ||
| members = [ |
|
Thanks for the review @xudong963 |
tustvold
left a comment
There was a problem hiding this comment.
Mainly just some minor nits. Good to see extra mileage from this generator 👍
| /// | ||
| /// Recursively searches for ParquetExec and returns the metrics | ||
| /// on the first one it finds | ||
| pub fn parquet_metrics(&self, plan: Arc<dyn ExecutionPlan>) -> Option<MetricsSet> { |
There was a problem hiding this comment.
Does this need to take &self?
There was a problem hiding this comment.
No -- I am still in the object oriented frame of mind -- will remove
| impl TestParquetFile { | ||
| /// Creates a new parquet file at the specified location | ||
| pub fn try_new( | ||
| path: PathBuf, |
There was a problem hiding this comment.
Taking a File instead of a Path would allow passing in a file generated by tempfile::tempfile. Mainly just thinking that these are potentially fairly substantial files which we therefore should be careful to cleanup automatically
There was a problem hiding this comment.
This is a good idea - thank you for the suggestion -- I tried to do this, but it seems as if the object store interface requires the length and canonical path which are currently computed from the path
let size = std::fs::metadata(&path)?.len() as usize;
let canonical_path = path.canonicalize()?;
Given this code is no worse than today I would like to leave it in this PR -- it does in fact leave large files around which I will attempt to clean up in #3976
|
Merging this PR in as I have another that is built on it and will address comments in follow on PRs |
|
Benchmark runs are scheduled for baseline = ea31da9 and contender = 41467ab. 41467ab is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
…che#4042) * Extract common parquet testing code to `parquet-test-util` crate * fix doc tests
Which issue does this PR close?
Part of #3463
Rationale for this change
To create parquet integration testing (#3976) , I want to reuse code from a pre-existing benchmark
What changes are included in this PR?
Are there any user-facing changes?
No, this is just moving code around