Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 11 additions & 1 deletion datafusion/core/src/test_util/parquet.rs
Original file line number Diff line number Diff line change
Expand Up @@ -103,7 +103,17 @@ impl TestParquetFile {

let size = std::fs::metadata(&path)?.len() as usize;

let canonical_path = path.canonicalize()?;
let mut canonical_path = path.canonicalize()?;

if cfg!(target_os = "windows") {
canonical_path = canonical_path
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this test exercised on the CI?

windows:
name: cargo test (win64)
runs-on: windows-latest
steps:
- uses: actions/checkout@v4
with:
submodules: true
- name: Setup Rust toolchain
uses: ./.github/actions/setup-windows-builder
- name: Run tests (excluding doctests)
shell: bash
run: |
export PATH=$PATH:$HOME/d/protoc/bin
cargo test --lib --tests --bins --features avro,json,backtrace

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure it, I also suspect it in PR #11575. However, this case run ok on my windows machine, and when I remove the code referenced above, this case can't be ok (got below error).

thread 'parquet::filter_pushdown::single_file' panicked at datafusion\core\tests\parquet\filter_pushdown.rs:516:66:
called `Result::unwrap()` on an `Err` value: ParquetError(External(Generic { store: "LocalFileSystem", source: InvalidUrl { url: Url { scheme: "file", cannot_be_a_base: false, username: "", password: None, host: None, port: None, path: "/%5C%5C%3F%5CD:%5Cworkspace%5Crust%5Cgithub.com%5Czhuliquan%5Cdatafusion%5Cdatafusion%5Ccore%5C.tmpoyfZYi%5Cdata.parquet", query: None, fragment: None } } }))
stack backtrace:
   0: rust_begin_unwind
             at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14\library/std\src/panicking.rs:662:5
   1: core::panicking::panic_fmt
             at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14\library/core\src/panicking.rs:74:14
   2: core::result::unwrap_failed
             at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14\library/core\src/result.rs:1677:5
   3: core::result::Result<T,E>::unwrap
             at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14\library\core\src/result.rs:1102:23
   4: parquet_exec::parquet::filter_pushdown::TestCase::read_with_options::{{closure}}
             at .\tests\parquet\filter_pushdown.rs:516:22
   5: parquet_exec::parquet::filter_pushdown::TestCase::run_with_filter::{{closure}}
             at .\tests\parquet\filter_pushdown.rs:448:14
   6: parquet_exec::parquet::filter_pushdown::TestCase::run::{{closure}}
             at .\tests\parquet\filter_pushdown.rs:433:43
   7: parquet_exec::parquet::filter_pushdown::single_file::{{closure}}
             at .\tests\parquet\filter_pushdown.rs:85:16
   8: <core::pin::Pin<P> as core::future::future::Future>::poll
             at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14\library\core\src\future/future.rs:123:9
   9: <core::pin::Pin<P> as core::future::future::Future>::poll
             at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14\library\core\src\future/future.rs:123:9
  10: tokio::runtime::scheduler::current_thread::CoreGuard::block_on::{{closure}}::{{closure}}::{{closure}}
             at C:\Users\zlq16\.cargo\registry\src\rsproxy.cn-0dccff568467c15b\tokio-1.40.0\src\runtime\scheduler\current_thread\mod.rs:696:57
  11: tokio::runtime::coop::with_budget
             at C:\Users\zlq16\.cargo\registry\src\rsproxy.cn-0dccff568467c15b\tokio-1.40.0\src\runtime\coop.rs:107:5
  12: tokio::runtime::coop::budget
             at C:\Users\zlq16\.cargo\registry\src\rsproxy.cn-0dccff568467c15b\tokio-1.40.0\src\runtime\coop.rs:73:5
  13: tokio::runtime::scheduler::current_thread::CoreGuard::block_on::{{closure}}::{{closure}}
             at C:\Users\zlq16\.cargo\registry\src\rsproxy.cn-0dccff568467c15b\tokio-1.40.0\src\runtime\scheduler\current_thread\mod.rs:696:25
  14: tokio::runtime::scheduler::current_thread::Context::enter
             at C:\Users\zlq16\.cargo\registry\src\rsproxy.cn-0dccff568467c15b\tokio-1.40.0\src\runtime\scheduler\current_thread\mod.rs:423:19
  15: tokio::runtime::scheduler::current_thread::CoreGuard::block_on::{{closure}}
             at C:\Users\zlq16\.cargo\registry\src\rsproxy.cn-0dccff568467c15b\tokio-1.40.0\src\runtime\scheduler\current_thread\mod.rs:695:36
  16: tokio::runtime::scheduler::current_thread::CoreGuard::enter::{{closure}}
             at C:\Users\zlq16\.cargo\registry\src\rsproxy.cn-0dccff568467c15b\tokio-1.40.0\src\runtime\scheduler\current_thread\mod.rs:774:68
  17: tokio::runtime::context::scoped::Scoped<T>::set
             at C:\Users\zlq16\.cargo\registry\src\rsproxy.cn-0dccff568467c15b\tokio-1.40.0\src\runtime\context\scoped.rs:40:9
  18: tokio::runtime::context::set_scheduler::{{closure}}
             at C:\Users\zlq16\.cargo\registry\src\rsproxy.cn-0dccff568467c15b\tokio-1.40.0\src\runtime\context.rs:180:26
  19: std::thread::local::LocalKey<T>::try_with
             at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14\library\std\src\thread/local.rs:283:12
  20: std::thread::local::LocalKey<T>::with
             at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14\library\std\src\thread/local.rs:260:9
  21: tokio::runtime::context::set_scheduler
             at C:\Users\zlq16\.cargo\registry\src\rsproxy.cn-0dccff568467c15b\tokio-1.40.0\src\runtime\context.rs:180:9
  22: tokio::runtime::scheduler::current_thread::CoreGuard::enter
             at C:\Users\zlq16\.cargo\registry\src\rsproxy.cn-0dccff568467c15b\tokio-1.40.0\src\runtime\scheduler\current_thread\mod.rs:774:27
  23: tokio::runtime::scheduler::current_thread::CoreGuard::block_on
             at C:\Users\zlq16\.cargo\registry\src\rsproxy.cn-0dccff568467c15b\tokio-1.40.0\src\runtime\scheduler\current_thread\mod.rs:683:19
  24: tokio::runtime::scheduler::current_thread::CurrentThread::block_on::{{closure}}
             at C:\Users\zlq16\.cargo\registry\src\rsproxy.cn-0dccff568467c15b\tokio-1.40.0\src\runtime\scheduler\current_thread\mod.rs:191:28
  25: tokio::runtime::context::runtime::enter_runtime
             at C:\Users\zlq16\.cargo\registry\src\rsproxy.cn-0dccff568467c15b\tokio-1.40.0\src\runtime\context\runtime.rs:65:16
  26: tokio::runtime::scheduler::current_thread::CurrentThread::block_on
             at C:\Users\zlq16\.cargo\registry\src\rsproxy.cn-0dccff568467c15b\tokio-1.40.0\src\runtime\scheduler\current_thread\mod.rs:179:9
  27: tokio::runtime::runtime::Runtime::block_on_inner
             at C:\Users\zlq16\.cargo\registry\src\rsproxy.cn-0dccff568467c15b\tokio-1.40.0\src\runtime\runtime.rs:361:47
  28: tokio::runtime::runtime::Runtime::block_on
             at C:\Users\zlq16\.cargo\registry\src\rsproxy.cn-0dccff568467c15b\tokio-1.40.0\src\runtime\runtime.rs:335:13
  29: parquet_exec::parquet::filter_pushdown::single_file
             at .\tests\parquet\filter_pushdown.rs:223:5
  30: parquet_exec::parquet::filter_pushdown::single_file::{{closure}}
             at .\tests\parquet\filter_pushdown.rs:69:23
  31: core::ops::function::FnOnce::call_once
             at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14\library\core\src\ops/function.rs:250:5
  32: core::ops::function::FnOnce::call_once
             at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14\library/core\src\ops/function.rs:250:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
error: test failed, to rerun pass `--test parquet_exec`

.to_str()
.unwrap()
.replace("\\", "/")
.strip_prefix("//?/")
.unwrap()
.into();
};

let object_store_url =
ListingTableUrl::parse(canonical_path.to_str().unwrap_or_default())?
Expand Down
6 changes: 4 additions & 2 deletions datafusion/core/tests/parquet/filter_pushdown.rs
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,8 @@
//! select * from data limit 10;
//! ```

use std::path::Path;

use arrow::compute::concat_batches;
use arrow::record_batch::RecordBatch;
use datafusion::physical_plan::collect;
Expand Down Expand Up @@ -67,7 +69,7 @@ fn generate_file(tempdir: &TempDir, props: WriterProperties) -> TestParquetFile
async fn single_file() {
// Only create the parquet file once as it is fairly large

let tempdir = TempDir::new().unwrap();
let tempdir = TempDir::new_in(Path::new(".")).unwrap();
// Set row group size smaller so can test with fewer rows
let props = WriterProperties::builder()
.set_max_row_group_size(1024)
Expand Down Expand Up @@ -223,7 +225,7 @@ async fn single_file() {

#[tokio::test]
async fn single_file_small_data_pages() {
let tempdir = TempDir::new().unwrap();
let tempdir = TempDir::new_in(Path::new(".")).unwrap();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could this end up litterring in the local git checkout when running the tests?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you perhaps try to undo this change? (same above)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could this end up litterring in the local git checkout when running the tests?

Emm, path which hold by tempdir will be delete after test case compeleted, Because struct TempDir implements Drop trait.

impl Drop for TempDir {
    fn drop(&mut self) {
        if !self.keep {
            let _ = remove_dir_all(self.path());
        }
    }
}

Copy link
Contributor Author

@zhuliquan zhuliquan Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you perhaps try to undo this change? (same above)

Hi @findepi. It's a bummer, but we still have to make this change. Because, I found out from the previous PR #13531 that the temp directory (i.g. "C:\Users\RUNNER~1\AppData\Local\Temp") that contains the special character ~ in CI, which will be replace to "%7E" when loading file, so that the file cannot be find. I notice that some parquet cases are working fine on windows machine (i.g. parquet::file_statistics::check_stats_precision_with_filter_pushdown). This cases read parquet files in "../../parquet-testing/data", which implies that the current folder definitely ok and does not have special characters (i.g. "~", "\").

pub fn parquet_test_data() -> String {
match get_data_dir("PARQUET_TEST_DATA", "../../parquet-testing/data") {
Ok(pb) => pb.display().to_string(),
Err(err) => panic!("failed to get parquet data dir: {err}"),
}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @zhuliquan -- using new_in makes sense for me and I agree that the temporary file will be deleted when the TempFile goes out of scope

Copy link
Contributor Author

@zhuliquan zhuliquan Dec 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @zhuliquan -- using new_in makes sense for me and I agree that the temporary file will be deleted when the TempFile goes out of scope
Thanks @alamb reviews. Emm,using new_in is aim to make temp parquet file in current directory instead of system temp directory. It can avoid some special char (i.g ~). The feature which deleting file without scope is owe to Drop trait which is implemented by TempDir

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the temporary file will be deleted when the TempFile goes out of scope

will this work too if a test panics?
tests creating files -- or sometimes creating files -- in the git repo is something we shoulda void

Copy link
Contributor Author

@zhuliquan zhuliquan Dec 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the temporary file will be deleted when the TempFile goes out of scope

will this work too if a test panics? tests creating files -- or sometimes creating files -- in the git repo is something we shoulda void

Actually, drop will be invoked when test panic. If test happend panic, Rust will begin unwinding the stack, which means it will try to drop all variables that are still in scope (and therefore their drop methods will be called). I've done experiments to confirm this conclusion (Wherever I add panic! to the test code, the temporary directory seems to be removed). Even in extreme cases (i.g. panic happened in drop function), there is an ignore file to exclude the use of parquet files for this test.

datafusion/core/*.parquet


// Set low row count limit to improve page filtering
let props = WriterProperties::builder()
Expand Down
1 change: 0 additions & 1 deletion datafusion/core/tests/parquet/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,6 @@ use tempfile::NamedTempFile;
mod custom_reader;
mod external_access_plan;
mod file_statistics;
#[cfg(not(target_family = "windows"))]
mod filter_pushdown;
mod page_pruning;
mod row_group_pruning;
Expand Down