-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the bug
Reading nested parquet files results in index out of bounds error as seen bellow:
thread 'main' panicked at 'index out of bounds: the len is 8 but the index is 8', /Users/xxxx/.cargo/registry/
src/github.com-1ecc6299db9ec823/datafusion-6.0.0/src/datasource/file_format/parquet.rs:272:13
To Reproduce
- Download attached zipped parquet file and unzip it: nested_schema_1row.parquet.zip
- Place it in a
./datafolder - Execute the following code:
let mut ctx = ExecutionContext::new();
let df = ctx.read_parquet("./data/nested_schema_1row.parquet").await?;
df.show().await- The result is
index out of boundspanic
thread 'main' panicked at 'index out of bounds: the len is 8 but the index is 8', /Users/xxxx/.cargo/registry/
src/github.com-1ecc6299db9ec823/datafusion-6.0.0/src/datasource/file_format/parquet.rs:272:13
stack backtrace:
0: rust_begin_unwind
at /rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/std/src/panicking.rs:498:5
1: core::panicking::panic_fmt
at /rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/core/src/panicking.rs:107:14
2: core::panicking::panic_bounds_check
at /rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/core/src/panicking.rs:75:5
3: <usize as core::slice::index::SliceIndex<[T]>>::index_mut
at /rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/core/src/slice/index.rs:190:14
4: core::slice::index::<impl core::ops::index::IndexMut<I> for [T]>::index_mut
at /rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/core/src/slice/index.rs:26:9
5: <alloc::vec::Vec<T,A> as core::ops::index::IndexMut<I>>::index_mut
at /rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/alloc/src/vec/mod.rs:2540:9
6: datafusion::datasource::file_format::parquet::fetch_metadata
at /Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/datafusion-6.0.0/src/datasource/file_format/parquet.rs:272:13
7: <datafusion::datasource::file_format::parquet::ParquetFormat as datafusion::datasource::file_format::FileFormat>::infer_schema::{{closure}}
at /Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/datafusion-6.0.0/src/datasource/file_format/parquet.rs:96:27
8: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
at /rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/core/src/future/mod.rs:80:19
9: <core::pin::Pin<P> as core::future::future::Future>::poll
at /rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/core/src/future/future.rs:119:9
10: datafusion::datasource::listing::table::ListingOptions::infer_schema::{{closure}}
at /Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/datafusion-6.0.0/src/datasource/listing/table.rs:99:27
11: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
at /rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/core/src/future/mod.rs:80:19
12: datafusion::logical_plan::builder::LogicalPlanBuilder::scan_parquet_with_name::{{closure}}
at /Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/datafusion-6.0.0/src/logical_plan/builder.rs:287:31
13: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
at /rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/core/src/future/mod.rs:80:19
14: datafusion::logical_plan::builder::LogicalPlanBuilder::scan_parquet::{{closure}}
at /Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/datafusion-6.0.0/src/logical_plan/builder.rs:255:9
15: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
at /rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/core/src/future/mod.rs:80:19
16: datafusion::execution::context::ExecutionContext::read_parquet::{{closure}}
at /Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/datafusion-6.0.0/src/execution/context.rs:403:13
17: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
at /rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/core/src/future/mod.rs:80:19
18: read_parquet::main::{{closure}}
at ./src/main.rs:79:14
19: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
at /rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/core/src/future/mod.rs:80:19
20: tokio::park::thread::CachedParkThread::block_on::{{closure}}
at /Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/park/thread.rs:263:54
21: tokio::coop::with_budget::{{closure}}
at /Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/coop.rs:106:9
22: std::thread::local::LocalKey<T>::try_with
at /rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/std/src/thread/local.rs:399:16
23: std::thread::local::LocalKey<T>::with
at /rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/std/src/thread/local.rs:375:9
24: tokio::coop::with_budget
at /Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/coop.rs:99:5
25: tokio::coop::budget
at /Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/coop.rs:76:5
26: tokio::park::thread::CachedParkThread::block_on
at /Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/park/thread.rs:263:31
27: tokio::runtime::enter::Enter::block_on
at /Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/runtime/enter.rs:151:13
28: tokio::runtime::thread_pool::ThreadPool::block_on
at /Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/runtime/thread_pool/mod.rs:77:9
29: tokio::runtime::Runtime::block_on
at /Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/runtime/mod.rs:463:43
30: read_parquet::main
at ./src/main.rs:80:5
31: core::ops::function::FnOnce::call_once
at /rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/core/src/ops/function.rs:227:5
Expected behavior
To properly read the parquet file.
Additional context
After debugging a bit the issue the error happens in fetch_statistics function. To be more precise the schema.fields().len() datasource/file_format/parquet.rs#L261 construct returns only the top fields, while the row_group_meta.columns() (datasource/file_format/parquet.rs#L276-L277) returns all leaves.
In the context of the given parquet file, there are 8 top level fields and about 262 leaves.
DataFusion is 6.0
Rust is 1.58.0-nightly (65c55bf93 2021-11-23)
Cargo is 1.58.0-nightly (e1fb17631 2021-11-22)
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working