Skip to content

Reading nested parquet files results in index out of bounds #1383

@andrei-ionescu

Description

@andrei-ionescu

Describe the bug

Reading nested parquet files results in index out of bounds error as seen bellow:

thread 'main' panicked at 'index out of bounds: the len is 8 but the index is 8', /Users/xxxx/.cargo/registry/
    src/github.com-1ecc6299db9ec823/datafusion-6.0.0/src/datasource/file_format/parquet.rs:272:13

To Reproduce

  1. Download attached zipped parquet file and unzip it: nested_schema_1row.parquet.zip
  2. Place it in a ./data folder
  3. Execute the following code:
let mut ctx = ExecutionContext::new(); 
let df = ctx.read_parquet("./data/nested_schema_1row.parquet").await?;
df.show().await
  1. The result is index out of bounds panic
thread 'main' panicked at 'index out of bounds: the len is 8 but the index is 8', /Users/xxxx/.cargo/registry/
    src/github.com-1ecc6299db9ec823/datafusion-6.0.0/src/datasource/file_format/parquet.rs:272:13
stack backtrace:
   0: rust_begin_unwind
             at /rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/std/src/panicking.rs:498:5
   1: core::panicking::panic_fmt
             at /rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/core/src/panicking.rs:107:14
   2: core::panicking::panic_bounds_check
             at /rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/core/src/panicking.rs:75:5
   3: <usize as core::slice::index::SliceIndex<[T]>>::index_mut
             at /rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/core/src/slice/index.rs:190:14
   4: core::slice::index::<impl core::ops::index::IndexMut<I> for [T]>::index_mut
             at /rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/core/src/slice/index.rs:26:9
   5: <alloc::vec::Vec<T,A> as core::ops::index::IndexMut<I>>::index_mut
             at /rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/alloc/src/vec/mod.rs:2540:9
   6: datafusion::datasource::file_format::parquet::fetch_metadata
             at /Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/datafusion-6.0.0/src/datasource/file_format/parquet.rs:272:13
   7: <datafusion::datasource::file_format::parquet::ParquetFormat as datafusion::datasource::file_format::FileFormat>::infer_schema::{{closure}}
             at /Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/datafusion-6.0.0/src/datasource/file_format/parquet.rs:96:27
   8: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
             at /rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/core/src/future/mod.rs:80:19
   9: <core::pin::Pin<P> as core::future::future::Future>::poll
             at /rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/core/src/future/future.rs:119:9
  10: datafusion::datasource::listing::table::ListingOptions::infer_schema::{{closure}}
             at /Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/datafusion-6.0.0/src/datasource/listing/table.rs:99:27
  11: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
             at /rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/core/src/future/mod.rs:80:19
  12: datafusion::logical_plan::builder::LogicalPlanBuilder::scan_parquet_with_name::{{closure}}
             at /Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/datafusion-6.0.0/src/logical_plan/builder.rs:287:31
  13: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
             at /rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/core/src/future/mod.rs:80:19
  14: datafusion::logical_plan::builder::LogicalPlanBuilder::scan_parquet::{{closure}}
             at /Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/datafusion-6.0.0/src/logical_plan/builder.rs:255:9
  15: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
             at /rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/core/src/future/mod.rs:80:19
  16: datafusion::execution::context::ExecutionContext::read_parquet::{{closure}}
             at /Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/datafusion-6.0.0/src/execution/context.rs:403:13
  17: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
             at /rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/core/src/future/mod.rs:80:19
  18: read_parquet::main::{{closure}}
             at ./src/main.rs:79:14
  19: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
             at /rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/core/src/future/mod.rs:80:19
  20: tokio::park::thread::CachedParkThread::block_on::{{closure}}
             at /Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/park/thread.rs:263:54
  21: tokio::coop::with_budget::{{closure}}
             at /Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/coop.rs:106:9
  22: std::thread::local::LocalKey<T>::try_with
             at /rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/std/src/thread/local.rs:399:16
  23: std::thread::local::LocalKey<T>::with
             at /rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/std/src/thread/local.rs:375:9
  24: tokio::coop::with_budget
             at /Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/coop.rs:99:5
  25: tokio::coop::budget
             at /Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/coop.rs:76:5
  26: tokio::park::thread::CachedParkThread::block_on
             at /Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/park/thread.rs:263:31
  27: tokio::runtime::enter::Enter::block_on
             at /Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/runtime/enter.rs:151:13
  28: tokio::runtime::thread_pool::ThreadPool::block_on
             at /Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/runtime/thread_pool/mod.rs:77:9
  29: tokio::runtime::Runtime::block_on
             at /Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/runtime/mod.rs:463:43
  30: read_parquet::main
             at ./src/main.rs:80:5
  31: core::ops::function::FnOnce::call_once
             at /rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/core/src/ops/function.rs:227:5

Expected behavior

To properly read the parquet file.

Additional context

After debugging a bit the issue the error happens in fetch_statistics function. To be more precise the schema.fields().len() datasource/file_format/parquet.rs#L261 construct returns only the top fields, while the row_group_meta.columns() (datasource/file_format/parquet.rs#L276-L277) returns all leaves.

In the context of the given parquet file, there are 8 top level fields and about 262 leaves.

DataFusion is 6.0
Rust is 1.58.0-nightly (65c55bf93 2021-11-23)
Cargo is 1.58.0-nightly (e1fb17631 2021-11-22)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions