Skip to content

[Rust] [Parquet] ArrowReader fails on some timestamp types #24454

@asfimport

Description

@asfimport

I discovered this bug with this query

> SELECT tpep_pickup_datetime FROM taxi LIMIT 1;
General("InvalidArgumentError(\"column types must match schema types, expected Timestamp(Microsecond, None) but found UInt64 at column index 0\")") 

The parquet reader detects this schema when reading from the file:

Schema { 
  fields: [
    Field { name: "tpep_pickup_datetime", data_type: Timestamp(Microsecond, None), nullable: true, dict_id: 0, dict_is_ordered: false }
  ], 
  metadata: {} 
} 

The struct array read from the file contains:

[PrimitiveArray<UInt64>
[
  1567318008000000,
  1567319357000000,
  1567320092000000,
  1567321151000000, 

 When the Parquet arrow reader creates the record batch, the following validation logic fails:

for i in 0..columns.len() {
    if columns[i].len() != len {
        return Err(ArrowError::InvalidArgumentError(
            "all columns in a record batch must have the same length".to_string(),
        ));
    }
    if columns[i].data_type() != schema.field(i).data_type() {
        return Err(ArrowError::InvalidArgumentError(format!(
            "column types must match schema types, expected {:?} but found {:?} at column index {}",
            schema.field(i).data_type(),
            columns[i].data_type(),
            i)));
    }
}
 

Reporter: Andy Grove / @andygrove
Assignee: Renjie Liu / @liurenjie1024

Related issues:

Note: This issue was originally created as ARROW-8258. Please see the migration documentation for further details.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions