Skip to content

Conversation

@tustvold
Copy link
Contributor

@tustvold tustvold commented Jun 28, 2023

Which issue does this PR close?

Closes #4459 (fixes a bug introduced by #4389 in 42.0.0)

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jun 28, 2023
@alamb
Copy link
Contributor

alamb commented Jun 28, 2023

cc @AdamGS

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed the code and test in this PR carefully and it looks good to me

I will also attempt to verify it fixes out actual data problem as well using some internal test data


writer.close().unwrap();
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I verified that without the code change in this PR, this test fails like this:


failures:

---- arrow::arrow_writer::tests::test_writer_all_null stdout ----
thread 'arrow::arrow_writer::tests::test_writer_all_null' panicked at 'assertion failed: `(left == right)`
  left: `0`,
 right: `1`', parquet/src/arrow/arrow_writer/mod.rs:2637:9
stack backtrace:
   0: rust_begin_unwind
             at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/std/src/panicking.rs:578:5
   1: core::panicking::panic_fmt
             at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/core/src/panicking.rs:67:14
   2: core::panicking::assert_failed_inner
             at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/core/src/panicking.rs:274:17
   3: core::panicking::assert_failed
             at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/core/src/panicking.rs:228:5
   4: parquet::arrow::arrow_writer::tests::test_writer_all_null
             at ./src/arrow/arrow_writer/mod.rs:2637:9
   5: parquet::arrow::arrow_writer::tests::test_writer_all_null::{{closure}}
             at ./src/arrow/arrow_writer/mod.rs:2615:31
   6: core::ops::function::FnOnce::call_once
             at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/core/src/ops/function.rs:250:5
   7: core::ops::function::FnOnce::call_once
             at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/core/src/ops/function.rs:250:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have verified that using master and 42.0.0, when I a parquet file of our internal data, the file errors on read with Parquet error: StructArrayReader out of sync in read_records, expected 0 skipped, got 11:

$ datafusion-cli -c "SELECT xxx, time FROM '/tmp/foo.parquet' WHERE 1684850057953220316 <= time::bigint";
DataFusion CLI v27.0.0
Arrow error: External error: Arrow: Parquet argument error: Parquet error: StructArrayReader out of sync in read_records, expected 0 skipped, got 11

When I use the code in this PR or 41.0.0 to write out the same data, I can query the resulting parquet file successfully

$ datafusion-cli -c "SELECT xxx, time FROM '/tmp/foo.parquet' WHERE 1684850057953220316 <= time::bigint";
DataFusion CLI v27.0.0
+-------+---------------------+
| xxx | time                |
+-------+---------------------+
(there is data here)

Thus I am quite confident that this is the root cause and fix.

Thank you for the quick turnaround @tustvold

Test Program

fn main() {

    // read data from parquet and write it back
    let path = "/path/to/input/data.parquet";
    println!("reading from {path}");

    let file = File::open(path).unwrap();
    let builder = ParquetRecordBatchReaderBuilder::try_new(file).unwrap();
    let mut reader = builder.build().unwrap();


    // rewrite the data to /tmp/foo.parquet
    let path = "/tmp/foo.parquet";
    println!("writing to {path}");
    let file = File::create(path).unwrap();
    let mut writer = ArrowWriter::try_new(file, reader.schema(), None).unwrap();

    while let Some(record_batch) = reader.next().transpose().unwrap() {
        println!("Read {} records.", record_batch.num_rows());
        writer.write(&record_batch).expect("Writing batch");

    }

    // writer must be closed to write footer
    writer.close().unwrap();

}

@tustvold tustvold merged commit 554aebe into apache:master Jun 28, 2023
alamb pushed a commit to alamb/arrow-rs that referenced this pull request Jun 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Regression in in parquet 42.0.0 : Bad parquet column indexes for All Null Columns, resulting in Parquet error: StructArrayReader out of sync on read

2 participants