-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Fix empty offset index for all null columns (#4459) #4460
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix empty offset index for all null columns (#4459) #4460
Conversation
|
cc @AdamGS |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reviewed the code and test in this PR carefully and it looks good to me
I will also attempt to verify it fixes out actual data problem as well using some internal test data
|
|
||
| writer.close().unwrap(); | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I verified that without the code change in this PR, this test fails like this:
failures:
---- arrow::arrow_writer::tests::test_writer_all_null stdout ----
thread 'arrow::arrow_writer::tests::test_writer_all_null' panicked at 'assertion failed: `(left == right)`
left: `0`,
right: `1`', parquet/src/arrow/arrow_writer/mod.rs:2637:9
stack backtrace:
0: rust_begin_unwind
at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/std/src/panicking.rs:578:5
1: core::panicking::panic_fmt
at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/core/src/panicking.rs:67:14
2: core::panicking::assert_failed_inner
at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/core/src/panicking.rs:274:17
3: core::panicking::assert_failed
at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/core/src/panicking.rs:228:5
4: parquet::arrow::arrow_writer::tests::test_writer_all_null
at ./src/arrow/arrow_writer/mod.rs:2637:9
5: parquet::arrow::arrow_writer::tests::test_writer_all_null::{{closure}}
at ./src/arrow/arrow_writer/mod.rs:2615:31
6: core::ops::function::FnOnce::call_once
at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/core/src/ops/function.rs:250:5
7: core::ops::function::FnOnce::call_once
at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/core/src/ops/function.rs:250:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have verified that using master and 42.0.0, when I a parquet file of our internal data, the file errors on read with Parquet error: StructArrayReader out of sync in read_records, expected 0 skipped, got 11:
$ datafusion-cli -c "SELECT xxx, time FROM '/tmp/foo.parquet' WHERE 1684850057953220316 <= time::bigint";
DataFusion CLI v27.0.0
Arrow error: External error: Arrow: Parquet argument error: Parquet error: StructArrayReader out of sync in read_records, expected 0 skipped, got 11
When I use the code in this PR or 41.0.0 to write out the same data, I can query the resulting parquet file successfully
$ datafusion-cli -c "SELECT xxx, time FROM '/tmp/foo.parquet' WHERE 1684850057953220316 <= time::bigint";
DataFusion CLI v27.0.0
+-------+---------------------+
| xxx | time |
+-------+---------------------+
(there is data here)
Thus I am quite confident that this is the root cause and fix.
Thank you for the quick turnaround @tustvold
Test Program
fn main() {
// read data from parquet and write it back
let path = "/path/to/input/data.parquet";
println!("reading from {path}");
let file = File::open(path).unwrap();
let builder = ParquetRecordBatchReaderBuilder::try_new(file).unwrap();
let mut reader = builder.build().unwrap();
// rewrite the data to /tmp/foo.parquet
let path = "/tmp/foo.parquet";
println!("writing to {path}");
let file = File::create(path).unwrap();
let mut writer = ArrowWriter::try_new(file, reader.schema(), None).unwrap();
while let Some(record_batch) = reader.next().transpose().unwrap() {
println!("Read {} records.", record_batch.num_rows());
writer.write(&record_batch).expect("Writing batch");
}
// writer must be closed to write footer
writer.close().unwrap();
}
Which issue does this PR close?
Closes #4459 (fixes a bug introduced by #4389 in
42.0.0)Rationale for this change
What changes are included in this PR?
Are there any user-facing changes?