fix: incorrect Parquet INT96 timestamp values from ArrowReader#2301
fix: incorrect Parquet INT96 timestamp values from ArrowReader#2301blackmwk merged 13 commits intoapache:mainfrom
Conversation
|
Thanks for the feedback @emkornfield! You gave me a good idea on how I might be able to restructure this to address several of your comments. If you don't mind, I'll re-request a review probably later today. |
Sounds good, please take comments with a grain of salt, I'm fairly new the code base. |
|
Rewrote using This also eliminates the Made |
| } | ||
|
|
||
| impl ArrowSchemaVisitor for Int96CoercionVisitor<'_> { | ||
| type T = Field; |
There was a problem hiding this comment.
|
Most recent changes:
Thanks for the patience so far @blackmwk! |
|
|
||
| /// Files with embedded field IDs (branch 1 of schema resolution). | ||
| #[tokio::test] | ||
| async fn test_read_int96_timestamps_with_field_ids() { |
There was a problem hiding this comment.
As ut, I expect to see tests for the schema visitor only. I think it would be better to put these tests into reader module?
There was a problem hiding this comment.
Thanks for the review @blackmwk!
Moved integration tests from int96.rs to reader.rs
- The 5 tests that write Parquet files and read through
ArrowReadernow live in thereadertest module where they belong. - Converted
///doc comments to//to match codebase test style.
Added 11 unit tests for the schema visitor in int96.rs
test_coerce_timestamp_ns_to_us—Timestampfield ID → microsecondtest_coerce_timestamptz_ns_to_us—Timestamptzfield ID → microsecond, preserves timezonetest_no_coercion_when_iceberg_is_timestamp_ns—TimestampNs→ no changetest_no_coercion_when_iceberg_is_timestamptz_ns—TimestamptzNs→ no changetest_no_coercion_when_already_microsecond— already microsecond → no changetest_defaults_to_us_without_field_ids— no field ID metadata → falls back to microsecond (Iceberg Java behavior)test_defaults_to_us_when_iceberg_type_is_not_timestamp— field ID maps to non-timestamp Iceberg type → falls back to microsecondtest_coerce_preserves_field_metadata— field metadata survives coerciontest_coerce_timestamp_in_struct— nested struct coerciontest_coerce_timestamp_in_list— nested list coerciontest_coerce_timestamp_in_map_value— nested map value coercion
# Conflicts: # crates/iceberg/src/arrow/reader.rs
blackmwk
left a comment
There was a problem hiding this comment.
Thanks @mbutrovich for this pr!
## Which issue does this PR close? - Closes #2299. ## What changes are included in this PR? - Add `coerce_int96_timestamps()` to patch the Arrow schema before reading, using arrow-rs's schema hint mechanism (`ArrowReaderOptions::with_schema`) to read INT96 columns at the resolution specified by the Iceberg table schema - `timestamp`/`timestamptz` → microsecond, `timestamp_ns`/`timestamptz_ns` → nanosecond, per the [Iceberg spec](https://iceberg.apache.org/spec/#primitive-types) - Falls back to microsecond when no field ID is available (matching Iceberg Java's `TimestampInt96Reader` behavior) - Applied after all three schema resolution branches (with field IDs, name mapping, positional fallback) so the fix covers both native and migrated tables - Handles INT96 inside nested types (structs, lists, maps) via `ArrowSchemaVisitor` traversal - Visitor and tests live in a standalone `arrow/int96.rs` module to keep `reader.rs` manageable - Made `visit_schema` in `arrow/schema.rs` `pub(crate)` so the coercion visitor can reuse the existing traversal ## Are these changes tested? - `test_read_int96_timestamps_with_field_ids` — files with embedded field IDs (branch 1) - `test_read_int96_timestamps_without_field_ids` — migrated files without field IDs (branches 2/3) - `test_read_int96_timestamps_in_struct` — INT96 inside a struct field - `test_read_int96_timestamps_in_list` — INT96 inside a list field (3-level Parquet LIST encoding) - `test_read_int96_timestamps_in_map` — INT96 as map values - All tests use dates outside the i64 nanosecond range (~1677-2262) to confirm the overflow is avoided - [Apache DataFusion Comet](https://github.com/apache/datafusion-comet) used the repro test in [apache/datafusion-comet#3856](apache/datafusion-comet#3856) and it passes with this change: apache/datafusion-comet#3857 (cherry picked from commit a2f067d)
…e#2301) ## Which issue does this PR close? - Closes apache#2299. ## What changes are included in this PR? - Add `coerce_int96_timestamps()` to patch the Arrow schema before reading, using arrow-rs's schema hint mechanism (`ArrowReaderOptions::with_schema`) to read INT96 columns at the resolution specified by the Iceberg table schema - `timestamp`/`timestamptz` → microsecond, `timestamp_ns`/`timestamptz_ns` → nanosecond, per the [Iceberg spec](https://iceberg.apache.org/spec/#primitive-types) - Falls back to microsecond when no field ID is available (matching Iceberg Java's `TimestampInt96Reader` behavior) - Applied after all three schema resolution branches (with field IDs, name mapping, positional fallback) so the fix covers both native and migrated tables - Handles INT96 inside nested types (structs, lists, maps) via `ArrowSchemaVisitor` traversal - Visitor and tests live in a standalone `arrow/int96.rs` module to keep `reader.rs` manageable - Made `visit_schema` in `arrow/schema.rs` `pub(crate)` so the coercion visitor can reuse the existing traversal ## Are these changes tested? - `test_read_int96_timestamps_with_field_ids` — files with embedded field IDs (branch 1) - `test_read_int96_timestamps_without_field_ids` — migrated files without field IDs (branches 2/3) - `test_read_int96_timestamps_in_struct` — INT96 inside a struct field - `test_read_int96_timestamps_in_list` — INT96 inside a list field (3-level Parquet LIST encoding) - `test_read_int96_timestamps_in_map` — INT96 as map values - All tests use dates outside the i64 nanosecond range (~1677-2262) to confirm the overflow is avoided - [Apache DataFusion Comet](https://github.com/apache/datafusion-comet) used the repro test in [apache/datafusion-comet#3856](apache/datafusion-comet#3856) and it passes with this change: apache/datafusion-comet#3857 (cherry picked from commit a2f067d)
Which issue does this PR close?
What changes are included in this PR?
coerce_int96_timestamps()to patch the Arrow schema before reading, using arrow-rs's schema hint mechanism (ArrowReaderOptions::with_schema) to readINT96 columns at the resolution specified by the Iceberg table schema
timestamp/timestamptz→ microsecond,timestamp_ns/timestamptz_ns→ nanosecond, per the Iceberg specTimestampInt96Readerbehavior)ArrowSchemaVisitortraversalarrow/int96.rsmodule to keepreader.rsmanageablevisit_schemainarrow/schema.rspub(crate)so the coercion visitor can reuse the existing traversalAre these changes tested?
test_read_int96_timestamps_with_field_ids— files with embedded field IDs (branch 1)test_read_int96_timestamps_without_field_ids— migrated files without field IDs (branches 2/3)test_read_int96_timestamps_in_struct— INT96 inside a struct fieldtest_read_int96_timestamps_in_list— INT96 inside a list field (3-level Parquet LIST encoding)test_read_int96_timestamps_in_map— INT96 as map valuesapache/datafusion-comet#3856 and it passes with this change:
test: [DO NOT MERGE] test upstream iceberg-rust fix for #3856 datafusion-comet#3857