-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-11824: [Rust] [Parquet] Use logical types in Arrow schema conversion #9612
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
The clippy error seems unrelated to this PR: I also saw it on @Dandandan 's PR. #9639 |
|
Sorry I did not mean to close this PR |
ARROW-11881: [Rust][DataFusion] Fix clippy lint A linter error has appeared on master somehow: ``` error: unnecessary parentheses around `for` iterator expression --> datafusion/src/physical_plan/merge.rs:124:31 | 124 | for part_i in (0..input_partitions) { | ^^^^^^^^^^^^^^^^^^^^^ help: remove these parentheses | = note: `-D unused-parens` implied by `-D warnings` ``` Seen on at least #9612 and #9639: https://github.com/apache/arrow/pull/9612/checks?check_run_id=2042047472 https://github.com/apache/arrow/pull/9639/checks?check_run_id=2042649120 Closes #9642 from alamb/fix_clippy Authored-by: Andrew Lamb <andrew@nerdnetworks.org> Signed-off-by: Neville Dipale <nevilledips@gmail.com>
5afcc44 to
aecd501
Compare
|
@sunchao may you please have a look at this when you get a chance, thanks :) |
|
@nevi-me sorry missed this one - will take a look today. |
rust/parquet/src/arrow/schema.rs
Outdated
| TimeUnit::Nanosecond => ConvertedType::TIMESTAMP_MICROS, | ||
| }) | ||
| .with_logical_type(Some(LogicalType::TIMESTAMP(TimestampType { | ||
| is_adjusted_to_u_t_c: matches!(zone, Some(z) if z.as_str() == "UTC"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm this means we'll lose the timezone info right? as is_adjusted_to_u_t_c means using local timezone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think that my logic is faulty. Reading https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#instant-semantics-timestamps-normalized-to-utc again, I now see that it says that is_adjusted_to_u_t_c = true is if we actually adjust the timezone.
So, I think it's safer to use false always, as we don't adjust any timezones?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've read the thread, and my interpretation is this:
- What I was initially doing:
no timezone: false
"UTC": true
other timezone: false
- What's done in C++
no timezone: false
"UTC": true
other timezone: true
and normalize the timestamp value to UTC when converting to Parquet
Arrow timestamps are always in UTC, such that any non-UTC timezone is for display purposes only (e.g. if we want to print formatted timestamps).
So, we shouldn't need to normalise timezones as they'll always be adjusted to UTC.
My initial approach was to set is_adjusted_to_u_t_c = true whenever there's a timezone, but I second-guessed myself while working on this code. I had looked at the C++ implementation, but somehow interpreted the true value to only be set if timezone = UTC.
@sunchao are you fine with setting is_adjusted_to_u_t_c = true whenever there's a timezone?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. In that case we don't need to do the normalization part right? Yes +1 on setting is_adjusted_to_u_t_c = true whenever there's a timezone. We should also handle the case when the timezone string is empty the same way as it is not set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we don't need the normalisation. I've modified the code, to check if a timezone string is not empty
rust/parquet/src/arrow/schema.rs
Outdated
| "Unable to convert parquet INT32 logical type {}", | ||
| other | ||
| match ( | ||
| self.schema.get_basic_info().logical_type(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: perhaps we can first "merge" the logical and converted type into a logical type and then do the conversion, to avoid some of the code duplications. In the case when logical type is not present, we can always convert the converted type into a logical type while losing some information.
We can do this as a follow-up though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I can do it as a follow-up when I've completed the overall 2.6.0 type support
rust/parquet/src/schema/types.rs
Outdated
| id: self.id, | ||
| }; | ||
| // Populate the converted type if only the logical type is populated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we might need more tests for the case when logical type is set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a test for the group type (modified an existing one), but for primitive types, I need the schema printer + parser. So, I'll increase the test coverage as part of #9705
This makes it convenient for users to only specify the logical type, with the converted type being populated based on a 1:1 mapping. (cherry picked from commit 7b4dda4d347994cc37b91110fd99ad1ab98080a2)
(cherry picked from commit 780e9662b0221e66becfbb669040f50e25b8778e)
(cherry picked from commit 115b9460966708178e56d9ce3417d94ad9661f59)
(cherry picked from commit 5afcc44124e51295c07245fd960dc5d461fdfe2e)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks @nevi-me . I think the clippy check failure is unrelated.
Populate LogicalType when converting from Arrow schema to Parquet schema.
This is on top of #9592