-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-10674: [Rust] Fix BigDecimal to be little endian; Add IPC Reader/Writer for Decimal type to allow integration tests #8784
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @sweb ! Looks good to me -- I can't argue with a new passing integration test!
|
@sweb -- it seems from the arrow definition that the endianness may be big or little endiain: So I suggest at least validating the schema in the IPC message and error'ing with a "unimplemented" type error if it the schema is Big Endian. The endianness can be checked: Note that I still think this implementation (that now passes a new test) is still ok to merge in as is (as it is more correct for one case) but gracefully handling an unimplemented endianness would be better. |
|
Hey @alamb thank you for the review! I will add an unimplemented path to indicate a potential misuse - thank you for your hint on how to check endianness - I was not aware that this was available. I am currently trying to add the required conversions to convert from parquet (big endian, fixed size) to arrow (little endian, 128bit), but maybe this is something I will add in a separate PR. |
| let len = c_fields.len(); | ||
| for i in 0..len { | ||
| let c_field: ipc::Field = c_fields.get(i); | ||
| match c_field.type_type() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb could you take another look at my attempt to add the unimplemented path for big endian?
I am not happy with placing the check in fb_to_schema and would have preferred to put it in get_data_type but I found no way to pass on the endianness from the schema.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the problem -- yes, there since the endianness is on the shema object, not the field, since the field is all that is passed around there is no way to know what the details of the schema are.
I personally think this code is fine, if a bit un-indeal and could be cleaned up in the future. My only worry is that it would get lost / broken during such cleanup
What would you think about adding a test that triggers the error? Then we could be sure that any future cleanups will not break the check?
Thanks again @sweb
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb thank you for being so nice about it - I was just too lazy to add a test and should receive full scrutiny ;)
This is partly due to the fact that I am not very familiar with flatbuffers and still do not fully understand how to create the appropriate flatbuffer to test this. As a temporary solution, I have added two tests to ipc::reader that uses the BigEndian files in arrow-ipc-stream/integration/1.0.0-bigendian. The one for decimal fails, the others work. I hope this is okay for now, until I am able to construct the correct schema message to test this directly in ipc::convert.
While adding the big endian test for the other types I noticed that the contents are not equal to the json content. That is why the test does not contain an equality check. Thus, there may be problems with Big Endian for other types as well.
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is looking very nice -- any other thoughts @jorgecarleitao (who reviewed the initial BigDecimal implementation) or @nevi-me , our resident Arrow IPC expert?
rust/arrow/src/ipc/reader.rs
Outdated
|
|
||
| #[test] | ||
| #[should_panic(expected = "Big Endian is not supported for Decimal!")] | ||
| fn read_decimal_file_be() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
|
I just double checked though and the CI integration test is still failing here: https://github.com/apache/arrow/pull/8784/checks?check_run_id=1497340378 Which seems maybe unrelated (an error fetching go?) I restarted that test to see if the problem was some intermittent infrastructure error |
|
CI passed! |
|
I'm on the road over the weekend, but I'll try look at this maybe on Sunday evening. |
jorgecarleitao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, so merged it :)
This PR introduces capabilities to construct a `DecimalArray` from Parquet files.
This is done by adding a new `Converter<Vec<Option<ByteArray>>, DecimalArray>` in `parquet::arrow::converter`:
```
pub struct DecimalArrayConverter {
precision: i32,
scale: i32,
}
```
It is then used in `ArrayBuilderReader` using a match guard to differentiate it from regular fixed size binaries:
```
PhysicalType::FIXED_LEN_BYTE_ARRAY if cur_type.get_basic_info().logical_type() == LogicalType::DECIMAL =>
```
A test was added that uses a parquet file from `PARQUET_TEST_DATA` to check whether loading and casting to `DecimalArray` works. I did not find a corresponding json file with the correct values, but I used `pyarrow` to extract the correct values and hard coded them in the test.
I thought that this PR would require #8784 to be merged, but they are independent. I used the same JIRA issue here - I hope that this is okay.
Closes #8880 from sweb/rust-parquet-decimal-arrow-reader
Authored-by: Florian Müller <florian@tomueller.de>
Signed-off-by: Andy Grove <andygrove73@gmail.com>
This PR introduces capabilities to construct a `DecimalArray` from Parquet files.
This is done by adding a new `Converter<Vec<Option<ByteArray>>, DecimalArray>` in `parquet::arrow::converter`:
```
pub struct DecimalArrayConverter {
precision: i32,
scale: i32,
}
```
It is then used in `ArrayBuilderReader` using a match guard to differentiate it from regular fixed size binaries:
```
PhysicalType::FIXED_LEN_BYTE_ARRAY if cur_type.get_basic_info().logical_type() == LogicalType::DECIMAL =>
```
A test was added that uses a parquet file from `PARQUET_TEST_DATA` to check whether loading and casting to `DecimalArray` works. I did not find a corresponding json file with the correct values, but I used `pyarrow` to extract the correct values and hard coded them in the test.
I thought that this PR would require #8784 to be merged, but they are independent. I used the same JIRA issue here - I hope that this is okay.
Closes #8880 from sweb/rust-parquet-decimal-arrow-reader
Authored-by: Florian Müller <florian@tomueller.de>
Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
This is a follow up to #8640
Currently, there is a first working IPC reader/writer test using data from
testing/arrow-ipc-stream/integration/0.14.1/generated_decimal.arrow_fileHowever, this lead me to discover that my first decimal type implementation is wrong, in that it uses BigEndian, whereas this is parquet specific and therefore should not be used in arrow/array and so on. I will try to address this in this PR as well.