ARROW-10674: [Rust] Fix BigDecimal to be little endian; Add IPC Reader/Writer for Decimal type to allow integration tests #8784

sweb · 2020-11-27T16:02:03Z

This is a follow up to #8640

Currently, there is a first working IPC reader/writer test using data from testing/arrow-ipc-stream/integration/0.14.1/generated_decimal.arrow_file

However, this lead me to discover that my first decimal type implementation is wrong, in that it uses BigEndian, whereas this is parquet specific and therefore should not be used in arrow/array and so on. I will try to address this in this PR as well.

alamb

Thank you @sweb ! Looks good to me -- I can't argue with a new passing integration test!

alamb · 2020-12-02T16:50:10Z

@sweb -- it seems from the arrow definition that the endianness may be big or little endiain:
https://github.com/apache/arrow/blob/master/format/Schema.fbs#L175-L178

/// Exact decimal value represented as an integer value in two's
/// complement. Currently only 128-bit (16-byte) and 256-bit (32-byte) integers
/// are used. The representation uses the endianness indicated
/// in the Schema.

So I suggest at least validating the schema in the IPC message and error'ing with a "unimplemented" type error if it the schema is Big Endian.

The endianness can be checked:
https://docs.rs/arrow/2.0.0/arrow/ipc/gen/Schema/struct.Schema.html#method.endianness

Note that I still think this implementation (that now passes a new test) is still ok to merge in as is (as it is more correct for one case) but gracefully handling an unimplemented endianness would be better.

sweb · 2020-12-02T19:33:29Z

Hey @alamb thank you for the review!

I will add an unimplemented path to indicate a potential misuse - thank you for your hint on how to check endianness - I was not aware that this was available.

I am currently trying to add the required conversions to convert from parquet (big endian, fixed size) to arrow (little endian, 128bit), but maybe this is something I will add in a separate PR.

sweb · 2020-12-03T07:35:38Z

rust/arrow/src/ipc/convert.rs

    let len = c_fields.len();
    for i in 0..len {
        let c_field: ipc::Field = c_fields.get(i);
+        match c_field.type_type() {


@alamb could you take another look at my attempt to add the unimplemented path for big endian?

I am not happy with placing the check in fb_to_schema and would have preferred to put it in get_data_type but I found no way to pass on the endianness from the schema.

I see the problem -- yes, there since the endianness is on the shema object, not the field, since the field is all that is passed around there is no way to know what the details of the schema are.

I personally think this code is fine, if a bit un-indeal and could be cleaned up in the future. My only worry is that it would get lost / broken during such cleanup

What would you think about adding a test that triggers the error? Then we could be sure that any future cleanups will not break the check?

Thanks again @sweb

@alamb thank you for being so nice about it - I was just too lazy to add a test and should receive full scrutiny ;)

This is partly due to the fact that I am not very familiar with flatbuffers and still do not fully understand how to create the appropriate flatbuffer to test this. As a temporary solution, I have added two tests to ipc::reader that uses the BigEndian files in arrow-ipc-stream/integration/1.0.0-bigendian. The one for decimal fails, the others work. I hope this is okay for now, until I am able to construct the correct schema message to test this directly in ipc::convert.

While adding the big endian test for the other types I noticed that the contents are not equal to the json content. That is why the test does not contain an equality check. Thus, there may be problems with Big Endian for other types as well.

github-actions · 2020-12-03T07:47:18Z

https://issues.apache.org/jira/browse/ARROW-10674

alamb

I think this is looking very nice -- any other thoughts @jorgecarleitao (who reviewed the initial BigDecimal implementation) or @nevi-me , our resident Arrow IPC expert?

alamb · 2020-12-04T11:22:30Z

rust/arrow/src/ipc/reader.rs


+    #[test]
+    #[should_panic(expected = "Big Endian is not supported for Decimal!")]
+    fn read_decimal_file_be() {


alamb · 2020-12-04T11:28:09Z

I just double checked though and the CI integration test is still failing here:

https://github.com/apache/arrow/pull/8784/checks?check_run_id=1497340378

go: github.com/prometheus/client_model@v0.0.0-20190812154241-14fe0d1b01d4: git fetch -f origin refs/heads/*:refs/heads/* refs/tags/*:refs/tags/* in /go/pkg/mod/cache/vcs/2a98e665081184f4ca01f0af8738c882495d1fb131b7ed20ad844d3ba1bb6393: exit status 128:
	error: RPC failed; curl 18 transfer closed with outstanding read data remaining
	fatal: error reading section header 'shallow-info'
Fetching https://golang.org/x/exp?go-get=1

Which seems maybe unrelated (an error fetching go?)

I restarted that test to see if the problem was some intermittent infrastructure error

alamb · 2020-12-04T13:02:11Z

CI passed!

nevi-me · 2020-12-04T23:14:30Z

I'm on the road over the weekend, but I'll try look at this maybe on Sunday evening.

jorgecarleitao

Looks good, so merged it :)

This PR introduces capabilities to construct a `DecimalArray` from Parquet files. This is done by adding a new `Converter<Vec<Option<ByteArray>>, DecimalArray>` in `parquet::arrow::converter`: ``` pub struct DecimalArrayConverter { precision: i32, scale: i32, } ``` It is then used in `ArrayBuilderReader` using a match guard to differentiate it from regular fixed size binaries: ``` PhysicalType::FIXED_LEN_BYTE_ARRAY if cur_type.get_basic_info().logical_type() == LogicalType::DECIMAL => ``` A test was added that uses a parquet file from `PARQUET_TEST_DATA` to check whether loading and casting to `DecimalArray` works. I did not find a corresponding json file with the correct values, but I used `pyarrow` to extract the correct values and hard coded them in the test. I thought that this PR would require #8784 to be merged, but they are independent. I used the same JIRA issue here - I hope that this is okay. Closes #8880 from sweb/rust-parquet-decimal-arrow-reader Authored-by: Florian Müller <florian@tomueller.de> Signed-off-by: Andy Grove <andygrove73@gmail.com>

This PR introduces capabilities to construct a `DecimalArray` from Parquet files. This is done by adding a new `Converter<Vec<Option<ByteArray>>, DecimalArray>` in `parquet::arrow::converter`: ``` pub struct DecimalArrayConverter { precision: i32, scale: i32, } ``` It is then used in `ArrayBuilderReader` using a match guard to differentiate it from regular fixed size binaries: ``` PhysicalType::FIXED_LEN_BYTE_ARRAY if cur_type.get_basic_info().logical_type() == LogicalType::DECIMAL => ``` A test was added that uses a parquet file from `PARQUET_TEST_DATA` to check whether loading and casting to `DecimalArray` works. I did not find a corresponding json file with the correct values, but I used `pyarrow` to extract the correct values and hard coded them in the test. I thought that this PR would require #8784 to be merged, but they are independent. I used the same JIRA issue here - I hope that this is okay. Closes #8880 from sweb/rust-parquet-decimal-arrow-reader Authored-by: Florian Müller <florian@tomueller.de> Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>

sweb added 2 commits November 27, 2020 16:50

feat: first working ipc test

8bf9024

chore: remove printlns

59cecd2

sweb marked this pull request as draft November 27, 2020 16:02

github-actions bot added needs-rebase A PR that needs to be rebased by the author and removed needs-rebase A PR that needs to be rebased by the author labels Nov 29, 2020

sweb added 3 commits November 30, 2020 11:00

feat: add ipc convert for decimal

0394d78

chore: Merge branch 'master' into rust-decimal-ipc

dd0adb6

chore: reformat

0f8f22e

alamb changed the title ~~WIP: ARROW-10674: [Rust] Add IPC Reader/Writer for Decimal type to allow integration tests~~ WIP: ARROW-10674: [Rust] Fix BigDecimal to be little endian; Add IPC Reader/Writer for Decimal type to allow integration tests Dec 2, 2020

alamb approved these changes Dec 2, 2020

View reviewed changes

chore: address review comments

a4c7da7

sweb commented Dec 3, 2020

View reviewed changes

sweb changed the title ~~WIP: ARROW-10674: [Rust] Fix BigDecimal to be little endian; Add IPC Reader/Writer for Decimal type to allow integration tests~~ ARROW-10674: [Rust] Fix BigDecimal to be little endian; Add IPC Reader/Writer for Decimal type to allow integration tests Dec 3, 2020

sweb marked this pull request as ready for review December 3, 2020 07:37

chore: move fixed size calculation from array to parquet schema

5a4ce51

sweb added 3 commits December 3, 2020 18:07

chore: add test for decimal big-endian panic

4466531

chore: formatting

3dadd2c

chore: test renaming

da79589

alamb approved these changes Dec 4, 2020

View reviewed changes

nevi-me self-requested a review December 4, 2020 23:13

apache deleted a comment from github-actions bot Dec 7, 2020

jorgecarleitao added the Component: Rust label Dec 7, 2020

sweb mentioned this pull request Dec 9, 2020

ARROW-10927: [Rust][Parquet] Add Decimal to ArrayBuilderReader #8880

Closed

jorgecarleitao closed this in b9e94d3 Dec 11, 2020

jorgecarleitao reviewed Dec 11, 2020

View reviewed changes

asfimport mentioned this pull request Dec 24, 2020

[Rust] Add integration tests for Decimal type #26627

Closed

ARROW-10674: [Rust] Fix BigDecimal to be little endian; Add IPC Reader/Writer for Decimal type to allow integration tests #8784

ARROW-10674: [Rust] Fix BigDecimal to be little endian; Add IPC Reader/Writer for Decimal type to allow integration tests #8784

Uh oh!

Conversation

sweb commented Nov 27, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Dec 2, 2020

Uh oh!

sweb commented Dec 2, 2020

Uh oh!

sweb Dec 3, 2020

Choose a reason for hiding this comment

Uh oh!

alamb Dec 3, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sweb Dec 3, 2020

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Dec 3, 2020

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb Dec 4, 2020

Choose a reason for hiding this comment

Uh oh!

alamb commented Dec 4, 2020

Uh oh!

alamb commented Dec 4, 2020

Uh oh!

nevi-me commented Dec 4, 2020

Uh oh!

jorgecarleitao left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sweb commented Nov 27, 2020 •

edited

Loading

alamb left a comment •

edited

Loading

alamb Dec 3, 2020 •

edited

Loading