Skip to content

Conversation

@sweb
Copy link
Contributor

@sweb sweb commented Nov 27, 2020

This is a follow up to #8640

Currently, there is a first working IPC reader/writer test using data from testing/arrow-ipc-stream/integration/0.14.1/generated_decimal.arrow_file

However, this lead me to discover that my first decimal type implementation is wrong, in that it uses BigEndian, whereas this is parquet specific and therefore should not be used in arrow/array and so on. I will try to address this in this PR as well.

@sweb sweb marked this pull request as draft November 27, 2020 16:02
@github-actions github-actions bot added needs-rebase A PR that needs to be rebased by the author and removed needs-rebase A PR that needs to be rebased by the author labels Nov 29, 2020
@alamb alamb changed the title WIP: ARROW-10674: [Rust] Add IPC Reader/Writer for Decimal type to allow integration tests WIP: ARROW-10674: [Rust] Fix BigDecimal to be little endian; Add IPC Reader/Writer for Decimal type to allow integration tests Dec 2, 2020
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @sweb ! Looks good to me -- I can't argue with a new passing integration test!

@alamb
Copy link
Contributor

alamb commented Dec 2, 2020

@sweb -- it seems from the arrow definition that the endianness may be big or little endiain:
https://github.com/apache/arrow/blob/master/format/Schema.fbs#L175-L178

/// Exact decimal value represented as an integer value in two's
/// complement. Currently only 128-bit (16-byte) and 256-bit (32-byte) integers
/// are used. The representation uses the endianness indicated
/// in the Schema.

So I suggest at least validating the schema in the IPC message and error'ing with a "unimplemented" type error if it the schema is Big Endian.

The endianness can be checked:
https://docs.rs/arrow/2.0.0/arrow/ipc/gen/Schema/struct.Schema.html#method.endianness

Note that I still think this implementation (that now passes a new test) is still ok to merge in as is (as it is more correct for one case) but gracefully handling an unimplemented endianness would be better.

@sweb
Copy link
Contributor Author

sweb commented Dec 2, 2020

Hey @alamb thank you for the review!

I will add an unimplemented path to indicate a potential misuse - thank you for your hint on how to check endianness - I was not aware that this was available.

I am currently trying to add the required conversions to convert from parquet (big endian, fixed size) to arrow (little endian, 128bit), but maybe this is something I will add in a separate PR.

let len = c_fields.len();
for i in 0..len {
let c_field: ipc::Field = c_fields.get(i);
match c_field.type_type() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb could you take another look at my attempt to add the unimplemented path for big endian?

I am not happy with placing the check in fb_to_schema and would have preferred to put it in get_data_type but I found no way to pass on the endianness from the schema.

Copy link
Contributor

@alamb alamb Dec 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the problem -- yes, there since the endianness is on the shema object, not the field, since the field is all that is passed around there is no way to know what the details of the schema are.

I personally think this code is fine, if a bit un-indeal and could be cleaned up in the future. My only worry is that it would get lost / broken during such cleanup

What would you think about adding a test that triggers the error? Then we could be sure that any future cleanups will not break the check?

Thanks again @sweb

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb thank you for being so nice about it - I was just too lazy to add a test and should receive full scrutiny ;)

This is partly due to the fact that I am not very familiar with flatbuffers and still do not fully understand how to create the appropriate flatbuffer to test this. As a temporary solution, I have added two tests to ipc::reader that uses the BigEndian files in arrow-ipc-stream/integration/1.0.0-bigendian. The one for decimal fails, the others work. I hope this is okay for now, until I am able to construct the correct schema message to test this directly in ipc::convert.

While adding the big endian test for the other types I noticed that the contents are not equal to the json content. That is why the test does not contain an equality check. Thus, there may be problems with Big Endian for other types as well.

@sweb sweb changed the title WIP: ARROW-10674: [Rust] Fix BigDecimal to be little endian; Add IPC Reader/Writer for Decimal type to allow integration tests ARROW-10674: [Rust] Fix BigDecimal to be little endian; Add IPC Reader/Writer for Decimal type to allow integration tests Dec 3, 2020
@sweb sweb marked this pull request as ready for review December 3, 2020 07:37
@github-actions
Copy link

github-actions bot commented Dec 3, 2020

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is looking very nice -- any other thoughts @jorgecarleitao (who reviewed the initial BigDecimal implementation) or @nevi-me , our resident Arrow IPC expert?


#[test]
#[should_panic(expected = "Big Endian is not supported for Decimal!")]
fn read_decimal_file_be() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@alamb
Copy link
Contributor

alamb commented Dec 4, 2020

I just double checked though and the CI integration test is still failing here:

https://github.com/apache/arrow/pull/8784/checks?check_run_id=1497340378

go: github.com/prometheus/client_model@v0.0.0-20190812154241-14fe0d1b01d4: git fetch -f origin refs/heads/*:refs/heads/* refs/tags/*:refs/tags/* in /go/pkg/mod/cache/vcs/2a98e665081184f4ca01f0af8738c882495d1fb131b7ed20ad844d3ba1bb6393: exit status 128:
	error: RPC failed; curl 18 transfer closed with outstanding read data remaining
	fatal: error reading section header 'shallow-info'
Fetching https://golang.org/x/exp?go-get=1

Which seems maybe unrelated (an error fetching go?)

I restarted that test to see if the problem was some intermittent infrastructure error

@alamb
Copy link
Contributor

alamb commented Dec 4, 2020

CI passed!

@nevi-me nevi-me self-requested a review December 4, 2020 23:13
@nevi-me
Copy link
Contributor

nevi-me commented Dec 4, 2020

I'm on the road over the weekend, but I'll try look at this maybe on Sunday evening.

Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, so merged it :)

andygrove pushed a commit that referenced this pull request Dec 16, 2020
This PR introduces capabilities to construct a `DecimalArray` from Parquet files.

This is done by adding a new `Converter<Vec<Option<ByteArray>>, DecimalArray>` in `parquet::arrow::converter`:

```
pub struct DecimalArrayConverter {
    precision: i32,
    scale: i32,
}
```

It is then used in `ArrayBuilderReader` using a match guard to differentiate it from regular fixed size binaries:

```
  PhysicalType::FIXED_LEN_BYTE_ARRAY if cur_type.get_basic_info().logical_type() == LogicalType::DECIMAL =>
```

A test was added that uses a parquet file from `PARQUET_TEST_DATA` to check whether loading and casting to `DecimalArray` works. I did not find a corresponding json file with the correct values, but I used `pyarrow` to extract the correct values and hard coded them in the test.

I thought that this PR would require #8784 to be merged, but they are independent. I used the same JIRA issue here - I hope that this is okay.

Closes #8880 from sweb/rust-parquet-decimal-arrow-reader

Authored-by: Florian Müller <florian@tomueller.de>
Signed-off-by: Andy Grove <andygrove73@gmail.com>
jorgecarleitao pushed a commit that referenced this pull request Dec 18, 2020
This PR introduces capabilities to construct a `DecimalArray` from Parquet files.

This is done by adding a new `Converter<Vec<Option<ByteArray>>, DecimalArray>` in `parquet::arrow::converter`:

```
pub struct DecimalArrayConverter {
    precision: i32,
    scale: i32,
}
```

It is then used in `ArrayBuilderReader` using a match guard to differentiate it from regular fixed size binaries:

```
  PhysicalType::FIXED_LEN_BYTE_ARRAY if cur_type.get_basic_info().logical_type() == LogicalType::DECIMAL =>
```

A test was added that uses a parquet file from `PARQUET_TEST_DATA` to check whether loading and casting to `DecimalArray` works. I did not find a corresponding json file with the correct values, but I used `pyarrow` to extract the correct values and hard coded them in the test.

I thought that this PR would require #8784 to be merged, but they are independent. I used the same JIRA issue here - I hope that this is okay.

Closes #8880 from sweb/rust-parquet-decimal-arrow-reader

Authored-by: Florian Müller <florian@tomueller.de>
Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants