Skip to content

Conversation

@sweb
Copy link
Contributor

@sweb sweb commented Nov 11, 2020

This PR adds support for the decimal data type by adding DataType::Decimal, array::DecimalArray and builder::DecimalBuilder.

The implementation is heavily based on FixedSizeBinaryArray in order to store values as fixed size binary arrays. However, values are returned as i128 and the builder expects i128 values for constructing a DecimalArray.

Some additional notes:

  • The byte size of values is calculated using the precision field. The necessary calculation is implemented as a static method in array::DecimalArray. This is probably the wrong place for it.
  • Values are always returned as i128, even though we could check the precision and potentially return a smaller integer.
  • I did not add a new dependency for a crate representing decimal values, like e.g. rust_decimal. This makes working with DecimalArray a bit more difficult but I did not want to introduce new dependencies in this change.
  • Decimal values from Json are expected as Strings, e.g. "5000". This is due to the fact that I am not sure whether Json runs into problems when representing integers beyond i64. For starting I assumed that this way is safer.

@sweb sweb marked this pull request as draft November 11, 2020 16:49
@github-actions
Copy link

Thanks for opening a pull request!

Could you open an issue for this pull request on JIRA?
https://issues.apache.org/jira/browse/ARROW

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

See also:

@sweb sweb changed the title WIP: ARROW-4193: [Rust] Add support for decimal data type ARROW-4193: [Rust] Add support for decimal data type Nov 15, 2020
@sweb sweb marked this pull request as ready for review November 15, 2020 17:00
@sweb
Copy link
Contributor Author

sweb commented Nov 15, 2020

@jorgecarleitao @nevi-me may I ask you for a review? From a functionality point of view I think this is more or less done, however I am very much at the beginning of writing stuff in Rust, so help is much appreciated!

@github-actions
Copy link

Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wow! Impressive work, @sweb. Virtually flawless in my opinion.

You really grasped all the ideas on the crate and implemented this really well. Congratulations! 💯

P.S. Thank you very much for your first contribution!

@nevi-me or @alamb , could you take a look, particularly on the bit ops, that I am less confident about?

@alamb
Copy link
Contributor

alamb commented Nov 16, 2020

@jorgecarleitao I will put it on my queue for tomorrow. Hopefully the morning

@nevi-me
Copy link
Contributor

nevi-me commented Nov 17, 2020

Hi @sweb, I'm providing general comments, I'll look at this in detail over the days.

I see that you're using i128 as the backing type. One of the reasons that's prevented us from making progress on this, is not having a good BigDecimal impl in Rust to use.
The biggest hurdle will be converting 123.45f64 to decimals, as we'll need to do that for integration testing.

On the JIRA itself, the plan was to add the type, then also:

  • add support on the IPC reader and writer
  • enable decimal types in integration teting

I'm fine with us doing the above as follow-ups, as they'd be quite involved.

@sweb
Copy link
Contributor Author

sweb commented Nov 17, 2020

Hi @sweb, I'm providing general comments, I'll look at this in detail over the days.

I see that you're using i128 as the backing type. One of the reasons that's prevented us from making progress on this, is not having a good BigDecimal impl in Rust to use.
The biggest hurdle will be converting 123.45f64 to decimals, as we'll need to do that for integration testing.

On the JIRA itself, the plan was to add the type, then also:

* add support on the IPC reader and writer

* enable decimal types in integration teting

I'm fine with us doing the above as follow-ups, as they'd be quite involved.

Thank you very much for your first feedback!

I can try to create a second pull request for adding the IPC reader / writer after this one. I didn't think of it since I mainly use arrow as a processing layer after reading parquet files and never use the IPC parts.

I have some ideas concerning the conversion from f64 to decimal, but I will try to find a reference implementation in the Python sources - maybe we can borrow some ideas from there.

@alamb
Copy link
Contributor

alamb commented Nov 17, 2020

@jorgecarleitao I didn't have a chance to get to this PR today in my allotted time budget, but it sounds like @nevi-me may be planning on doing so, so I am going to lower the priority. I'll try and read it more carefully shortly

@jorgecarleitao
Copy link
Member

@jorgecarleitao I didn't have a chance to get to this PR today in my allotted time budget, but it sounds like @nevi-me may be planning on doing so, so I am going to lower the priority. I'll try and read it more carefully shortly

I am sorry, I did not mean to push for reviewing to neither of you. It was just a comment to prepare @sweb that there is a part of the code that I can't review well and thus I am relying on some help ❤️

@sweb
Copy link
Contributor Author

sweb commented Nov 17, 2020

@jorgecarleitao could you check the following strange thing I cannot explain to myself:

I added two additional asserts in test_decimal_offsets and test_fixed_size_binary_offsets (see latest commit, leading to 2 failing tests).

            let a = create_decimal_array(&[
            Some(8_887_000_000),
            None,
            None,
            Some(-8_887_000_000),
            None,
            None,
        ]);
        let b = create_decimal_array(&[
            Some(8_887_000_000),
            None,
            None,
            Some(15_887_000_000),
            None,
            None,
        ]);

        let a_slice = a.slice(3, 3);
        let b_slice = b.slice(3, 3);
        test_equal(&a_slice, &b_slice, false);

From my current understanding, these slices should not be the same, since one is [-8887, None, None] and the other one is [15887, None, None]. However the test states that they are equal. Same for the fixed_binary test. Do you have an idea what I am missing? It is very likely that I do not really understand how the offsets work and that these checks should in fact evaluate to true.

Thanks!

@jorgecarleitao
Copy link
Member

@sweb , it is a bug in the equality itself. I have a PR for it, hang on a sec.

@jorgecarleitao
Copy link
Member

PR for it: #8695

@sweb
Copy link
Contributor Author

sweb commented Nov 17, 2020

@jorgecarleitao Thanks for the quick feedback! I will remove the two asserts since this is already addressed in your PR.

@sweb
Copy link
Contributor Author

sweb commented Nov 21, 2020

@jorgecarleitao may I ask why you closed this PR? :)

@jorgecarleitao
Copy link
Member

I merged it. Wasn't it ready?

@sweb
Copy link
Contributor Author

sweb commented Nov 21, 2020

Ah sorry, I missed the merge and thought it was just closed. Thank you for merging!

@sweb sweb deleted the ARROW-4193-decimal-support branch November 21, 2020 11:18
@sweb
Copy link
Contributor Author

sweb commented Nov 21, 2020

@jorgecarleitao In closing remarks, I just wanted to say that this was the most pleasant contribution experience I had so far on an open source project. Thank you very much for this.

@nevi-me
Copy link
Contributor

nevi-me commented Nov 21, 2020

There's some follow-up to eventually support decimals in the integration testing (and Parquet reader and writer). Would you be interested in assisting us with that @sweb? You can open PRs on the Integration testing umbrella JIRA, or as individual/standalone ones. I'm fine with either appraoch. Thanks

@nevi-me
Copy link
Contributor

nevi-me commented Nov 21, 2020

@sweb
Copy link
Contributor Author

sweb commented Nov 21, 2020

@nevi-me I will gladly continue to work on supporting decimals. I will start with the IPC reader / writer.

jorgecarleitao pushed a commit that referenced this pull request Dec 11, 2020
…r/Writer for Decimal type to allow integration tests

This is a follow up to #8640

Currently, there is a first working IPC reader/writer test using data from `testing/arrow-ipc-stream/integration/0.14.1/generated_decimal.arrow_file`

However, this lead me to discover that my first decimal type implementation is wrong, in that it uses BigEndian, whereas this is parquet specific and therefore should not be used in arrow/array and so on. I will try to address this in this PR as well.

Closes #8784 from sweb/rust-decimal-ipc

Authored-by: Florian Müller <florian@tomueller.de>
Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants