Skip to content

Conversation

@javierluraschi
Copy link
Contributor

Fix for https://issues.apache.org/jira/browse/ARROW-3591, implemented following Python string to decimal conversion: https://github.com/apache/arrow/blob/master/cpp/src/arrow/python/arrow_to_pandas.cc#L723.

Note: We could consider a more efficient conversion from decimal to R's integer or numeric types depending on the precision/scale from Arrows decimal; however, my guess is that this has been explored before for other languages and there is no straightforward memcpy() version of cross language decimal type conversions since the internal representation has been implemented as language specific.

Bonus: Added accessors in the schema wrapper for num_fields() and field() which can help diagnose schema issues.

Copy link
Contributor

@romainfrancois romainfrancois left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if there is a way to keep the information, rather than converting to a double vector.

Maybe inspired by the rational example in vctrs. https://vctrs.r-lib.org/articles/s3-vector.html#rational

@codecov-io
Copy link

Codecov Report

Merging #2819 into master will increase coverage by 15.19%.
The diff coverage is n/a.

Impacted file tree graph

@@             Coverage Diff             @@
##           master    #2819       +/-   ##
===========================================
+ Coverage   73.41%   88.61%   +15.19%     
===========================================
  Files          62      345      +283     
  Lines        4010    58886    +54876     
===========================================
+ Hits         2944    52181    +49237     
- Misses        992     6705     +5713     
+ Partials       74        0       -74
Impacted Files Coverage Δ
rust/src/record_batch.rs
go/arrow/array/table.go
rust/src/array.rs
go/arrow/math/uint64_amd64.go
go/arrow/internal/testing/tools/bool.go
go/arrow/internal/bitutil/bitutil.go
go/arrow/memory/memory_avx2_amd64.go
rust/src/error.rs
go/arrow/datatype_nested.go
rust/src/lib.rs
... and 397 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2d0d3d0...efabe27. Read the comment docs.

@romainfrancois
Copy link
Contributor

🤔 this is kind of confusing from

/// Represents a signed 128-bit integer in two's complement.

/// Represents a signed 128-bit integer in two's complement.
/// Calculations wrap around and overflow is ignored.
///
/// For a discussion of the algorithms, look at Knuth's volume 2,
/// Semi-numerical Algorithms section 4.3.1.
///
/// Adapted from the Apache ORC C++ implementation
class ARROW_EXPORT Decimal128 {

but looking at the tests this is indeed a decimal

INSTANTIATE_TEST_CASE_P(Decimal128PrintingTest, Decimal128PrintingTest,

INSTANTIATE_TEST_CASE_P(Decimal128PrintingTest, Decimal128PrintingTest,
                        ::testing::Values(std::make_tuple(123, 1, "12.3"),
                                          std::make_tuple(123, 5, "0.00123"),
                                          std::make_tuple(123, 10, "1.23E-8"),
                                          std::make_tuple(123, -1, "1.23E+3"),
                                          std::make_tuple(-123, -1, "-1.23E+3"),
                                          std::make_tuple(123, -3, "1.23E+5"),
                                          std::make_tuple(-123, -3, "-1.23E+5"),
                                          std::make_tuple(12345, -3, "1.2345E+7")));

@hadley would this be a good example of a vctrs record implemented as a list of two integer vectors, or perhaps a better approach would be a list of one complex vector (which uses 128 bits, just like Decimal128) so that we could minimize copies.

Can I assume that if we do the "right" thing and implement this as a record, then the complex vector would only be an implementation detail that would be not so easy for the user to get to ?

@romainfrancois
Copy link
Contributor

Generally speaking, would things like this and int64 be safer if implemented as a vctrs record ?

@hadley
Copy link
Contributor

hadley commented Oct 24, 2018

Yes, definitely.

@romainfrancois
Copy link
Contributor

Great. @javierluraschi i’ll play around with a record based implementation for decimal128

@javierluraschi
Copy link
Contributor Author

@romainfrancois one thing to mention is that, while in some cases having a true decimal type might be needed, similarly to int64s, the use cases I've seen work much better by casting to a native data type in R, say numeric in this case and integer for int64s. So I believe we still need to let the users define how to map these types. For sparklyr, I'm planning to map to numeric/integer by default and let more advanced users opt-in to bit64:: and vctrs types.

So, I'm still proposing we merge this and then enhance with support for vctrs.

For sparklyr, I'm sure the right approach is to cast to integer/numeric by default, but I'm less convinced that's the right default for arrow. Regardless, letting package authors / users choose the defaults seems reasonable.

BTW. I'm thinking something around the lines of:

options(arrow.decimal = "numeric")
options(arrow.decimal = "vctrs")
options(arrow.int64 = "integer")
options(arrow.int64 = "bit64")

@romainfrancois
Copy link
Contributor

We could also do both:

  • have something requiring almost no copy (e.g use a record for decimal and a double for int64) hosted in a vctrs thing (so still manipulable by r)
  • have appropriate coercions, e.g. as.numeric, as.integer if this is what the user, package needs

@javierluraschi
Copy link
Contributor Author

javierluraschi commented Oct 25, 2018

@romainfrancois I'm hoping I don't need to cast each column with as.numeric(), we could have as.numeric() implemented in arrow and then cast columns when we read data frames, in which case, we might want to avoid having an intermediate representation with vectr and bit64.

The reasoning behind configuring this in the arrow package, is that we could document and teach users the tradeoffs between using bit64/vtrs vs native data types in a single place. Otherwise, I would have to document this behavior with custom settings in sparklyr and other packages would have to duplicate this behavior.

So net, net, I would prefer to use options() to let the users configure this behavior as proposed in this PR, thoughts?

@hadley
Copy link
Contributor

hadley commented Oct 25, 2018

For int64, we have to copy what the DBI package do in order to be consistent.

@romainfrancois
Copy link
Contributor

From dbFetch :

image

@romainfrancois
Copy link
Contributor

My understanding is that by default we should return an integer64, and then places that want to convert it to a double or integer vector can always use the converters, those are explicit. Can you confirm @krlmlr ?

@javierluraschi that does not mean that the user has to explicitly use the as.* methods but package code can.

We might have to consider changing the Array$as_vector() method in favor of something more flexible if we want to avoid the Array -> bit64::integer64 -> numeric roundtrip.

@javierluraschi
Copy link
Contributor Author

Fair enough, so for decimals, since odbc collects decimals as numeric not vctrs, can we call this PR a day and merge? Or is there something else to consider?

@romainfrancois
Copy link
Contributor

I would rather like something closer to the arrow layout initially (lossless) and the possibility to convert to doubles then.

@javierluraschi
Copy link
Contributor Author

That's fine but can we merge this one and then replace with something better?

@romainfrancois
Copy link
Contributor

That’s fine by me, as long as you realise this will likely change.

@javierluraschi
Copy link
Contributor Author

Yes, that's totally fine with me.

@krlmlr
Copy link

krlmlr commented Oct 25, 2018

@romainfrancois: The native DBI backends return an integer64, but the specification is kept vague on purpose so that the actual implementation can change later. Also, dbConnect() has a bigint argument that controls if conversion to numeric or integer happens immediately.

@javierluraschi
Copy link
Contributor Author

@wesm merge please? Looks like Romain and I are both 👍 to get this merged.

@kou kou closed this in d9ee70c Oct 26, 2018
@kou
Copy link
Member

kou commented Oct 26, 2018

I've merged!
Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants