-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-2145/ARROW-2153/ARROW-2157/ARROW-2160/ARROW-2177: [Python] Decimal conversion not working for NaN values #1651
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Since we'll probably want to use libre2 for analytics, we should see at some point if we can replace the Boost regexen with libre2 |
373ef1d to
2adb26f
Compare
|
@kou Do you have any idea why in this build: https://travis-ci.org/apache/arrow/jobs/345443821 OS X isn't finding the correct symbol? Is there some installation step for Here's the error message: |
|
@wesm I'll open a JIRA for it. |
|
Ah, looks like it was added in ARROW-29. |
018213b to
3f9414d
Compare
cpp/src/arrow/util/decimal.cc
Outdated
| if (precision != NULLPTR) { | ||
| *precision = static_cast<int>(charp - numeric_string_start); | ||
| *precision = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this is the correct behavior here. I need to look into what other systems do with a string of all zeros for precision and scale.
wesm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't look like too much fun, thanks for slogging through this! left some stylistic comments and other things
| DCHECK(status.ok()) << "Unable to import decimal module"; | ||
| status = ::arrow::py::internal::ImportFromModule(decimal_module, "Decimal", | ||
| &decimal_type_); | ||
| DCHECK(status.ok()) << "Unable to import decimal.Decimal"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should make some global state that is initialized when the library is loaded
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aren't these dchecks already done?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep they are done in the Import* functions. I'll remove these.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kept these DCHECKS since these functions are returning Status but I removed the messages.
ci/travis_build_parquet_cpp.sh
Outdated
| @@ -38,7 +38,7 @@ cmake \ | |||
| -GNinja \ | |||
| -DCMAKE_BUILD_TYPE=debug \ | |||
| -DCMAKE_INSTALL_PREFIX=$ARROW_PYTHON_PARQUET_HOME \ | |||
| -DPARQUET_BOOST_USE_SHARED=off \ | |||
| -DPARQUET_BOOST_USE_SHARED=on \ | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the rationale for this, the symbol linking issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
import pyarrow.parquet was segfaulting, I assumed because we're statically linking boost in the parquet build and dynamically in the arrow build. This only shows up when using the regex library.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, we should be consistent about which we do across the libraries. Part of why I wish we were building all these libraries in a monorepo setting
|
|
||
| bool IsNull(PyObject* obj) const { | ||
| return obj == Py_None || obj == numpy_nan || internal::PyFloat_isnan(obj) || | ||
| (internal::PyDecimal_Check(obj) && internal::PyDecimal_ISNAN(obj)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ugh, Python, what did we do to deserve this? =)
cpp/src/arrow/python/helpers.cc
Outdated
| DCHECK(status.ok()) << "Error during import of the decimal module"; | ||
| status = ImportFromModule(decimal, "Decimal", &Decimal); | ||
| DCHECK(status.ok()) | ||
| << "Error during import of the Decimal object from the decimal module"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these dchecks are performed twice -- should this be just DCHECK_OK on each of these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I introduced a DCHECK_OK macro and used it here and in a few other places.
cpp/src/arrow/python/helpers.cc
Outdated
| Status DecimalMetadata::Update(PyObject* object) { | ||
| DCHECK(PyDecimal_Check(object)) << "Object is not a Python Decimal"; | ||
| DCHECK(!PyDecimal_ISNAN(object)) | ||
| << "Decimal object cannot be NAN when inferring precision and scale"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should never happen by design, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I was guarding against potential uses of it after the fact so that arrow crashes with a useful error message to the developer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose I could relax this and just do nothing if the value is nan.
cpp/src/arrow/util/decimal.cc
Outdated
| return Status::Invalid(ss.str()); | ||
| } | ||
|
|
||
| const std::string sign = results["SIGN"]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there some TMP magic that makes this abstraction zero-cost, or does this add overhead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, operator[](const std::string) returns a const_reference to a sub_match object, which has a cast to std::string operator defined. sub_match has first and second attributes which are bidirectional iterators which are used to construct a string like std::string(match.first, match.second). Alternatively we use results["SIGN"].str(). The main difference is that the first uses __builtin_memcpy and the second uses reserve then ultimately __builtin_memset N number of times. I suspect that one call to memcpy N bytes is cheaper than N calls to memset individual elements.
cpp/src/arrow/util/decimal.cc
Outdated
| if (s.empty()) { | ||
| return Status::Invalid("Empty string cannot be converted to decimal"); | ||
| } | ||
| static const boost::regex DECIMAL_REGEX( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reckon we'll want to replace this with libre2 at some point. it's also a lot faster than boost::regex http://lh3lh3.users.sourceforge.net/reb.shtml
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I'll make a JIRA for it.
cpp/src/arrow/util/decimal.cc
Outdated
| const int32_t abs_scale = std::abs(*scale); | ||
| *out *= ScaleMultipliers[abs_scale]; | ||
|
|
||
| if (precision != NULLPTR) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW it's not necessary to use this NULLPTR macro outside headers I don't believe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool I'll fix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| def test_decimal_array_with_none_and_nan(): | ||
| values = [decimal.Decimal('1.234'), None, np.nan, decimal.Decimal('nan')] | ||
| array = pa.array(values) | ||
| assert array.type == pa.decimal128(4, 3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a test here with an explicit decimal type sufficient to accommodate the data?
| series = pd.Series(data) | ||
| array = pa.array(series) | ||
| assert array.to_pylist() == data | ||
| assert array.type == pa.decimal128(3, 3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
| desired_scale = tmp_scale; | ||
| } | ||
| for (PyObject* object : objects) { | ||
| RETURN_NOT_OK(max_decimal_metadata.Update(object)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should ignore nans
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Update method now ignores nans
| } else if (PyObject_IsInstance(obj, decimal_type_.obj())) { | ||
| // Don't infer anything if we encounter a Decimal('nan') | ||
| if (!internal::PyDecimal_ISNAN(obj)) { | ||
| RETURN_NOT_OK(max_decimal_metadata_.Update(obj)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to change to ignore nans
|
Umm. I have never seen the error. I may not help you because I'don't have macOS. What are the outputs of the followings? % nm /Users/travis/build/apache/arrow/cpp-install/lib/libarrow.0.dylib | grep boost
% nm /usr/local/opt/boost/lib/libboost_regex-mt.dylib
% strings /Users/travis/build/apache/arrow/cpp-install/lib/libarrow.0.dylib | grep boost
% strings /usr/local/opt/boost/lib/libboost_regex-mt.dylib | grep boost
% otool -L /Users/travis/build/apache/arrow/cpp-install/lib/libarrow.0.dylib
% otool -L /usr/local/opt/boost/lib/libboost_regex-mt.dylib |
afd8eaf to
641f535
Compare
d02780e to
94fc582
Compare
ci/travis_install_osx.sh
Outdated
| # under the License. | ||
|
|
||
| brew update | ||
| brew bundle --file=$TRAVIS_BUILD_DIR/c_glib/Brewfile |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't that be conditioned on ARROW_CI_C_GLIB_AFFECTED?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pitrou This is already conditioned on in .travis.yml just before this script is called. Is it really necessary to condition on it again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really, though given the filename it might be better to avoid further mistakes :-)
a59b0d8 to
0d45688
Compare
|
@wesm @pitrou this is passing on travis: https://travis-ci.org/cpcloud/arrow/builds/347872453 |
|
Sweet, here is the Appveyor build: https://ci.appveyor.com/project/cpcloud/arrow/build/1.0.587. Going to take a quick look through and then merge |
wesm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, thanks @cpcloud!
This PR closes the following JIRAs
ARROW-2145: [Python] Decimal conversion not working for NaN values
ARROW-2153: [C++/Python] Decimal conversion not working for exponential notation
ARROW-2157: [Python] Decimal arrays cannot be constructed from Python lists
ARROW-2160: [C++/Python] Fix decimal precision inference
ARROW-2177: [C++] Remove support for specifying negative scale values in DecimalType
I originally separated these fixes into a few smaller PRs, but it turned out
that the issues were all related, so I fixed them all in one PR.