ARROW-2145/ARROW-2153/ARROW-2157/ARROW-2160/ARROW-2177: [Python] Decimal conversion not working for NaN values #1651

cpcloud · 2018-02-23T17:50:03Z

This PR closes the following JIRAs

ARROW-2145: [Python] Decimal conversion not working for NaN values
ARROW-2153: [C++/Python] Decimal conversion not working for exponential notation
ARROW-2157: [Python] Decimal arrays cannot be constructed from Python lists
ARROW-2160: [C++/Python] Fix decimal precision inference
ARROW-2177: [C++] Remove support for specifying negative scale values in DecimalType

I originally separated these fixes into a few smaller PRs, but it turned out
that the issues were all related, so I fixed them all in one PR.

wesm · 2018-02-25T00:45:05Z

Since we'll probably want to use libre2 for analytics, we should see at some point if we can replace the Boost regexen with libre2

cpcloud · 2018-02-26T14:39:19Z

@kou Do you have any idea why in this build: https://travis-ci.org/apache/arrow/jobs/345443821 OS X isn't finding the correct symbol? Is there some installation step for brew that I need to add?

Here's the error message:

dyld: Symbol not found: __ZNK5boost16re_detail_10650131cpp_regex_traits_implementationIcE17transform_primaryEPKcS4_
  Referenced from: /Users/travis/build/apache/arrow/cpp-install/lib/libarrow.0.dylib
  Expected in: /usr/local/opt/boost/lib/libboost_regex-mt.dylib

cpcloud · 2018-02-26T14:42:10Z

@wesm I'll open a JIRA for it.

cpcloud · 2018-02-26T14:43:17Z

Ah, looks like it was added in ARROW-29.

cpcloud · 2018-02-26T21:42:31Z

cpp/src/arrow/util/decimal.cc

    if (precision != NULLPTR) {
-      *precision = static_cast<int>(charp - numeric_string_start);
+      *precision = 0;


I'm not sure if this is the correct behavior here. I need to look into what other systems do with a string of all zeros for precision and scale.

wesm

This doesn't look like too much fun, thanks for slogging through this! left some stylistic comments and other things

wesm · 2018-02-25T00:47:12Z

cpp/src/arrow/python/builtin_convert.cc

+    DCHECK(status.ok()) << "Unable to import decimal module";
+    status = ::arrow::py::internal::ImportFromModule(decimal_module, "Decimal",
+                                                     &decimal_type_);
+    DCHECK(status.ok()) << "Unable to import decimal.Decimal";


I wonder if we should make some global state that is initialized when the library is loaded

aren't these dchecks already done?

Yep they are done in the Import* functions. I'll remove these.

I kept these DCHECKS since these functions are returning Status but I removed the messages.

wesm · 2018-02-26T21:12:22Z

ci/travis_build_parquet_cpp.sh

@@ -38,7 +38,7 @@ cmake \
    -GNinja \
    -DCMAKE_BUILD_TYPE=debug \
    -DCMAKE_INSTALL_PREFIX=$ARROW_PYTHON_PARQUET_HOME \
-    -DPARQUET_BOOST_USE_SHARED=off \
+    -DPARQUET_BOOST_USE_SHARED=on \


what's the rationale for this, the symbol linking issue?

import pyarrow.parquet was segfaulting, I assumed because we're statically linking boost in the parquet build and dynamically in the arrow build. This only shows up when using the regex library.

I see, we should be consistent about which we do across the libraries. Part of why I wish we were building all these libraries in a monorepo setting

wesm · 2018-02-26T21:14:41Z

cpp/src/arrow/python/builtin_convert.cc

+
+  bool IsNull(PyObject* obj) const {
+    return obj == Py_None || obj == numpy_nan || internal::PyFloat_isnan(obj) ||
+           (internal::PyDecimal_Check(obj) && internal::PyDecimal_ISNAN(obj));


Ugh, Python, what did we do to deserve this? =)

wesm · 2018-02-26T21:16:50Z

cpp/src/arrow/python/helpers.cc

+  DCHECK(status.ok()) << "Error during import of the decimal module";
+  status = ImportFromModule(decimal, "Decimal", &Decimal);
+  DCHECK(status.ok())
+      << "Error during import of the Decimal object from the decimal module";


these dchecks are performed twice -- should this be just DCHECK_OK on each of these?

I introduced a DCHECK_OK macro and used it here and in a few other places.

wesm · 2018-02-26T21:18:57Z

cpp/src/arrow/python/helpers.cc

+Status DecimalMetadata::Update(PyObject* object) {
+  DCHECK(PyDecimal_Check(object)) << "Object is not a Python Decimal";
+  DCHECK(!PyDecimal_ISNAN(object))
+      << "Decimal object cannot be NAN when inferring precision and scale";


This should never happen by design, right?

Yep, I was guarding against potential uses of it after the fact so that arrow crashes with a useful error message to the developer.

I suppose I could relax this and just do nothing if the value is nan.

wesm · 2018-02-26T21:41:24Z

cpp/src/arrow/util/decimal.cc

+    return Status::Invalid(ss.str());
+  }
+
+  const std::string sign = results["SIGN"];


is there some TMP magic that makes this abstraction zero-cost, or does this add overhead?

So, operator[](const std::string) returns a const_reference to a sub_match object, which has a cast to std::string operator defined. sub_match has first and second attributes which are bidirectional iterators which are used to construct a string like std::string(match.first, match.second). Alternatively we use results["SIGN"].str(). The main difference is that the first uses __builtin_memcpy and the second uses reserve then ultimately __builtin_memset N number of times. I suspect that one call to memcpy N bytes is cheaper than N calls to memset individual elements.

wesm · 2018-02-26T21:43:05Z

cpp/src/arrow/util/decimal.cc

-  if (s.empty()) {
-    return Status::Invalid("Empty string cannot be converted to decimal");
-  }
+static const boost::regex DECIMAL_REGEX(


I reckon we'll want to replace this with libre2 at some point. it's also a lot faster than boost::regex http://lh3lh3.users.sourceforge.net/reb.shtml

Yep, I'll make a JIRA for it.

wesm · 2018-02-26T21:44:12Z

cpp/src/arrow/util/decimal.cc

+      const int32_t abs_scale = std::abs(*scale);
+      *out *= ScaleMultipliers[abs_scale];
+
+      if (precision != NULLPTR) {


FWIW it's not necessary to use this NULLPTR macro outside headers I don't believe

Cool I'll fix

wesm · 2018-02-26T21:49:14Z

python/pyarrow/tests/test_convert_builtin.py

+def test_decimal_array_with_none_and_nan():
+    values = [decimal.Decimal('1.234'), None, np.nan, decimal.Decimal('nan')]
+    array = pa.array(values)
+    assert array.type == pa.decimal128(4, 3)


can you add a test here with an explicit decimal type sufficient to accommodate the data?

wesm · 2018-02-26T21:49:24Z

python/pyarrow/tests/test_convert_pandas.py

+        series = pd.Series(data)
+        array = pa.array(series)
+        assert array.to_pylist() == data
+        assert array.type == pa.decimal128(3, 3)


cpcloud · 2018-02-26T22:13:57Z

cpp/src/arrow/python/numpy_to_arrow.cc

-        desired_scale = tmp_scale;
-      }
+    for (PyObject* object : objects) {
+      RETURN_NOT_OK(max_decimal_metadata.Update(object));


This should ignore nans

The Update method now ignores nans

cpcloud · 2018-02-26T22:54:51Z

cpp/src/arrow/python/builtin_convert.cc

+    } else if (PyObject_IsInstance(obj, decimal_type_.obj())) {
+      // Don't infer anything if we encounter a Decimal('nan')
+      if (!internal::PyDecimal_ISNAN(obj)) {
+        RETURN_NOT_OK(max_decimal_metadata_.Update(obj));


I'm going to change to ignore nans

kou · 2018-02-27T01:34:44Z

Umm. I have never seen the error. I may not help you because I'don't have macOS.

What are the outputs of the followings?

% nm /Users/travis/build/apache/arrow/cpp-install/lib/libarrow.0.dylib | grep boost
% nm /usr/local/opt/boost/lib/libboost_regex-mt.dylib
% strings /Users/travis/build/apache/arrow/cpp-install/lib/libarrow.0.dylib | grep boost
% strings /usr/local/opt/boost/lib/libboost_regex-mt.dylib | grep boost
% otool -L /Users/travis/build/apache/arrow/cpp-install/lib/libarrow.0.dylib
% otool -L /usr/local/opt/boost/lib/libboost_regex-mt.dylib

pitrou · 2018-03-01T09:37:52Z

ci/travis_install_osx.sh

+# under the License.
+
+brew update
+brew bundle --file=$TRAVIS_BUILD_DIR/c_glib/Brewfile


Shouldn't that be conditioned on ARROW_CI_C_GLIB_AFFECTED?

@pitrou This is already conditioned on in .travis.yml just before this script is called. Is it really necessary to condition on it again?

Not really, though given the filename it might be better to avoid further mistakes :-)

cpcloud · 2018-03-01T19:48:39Z

@wesm @pitrou this is passing on travis: https://travis-ci.org/cpcloud/arrow/builds/347872453

wesm · 2018-03-01T22:24:29Z

Sweet, here is the Appveyor build: https://ci.appveyor.com/project/cpcloud/arrow/build/1.0.587. Going to take a quick look through and then merge

wesm

+1, thanks @cpcloud!

This was referenced Feb 23, 2018

ARROW-2145/ARROW-2157: [Python] Decimal conversion not working for NaN values #1610

Closed

ARROW-2153/ARROW-2160: [C++/Python] Fix decimal precision inference #1618

Closed

cpcloud force-pushed the ARROW-2145-2153-2157-2160 branch from 373ef1d to 2adb26f Compare February 26, 2018 14:34

cpcloud force-pushed the ARROW-2145-2153-2157-2160 branch from 018213b to 3f9414d Compare February 26, 2018 19:54

cpcloud commented Feb 26, 2018

View reviewed changes

wesm reviewed Feb 26, 2018

View reviewed changes

cpcloud commented Feb 26, 2018

View reviewed changes

cpcloud force-pushed the ARROW-2145-2153-2157-2160 branch 2 times, most recently from afd8eaf to 641f535 Compare February 28, 2018 15:28

cpcloud mentioned this pull request Feb 28, 2018

ARROW-2135: [Python] Fix NaN conversion when casting from Numpy array #1681

Closed

cpcloud force-pushed the ARROW-2145-2153-2157-2160 branch 2 times, most recently from d02780e to 94fc582 Compare February 28, 2018 23:40

pitrou reviewed Mar 1, 2018

View reviewed changes

cpcloud added 11 commits March 1, 2018 10:30

ARROW-2145: [Python] Decimal conversion not working for NaN values

8e816ec

IWYU

f562378

Revert header change

8893a45

Revert test change

0665f6e

Install libboost-regex-dev on travis

e6ac864

Use shared boost on parquet CI build

50e35d6

Install boost with c++11 option

8be22a6

Show boost install

7c7270a

Install boost first

77a41ee

NULLPTR to nullptr

4c74c63

DCHECK_OK

d905202

cpcloud added 18 commits March 1, 2018 10:30

DCHECK_OK

281f798

DCHECK_OK

1df6923

DCHECK_Ok

db664f2

Fix order of operands

092a962

Check return value of PyList_SetItem

418754f

Add DecimalMetadata::Update test for ignoring NaN values

b24ff25

Ignore nans in decimal metadata update

3190b1a

Refactor import decimal and acquire the gil before importing

a05b316

Formatting

4e6db3c

boost osx debugging

29e1ebc

DCHECK_OK for release builds

b4bcfd9

More script debugging

78cbf51

Fix boost root

03ee999

Perms

ae5db5f

Silence cmake complaints about boost version

99505a9

Add tests to accommodate decimal values

00be578

Brewfile

ab3e4a5

Pass version as argument

0d45688

cpcloud force-pushed the ARROW-2145-2153-2157-2160 branch from a59b0d8 to 0d45688 Compare March 1, 2018 15:38

Args must be a ruby Hash

1fc2a96

Make sure we only install if glibc is affected

97fcb96

wesm approved these changes Mar 1, 2018

View reviewed changes

wesm closed this in bfac60d Mar 1, 2018

asfimport mentioned this pull request Mar 1, 2018

[Python] Decimal conversion not working for NaN values #18112

Closed

ARROW-2145/ARROW-2153/ARROW-2157/ARROW-2160/ARROW-2177: [Python] Decimal conversion not working for NaN values #1651

ARROW-2145/ARROW-2153/ARROW-2157/ARROW-2160/ARROW-2177: [Python] Decimal conversion not working for NaN values #1651

Uh oh!

Conversation

cpcloud commented Feb 23, 2018

Uh oh!

wesm commented Feb 25, 2018

Uh oh!

cpcloud commented Feb 26, 2018

Uh oh!

cpcloud commented Feb 26, 2018

Uh oh!

cpcloud commented Feb 26, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wesm left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kou commented Feb 27, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cpcloud commented Mar 1, 2018