Skip to content

Conversation

@cpcloud
Copy link
Contributor

@cpcloud cpcloud commented Feb 23, 2018

This PR closes the following JIRAs

ARROW-2145: [Python] Decimal conversion not working for NaN values
ARROW-2153: [C++/Python] Decimal conversion not working for exponential notation
ARROW-2157: [Python] Decimal arrays cannot be constructed from Python lists
ARROW-2160: [C++/Python] Fix decimal precision inference
ARROW-2177: [C++] Remove support for specifying negative scale values in DecimalType

I originally separated these fixes into a few smaller PRs, but it turned out
that the issues were all related, so I fixed them all in one PR.

@wesm
Copy link
Member

wesm commented Feb 25, 2018

Since we'll probably want to use libre2 for analytics, we should see at some point if we can replace the Boost regexen with libre2

@cpcloud cpcloud force-pushed the ARROW-2145-2153-2157-2160 branch from 373ef1d to 2adb26f Compare February 26, 2018 14:34
@cpcloud
Copy link
Contributor Author

cpcloud commented Feb 26, 2018

@kou Do you have any idea why in this build: https://travis-ci.org/apache/arrow/jobs/345443821 OS X isn't finding the correct symbol? Is there some installation step for brew that I need to add?

Here's the error message:

dyld: Symbol not found: __ZNK5boost16re_detail_10650131cpp_regex_traits_implementationIcE17transform_primaryEPKcS4_
  Referenced from: /Users/travis/build/apache/arrow/cpp-install/lib/libarrow.0.dylib
  Expected in: /usr/local/opt/boost/lib/libboost_regex-mt.dylib

@cpcloud
Copy link
Contributor Author

cpcloud commented Feb 26, 2018

@wesm I'll open a JIRA for it.

@cpcloud
Copy link
Contributor Author

cpcloud commented Feb 26, 2018

Ah, looks like it was added in ARROW-29.

@cpcloud cpcloud force-pushed the ARROW-2145-2153-2157-2160 branch from 018213b to 3f9414d Compare February 26, 2018 19:54
if (precision != NULLPTR) {
*precision = static_cast<int>(charp - numeric_string_start);
*precision = 0;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this is the correct behavior here. I need to look into what other systems do with a string of all zeros for precision and scale.

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't look like too much fun, thanks for slogging through this! left some stylistic comments and other things

DCHECK(status.ok()) << "Unable to import decimal module";
status = ::arrow::py::internal::ImportFromModule(decimal_module, "Decimal",
&decimal_type_);
DCHECK(status.ok()) << "Unable to import decimal.Decimal";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should make some global state that is initialized when the library is loaded

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aren't these dchecks already done?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep they are done in the Import* functions. I'll remove these.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept these DCHECKS since these functions are returning Status but I removed the messages.

@@ -38,7 +38,7 @@ cmake \
-GNinja \
-DCMAKE_BUILD_TYPE=debug \
-DCMAKE_INSTALL_PREFIX=$ARROW_PYTHON_PARQUET_HOME \
-DPARQUET_BOOST_USE_SHARED=off \
-DPARQUET_BOOST_USE_SHARED=on \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the rationale for this, the symbol linking issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import pyarrow.parquet was segfaulting, I assumed because we're statically linking boost in the parquet build and dynamically in the arrow build. This only shows up when using the regex library.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, we should be consistent about which we do across the libraries. Part of why I wish we were building all these libraries in a monorepo setting


bool IsNull(PyObject* obj) const {
return obj == Py_None || obj == numpy_nan || internal::PyFloat_isnan(obj) ||
(internal::PyDecimal_Check(obj) && internal::PyDecimal_ISNAN(obj));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh, Python, what did we do to deserve this? =)

DCHECK(status.ok()) << "Error during import of the decimal module";
status = ImportFromModule(decimal, "Decimal", &Decimal);
DCHECK(status.ok())
<< "Error during import of the Decimal object from the decimal module";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these dchecks are performed twice -- should this be just DCHECK_OK on each of these?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I introduced a DCHECK_OK macro and used it here and in a few other places.

Status DecimalMetadata::Update(PyObject* object) {
DCHECK(PyDecimal_Check(object)) << "Object is not a Python Decimal";
DCHECK(!PyDecimal_ISNAN(object))
<< "Decimal object cannot be NAN when inferring precision and scale";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should never happen by design, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I was guarding against potential uses of it after the fact so that arrow crashes with a useful error message to the developer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose I could relax this and just do nothing if the value is nan.

return Status::Invalid(ss.str());
}

const std::string sign = results["SIGN"];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there some TMP magic that makes this abstraction zero-cost, or does this add overhead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, operator[](const std::string) returns a const_reference to a sub_match object, which has a cast to std::string operator defined. sub_match has first and second attributes which are bidirectional iterators which are used to construct a string like std::string(match.first, match.second). Alternatively we use results["SIGN"].str(). The main difference is that the first uses __builtin_memcpy and the second uses reserve then ultimately __builtin_memset N number of times. I suspect that one call to memcpy N bytes is cheaper than N calls to memset individual elements.

if (s.empty()) {
return Status::Invalid("Empty string cannot be converted to decimal");
}
static const boost::regex DECIMAL_REGEX(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reckon we'll want to replace this with libre2 at some point. it's also a lot faster than boost::regex http://lh3lh3.users.sourceforge.net/reb.shtml

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I'll make a JIRA for it.

const int32_t abs_scale = std::abs(*scale);
*out *= ScaleMultipliers[abs_scale];

if (precision != NULLPTR) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW it's not necessary to use this NULLPTR macro outside headers I don't believe

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool I'll fix

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

def test_decimal_array_with_none_and_nan():
values = [decimal.Decimal('1.234'), None, np.nan, decimal.Decimal('nan')]
array = pa.array(values)
assert array.type == pa.decimal128(4, 3)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a test here with an explicit decimal type sufficient to accommodate the data?

series = pd.Series(data)
array = pa.array(series)
assert array.to_pylist() == data
assert array.type == pa.decimal128(3, 3)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

desired_scale = tmp_scale;
}
for (PyObject* object : objects) {
RETURN_NOT_OK(max_decimal_metadata.Update(object));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should ignore nans

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Update method now ignores nans

} else if (PyObject_IsInstance(obj, decimal_type_.obj())) {
// Don't infer anything if we encounter a Decimal('nan')
if (!internal::PyDecimal_ISNAN(obj)) {
RETURN_NOT_OK(max_decimal_metadata_.Update(obj));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to change to ignore nans

@kou
Copy link
Member

kou commented Feb 27, 2018

Umm. I have never seen the error. I may not help you because I'don't have macOS.

What are the outputs of the followings?

% nm /Users/travis/build/apache/arrow/cpp-install/lib/libarrow.0.dylib | grep boost
% nm /usr/local/opt/boost/lib/libboost_regex-mt.dylib
% strings /Users/travis/build/apache/arrow/cpp-install/lib/libarrow.0.dylib | grep boost
% strings /usr/local/opt/boost/lib/libboost_regex-mt.dylib | grep boost
% otool -L /Users/travis/build/apache/arrow/cpp-install/lib/libarrow.0.dylib
% otool -L /usr/local/opt/boost/lib/libboost_regex-mt.dylib

@cpcloud cpcloud force-pushed the ARROW-2145-2153-2157-2160 branch 2 times, most recently from afd8eaf to 641f535 Compare February 28, 2018 15:28
@cpcloud cpcloud force-pushed the ARROW-2145-2153-2157-2160 branch 2 times, most recently from d02780e to 94fc582 Compare February 28, 2018 23:40
# under the License.

brew update
brew bundle --file=$TRAVIS_BUILD_DIR/c_glib/Brewfile
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't that be conditioned on ARROW_CI_C_GLIB_AFFECTED?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pitrou This is already conditioned on in .travis.yml just before this script is called. Is it really necessary to condition on it again?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really, though given the filename it might be better to avoid further mistakes :-)

@cpcloud cpcloud force-pushed the ARROW-2145-2153-2157-2160 branch from a59b0d8 to 0d45688 Compare March 1, 2018 15:38
@cpcloud
Copy link
Contributor Author

cpcloud commented Mar 1, 2018

@wesm @pitrou this is passing on travis: https://travis-ci.org/cpcloud/arrow/builds/347872453

@wesm
Copy link
Member

wesm commented Mar 1, 2018

Sweet, here is the Appveyor build: https://ci.appveyor.com/project/cpcloud/arrow/build/1.0.587. Going to take a quick look through and then merge

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, thanks @cpcloud!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants