ARROW-834: Python Support creating from iterables #602

holdenk · 2017-04-26T08:47:32Z

Support creating arrow arrays from iterables.
Possible follow up TODO (or possibly belongs in this issue); throw a clear exception when passed an iterator rather than an iterable.

holdenk · 2017-04-26T18:49:03Z

cc @BryanCutler this is the PR I mentioned earlier, if you have a chance to take a look I'd appreciate it (since I this is my first dive into the C side of Python Arrow API).

wesm

Nice! Thanks for doing some refactoring -- I saw some error handling issues if you don't mind doing a little extra scrubbing. I am not sure the inlining / CRTP issue is worth doing a lot of work over, we might open a JIRA to return to it and add some microbenchmarks so that we can demonstrate performance gains (if any)

wesm · 2017-04-27T01:43:11Z

cpp/src/arrow/python/builtin_convert.cc

+      while ((item = PyIter_Next(iter))) {
+	RETURN_NOT_OK(VisitElem(item, level));
+      }
+      Py_DECREF(iter);


Put iter in an OwnedRef -- this fails before iteration finishes, this will leak memory

wesm · 2017-04-27T01:44:05Z

cpp/src/arrow/python/builtin_convert.cc

+      *size += 1;
+      Py_DECREF(item);
+    }
+    Py_DECREF(iter);


wesm · 2017-04-27T01:44:51Z

cpp/src/arrow/python/builtin_convert.cc

+	OwnedRef ref(item);
+	RETURN_NOT_OK(appendItem(ref));
      }
+      Py_DECREF(iter);


Possibly leak

wesm · 2017-04-27T01:48:28Z

cpp/src/arrow/python/builtin_convert.cc


-class BoolConverter : public TypedConverter<BooleanBuilder> {
+template <typename BuilderType>
+class TypedConverterVisitor : public TypedConverter<BuilderType> {


I see that the base class here SeqConverter does not have a virtual destructor (which can cause a memory leak), can you add one while we're touching this code?

virtual ~SeqConverter() {}

wesm · 2017-04-27T01:49:33Z

cpp/src/arrow/python/builtin_convert.cc

    return Status::OK();
  }
+
+  virtual Status appendItem(OwnedRef &item) = 0;


use AppendItem(const OwnedRef&) (also capital A) here for immutable reference

wesm · 2017-04-27T01:55:44Z

cpp/src/arrow/python/builtin_convert.cc

+    // No error checking
+    RETURN_NOT_OK(CheckPythonBytesAreFixedLength(bytes_obj, expected_length));
+    RETURN_NOT_OK(typed_builder_->Append(
+		      reinterpret_cast<const uint8_t*>(PyBytes_AS_STRING(bytes_obj))));


return result of Append

wesm · 2017-04-27T01:55:57Z

cpp/src/arrow/python/builtin_convert.cc

+    length = PyBytes_GET_SIZE(bytes_obj);
+    bytes = PyBytes_AS_STRING(bytes_obj);
+    RETURN_NOT_OK(typed_builder_->Append(bytes, static_cast<int32_t>(length)));
    return Status::OK();


Return result of Append

wesm · 2017-04-27T01:56:14Z

cpp/src/arrow/python/builtin_convert.cc

+	static_cast<int64_t>(PySequence_Size(item_obj));
+      RETURN_NOT_OK(value_converter_->AppendData(item_obj, list_size));
    }
    return Status::OK();


Return respective Append result

wesm · 2017-04-27T01:56:24Z

cpp/src/arrow/python/builtin_convert.cc

+      RETURN_NOT_OK(typed_builder_->AppendNull());
    }

    return Status::OK();


Same as above

wesm · 2017-04-27T01:57:42Z

python/pyarrow/_array.pyx



-def array(object sequence, DataType type=None, MemoryPool memory_pool=None):
+def array(object sequence, DataType type=None, MemoryPool memory_pool=None, size=None):


line length

holdenk · 2017-04-27T02:09:45Z

Awesome, thanks for the review @wesm, I'll try and update this on Friday :)

…ne )

holdenk · 2017-04-28T19:43:29Z

It seems like this repo has been hit by the weird github bug so I'll resolve the conflicts once the repo comes back. (oi vey).

wesm · 2017-04-28T20:09:52Z

Good times. The ASF git repos are fine, so you can rebase against master in git://git.apache.org/arrow.git if you like

BryanCutler

Thanks for doing this @holdenk, this will be great to have!

I'm just wondering if it is possible to not require size and just append to the buffers with an initial capacity and resize with some strategy as needed? If not maybe it would be better to require the size passed in with the iterable, that way the user will need to do an initial pass instead of doing it internally in Arrow? Just a thought..

BryanCutler · 2017-05-02T18:55:20Z

cpp/src/arrow/python/builtin_convert.cc

+  if (PySequence_Check(obj)) {
+    *size = static_cast<int64_t>(PySequence_Size(obj));
+  } else if (PyObject_HasAttrString(obj, "__iter__")) {
+    PyObject* iter = PyObject_GetIter(obj);


This is assuming that iter(obj) would return a new iterator, but what if the object is already an iterator? I think it would just return itself and can only be iterated over once right?

BryanCutler · 2017-05-02T19:06:40Z

python/pyarrow/_array.pyx

    ----------
-    sequence : sequence-like object of Python objects
+    sequence : sequence-like or iterable object of Python objects.
+        If both type and size are specified may be a single use iterable.


Just wondering why type is required for a single use iterable, could you just infer from the first element?

It might be nice to have a maxsize argument instead of an exact size with the interable (so we bail out in the event of infinitely long iterators)

@BryanCutler I think that might not work so well with nulls, to be fair thought the type inference code is a bit difficult to trace in my head so I could be wrong. If people would be OK with it I'd like to take a stab (probably in another PR) at making the type inference a little easier to follow?

@wesm, so right now we use the size to allocate the buffer at the start of each append. If we wanted to allow for underrun we'd have to re-alloc which would maybe not be worth it?

wesm · 2017-05-09T12:36:45Z

I can take another look at this if it is close to ready to go? There are some cpplint warnings that need to be fixed (make lint after running cmake)

wesm · 2017-05-15T13:03:53Z

@holdenk we are on the cusp of doing a 0.4.0 release, it would be great to get this in. Can you rebase and get the build passing? I can give this another review also

wesm · 2017-05-26T16:49:05Z

@holdenk would you be able to update this PR? thanks!

wesm · 2017-06-09T15:23:59Z

With some linting and a rebase, I think this is a good start to merge. I might want to make another pass on the API (see comment re maxsize) and type inference for sequences.

wesm · 2017-06-19T15:32:17Z

@holdenk @xhochy we need to rebase and merge this with some small fixes (e.g. adding a maxsize parameter for the iterable). I would like to do some refactoring of pandas_convert.h/cc since it's gotten so big, and we also want to add NumPy-based converters (versus NumPy-intended-for-pandas). Any takers? I can also try to work on this sometime this week

holdenk · 2017-06-19T15:34:45Z

I'd be happy to do the rebate on this and either do the refactoring as part of it or a separate PR. Sorry for my radio silence, the book I'm working on only recently wrapped up so I've got bandwidth for not directly Spark projects again :)

…e.g. support underflow from iterator constructors).

holdenk · 2017-06-20T10:33:57Z

So I've done the change for maxsize support in this PR. I'd be happy to re look at the type inference in another PR if that sounds good to people (trying to trace it in my head on a flight got a little too confusing so I think we can probably simplify it, or at least comment it some more for the future).

holdenk · 2017-06-20T16:48:46Z

Commented the type inference code a bit, it seems like we could probably simplify it, but I don't know what the future plans are around mixed types so I'll just leave it as is for now.

wesm · 2017-06-26T13:28:34Z

python/pyarrow/includes/pyarrow.pxd

+        PyOutputStream(object fo)
+
+    cdef cppclass PyBytesReader(CBufferReader):
+        PyBytesReader(object fo)


This is a rebase artifact, can you move the relevant new code to libarrow.pxd and remove this file?

wesm · 2017-06-26T13:30:02Z

Thanks @holdenk, sorry for the delay with this -- it looks like the tests are only failing due to cpplint warnings -- I left a minor comment about a rebase issue with the pyarrow.pxd file that was removed. Could you change the PR title to start with "ARROW-834:"? This file is probably due for some follow up refactoring to abstract out the sequence iteration and simplify the type inference, so we can do that later.

wesm · 2017-06-28T01:21:09Z

A minor remaining buglet:

/Users/travis/build/apache/arrow/cpp/src/arrow/python/builtin_convert.cc:366:12: error: no viable conversion from 'arrow::Status (const std::string &)' to 'arrow::Status'
    return Status::NotImplemented;
           ^~~~~~~~~~~~~~~~~~~~~~
/Users/travis/build/apache/arrow/cpp/src/arrow/status.h:186:16: note: candidate constructor not viable: no known conversion from 'arrow::Status (const std::string &)' to 'const arrow::Status &' for 1st argument
inline Status::Status(const Status& s) {
               ^
1 error generated.

I'm surprised this didn't fail the gcc Linux build

holdenk · 2017-06-28T04:42:00Z

huh yeah I don't know what that wasn't caught in the Linux build. Function probably should have been pure virtual anyways, so I changed it to that.

wesm

+1, thanks @holdenk!

…ache#602) ### Rationale for this change Multiple threads can attempt to create the same llvm expression in Gandiva. This isn't allowed with the new JIT compiler, so synchronizing will prevent this scenario. ### What changes are included in this PR? Synchronize some methods to avoid adding duplicate llvm expressions. ### Are these changes tested? Yes, through unit tests in Gandiva. ### Are there any user-facing changes? No. Closes apacheGH-601

holdenk added 11 commits April 24, 2017 23:37

Start adding iterable support

5c0fa0b

Revert accidently java change

77c935b

Part of the way along adding iterable support

76e08ca

Move more of the convertors to the visitor version

15cdfe3

Move over timestamp and byte converters

a1bf4bd

Switch remaining converters

48b08aa

Make TypedConverterVisitor work on PySequence or Python Iterators

b6c72f5

In theory this works ok now for iterables as well

ca0d530

revert changes to _array.pyx

82ec3c3

Tests pass (TODO cleanup debugging)

63c0b7f

Cleanup debugging

80cc971

holdenk changed the title ~~[ARROW-834][Python] Support creating from iterables~~ [WIP][ARROW-834][Python] Support creating from iterables Apr 26, 2017

holdenk added 4 commits April 26, 2017 01:48

Update array function description

3a55e82

Restore ArrowBlock (unreleated change)

be58bc0

Add limmited pure iterator support and a note

d392daa

Style fixes

c429f9a

wesm reviewed Apr 27, 2017

View reviewed changes

holdenk added 4 commits April 26, 2017 23:09

Switch the SeqVisitor to use OwnedRef

1d970bd

Use a const ownedref

52b03e3

Use CRTP in the iterator

389976c

Feedback from wes (fix some previously unchecked appends, fix long li…

8c42fdc

…ne )

holdenk added 2 commits April 28, 2017 15:53

Merge in changes to timestamp/datetime builtin converter

a571ad4

Return the append inside of the decimal convert case/switch business

9eb3f10

BryanCutler reviewed May 2, 2017

View reviewed changes

wesm mentioned this pull request May 15, 2017

ARROW-881: [Python] Reconstruct Pandas DataFrame indexes using metadata #612

Closed

holdenk added 6 commits June 19, 2017 18:06

Naive merge, lets see if it works

01e462c

Add ConvertPySequence to other side

fa0abcc

Style fix

42f0699

Have size override the size of the iterator if the iterator is larger.

ad935e9

Do dynamic resize on the array buffer if size ended up being larger (…

1fd9588

…e.g. support underflow from iterator constructors).

Make a note about the resize/realloc in underflow with size

dddf57d

Comment the built in converter type inferance code a bit.

ee2afaa

holdenk changed the title ~~[WIP][ARROW-834][Python] Support creating from iterables~~ [ARROW-834][Python] Support creating from iterables Jun 20, 2017

wesm reviewed Jun 26, 2017

View reviewed changes

holdenk changed the title ~~[ARROW-834][Python] Support creating from iterables~~ ARROW-834: Python Support creating from iterables Jun 27, 2017

holdenk added 2 commits June 27, 2017 16:26

Fix long line

2ed00d9

Remove unecessary file after merge

0b72e95

Switch AppendItem to pure virtual for TypedConverterVisitor

750e7f4

wesm approved these changes Jun 28, 2017

View reviewed changes

asfgit closed this in bddb219 Jun 28, 2017



		def array(object sequence, DataType type=None, MemoryPool memory_pool=None):
		def array(object sequence, DataType type=None, MemoryPool memory_pool=None, size=None):

ARROW-834: Python Support creating from iterables #602

ARROW-834: Python Support creating from iterables #602

Uh oh!

Conversation

holdenk commented Apr 26, 2017

Uh oh!

holdenk commented Apr 26, 2017

Uh oh!

wesm left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

holdenk commented Apr 27, 2017

Uh oh!

holdenk commented Apr 28, 2017

Uh oh!

wesm commented Apr 28, 2017

Uh oh!

BryanCutler left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wesm commented May 9, 2017

Uh oh!

wesm commented May 15, 2017

Uh oh!

wesm commented May 26, 2017

Uh oh!

wesm commented Jun 9, 2017

Uh oh!

wesm commented Jun 19, 2017

Uh oh!

holdenk commented Jun 19, 2017

Uh oh!

holdenk commented Jun 20, 2017

Uh oh!

holdenk commented Jun 20, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wesm commented Jun 26, 2017

Uh oh!

wesm commented Jun 28, 2017

Uh oh!

holdenk commented Jun 28, 2017

Uh oh!

wesm left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone