-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-834: Python Support creating from iterables #602
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-834: Python Support creating from iterables #602
Conversation
|
cc @BryanCutler this is the PR I mentioned earlier, if you have a chance to take a look I'd appreciate it (since I this is my first dive into the C side of Python Arrow API). |
wesm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Thanks for doing some refactoring -- I saw some error handling issues if you don't mind doing a little extra scrubbing. I am not sure the inlining / CRTP issue is worth doing a lot of work over, we might open a JIRA to return to it and add some microbenchmarks so that we can demonstrate performance gains (if any)
| while ((item = PyIter_Next(iter))) { | ||
| RETURN_NOT_OK(VisitElem(item, level)); | ||
| } | ||
| Py_DECREF(iter); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put iter in an OwnedRef -- this fails before iteration finishes, this will leak memory
| *size += 1; | ||
| Py_DECREF(item); | ||
| } | ||
| Py_DECREF(iter); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same
| OwnedRef ref(item); | ||
| RETURN_NOT_OK(appendItem(ref)); | ||
| } | ||
| Py_DECREF(iter); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly leak
|
|
||
| class BoolConverter : public TypedConverter<BooleanBuilder> { | ||
| template <typename BuilderType> | ||
| class TypedConverterVisitor : public TypedConverter<BuilderType> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that the base class here SeqConverter does not have a virtual destructor (which can cause a memory leak), can you add one while we're touching this code?
virtual ~SeqConverter() {}
| return Status::OK(); | ||
| } | ||
|
|
||
| virtual Status appendItem(OwnedRef &item) = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use AppendItem(const OwnedRef&) (also capital A) here for immutable reference
| // No error checking | ||
| RETURN_NOT_OK(CheckPythonBytesAreFixedLength(bytes_obj, expected_length)); | ||
| RETURN_NOT_OK(typed_builder_->Append( | ||
| reinterpret_cast<const uint8_t*>(PyBytes_AS_STRING(bytes_obj)))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return result of Append
| length = PyBytes_GET_SIZE(bytes_obj); | ||
| bytes = PyBytes_AS_STRING(bytes_obj); | ||
| RETURN_NOT_OK(typed_builder_->Append(bytes, static_cast<int32_t>(length))); | ||
| return Status::OK(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Return result of Append
| static_cast<int64_t>(PySequence_Size(item_obj)); | ||
| RETURN_NOT_OK(value_converter_->AppendData(item_obj, list_size)); | ||
| } | ||
| return Status::OK(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Return respective Append result
| RETURN_NOT_OK(typed_builder_->AppendNull()); | ||
| } | ||
|
|
||
| return Status::OK(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above
python/pyarrow/_array.pyx
Outdated
|
|
||
|
|
||
| def array(object sequence, DataType type=None, MemoryPool memory_pool=None): | ||
| def array(object sequence, DataType type=None, MemoryPool memory_pool=None, size=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
line length
|
Awesome, thanks for the review @wesm, I'll try and update this on Friday :) |
|
It seems like this repo has been hit by the weird github bug so I'll resolve the conflicts once the repo comes back. (oi vey). |
|
Good times. The ASF git repos are fine, so you can rebase against master in git://git.apache.org/arrow.git if you like |
BryanCutler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this @holdenk, this will be great to have!
I'm just wondering if it is possible to not require size and just append to the buffers with an initial capacity and resize with some strategy as needed? If not maybe it would be better to require the size passed in with the iterable, that way the user will need to do an initial pass instead of doing it internally in Arrow? Just a thought..
| if (PySequence_Check(obj)) { | ||
| *size = static_cast<int64_t>(PySequence_Size(obj)); | ||
| } else if (PyObject_HasAttrString(obj, "__iter__")) { | ||
| PyObject* iter = PyObject_GetIter(obj); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is assuming that iter(obj) would return a new iterator, but what if the object is already an iterator? I think it would just return itself and can only be iterated over once right?
| ---------- | ||
| sequence : sequence-like object of Python objects | ||
| sequence : sequence-like or iterable object of Python objects. | ||
| If both type and size are specified may be a single use iterable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wondering why type is required for a single use iterable, could you just infer from the first element?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be nice to have a maxsize argument instead of an exact size with the interable (so we bail out in the event of infinitely long iterators)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BryanCutler I think that might not work so well with nulls, to be fair thought the type inference code is a bit difficult to trace in my head so I could be wrong. If people would be OK with it I'd like to take a stab (probably in another PR) at making the type inference a little easier to follow?
@wesm, so right now we use the size to allocate the buffer at the start of each append. If we wanted to allow for underrun we'd have to re-alloc which would maybe not be worth it?
|
I can take another look at this if it is close to ready to go? There are some cpplint warnings that need to be fixed ( |
|
@holdenk we are on the cusp of doing a 0.4.0 release, it would be great to get this in. Can you rebase and get the build passing? I can give this another review also |
|
@holdenk would you be able to update this PR? thanks! |
|
With some linting and a rebase, I think this is a good start to merge. I might want to make another pass on the API (see comment re maxsize) and type inference for sequences. |
|
@holdenk @xhochy we need to rebase and merge this with some small fixes (e.g. adding a maxsize parameter for the iterable). I would like to do some refactoring of pandas_convert.h/cc since it's gotten so big, and we also want to add NumPy-based converters (versus NumPy-intended-for-pandas). Any takers? I can also try to work on this sometime this week |
|
I'd be happy to do the rebate on this and either do the refactoring as part of it or a separate PR. Sorry for my radio silence, the book I'm working on only recently wrapped up so I've got bandwidth for not directly Spark projects again :) |
…e.g. support underflow from iterator constructors).
|
So I've done the change for maxsize support in this PR. I'd be happy to re look at the type inference in another PR if that sounds good to people (trying to trace it in my head on a flight got a little too confusing so I think we can probably simplify it, or at least comment it some more for the future). |
|
Commented the type inference code a bit, it seems like we could probably simplify it, but I don't know what the future plans are around mixed types so I'll just leave it as is for now. |
python/pyarrow/includes/pyarrow.pxd
Outdated
| PyOutputStream(object fo) | ||
|
|
||
| cdef cppclass PyBytesReader(CBufferReader): | ||
| PyBytesReader(object fo) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a rebase artifact, can you move the relevant new code to libarrow.pxd and remove this file?
|
Thanks @holdenk, sorry for the delay with this -- it looks like the tests are only failing due to cpplint warnings -- I left a minor comment about a rebase issue with the |
|
A minor remaining buglet: I'm surprised this didn't fail the gcc Linux build |
|
huh yeah I don't know what that wasn't caught in the Linux build. Function probably should have been pure virtual anyways, so I changed it to that. |
wesm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, thanks @holdenk!
…ache#602) ### Rationale for this change Multiple threads can attempt to create the same llvm expression in Gandiva. This isn't allowed with the new JIT compiler, so synchronizing will prevent this scenario. ### What changes are included in this PR? Synchronize some methods to avoid adding duplicate llvm expressions. ### Are these changes tested? Yes, through unit tests in Gandiva. ### Are there any user-facing changes? No. Closes apacheGH-601
Support creating arrow arrays from iterables.
Possible follow up TODO (or possibly belongs in this issue); throw a clear exception when passed an iterator rather than an iterable.