Skip to content

Conversation

@holdenk
Copy link
Contributor

@holdenk holdenk commented Apr 26, 2017

Support creating arrow arrays from iterables.
Possible follow up TODO (or possibly belongs in this issue); throw a clear exception when passed an iterator rather than an iterable.

@holdenk holdenk changed the title [ARROW-834][Python] Support creating from iterables [WIP][ARROW-834][Python] Support creating from iterables Apr 26, 2017
@holdenk
Copy link
Contributor Author

holdenk commented Apr 26, 2017

cc @BryanCutler this is the PR I mentioned earlier, if you have a chance to take a look I'd appreciate it (since I this is my first dive into the C side of Python Arrow API).

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Thanks for doing some refactoring -- I saw some error handling issues if you don't mind doing a little extra scrubbing. I am not sure the inlining / CRTP issue is worth doing a lot of work over, we might open a JIRA to return to it and add some microbenchmarks so that we can demonstrate performance gains (if any)

while ((item = PyIter_Next(iter))) {
RETURN_NOT_OK(VisitElem(item, level));
}
Py_DECREF(iter);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put iter in an OwnedRef -- this fails before iteration finishes, this will leak memory

*size += 1;
Py_DECREF(item);
}
Py_DECREF(iter);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same

OwnedRef ref(item);
RETURN_NOT_OK(appendItem(ref));
}
Py_DECREF(iter);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly leak


class BoolConverter : public TypedConverter<BooleanBuilder> {
template <typename BuilderType>
class TypedConverterVisitor : public TypedConverter<BuilderType> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that the base class here SeqConverter does not have a virtual destructor (which can cause a memory leak), can you add one while we're touching this code?

virtual ~SeqConverter() {}

return Status::OK();
}

virtual Status appendItem(OwnedRef &item) = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use AppendItem(const OwnedRef&) (also capital A) here for immutable reference

// No error checking
RETURN_NOT_OK(CheckPythonBytesAreFixedLength(bytes_obj, expected_length));
RETURN_NOT_OK(typed_builder_->Append(
reinterpret_cast<const uint8_t*>(PyBytes_AS_STRING(bytes_obj))));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return result of Append

length = PyBytes_GET_SIZE(bytes_obj);
bytes = PyBytes_AS_STRING(bytes_obj);
RETURN_NOT_OK(typed_builder_->Append(bytes, static_cast<int32_t>(length)));
return Status::OK();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Return result of Append

static_cast<int64_t>(PySequence_Size(item_obj));
RETURN_NOT_OK(value_converter_->AppendData(item_obj, list_size));
}
return Status::OK();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Return respective Append result

RETURN_NOT_OK(typed_builder_->AppendNull());
}

return Status::OK();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above



def array(object sequence, DataType type=None, MemoryPool memory_pool=None):
def array(object sequence, DataType type=None, MemoryPool memory_pool=None, size=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line length

@holdenk
Copy link
Contributor Author

holdenk commented Apr 27, 2017

Awesome, thanks for the review @wesm, I'll try and update this on Friday :)

@holdenk
Copy link
Contributor Author

holdenk commented Apr 28, 2017

It seems like this repo has been hit by the weird github bug so I'll resolve the conflicts once the repo comes back. (oi vey).

@wesm
Copy link
Member

wesm commented Apr 28, 2017

Good times. The ASF git repos are fine, so you can rebase against master in git://git.apache.org/arrow.git if you like

Copy link
Member

@BryanCutler BryanCutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this @holdenk, this will be great to have!

I'm just wondering if it is possible to not require size and just append to the buffers with an initial capacity and resize with some strategy as needed? If not maybe it would be better to require the size passed in with the iterable, that way the user will need to do an initial pass instead of doing it internally in Arrow? Just a thought..

if (PySequence_Check(obj)) {
*size = static_cast<int64_t>(PySequence_Size(obj));
} else if (PyObject_HasAttrString(obj, "__iter__")) {
PyObject* iter = PyObject_GetIter(obj);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is assuming that iter(obj) would return a new iterator, but what if the object is already an iterator? I think it would just return itself and can only be iterated over once right?

----------
sequence : sequence-like object of Python objects
sequence : sequence-like or iterable object of Python objects.
If both type and size are specified may be a single use iterable.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering why type is required for a single use iterable, could you just infer from the first element?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be nice to have a maxsize argument instead of an exact size with the interable (so we bail out in the event of infinitely long iterators)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BryanCutler I think that might not work so well with nulls, to be fair thought the type inference code is a bit difficult to trace in my head so I could be wrong. If people would be OK with it I'd like to take a stab (probably in another PR) at making the type inference a little easier to follow?

@wesm, so right now we use the size to allocate the buffer at the start of each append. If we wanted to allow for underrun we'd have to re-alloc which would maybe not be worth it?

@wesm
Copy link
Member

wesm commented May 9, 2017

I can take another look at this if it is close to ready to go? There are some cpplint warnings that need to be fixed (make lint after running cmake)

@wesm
Copy link
Member

wesm commented May 15, 2017

@holdenk we are on the cusp of doing a 0.4.0 release, it would be great to get this in. Can you rebase and get the build passing? I can give this another review also

@wesm
Copy link
Member

wesm commented May 26, 2017

@holdenk would you be able to update this PR? thanks!

@wesm
Copy link
Member

wesm commented Jun 9, 2017

With some linting and a rebase, I think this is a good start to merge. I might want to make another pass on the API (see comment re maxsize) and type inference for sequences.

@wesm
Copy link
Member

wesm commented Jun 19, 2017

@holdenk @xhochy we need to rebase and merge this with some small fixes (e.g. adding a maxsize parameter for the iterable). I would like to do some refactoring of pandas_convert.h/cc since it's gotten so big, and we also want to add NumPy-based converters (versus NumPy-intended-for-pandas). Any takers? I can also try to work on this sometime this week

@holdenk
Copy link
Contributor Author

holdenk commented Jun 19, 2017

I'd be happy to do the rebate on this and either do the refactoring as part of it or a separate PR. Sorry for my radio silence, the book I'm working on only recently wrapped up so I've got bandwidth for not directly Spark projects again :)

@holdenk
Copy link
Contributor Author

holdenk commented Jun 20, 2017

So I've done the change for maxsize support in this PR. I'd be happy to re look at the type inference in another PR if that sounds good to people (trying to trace it in my head on a flight got a little too confusing so I think we can probably simplify it, or at least comment it some more for the future).

@holdenk
Copy link
Contributor Author

holdenk commented Jun 20, 2017

Commented the type inference code a bit, it seems like we could probably simplify it, but I don't know what the future plans are around mixed types so I'll just leave it as is for now.

@holdenk holdenk changed the title [WIP][ARROW-834][Python] Support creating from iterables [ARROW-834][Python] Support creating from iterables Jun 20, 2017
PyOutputStream(object fo)

cdef cppclass PyBytesReader(CBufferReader):
PyBytesReader(object fo)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a rebase artifact, can you move the relevant new code to libarrow.pxd and remove this file?

@wesm
Copy link
Member

wesm commented Jun 26, 2017

Thanks @holdenk, sorry for the delay with this -- it looks like the tests are only failing due to cpplint warnings -- I left a minor comment about a rebase issue with the pyarrow.pxd file that was removed. Could you change the PR title to start with "ARROW-834:"? This file is probably due for some follow up refactoring to abstract out the sequence iteration and simplify the type inference, so we can do that later.

@holdenk holdenk changed the title [ARROW-834][Python] Support creating from iterables ARROW-834: Python Support creating from iterables Jun 27, 2017
@wesm
Copy link
Member

wesm commented Jun 28, 2017

A minor remaining buglet:

/Users/travis/build/apache/arrow/cpp/src/arrow/python/builtin_convert.cc:366:12: error: no viable conversion from 'arrow::Status (const std::string &)' to 'arrow::Status'
    return Status::NotImplemented;
           ^~~~~~~~~~~~~~~~~~~~~~
/Users/travis/build/apache/arrow/cpp/src/arrow/status.h:186:16: note: candidate constructor not viable: no known conversion from 'arrow::Status (const std::string &)' to 'const arrow::Status &' for 1st argument
inline Status::Status(const Status& s) {
               ^
1 error generated.

I'm surprised this didn't fail the gcc Linux build

@holdenk
Copy link
Contributor Author

holdenk commented Jun 28, 2017

huh yeah I don't know what that wasn't caught in the Linux build. Function probably should have been pure virtual anyways, so I changed it to that.

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, thanks @holdenk!

@asfgit asfgit closed this in bddb219 Jun 28, 2017
pribor pushed a commit to GlobalWebIndex/arrow that referenced this pull request Oct 24, 2025
…ache#602)

### Rationale for this change
Multiple threads can attempt to create the same llvm expression in
Gandiva. This isn't allowed with the new JIT compiler, so synchronizing
will prevent this scenario.

### What changes are included in this PR?
Synchronize some methods to avoid adding duplicate llvm expressions.

### Are these changes tested?
Yes, through unit tests in Gandiva.

### Are there any user-facing changes?
No.

Closes apacheGH-601
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants