Skip to content

Conversation

@bkietz
Copy link
Member

@bkietz bkietz commented Feb 25, 2019

Concatenate arrays into a single array

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be on the right track. The test coverage is a bit thin, though. It would be better to return NotImplemented in paths that are untested if tests don't get written for them

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about just writing:

CheckConcatenate(type, arrays_as_json, expected_json)

The ConcatenateParam seems unnecessarily elaborate to me.

It doesn't appear that the non-zero offset cases are being tested right now, so in those cases will have to construct and slice the test arrays manually

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation needs a more appropriate home than array.cc. We could create a subdirectory of miscellaneous array algorithms. I don't have a strong opinion, @xhochy @pitrou @fsaintjacques @emkornfield thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

under util/ might make sense for now. We can see how many random algorithms get put here vs just used in kernels (my second choice would be under compute I suppose).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Algorithms could be grouped under src/arrow/array/algorithm_<category>.{h,cc} with a single convenience header/source src/arrow/algorithm.{h,cc} similar to builders

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is all private, we should avoid convenience headers that balloon compile times.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment what this is?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should return Invalid if any type is unequal

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the null bitmap is getting concatenated twice here

@wesm wesm changed the title ARROW-549: [C++] concatenate function ARROW-549: [C++] Add arrow::Concatenate function to combine multiple arrays into a single Array Feb 25, 2019
@wesm
Copy link
Member

wesm commented Feb 25, 2019

I think you need to define 2^31 - 1 overflow behavior for offsets (lists and binary)

@bkietz bkietz force-pushed the ARROW-549-concatenate-arrays branch from 573be68 to 4ef76e7 Compare February 27, 2019 17:53
@bkietz
Copy link
Member Author

bkietz commented Feb 27, 2019

@fsaintjacques re property testing

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want this at the top-level namespace, maybe static Status Buffer::Concatenate(...) or BuffersConcatenate()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the renaming seems unnecessary- the overload will not conflict with concatenate(arrays)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not concerned about conflict. I'm concerned about readability. buffer::Concatenate or Buffer::Concatenate speaks a lot more than just Concatenate.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO it's more readable to avoid manual name mangling; the argument types are part of what users will read

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it help to introduce an alias like:

using BufferVector = std::vector<std::shared_ptr<Buffer>>

I think this would help make the signature easier to read.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto, there's no need for this function to depends on internal state. In fact, I'd want to see this function explicitly unit tested.

@bkietz
Copy link
Member Author

bkietz commented Mar 5, 2019

@wesm @fsaintjacques how's this look?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could use the ArrayVector alias in the first input argument.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, that will neaten this up quite a bit. I see there's also BufferVector which should be useful too

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Mersenne-Twister PRNG engine has been identified as a source of test slowness. See discussion here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I'll switch to std::default_random_engine

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another place where the ArrayVector alias could be used.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this. Here are some comments below. In particular, I would welcome a bit more clarity in the implementation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it? What if I have sliced a StringArray?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's in the spec: the first offset in an offsets buffer is always 0.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather have this in a separate test file. array-test.cc is already too large, and too slow to compile.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I'll move it to arrow-concatenate-test.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please call this ConcatenateBuffers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I rename Concatenate(arrays) to ConcatenateArrays as well?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you can, though disambiguating requires only one rename ;-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment explaining what this does?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment explaining what this does?

bkietz added 3 commits March 20, 2019 11:00
- renaming
- use memcpy not std::copy in ConcatenateBuffers
- concatenating zero buffers does not terminate
- refactor for clarity
- add comments
@bkietz bkietz force-pushed the ARROW-549-concatenate-arrays branch from 14a78f7 to c1600bc Compare March 20, 2019 17:41
@bkietz
Copy link
Member Author

bkietz commented Mar 20, 2019

@pitrou does this address your concerns?

@pitrou
Copy link
Member

pitrou commented Mar 20, 2019

Thanks @bkietz. Can you open a JIRA for concatenation of union arrays?

@bkietz
Copy link
Member Author

bkietz commented Mar 20, 2019

@pitrou pitrou closed this in 43f2a31 Mar 20, 2019
@bkietz bkietz deleted the ARROW-549-concatenate-arrays branch March 20, 2019 19:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants