Skip to content

Conversation

@danepitkin
Copy link
Member

@danepitkin danepitkin commented Jan 26, 2024

Rationale for this change

Add bindings to the ListView and LargeListView array formats.

What changes are included in this PR?

  • Add initial implementation for ListView and LargeListView
  • Add basic unit tests

Are these changes tested?

  • Basic unit tests only (follow up PRs will be needed to implement full functionality)

Are there any user-facing changes?

Yes, documentation is updated in this PR to include the new PyArrow objects.

@github-actions github-actions bot added the awaiting review Awaiting review label Jan 26, 2024
@github-actions
Copy link

⚠️ GitHub issue #39812 has been automatically assigned in GitHub to PR creator.

@danepitkin danepitkin changed the title GH-39812: [Python] Add ListView and LargeListView GH-39812: [Python] Add scaffolding for ListView and LargeListView Jan 30, 2024
@danepitkin
Copy link
Member Author

>>> import pyarrow as pa
>>>
>>> values = [1, 2, 3, None]
>>> offsets = [0, 1, 2]
>>> sizes = [2, 2, 2]
>>> 
>>> pa.ListViewArray.from_arrays(offsets, sizes, values)
<pyarrow.lib.ListViewArray object at 0x13ad6e200>
[
  [
    1,
    2
  ],
  [
    2,
    3
  ],
  [
    3,
    null
  ]
]
>>>
>>> pa.LargeListViewArray.from_arrays(offsets, sizes, values)
<pyarrow.lib.LargeListViewArray object at 0x13ad6e440>
[
  [
    1,
    2
  ],
  [
    2,
    3
  ],
  [
    3,
    null
  ]
]
>>>
>>> pa.list_view(pa.string())
ListViewType(list_view<item: string>)
>>> pa.large_list_view(pa.string())
LargeListViewType(large_list_view<item: string>)

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think here the additional disclaimer could be added that this also can return more values / values out of order if certain parts of the values array are not pointed to by the offsets.

In general, given the layout of the "views" list type, there is no guarantee about the content of the values here.

The ListArray.values version has a pointer to flatten(), which would be useful here as well, to explain the difference. But I assume Flatten is not yet implemented for ListView?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flatten is not yet implemented, I figured the generic List / Array APIs can be tested and fixed in follow up PRs. I haven't marked that as a TODO anywhere.. let me know if you think it's worth adding as a comment!

I will update the comment, great idea.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be a flat array?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you're right. I confused this with the logical value representation..

Comment on lines 2511 to 2528
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I hadn't expected this to work. So those get filled by ListViewArray::FromArrays? In which case this is not zero-copy?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if this is passed together with a mask? EDIT: I see that is checked and raises an error in the C++ code

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, yeah good point, maybe I should leave this example out to deter users from accidentally creating copies.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will update this to use a null mask instead of None

Comment on lines +2630 to +2660
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be a left-over from copy pasting this from offsets. Or can you also pass a null for the sizes to FromArrays?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does technically work the same. If you use None in either the offsets or sizes arrays, the value will become null.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

>>> import pyarrow as pa
>>> offsets = [0]
>>> values = [1]
>>> sizes = [None]
>>> pa.ListViewArray.from_arrays(offsets, sizes, values)
<pyarrow.lib.ListViewArray object at 0x13509a1a0>
[
  null
]
>>> sizes = [0]
>>> pa.ListViewArray.from_arrays(offsets, sizes, values)
<pyarrow.lib.ListViewArray object at 0x13509a3e0>
[
  []
]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this to the unit tests in test_array.py

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
>>> array.offsets
>>> array.sizes

Will need to update the below output as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

Comment on lines 4635 to 4637
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment can probably be generalized, mentioning that the "view" type in general might not be supported by all Arrow implementations (regardless of the size?)

@github-actions github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting review Awaiting review awaiting changes Awaiting changes labels Feb 1, 2024
@danepitkin
Copy link
Member Author

The next PR will focus on converting python objects to ListView arrays. That will enable ListView to be added to the existing ListArray tests. For now, some temporary basic unit testing is included.

@jorisvandenbossche jorisvandenbossche changed the title GH-39812: [Python] Add scaffolding for ListView and LargeListView GH-39812: [Python] Add bindings for ListView and LargeListView Feb 7, 2024
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
values taking into consideration the array's offset.
values taking into consideration the array's order and offset.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 I also updated LargeListViewArray.values() docstrings

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also show array itself first (so you can compare with that output)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe here show both array and array.values outputs as well so it's easier to compare the different ones?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def flatten(self, pool=None):
def flatten(self, memory_pool=None):

We are not very consistent about it, but at least in this file we use memory_pool more than pool .. (something to clean up at some point)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def flatten(self, pool=None):
def flatten(self, memory_pool=None):

@github-actions github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting change review Awaiting change review awaiting changes Awaiting changes labels Feb 7, 2024
@danepitkin danepitkin force-pushed the danepitkin/python-listview branch from 3155cea to 9bf1673 Compare February 7, 2024 21:39
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@jorisvandenbossche jorisvandenbossche merged commit 42e35f1 into apache:main Feb 8, 2024
@jorisvandenbossche jorisvandenbossche removed the awaiting change review Awaiting change review label Feb 8, 2024
@github-actions github-actions bot added the awaiting merge Awaiting merge label Feb 8, 2024
@conbench-apache-arrow

This comment was marked as resolved.

dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…pache#39813)

### Rationale for this change

Add bindings to the ListView and LargeListView array formats.

### What changes are included in this PR?

* Add initial implementation for ListView and LargeListView
* Add basic unit tests

### Are these changes tested?

* Basic unit tests only (follow up PRs will be needed to implement full functionality)

### Are there any user-facing changes?

Yes, documentation is updated in this PR to include the new PyArrow objects.
* Closes: apache#39812

Lead-authored-by: Dane Pitkin <dane@voltrondata.com>
Co-authored-by: Dane Pitkin <48041712+danepitkin@users.noreply.github.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Python] Add scaffolding for ListView and LargeListView array formats

2 participants