Skip to content

Conversation

@felipecrv
Copy link
Contributor

@felipecrv felipecrv commented Sep 26, 2023

Rationale for this change

More details in the draft implementations of this spec:

What changes are included in this PR?

  • Some unrelated fixes to the spec text (I can extract these to another PR if necessary)
  • Changes to the spec text
  • Additions to the Flatbuffers specifications of the Arrow format

Are these changes tested?

N/A.

Are there any user-facing changes?

Changes in documentation and backwards compatible additions to the format spec.

@felipecrv felipecrv requested review from bkietz and pitrou September 26, 2023 15:58
@felipecrv felipecrv marked this pull request as ready for review September 26, 2023 15:58
@github-actions
Copy link

⚠️ GitHub issue #37876 has been automatically assigned in GitHub to PR creator.

@felipecrv felipecrv changed the title GH-37876: [Format] Add string-view to arrow format GH-37876: [Format] Add list-view to arrow format Sep 26, 2023
@felipecrv felipecrv changed the title GH-37876: [Format] Add list-view to arrow format GH-37876: [Format] Add list-view specification to arrow format Sep 26, 2023
@pitrou pitrou requested a review from wjones127 September 28, 2023 16:49
Copy link
Member

@bkietz bkietz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few nits in wording, otherwise looks good

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Sep 29, 2023
Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

Co-authored-by: David Li <li.davidm96@gmail.com>
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Sep 30, 2023
Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Oct 3, 2023
@felipecrv
Copy link
Contributor Author

Would it be enough to require that sizes[i] == 0 when i is null to call it a "valid empty list-view"

At least in Rust the rule is that a slice must have an end index less than or equal to the length of the data being sliced.

So in this case a slice would be valid iff sizes[i] + offsets[i] <= child_data[0].length().

It has been a while since I worked in C++, but if I recall correctly this is consistent with the way iterators work as well.

I will rewrite the text saying that non-empty nulls are allowed, then.

@felipecrv felipecrv requested review from pitrou and tustvold October 3, 2023 18:29
@zeroshade
Copy link
Member

The vote on the mailing list is officially passed, @bkietz you have an outstanding change requested can you take a look at the updates and update your review accordingly?

@pitrou @tustvold Any outstanding comments here or can we approve this?

@tustvold
Copy link
Contributor

tustvold commented Oct 5, 2023

LGTM


Each logical data type has a well-defined physical layout. Here are
the different physical layouts defined by Arrow:
Each logical data type has one or more well-defined physical layouts. Here
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would keep the singular. There is no disjunction in Arrow (unlike Parquet) between "logical" data type and physical layout. ListView and StringView are simply distinct types.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will change this back to singular and all the other places I've changed it. But in the future, the "logical data type" terminology should probably be removed altogether because it's very confusing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitely agree with that. The spec was often confusing to me at the start.

@felipecrv felipecrv requested a review from pitrou October 5, 2023 14:05
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @felipecrv !

@pitrou
Copy link
Member

pitrou commented Oct 5, 2023

@bkietz Any other comment?

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels Oct 5, 2023
@zeroshade zeroshade merged commit 6d551aa into apache:main Oct 5, 2023
@zeroshade zeroshade removed the awaiting merge Awaiting merge label Oct 5, 2023
@felipecrv felipecrv deleted the format_list_view branch October 5, 2023 17:59
@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 6d551aa.

There was 1 benchmark result indicating a performance regression:

The full Conbench report has more details. It also includes information about 2 possible false positives for unstable benchmarks that are known to sometimes produce them.

loicalleyne pushed a commit to loicalleyne/arrow that referenced this pull request Nov 13, 2023
…pache#37877)

### Rationale for this change

More details in the draft implementations of this spec:

 - C++: apache#35345
 - Go: apache#37468

### What changes are included in this PR?

 - Some unrelated fixes to the spec text (I can extract these to another PR if necessary)
 - Changes to the spec text
 - Additions to the Flatbuffers specifications of the Arrow format

### Are these changes tested?

N/A.

### Are there any user-facing changes?

Changes in documentation and backwards compatible additions to the format spec.

* Closes: apache#37876

Lead-authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
Co-authored-by: David Li <li.davidm96@gmail.com>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>
Signed-off-by: Matt Topol <zotthewizard@gmail.com>
dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…pache#37877)

### Rationale for this change

More details in the draft implementations of this spec:

 - C++: apache#35345
 - Go: apache#37468

### What changes are included in this PR?

 - Some unrelated fixes to the spec text (I can extract these to another PR if necessary)
 - Changes to the spec text
 - Additions to the Flatbuffers specifications of the Arrow format

### Are these changes tested?

N/A.

### Are there any user-facing changes?

Changes in documentation and backwards compatible additions to the format spec.

* Closes: apache#37876

Lead-authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
Co-authored-by: David Li <li.davidm96@gmail.com>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>
Signed-off-by: Matt Topol <zotthewizard@gmail.com>

We illustrate an example of ``ListView<Int8>`` with length 4 having values::

[[12, -7, 25], null, [0, -127, 127, 50], []]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to have an example that shows what happens with duplicate lists and duplicate values:

[[12, -7, 25], null, [0, -127, 127, 12], [], [12, -7, 25]]

Copy link
Contributor Author

@felipecrv felipecrv Mar 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adriangb anything can happen: they can be duplicated in the data or entries can point to the same data.

Compact representation:

buffers:
  offsets: [0, _, 3, _, 0]
  sizes:   [3, _, 4, 0, 3]

children:
  values: [12, -7, 25, 0, -127, 127, 12]

Common representation:

buffers:
  offsets: [0, _, 3, _, 7]
  sizes:   [3, _, 4, 0, 3]

children:
  values: [12, -7, 25, 0, -127, 127, 12, 12, -7, 25]

using _ to indicate that the value doesn't matter

Doing de-duplication is an expensive operation, but you can imagine some kernel, by construction, producing a compact list-view array. Imagine a function that generates an array of prefixes of another array given sizes -- every offset of would be 0 and only the sizes would vary.

The main practical consequence of the ListViewArray is that lists can be written to the array in any random order. If you need to set array[i] to the logical value [a, b, c] all you have to do is append [a, b, c] to the child array and set offsets[i] and sizes[i] to the appropriate sizes. This is not possible with ListArray since an array at a random position i forces all the following values of the child array to move further.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes total sense, thanks!

* Offsets buffer (int32)

| Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | Bytes 12-15 | Bytes 16-63 |
|------------|-------------|-------------|-------------|-----------------------|
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does the value in Bytes 4-7 of the offset buffer is 7, does that mean the value does not matter because the validity bitmap is 0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I tried to make it clear that you can't expect sizes or offsets to be zero on NULL lists. The only constraint is that offset+size < children[0].length

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@felipecrv thanks for the reply!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Format] Add ListView to FlatBuffers and specification text

10 participants