-
Notifications
You must be signed in to change notification settings - Fork 4k
GH-37876: [Format] Add list-view specification to arrow format #37877
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
3fb1d4e to
f384285
Compare
bkietz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few nits in wording, otherwise looks good
mapleFU
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
Co-authored-by: David Li <li.davidm96@gmail.com>
Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>
dd6ed5f to
d88e00a
Compare
I will rewrite the text saying that non-empty nulls are allowed, then. |
|
LGTM |
docs/source/format/Columnar.rst
Outdated
|
|
||
| Each logical data type has a well-defined physical layout. Here are | ||
| the different physical layouts defined by Arrow: | ||
| Each logical data type has one or more well-defined physical layouts. Here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would keep the singular. There is no disjunction in Arrow (unlike Parquet) between "logical" data type and physical layout. ListView and StringView are simply distinct types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will change this back to singular and all the other places I've changed it. But in the future, the "logical data type" terminology should probably be removed altogether because it's very confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I definitely agree with that. The spec was often confusing to me at the start.
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot @felipecrv !
|
@bkietz Any other comment? |
|
After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 6d551aa. There was 1 benchmark result indicating a performance regression:
The full Conbench report has more details. It also includes information about 2 possible false positives for unstable benchmarks that are known to sometimes produce them. |
…pache#37877) ### Rationale for this change More details in the draft implementations of this spec: - C++: apache#35345 - Go: apache#37468 ### What changes are included in this PR? - Some unrelated fixes to the spec text (I can extract these to another PR if necessary) - Changes to the spec text - Additions to the Flatbuffers specifications of the Arrow format ### Are these changes tested? N/A. ### Are there any user-facing changes? Changes in documentation and backwards compatible additions to the format spec. * Closes: apache#37876 Lead-authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com> Co-authored-by: David Li <li.davidm96@gmail.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com> Signed-off-by: Matt Topol <zotthewizard@gmail.com>
…pache#37877) ### Rationale for this change More details in the draft implementations of this spec: - C++: apache#35345 - Go: apache#37468 ### What changes are included in this PR? - Some unrelated fixes to the spec text (I can extract these to another PR if necessary) - Changes to the spec text - Additions to the Flatbuffers specifications of the Arrow format ### Are these changes tested? N/A. ### Are there any user-facing changes? Changes in documentation and backwards compatible additions to the format spec. * Closes: apache#37876 Lead-authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com> Co-authored-by: David Li <li.davidm96@gmail.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com> Signed-off-by: Matt Topol <zotthewizard@gmail.com>
|
|
||
| We illustrate an example of ``ListView<Int8>`` with length 4 having values:: | ||
|
|
||
| [[12, -7, 25], null, [0, -127, 127, 50], []] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to have an example that shows what happens with duplicate lists and duplicate values:
[[12, -7, 25], null, [0, -127, 127, 12], [], [12, -7, 25]]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@adriangb anything can happen: they can be duplicated in the data or entries can point to the same data.
Compact representation:
buffers:
offsets: [0, _, 3, _, 0]
sizes: [3, _, 4, 0, 3]
children:
values: [12, -7, 25, 0, -127, 127, 12]
Common representation:
buffers:
offsets: [0, _, 3, _, 7]
sizes: [3, _, 4, 0, 3]
children:
values: [12, -7, 25, 0, -127, 127, 12, 12, -7, 25]
using _ to indicate that the value doesn't matter
Doing de-duplication is an expensive operation, but you can imagine some kernel, by construction, producing a compact list-view array. Imagine a function that generates an array of prefixes of another array given sizes -- every offset of would be 0 and only the sizes would vary.
The main practical consequence of the ListViewArray is that lists can be written to the array in any random order. If you need to set array[i] to the logical value [a, b, c] all you have to do is append [a, b, c] to the child array and set offsets[i] and sizes[i] to the appropriate sizes. This is not possible with ListArray since an array at a random position i forces all the following values of the child array to move further.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes total sense, thanks!
| * Offsets buffer (int32) | ||
|
|
||
| | Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | Bytes 12-15 | Bytes 16-63 | | ||
| |------------|-------------|-------------|-------------|-----------------------| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does the value in Bytes 4-7 of the offset buffer is 7, does that mean the value does not matter because the validity bitmap is 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I tried to make it clear that you can't expect sizes or offsets to be zero on NULL lists. The only constraint is that offset+size < children[0].length
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@felipecrv thanks for the reply!
Rationale for this change
More details in the draft implementations of this spec:
What changes are included in this PR?
Are these changes tested?
N/A.
Are there any user-facing changes?
Changes in documentation and backwards compatible additions to the format spec.