-
Notifications
You must be signed in to change notification settings - Fork 3k
Arrow: FIXED type support #3029
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
f4e3295 to
bc1b19c
Compare
bc1b19c to
d07341b
Compare
arrow/src/main/java/org/apache/iceberg/arrow/ArrowSchemaUtil.java
Outdated
Show resolved
Hide resolved
d07341b to
3fbddaf
Compare
| vectorizedColumnIterator.varWidthTypeBatchReader().nextBatch(vec, -1, nullabilityHolder); | ||
| break; | ||
| case FIXED_WIDTH_BINARY: | ||
| vectorizedColumnIterator.fixedWidthTypeBinaryBatchReader().nextBatch(vec, typeWidth, nullabilityHolder); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the difference between these two readers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the difference is in the way stuff is being read: FixedSizeBinary vs FixedWidthBinary. For the FIXED type we should essentially be creating/using a FixedSizeBinaryVector from Arrow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know the basics here, so I'm confused why case Fixed with binary is read with Fixed Size Binary, also I don't understand the difference between fixed size and width
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FIXED_WIDTH_BINARY might have been misleading so I renamed it to FIXED_SIZE_BINARY. It seems that the FixedWidthBinary code path existed as a workaround for Spark as can be seen here. I checked TestParquetVectorizedReads and that seems to be testing the FIXED type with Spark+Vectorization
iceberg/spark/v3.0/spark3/src/test/java/org/apache/iceberg/spark/data/AvroDataTest.java
Line 60 in f3e6770
| required(112, "fixed", Types.FixedType.ofLength(7)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So for clarification what is being added here? Fixed width or fixed size? Or are they the same?
Can you clarify your comment on spark as well: fix width already was (partially) handled and now its fully handled?
3fbddaf to
b0ff549
Compare
| vectorizedColumnIterator.varWidthTypeBatchReader().nextBatch(vec, -1, nullabilityHolder); | ||
| break; | ||
| case FIXED_WIDTH_BINARY: | ||
| vectorizedColumnIterator.fixedWidthTypeBinaryBatchReader().nextBatch(vec, typeWidth, nullabilityHolder); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So for clarification what is being added here? Fixed width or fixed size? Or are they the same?
Can you clarify your comment on spark as well: fix width already was (partially) handled and now its fully handled?
| } | ||
| } | ||
|
|
||
| public class FixedWidthTypeBinaryBatchReader extends BatchReader { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I am daft but it looks like you removed fixed width readers but I don't see where you added any readers?
|
@nastra does this still need to be reviewed? Somebody mentioned on slack this week (on Tuesday) that they had issues writing a fixed item as a truncated partition column. So they used Binary. I’ve been out sick but I’ll gather the details into an issue. I doubt this will directly solve that but made me think of this PR. |
|
@kbendick yes this still needs to be reviewed and at this point TBH I'm uncertain the approach is correct or not because I'm not sure if this statement is still correct (https://github.com/apache/iceberg/pull/3029/files#diff-80bc724de9a4dd358c4544fcf00e00139145c6763a5d0280e0bd0793a0fd4003L366-L368):
FWIW the PR itself is not related to the issue that was reported on Slack. |
|
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions. |
|
This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time. |
No description provided.