Skip to content

Conversation

@nastra
Copy link
Contributor

@nastra nastra commented Aug 26, 2021

No description provided.

@github-actions github-actions bot added the arrow label Aug 26, 2021
@nastra nastra force-pushed the arrow-support-fixed branch 2 times, most recently from f4e3295 to bc1b19c Compare August 27, 2021 07:24
@nastra nastra requested a review from rymurr August 27, 2021 12:50
@nastra nastra force-pushed the arrow-support-fixed branch from bc1b19c to d07341b Compare October 14, 2021 09:46
@nastra nastra changed the title Arrow: Add tests for FIXED type support Arrow: FIXED type support Oct 14, 2021
@nastra nastra requested a review from rdblue October 14, 2021 10:45
@nastra nastra closed this Oct 14, 2021
@nastra nastra reopened this Oct 14, 2021
@nastra nastra closed this Oct 14, 2021
@nastra nastra reopened this Oct 14, 2021
@nastra nastra closed this Oct 14, 2021
@nastra nastra reopened this Oct 14, 2021
@nastra nastra force-pushed the arrow-support-fixed branch from d07341b to 3fbddaf Compare October 20, 2021 16:04
vectorizedColumnIterator.varWidthTypeBatchReader().nextBatch(vec, -1, nullabilityHolder);
break;
case FIXED_WIDTH_BINARY:
vectorizedColumnIterator.fixedWidthTypeBinaryBatchReader().nextBatch(vec, typeWidth, nullabilityHolder);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the difference between these two readers?

Copy link
Contributor Author

@nastra nastra Oct 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the difference is in the way stuff is being read: FixedSizeBinary vs FixedWidthBinary. For the FIXED type we should essentially be creating/using a FixedSizeBinaryVector from Arrow

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know the basics here, so I'm confused why case Fixed with binary is read with Fixed Size Binary, also I don't understand the difference between fixed size and width

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FIXED_WIDTH_BINARY might have been misleading so I renamed it to FIXED_SIZE_BINARY. It seems that the FixedWidthBinary code path existed as a workaround for Spark as can be seen here. I checked TestParquetVectorizedReads and that seems to be testing the FIXED type with Spark+Vectorization

required(112, "fixed", Types.FixedType.ofLength(7)),

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So for clarification what is being added here? Fixed width or fixed size? Or are they the same?

Can you clarify your comment on spark as well: fix width already was (partially) handled and now its fully handled?

@nastra nastra force-pushed the arrow-support-fixed branch from 3fbddaf to b0ff549 Compare October 21, 2021 07:23
vectorizedColumnIterator.varWidthTypeBatchReader().nextBatch(vec, -1, nullabilityHolder);
break;
case FIXED_WIDTH_BINARY:
vectorizedColumnIterator.fixedWidthTypeBinaryBatchReader().nextBatch(vec, typeWidth, nullabilityHolder);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So for clarification what is being added here? Fixed width or fixed size? Or are they the same?

Can you clarify your comment on spark as well: fix width already was (partially) handled and now its fully handled?

}
}

public class FixedWidthTypeBinaryBatchReader extends BatchReader {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I am daft but it looks like you removed fixed width readers but I don't see where you added any readers?

@kbendick
Copy link
Contributor

@nastra does this still need to be reviewed?

Somebody mentioned on slack this week (on Tuesday) that they had issues writing a fixed item as a truncated partition column. So they used Binary.

I’ve been out sick but I’ll gather the details into an issue.

I doubt this will directly solve that but made me think of this PR.

@nastra
Copy link
Contributor Author

nastra commented Dec 10, 2021

@kbendick yes this still needs to be reviewed and at this point TBH I'm uncertain the approach is correct or not because I'm not sure if this statement is still correct (https://github.com/apache/iceberg/pull/3029/files#diff-80bc724de9a4dd358c4544fcf00e00139145c6763a5d0280e0bd0793a0fd4003L366-L368):

Spark does not support fixed width binary data type. To work around this limitation, the data is read as fixed width binary from parquet and stored in a {@link VarBinaryVector} in Arrow.

FWIW the PR itself is not related to the issue that was reported on Slack.

@github-actions
Copy link

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Jul 19, 2024
@github-actions
Copy link

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants