Skip to content

Conversation

@rdblue
Copy link
Contributor

@rdblue rdblue commented May 4, 2020

This fixes a problem with Spark 3.0 CTAS queries that use tinyint or smallint types. When Iceberg converts a Dataset schema, it promotes both smaller integers to int. Normally, Spark will insert casts in the analyzer so that the values are ints, but during a CTAS query, the table is created and the values may be passed as short or byte in the rows passed to Iceberg.

The problem happens when Iceberg accesses values from InternalRow. Before this commit, Iceberg would use the table's type to fetch a value, causing unsafe rows to return a corrupted byte or short value because 4 bytes had been read instead of 1 or 2.

The fix is to keep track of the Dataset schema and use it when accessing fields. This required building visitors for Avro and Parquet that traverse a Spark schema with a file schema.

@rdblue rdblue requested a review from rdsr May 4, 2020 18:46
@rdsr
Copy link
Contributor

rdsr commented May 5, 2020

Thanks @rdblue . I'll look into it tomorrow!

@rdblue
Copy link
Contributor Author

rdblue commented May 5, 2020

Thanks, @rdsr! Good for you to review since we will need to do the same thing for the ORC writers.

@rdsr
Copy link
Contributor

rdsr commented May 6, 2020

LGTM. Minor comments

@rdblue
Copy link
Contributor Author

rdblue commented May 6, 2020

Thanks, @rdsr! I've fixed the things you pointed out.

Copy link
Contributor

@rdsr rdsr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. Once the build goes through

@rdblue rdblue merged commit 699e68a into apache:master May 7, 2020
rodmeneses pushed a commit to rodmeneses/iceberg that referenced this pull request Feb 19, 2024
szehon-ho pushed a commit to szehon-ho/iceberg that referenced this pull request Sep 16, 2024
rodmeneses pushed a commit to rodmeneses/iceberg that referenced this pull request Jun 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants