Spark: Pass correct types to get data from InternalRow #999

rdblue · 2020-05-04T18:46:14Z

This fixes a problem with Spark 3.0 CTAS queries that use tinyint or smallint types. When Iceberg converts a Dataset schema, it promotes both smaller integers to int. Normally, Spark will insert casts in the analyzer so that the values are ints, but during a CTAS query, the table is created and the values may be passed as short or byte in the rows passed to Iceberg.

The problem happens when Iceberg accesses values from InternalRow. Before this commit, Iceberg would use the table's type to fetch a value, causing unsafe rows to return a corrupted byte or short value because 4 bytes had been read instead of 1 or 2.

The fix is to keep track of the Dataset schema and use it when accessing fields. This required building visitors for Avro and Parquet that traverse a Spark schema with a file schema.

rdsr · 2020-05-05T06:16:27Z

Thanks @rdblue . I'll look into it tomorrow!

rdblue · 2020-05-05T16:00:42Z

Thanks, @rdsr! Good for you to review since we will need to do the same thing for the ORC writers.

spark/src/main/java/org/apache/iceberg/spark/data/AvroWithSparkSchemaVisitor.java

spark/src/main/java/org/apache/iceberg/spark/data/ParquetWithSparkSchemaVisitor.java

rdsr · 2020-05-06T15:32:29Z

LGTM. Minor comments

rdblue · 2020-05-06T17:17:07Z

Thanks, @rdsr! I've fixed the things you pointed out.

rdsr

+1. Once the build goes through

)

Spark: Pass correct types to get data from InternalRow.

f6c6109

rdblue requested a review from rdsr May 4, 2020 18:46

rdblue added 2 commits May 4, 2020 13:28

Add missing methods from the refactor in apache#950.

65cfb29

Fix errorprone problems.

65c341e

Fix checkstyle problems.

e684926

rdblue added 2 commits May 5, 2020 09:12

Fix benchmarks.

fbef507

Fix checkstyle.

d5d9afe

rdsr reviewed May 6, 2020

View reviewed changes

Fix issues from review.

05d6680

Fix checkstyle.

9f7ba7a

rdsr approved these changes May 6, 2020

View reviewed changes

rdblue merged commit 699e68a into apache:master May 7, 2020

openinx mentioned this pull request Aug 12, 2020

Flink: Replace Row with RowData in flink write path. #1320

Merged

rodmeneses pushed a commit to rodmeneses/iceberg that referenced this pull request Feb 19, 2024

[Rewrite] Hive: Support connecting to multiple Hive-Catalog (apache#999)

cc96cb6

szehon-ho pushed a commit to szehon-ho/iceberg that referenced this pull request Sep 16, 2024

Internal, Hive: Support connecting to multiple Hive catalogs (apache#999

394deff

)

rodmeneses pushed a commit to rodmeneses/iceberg that referenced this pull request Jun 23, 2025

Internal, Hive: Support connecting to multiple Hive catalogs (apache#999

89784a1

)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark: Pass correct types to get data from InternalRow #999

Spark: Pass correct types to get data from InternalRow #999

Uh oh!

rdblue commented May 4, 2020

Uh oh!

rdsr commented May 5, 2020

Uh oh!

rdblue commented May 5, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rdsr commented May 6, 2020

Uh oh!

rdblue commented May 6, 2020

Uh oh!

rdsr left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Spark: Pass correct types to get data from InternalRow #999

Spark: Pass correct types to get data from InternalRow #999

Uh oh!

Conversation

rdblue commented May 4, 2020

Uh oh!

rdsr commented May 5, 2020

Uh oh!

rdblue commented May 5, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rdsr commented May 6, 2020

Uh oh!

rdblue commented May 6, 2020

Uh oh!

rdsr left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants