Skip to content

Error reading Avro files containing array of struct with 2 fields #605

@shardulm94

Description

@shardulm94

Stacktrace:

java.lang.NullPointerException
	at org.apache.iceberg.avro.AvroSchemaUtil.getFieldId(AvroSchemaUtil.java:286)
	at org.apache.iceberg.avro.PruneColumns.array(PruneColumns.java:126)
	at org.apache.iceberg.avro.PruneColumns.array(PruneColumns.java:34)
	at org.apache.iceberg.avro.AvroSchemaVisitor.visit(AvroSchemaVisitor.java:62)
	at org.apache.iceberg.avro.AvroSchemaVisitor.visitWithName(AvroSchemaVisitor.java:85)
	at org.apache.iceberg.avro.AvroSchemaVisitor.visit(AvroSchemaVisitor.java:44)
	at org.apache.iceberg.avro.AvroSchemaVisitor.visitWithName(AvroSchemaVisitor.java:85)
	at org.apache.iceberg.avro.AvroSchemaVisitor.visit(AvroSchemaVisitor.java:44)
	at org.apache.iceberg.avro.AvroSchemaVisitor.visitWithName(AvroSchemaVisitor.java:85)
	at org.apache.iceberg.avro.AvroSchemaVisitor.visit(AvroSchemaVisitor.java:64)
	at org.apache.iceberg.avro.AvroSchemaVisitor.visit(AvroSchemaVisitor.java:56)
	at org.apache.iceberg.avro.AvroSchemaVisitor.visitWithName(AvroSchemaVisitor.java:85)
	at org.apache.iceberg.avro.AvroSchemaVisitor.visit(AvroSchemaVisitor.java:44)
	at org.apache.iceberg.avro.PruneColumns.rootSchema(PruneColumns.java:46)
	at org.apache.iceberg.avro.AvroSchemaUtil.pruneColumns(AvroSchemaUtil.java:94)
	at org.apache.iceberg.avro.ProjectionDatumReader.setSchema(ProjectionDatumReader.java:59)
	at org.apache.avro.file.DataFileStream.initialize(DataFileStream.java:126)
	at org.apache.avro.file.DataFileReader.<init>(DataFileReader.java:97)
	at org.apache.avro.file.DataFileReader.openReader(DataFileReader.java:59)
	at org.apache.iceberg.avro.AvroIterable.newFileReader(AvroIterable.java:94)
	at org.apache.iceberg.avro.AvroIterable.iterator(AvroIterable.java:77)
	at org.apache.iceberg.spark.source.Reader$TaskDataReader.open(Reader.java:470)
	at org.apache.iceberg.spark.source.Reader$TaskDataReader.open(Reader.java:422)
	at org.apache.iceberg.spark.source.Reader$TaskDataReader.<init>(Reader.java:356)
	at org.apache.iceberg.spark.source.Reader$ReadTask.createPartitionReader(Reader.java:305)
	at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD.compute(DataSourceRDD.scala:42)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:384)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)

The input dataset has a field with type array<struct<a: string, b: string>>. At https://github.com/apache/incubator-iceberg/blob/d705aa8ccb3aaca4c4eb0fef74658fed1cac3e83/core/src/main/java/org/apache/iceberg/avro/PruneColumns.java#L124 the OR makes it think that the schema is a map when it is not and eventually tries to deference the fields key and value from the struct and fails. So seems like this condition needs to be more strict. According to the spec Array storage must use logical type name map and must store elements that are 2-field records., so seems like the condition should be an AND?

cc: @rdsr Since you worked on this in #207

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions