Stacktrace:
java.lang.NullPointerException
at org.apache.iceberg.avro.AvroSchemaUtil.getFieldId(AvroSchemaUtil.java:286)
at org.apache.iceberg.avro.PruneColumns.array(PruneColumns.java:126)
at org.apache.iceberg.avro.PruneColumns.array(PruneColumns.java:34)
at org.apache.iceberg.avro.AvroSchemaVisitor.visit(AvroSchemaVisitor.java:62)
at org.apache.iceberg.avro.AvroSchemaVisitor.visitWithName(AvroSchemaVisitor.java:85)
at org.apache.iceberg.avro.AvroSchemaVisitor.visit(AvroSchemaVisitor.java:44)
at org.apache.iceberg.avro.AvroSchemaVisitor.visitWithName(AvroSchemaVisitor.java:85)
at org.apache.iceberg.avro.AvroSchemaVisitor.visit(AvroSchemaVisitor.java:44)
at org.apache.iceberg.avro.AvroSchemaVisitor.visitWithName(AvroSchemaVisitor.java:85)
at org.apache.iceberg.avro.AvroSchemaVisitor.visit(AvroSchemaVisitor.java:64)
at org.apache.iceberg.avro.AvroSchemaVisitor.visit(AvroSchemaVisitor.java:56)
at org.apache.iceberg.avro.AvroSchemaVisitor.visitWithName(AvroSchemaVisitor.java:85)
at org.apache.iceberg.avro.AvroSchemaVisitor.visit(AvroSchemaVisitor.java:44)
at org.apache.iceberg.avro.PruneColumns.rootSchema(PruneColumns.java:46)
at org.apache.iceberg.avro.AvroSchemaUtil.pruneColumns(AvroSchemaUtil.java:94)
at org.apache.iceberg.avro.ProjectionDatumReader.setSchema(ProjectionDatumReader.java:59)
at org.apache.avro.file.DataFileStream.initialize(DataFileStream.java:126)
at org.apache.avro.file.DataFileReader.<init>(DataFileReader.java:97)
at org.apache.avro.file.DataFileReader.openReader(DataFileReader.java:59)
at org.apache.iceberg.avro.AvroIterable.newFileReader(AvroIterable.java:94)
at org.apache.iceberg.avro.AvroIterable.iterator(AvroIterable.java:77)
at org.apache.iceberg.spark.source.Reader$TaskDataReader.open(Reader.java:470)
at org.apache.iceberg.spark.source.Reader$TaskDataReader.open(Reader.java:422)
at org.apache.iceberg.spark.source.Reader$TaskDataReader.<init>(Reader.java:356)
at org.apache.iceberg.spark.source.Reader$ReadTask.createPartitionReader(Reader.java:305)
at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD.compute(DataSourceRDD.scala:42)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:384)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
The input dataset has a field with type array<struct<a: string, b: string>>. At https://github.com/apache/incubator-iceberg/blob/d705aa8ccb3aaca4c4eb0fef74658fed1cac3e83/core/src/main/java/org/apache/iceberg/avro/PruneColumns.java#L124 the OR makes it think that the schema is a map when it is not and eventually tries to deference the fields key and value from the struct and fails. So seems like this condition needs to be more strict. According to the spec Array storage must use logical type name map and must store elements that are 2-field records., so seems like the condition should be an AND?
cc: @rdsr Since you worked on this in #207
Stacktrace:
The input dataset has a field with type
array<struct<a: string, b: string>>. At https://github.com/apache/incubator-iceberg/blob/d705aa8ccb3aaca4c4eb0fef74658fed1cac3e83/core/src/main/java/org/apache/iceberg/avro/PruneColumns.java#L124 theORmakes it think that the schema is a map when it is not and eventually tries to deference the fieldskeyandvaluefrom the struct and fails. So seems like this condition needs to be more strict. According to the specArray storage must use logical type name map and must store elements that are 2-field records., so seems like the condition should be anAND?cc: @rdsr Since you worked on this in #207