After writing Iceberg dataset with nested partitions it cannot be read anymore

After writing an Iceberg dataset with nested partitions at read time the following error is encountered:

```
Illegal character in: nestedData.moreData
org.apache.avro.SchemaParseException: Illegal character in: nestedData.moreData
	at org.apache.avro.Schema.validateName(Schema.java:1151)
	at org.apache.avro.Schema.access$200(Schema.java:81)
	at org.apache.avro.Schema$Field.<init>(Schema.java:403)
	at org.apache.avro.Schema$Field.<init>(Schema.java:423)
	at org.apache.iceberg.avro.AvroSchemaUtil.copyField(AvroSchemaUtil.java:333)
	at org.apache.iceberg.avro.BuildAvroProjection.field(BuildAvroProjection.java:134)
	at org.apache.iceberg.avro.BuildAvroProjection.field(BuildAvroProjection.java:41)
	at org.apache.iceberg.avro.AvroCustomOrderSchemaVisitor$VisitFieldFuture.get(AvroCustomOrderSchemaVisitor.java:124)
	at com.google.common.collect.Iterators$6.transform(Iterators.java:783)
	at com.google.common.collect.TransformedIterator.next(TransformedIterator.java:47)
	at com.google.common.collect.Iterators.addAll(Iterators.java:356)
	at com.google.common.collect.Lists.newArrayList(Lists.java:143)
	at com.google.common.collect.Lists.newArrayList(Lists.java:130)
	at org.apache.iceberg.avro.BuildAvroProjection.record(BuildAvroProjection.java:60)
	at org.apache.iceberg.avro.BuildAvroProjection.record(BuildAvroProjection.java:41)
	at org.apache.iceberg.avro.AvroCustomOrderSchemaVisitor.visit(AvroCustomOrderSchemaVisitor.java:51)
	at org.apache.iceberg.avro.AvroCustomOrderSchemaVisitor$VisitFuture.get(AvroCustomOrderSchemaVisitor.java:109)
	at org.apache.iceberg.avro.BuildAvroProjection.field(BuildAvroProjection.java:130)
	at org.apache.iceberg.avro.BuildAvroProjection.field(BuildAvroProjection.java:41)
	at org.apache.iceberg.avro.AvroCustomOrderSchemaVisitor$VisitFieldFuture.get(AvroCustomOrderSchemaVisitor.java:124)
	...
```

The steps are:

1. Have a schema and its json data
2. Add a simple partition and a nested partition (both `identity`)
3. Write the data in iceberg format using Iceberg data frame writer
4. Read the data into a new Iceberg dataset

The last step fails with the error above.

Here is some code to explain the issue (extracted from [this gist I put together](https://gist.github.com/andrei-ionescu/b3e5f5345df3166af7562a830d3dc57d)).

```scala
Schema nestedSchema = new Schema(
    optional(1, "id", Types.IntegerType.get()),
    optional(2, "data", Types.StringType.get()),
    optional(3, "nestedData", Types.StructType.of(
        optional(4, "id", Types.IntegerType.get()),
        optional(5, "moreData", Types.StringType.get())))
);

File parent = temp.newFolder("parquet");
File location = new File(parent, "test");

HadoopTables tables = new HadoopTables(new Configuration());
PartitionSpec spec = PartitionSpec.builderFor(nestedSchema)
    .identity("id")
    .identity("nestedData.moreData")
    .build();
Table table = tables.create(nestedSchema, spec, location.toString());

List<String> jsons = Lists.newArrayList(
    "{ \"id\": 1, \"data\": \"a\", \"nestedData\": { \"id\": 100, \"moreData\": \"p1\"} }",
    "{ \"id\": 2, \"data\": \"b\", \"nestedData\": { \"id\": 200, \"moreData\": \"p1\"} }",
    "{ \"id\": 3, \"data\": \"c\", \"nestedData\": { \"id\": 300, \"moreData\": \"p2\"} }",
    "{ \"id\": 4, \"data\": \"d\", \"nestedData\": { \"id\": 400, \"moreData\": \"p2\"} }"
);
Dataset<Row> df = spark
    .read()
    .schema(SparkSchemaUtil.convert(nestedSchema))
    .json(spark.createDataset(jsons, Encoders.STRING()));

// TODO: incoming columns must be ordered according to the table's schema
df.select("id", "data", "nestedData").write()
    .format("iceberg")
    .mode("append")
    .save(location.toString());

table.refresh();

Dataset<Row> result = spark.read()
    .format("iceberg")
    .load(location.toString());
```

This seems related to https://github.com/apache/incubator-iceberg/issues/216 issue.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

After writing Iceberg dataset with nested partitions it cannot be read anymore #575

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

After writing Iceberg dataset with nested partitions it cannot be read anymore #575

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions