Skip to content

After writing Iceberg dataset with nested partitions it cannot be read anymore #575

@andrei-ionescu

Description

@andrei-ionescu

After writing an Iceberg dataset with nested partitions at read time the following error is encountered:

Illegal character in: nestedData.moreData
org.apache.avro.SchemaParseException: Illegal character in: nestedData.moreData
	at org.apache.avro.Schema.validateName(Schema.java:1151)
	at org.apache.avro.Schema.access$200(Schema.java:81)
	at org.apache.avro.Schema$Field.<init>(Schema.java:403)
	at org.apache.avro.Schema$Field.<init>(Schema.java:423)
	at org.apache.iceberg.avro.AvroSchemaUtil.copyField(AvroSchemaUtil.java:333)
	at org.apache.iceberg.avro.BuildAvroProjection.field(BuildAvroProjection.java:134)
	at org.apache.iceberg.avro.BuildAvroProjection.field(BuildAvroProjection.java:41)
	at org.apache.iceberg.avro.AvroCustomOrderSchemaVisitor$VisitFieldFuture.get(AvroCustomOrderSchemaVisitor.java:124)
	at com.google.common.collect.Iterators$6.transform(Iterators.java:783)
	at com.google.common.collect.TransformedIterator.next(TransformedIterator.java:47)
	at com.google.common.collect.Iterators.addAll(Iterators.java:356)
	at com.google.common.collect.Lists.newArrayList(Lists.java:143)
	at com.google.common.collect.Lists.newArrayList(Lists.java:130)
	at org.apache.iceberg.avro.BuildAvroProjection.record(BuildAvroProjection.java:60)
	at org.apache.iceberg.avro.BuildAvroProjection.record(BuildAvroProjection.java:41)
	at org.apache.iceberg.avro.AvroCustomOrderSchemaVisitor.visit(AvroCustomOrderSchemaVisitor.java:51)
	at org.apache.iceberg.avro.AvroCustomOrderSchemaVisitor$VisitFuture.get(AvroCustomOrderSchemaVisitor.java:109)
	at org.apache.iceberg.avro.BuildAvroProjection.field(BuildAvroProjection.java:130)
	at org.apache.iceberg.avro.BuildAvroProjection.field(BuildAvroProjection.java:41)
	at org.apache.iceberg.avro.AvroCustomOrderSchemaVisitor$VisitFieldFuture.get(AvroCustomOrderSchemaVisitor.java:124)
	...

The steps are:

  1. Have a schema and its json data
  2. Add a simple partition and a nested partition (both identity)
  3. Write the data in iceberg format using Iceberg data frame writer
  4. Read the data into a new Iceberg dataset

The last step fails with the error above.

Here is some code to explain the issue (extracted from this gist I put together).

Schema nestedSchema = new Schema(
    optional(1, "id", Types.IntegerType.get()),
    optional(2, "data", Types.StringType.get()),
    optional(3, "nestedData", Types.StructType.of(
        optional(4, "id", Types.IntegerType.get()),
        optional(5, "moreData", Types.StringType.get())))
);

File parent = temp.newFolder("parquet");
File location = new File(parent, "test");

HadoopTables tables = new HadoopTables(new Configuration());
PartitionSpec spec = PartitionSpec.builderFor(nestedSchema)
    .identity("id")
    .identity("nestedData.moreData")
    .build();
Table table = tables.create(nestedSchema, spec, location.toString());

List<String> jsons = Lists.newArrayList(
    "{ \"id\": 1, \"data\": \"a\", \"nestedData\": { \"id\": 100, \"moreData\": \"p1\"} }",
    "{ \"id\": 2, \"data\": \"b\", \"nestedData\": { \"id\": 200, \"moreData\": \"p1\"} }",
    "{ \"id\": 3, \"data\": \"c\", \"nestedData\": { \"id\": 300, \"moreData\": \"p2\"} }",
    "{ \"id\": 4, \"data\": \"d\", \"nestedData\": { \"id\": 400, \"moreData\": \"p2\"} }"
);
Dataset<Row> df = spark
    .read()
    .schema(SparkSchemaUtil.convert(nestedSchema))
    .json(spark.createDataset(jsons, Encoders.STRING()));

// TODO: incoming columns must be ordered according to the table's schema
df.select("id", "data", "nestedData").write()
    .format("iceberg")
    .mode("append")
    .save(location.toString());

table.refresh();

Dataset<Row> result = spark.read()
    .format("iceberg")
    .load(location.toString());

This seems related to #216 issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions