-
Notifications
You must be signed in to change notification settings - Fork 3k
Closed
Description
After writing an Iceberg dataset with nested partitions at read time the following error is encountered:
Illegal character in: nestedData.moreData
org.apache.avro.SchemaParseException: Illegal character in: nestedData.moreData
at org.apache.avro.Schema.validateName(Schema.java:1151)
at org.apache.avro.Schema.access$200(Schema.java:81)
at org.apache.avro.Schema$Field.<init>(Schema.java:403)
at org.apache.avro.Schema$Field.<init>(Schema.java:423)
at org.apache.iceberg.avro.AvroSchemaUtil.copyField(AvroSchemaUtil.java:333)
at org.apache.iceberg.avro.BuildAvroProjection.field(BuildAvroProjection.java:134)
at org.apache.iceberg.avro.BuildAvroProjection.field(BuildAvroProjection.java:41)
at org.apache.iceberg.avro.AvroCustomOrderSchemaVisitor$VisitFieldFuture.get(AvroCustomOrderSchemaVisitor.java:124)
at com.google.common.collect.Iterators$6.transform(Iterators.java:783)
at com.google.common.collect.TransformedIterator.next(TransformedIterator.java:47)
at com.google.common.collect.Iterators.addAll(Iterators.java:356)
at com.google.common.collect.Lists.newArrayList(Lists.java:143)
at com.google.common.collect.Lists.newArrayList(Lists.java:130)
at org.apache.iceberg.avro.BuildAvroProjection.record(BuildAvroProjection.java:60)
at org.apache.iceberg.avro.BuildAvroProjection.record(BuildAvroProjection.java:41)
at org.apache.iceberg.avro.AvroCustomOrderSchemaVisitor.visit(AvroCustomOrderSchemaVisitor.java:51)
at org.apache.iceberg.avro.AvroCustomOrderSchemaVisitor$VisitFuture.get(AvroCustomOrderSchemaVisitor.java:109)
at org.apache.iceberg.avro.BuildAvroProjection.field(BuildAvroProjection.java:130)
at org.apache.iceberg.avro.BuildAvroProjection.field(BuildAvroProjection.java:41)
at org.apache.iceberg.avro.AvroCustomOrderSchemaVisitor$VisitFieldFuture.get(AvroCustomOrderSchemaVisitor.java:124)
...
The steps are:
- Have a schema and its json data
- Add a simple partition and a nested partition (both
identity) - Write the data in iceberg format using Iceberg data frame writer
- Read the data into a new Iceberg dataset
The last step fails with the error above.
Here is some code to explain the issue (extracted from this gist I put together).
Schema nestedSchema = new Schema(
optional(1, "id", Types.IntegerType.get()),
optional(2, "data", Types.StringType.get()),
optional(3, "nestedData", Types.StructType.of(
optional(4, "id", Types.IntegerType.get()),
optional(5, "moreData", Types.StringType.get())))
);
File parent = temp.newFolder("parquet");
File location = new File(parent, "test");
HadoopTables tables = new HadoopTables(new Configuration());
PartitionSpec spec = PartitionSpec.builderFor(nestedSchema)
.identity("id")
.identity("nestedData.moreData")
.build();
Table table = tables.create(nestedSchema, spec, location.toString());
List<String> jsons = Lists.newArrayList(
"{ \"id\": 1, \"data\": \"a\", \"nestedData\": { \"id\": 100, \"moreData\": \"p1\"} }",
"{ \"id\": 2, \"data\": \"b\", \"nestedData\": { \"id\": 200, \"moreData\": \"p1\"} }",
"{ \"id\": 3, \"data\": \"c\", \"nestedData\": { \"id\": 300, \"moreData\": \"p2\"} }",
"{ \"id\": 4, \"data\": \"d\", \"nestedData\": { \"id\": 400, \"moreData\": \"p2\"} }"
);
Dataset<Row> df = spark
.read()
.schema(SparkSchemaUtil.convert(nestedSchema))
.json(spark.createDataset(jsons, Encoders.STRING()));
// TODO: incoming columns must be ordered according to the table's schema
df.select("id", "data", "nestedData").write()
.format("iceberg")
.mode("append")
.save(location.toString());
table.refresh();
Dataset<Row> result = spark.read()
.format("iceberg")
.load(location.toString());This seems related to #216 issue.
Metadata
Metadata
Assignees
Labels
No labels