Skip to content

Conversation

@pvary
Copy link
Contributor

@pvary pvary commented Feb 17, 2025

Here is what the PR does:

  • Created 3 interface classes which are implemented by the file formats:
    • ReadBuilder - Builder for reading data from data files
    • AppenderBuilder - Builder for writing data to data files
    • ObjectModel - Providing ReadBuilders, and AppenderBuilders for the specific data file format and object model pair
  • Updated the Parquet, Avro, ORC implementation for this interfaces, and deprecated the old reader/writer APIs
  • Created interface classes which will be used by the actual readers/writers of the data files:
    • AppenderBuilder - Builder for writing a file
    • DataWriterBuilder - Builder for generating a data file
    • PositionDeleteWriterBuilder - Builder for generating a position delete file
    • EqualityDeleteWriterBuilder - Builder for generating an equality delete file
    • No ReadBuilder here - the file format reader builder is reused
  • Created a WriterBuilder class which implements the interfaces above (AppenderBuilder/DataWriterBuilder/PositionDeleteWriterBuilder/EqualityDeleteWriterBuilder) based on a provided file format specific AppenderBuilder
  • Created an ObjectModelRegistry which stores the available ObjectModels, and engines and users could request the readers (ReadBuilder) and writers (AppenderBuilder/DataWriterBuilder/PositionDeleteWriterBuilder/EqualityDeleteWriterBuilder) from.
  • Created the appropriate ObjectModels:
    • GenericObjectModels - for reading and writing Iceberg Records
    • SparkObjectModels - for reading (vectorized and non-vectorized) and writing Spark InternalRow/ColumnarBatch objects
    • FlinkObjectModels - for reading and writing Flink RowData objects
    • An arrow object model is also registered for vectorized reads of Parquet files into Arrow ColumnarBatch objects
  • Updated the production code where the reading and writing happens to use the ObjectModelRegistry and the new reader/writer interfaces to access data files
  • Kept the testing code intact to ensure that the new API/code is not breaking anything

@pvary pvary force-pushed the file_Format_api_without_base branch 2 times, most recently from c528a52 to 9975b4f Compare February 20, 2025 09:45
@pvary pvary changed the title WIP: Interface based FileFormat API WIP: Interface based DataFile reader and writer API Feb 20, 2025
Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @pvary for this proposal, I left some comments.

@pvary
Copy link
Contributor Author

pvary commented Feb 21, 2025

I will start to collect the differences here between the different writer types (appender/dataWriter/equalityDeleteWriter/positionalDeleteWriter) for reference:

  • Writer context is different between delete and data files. This contains TableProperties/Configurations which could be different between delete and data files. For example for parquet: RowGroupSize/PageSize/PageRowLimit/DictSize/Compression etc. For ORC and Avro we have some similar changing configs
  • Specific writer functions for position deletes to write out the PositionDelete records
  • Positional delete PathTransformFunction to convert writer data type for the path to file format data type

@rdblue
Copy link
Contributor

rdblue commented Feb 22, 2025

While I think the goal here is a good one, the implementation looks too complex to be workable in its current form.

The primary issue that we currently have is adapting object models (like Iceber's internal StructLike, Spark's InternalRow, or Flink's RowData) to file formats so that you can separately write object model to format glue code and have it work throughout support for an engine. I think a diff from the InternalData PR demonstrates it pretty well:

-    switch (format) {
-      case AVRO:
-        AvroIterable<ManifestEntry<F>> reader =
-            Avro.read(file)
-                .project(ManifestEntry.wrapFileSchema(Types.StructType.of(fields)))
-                .createResolvingReader(this::newReader)
-                .reuseContainers()
-                .build();
+    CloseableIterable<ManifestEntry<F>> reader =
+        InternalData.read(format, file)
+            .project(ManifestEntry.wrapFileSchema(Types.StructType.of(fields)))
+            .reuseContainers()
+            .build();
 
-        addCloseable(reader);
+    addCloseable(reader);
 
-        return CloseableIterable.transform(reader, inheritableMetadata::apply);
+    return CloseableIterable.transform(reader, inheritableMetadata::apply);
-
-      default:
-        throw new UnsupportedOperationException("Invalid format for manifest file: " + format);
-    }

This shows:

  • Rather than a switch, the format is passed to create the builder
  • There is no longer a callback passed to create readers for the object model (createResolvingReader)

In this PR, there are a lot of other changes as well. I'm looking at one of the simpler Spark cases in the row reader.

The builder is initialized from DataFileServiceRegistry and now requires a format, class name, file, projection, and constant map:

    return DataFileServiceRegistry.readerBuilder(
            format, InternalRow.class.getName(), file, projection, idToConstant)

There are also new static classes in the file. Each creates a new service and each service creates the builder and object model:

  public static class AvroReaderService implements DataFileServiceRegistry.ReaderService {
    @Override
    public DataFileServiceRegistry.Key key() {
      return new DataFileServiceRegistry.Key(FileFormat.AVRO, InternalRow.class.getName());
    }

    @Override
    public ReaderBuilder builder(
        InputFile inputFile,
        Schema readSchema,
        Map<Integer, ?> idToConstant,
        DeleteFilter<?> deleteFilter) {
      return Avro.read(inputFile)
          .project(readSchema)
          .createResolvingReader(schema -> SparkPlannedAvroReader.create(schema, idToConstant));
    }

The createResolvingReader line is still there, just moved into its own service class instead of in branches of a switch statement.

In addition, there are now a lot more abstractions:

  • A builder for creating an appender for a file format
  • A builder for creating a data file writer for a file format
  • A builder for creating an equality delete writer for a file format
  • A builder for creating a position delete writer for a file format
  • A builder for creating a reader for a file format
  • A "service" registry (what is a service?)
  • A "key"
  • A writer service
  • A reader service

I think that the next steps are to focus on making this a lot simpler, and there are some good ways to do that:

  • Focus on removing boilerplate and hiding the internals. For instance, Key, if needed, should be an internal abstraction and not complexity that is exposed to callers
  • The format-specific data and delete file builders typically wrap an appender builder. Is there a way to handle just the reader builder and appender builder?
  • Is the extra "service" abstraction helpful?
  • Remove ServiceLoader and use a simpler solution. I think that formats could simply register themselves like we do for InternalData. I think it would be fine to have a trade-off that Iceberg ships with a list of known formats that can be loaded, and if you want to replace that list it's at your own risk.
  • Standardize more across the builders for FileFormat. How idToConstant is handled is a good example. That should be passed to the builder instead of making the whole API more complicated. Projection is the same.

@pvary
Copy link
Contributor Author

pvary commented Feb 24, 2025

While I think the goal here is a good one, the implementation looks too complex to be workable in its current form.

I'm happy that we agree with the goals. I created a PR to start the conversation. If there are willing reviewers we can introduce more invasive changes to archive a better API. I'm all for it!

The primary issue that we currently have is adapting object models (like Iceber's internal StructLike, Spark's InternalRow, or Flink's RowData) to file formats so that you can separately write object model to format glue code and have it work throughout support for an engine.

I think we need to keep this direct transformations to prevent the performance loss which would be caused by multiple transformations between object model -> common model -> file format.

We have a matrix of transformation which we need to encode somewhere:

Source Target
Parquet StructLike
Parquet InternalRow
Parquet RowData
Parquet Arrow
Avro ...
ORC ...

[..]

  • Rather than a switch, the format is passed to create the builder
  • There is no longer a callback passed to create readers for the object model (createResolvingReader)

The InternalData reader has one advantage over the data file readers/writers. The internal object model is static for these readers/writers. For the DataFile readers/writers we have multiple object models to handle.

[..]
I think that the next steps are to focus on making this a lot simpler, and there are some good ways to do that:

  • Focus on removing boilerplate and hiding the internals. For instance, Key, if needed, should be an internal abstraction and not complexity that is exposed to callers

If we allow adding new builders for the file formats we can remove a good chunk of the boilerplate code. Let me see how this would look like

  • The format-specific data and delete file builders typically wrap an appender builder. Is there a way to handle just the reader builder and appender builder?

We need to refactor the Avro positional delete write for this, or add a positionalWriterFunc. Also need to consider that the format specific configurations which are different for the appenders and the delete files (DELETE_PARQUET_ROW_GROUP_SIZE_BYTES vs. PARQUET_ROW_GROUP_SIZE_BYTES)

  • Is the extra "service" abstraction helpful?

If we are ok with having a new Builder for the readers/writers, then we don't need the service. It was needed to keep the current APIs and the new APIs compatible.

  • Remove ServiceLoader and use a simpler solution. I think that formats could simply register themselves like we do for InternalData. I think it would be fine to have a trade-off that Iceberg ships with a list of known formats that can be loaded, and if you want to replace that list it's at your own risk.

Will do

  • Standardize more across the builders for FileFormat. How idToConstant is handled is a good example. That should be passed to the builder instead of making the whole API more complicated. Projection is the same.

Will see what could be arcived

@pvary pvary force-pushed the file_Format_api_without_base branch 5 times, most recently from c488d32 to 71ec538 Compare February 25, 2025 16:53
@pvary pvary changed the title Core: Interface based DataFile reader and writer API - PoC Core: Interface based DataFile reader and writer API Jan 23, 2026
@pvary pvary force-pushed the file_Format_api_without_base branch from cbe306b to 1489bd9 Compare January 29, 2026 10:47
private static <D, S> FileWriterBuilder<EqualityDeleteWriter<D>, S> forEqualityDelete(
WriteBuilder<D, S> writeBuilder, String location, FileFormat format, int[] equalityFieldIds) {
return new FileWriterBuilderImpl<>(
writeBuilder.content(FileContent.EQUALITY_DELETES),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not add content to the builder implementation so that it has access to this and can pass it through?

Then you could also have a validation switch inside the builder class to run validations rather than using a callback that accesses fields directly from outside the class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the class to it's own file, and moved the static creation methods with them, so no calls outside of the class.

I prefer to keep the validation and the creation logic for the specific content types in one place, so I kept this as it is. Please leave a note if you still strongly prefer separating them out.


private FormatModelRegistry() {}

private static class FileWriterBuilderImpl<W extends FileWriter<?, ?>, D, S>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to have this as a top-level class that is package-private. I don't think making this a nested class is helping much, and I think it is also allowing an odd pattern with the builderMethod callback that accesses private fields.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved FileWriterBuilderImpl to its own file, and moved the builder methods with them. So the private calls still remain in the same file

} catch (NoSuchMethodException e) {
// failing to register a factory is normal and does not require a stack trace
LOG.info(
"Skip registration of {}. Likely the jar is not in the classpath", classToRegister);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd probably go with the same message that is used by InternalData:

      LOG.info("Unable to register model for (record-class) and (format): {}", e.getMessage());

It is sometimes nice to also include something to look for, the library itself should not speculate about what went wrong. If you want to include this, then it should be "Check the classpath" or "Check that the Jar for %s is in the classpath". It should not state that some cause is "likely".

Also, a statement like "skip registration" or "skipping registration" should be more clear that registration failed. It was not skipped due to a warning, there was an error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is how I ended up:

        LOG.info(
            "Unable to call register for ({}). Check for missing jars on the classpath: {}",
            classToRegister,
            e.getMessage());


private FormatModelRegistry() {}

private static class FileWriterBuilderImpl<W extends FileWriter<?, ?>, D, S>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that the type params are quite right here. The row type of FileWriter should be D, right? That means that this should probably be FileWriterBuilderImpl<D, S, W extends FileWriter<D, ?> right? And it seems suspicious that we aren't correctly carrying through the R param of FileWriter, too. This could probably be parameterized by R since it is determined by the returned writer type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left it like this, because it needs some ugly casting magic on the registry side:

    FormatModel<PositionDelete<D>, ?> model =
        (FormatModel<PositionDelete<D>, ?>) (FormatModel) modelFor(format, PositionDelete.class);

Updated the code based on your recommendation. Check if you like it this way better, or not.

@pvary pvary force-pushed the file_Format_api_without_base branch from fc5a2f2 to 8a8a67e Compare January 31, 2026 10:50
public CloseableIterable<D> build() {
return internal
.createResolvingReader(
icebergSchema -> readerFunction.read(icebergSchema, null, engineSchema, idToConstant))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: Passing null here is okay because the Avro read path will pass it to the DatumReader<D> that the readerFunction constructs. I think it would be good to add that in a comment here for anyone that comes looking to find out why it's null.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Does this comment cover what you expect?

      // The file schema is passed directly to the DatumReader by the Avro read path, so null is
      // passed here

// Spark eagerly consumes the batches. So the underlying memory allocated could be
// reused without worrying about subsequent reads clobbering over each other. This
// improves read performance as every batch read doesn't have to pay the cost of
// allocating memory.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Did this need to be reformatted? It's less of a problem if there aren't substantive changes mixed together with reformatting.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reformatted to comply with line-length restrictions. The increased indentation required the comment the comment to be reformatted.

} else {
readBuilder = FormatModelRegistry.readBuilder(format, ColumnarBatch.class, inputFile);
if (orcConf != null) {
readBuilder = readBuilder.recordsPerBatch(orcConf.batchSize());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a bit awkward because it is using the separate configs to modify a format-agnostic API. I think it is more clear to move the builder configuration to the top level, like this:

  private boolean useComet() {
    return parquetConf != null && parquetConf.readerType() == ParquetReaderType.COMET;
  }

    Class<? extends ColumnarBatch> readType =
        useComet() ? VectorizedSparkParquetReaders.CometColumnarBatch.class : ColumnarBatch.class;

    ReadBuilder<ColumnarBatch, ?> readBuilder =
        FormatModelRegistry.readBuilder(format, readType, inputFile);

    int batchSize = parquetConf != null ? parquetConf.batchSize() : orcConf.batchSize();
    readBuilder.recordsPerBatch(batchSize);

    CloseableIterable<ColumnarBatch> iterable = ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Needed to add an if for the conf to cover both conf being null.

The result is this:

    if (parquetConf != null) {
      readBuilder = readBuilder.recordsPerBatch(parquetConf.batchSize());
    } else if (orcConf != null) {
      readBuilder = readBuilder.recordsPerBatch(orcConf.batchSize());
    }

@pvary pvary force-pushed the file_Format_api_without_base branch from cecf8c3 to bec9b38 Compare February 3, 2026 10:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants