Skip to content

Format v3: deletion vectors hardcoded to Puffin, bypassing FormatModel #15974

@satishkotha

Description

@satishkotha

Summary

In format-version 3, SparkWriteConf.deleteFileFormat() unconditionally returns FileFormat.PUFFIN for all non-metadata tables, bypassing the FormatModel API entirely for position deletes:

// SparkWriteConf.java
public FileFormat deleteFileFormat() {
  if (!(table instanceof BaseMetadataTable) && TableUtil.formatVersion(table) >= 3) {
    return FileFormat.PUFFIN;  // hardcoded, no table property override
  }
  // v2: reads write.delete.format.default → delegates to FormatModel
}

This means FormatModelRegistry.positionDeleteWriteBuilder() and any registered FormatModel<PositionDelete<?>, Void> implementations are dead code for v3 tables. SparkPositionDeltaWrite.newDeleteWriter() routes directly to PartitioningDVWriterBaseDVFileWriter → Puffin, with no way to override.

Question

Was this intentional to simplify the v3 spec, or is there room for the deletion vector format to be pluggable via FormatModel (or a similar registry)?

Context

We're building a custom storage format that integrates with Iceberg via the FormatModel API. For v2, we registered a FormatModel<PositionDelete<?>, Void> that writes position deletes in our native format — this works correctly.

When we upgraded to v3 for row lineage, we discovered that our delete FormatModel is never called. The Puffin DV path works fine (Iceberg handles it internally), but it means:

  1. Custom formats cannot control the delete file representation in v3
  2. The FormatModel<PositionDelete<?>, Void> registration pattern that works for v2 silently becomes unused
  3. There's no way to opt into position delete files (via FormatModel) instead of DVs for v3 tables, even via table properties

Possible approaches

  • Status quo: DVs are always Puffin in v3. Document that FormatModel<PositionDelete<?>, Void> is v2-only.
  • Make it configurable: Allow write.delete.format.default to override the DV format in v3, similar to how it works in v2. Fall back to Puffin if no override is set.
  • Extend FormatModel for DVs: Add a DV-aware FormatModel variant that custom formats can implement.

Happy to hear the rationale for the current design.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions