Initial work for file format writer API by nssalian · Pull Request #3119 · apache/iceberg-python

nssalian · 2026-03-03T18:59:23Z

Initial work for #3100. Since this is a large change, doing it in parts similar to the AuthManager so it's easier to review and move the existing code around.

Rationale for this change

Introduces the pluggable file format writer API: FileFormatWriter, FileFormatModel, and
FileFormatFactory in pyiceberg/io/fileformat.py. Moves DataFileStatistics from pyarrow.py with a
re-export for backward compatibility. The move is more forward looking and the idea is to keep the stats generic in the future as we add additional formats too.

This is the first part of work for #3100. No behavioral changes; the write path remains hardcoded to Parquet.

Are these changes tested?

Yes. tests/io/test_fileformat.py tests backward-compatible import of DataFileStatistics

Are there any user-facing changes?

No

nssalian · 2026-03-06T16:46:58Z

CC: @kevinjqliu @Fokko @geruh for review

Fokko · 2026-03-26T18:44:52Z

    OutputFile,
    OutputStream,
 )
+from pyiceberg.io.fileformat import DataFileStatistics as DataFileStatistics


Suggested change

from pyiceberg.io.fileformat import DataFileStatistics as DataFileStatistics

from pyiceberg.io.fileformat import DataFileStatistics

mypy wasn't happy about this previously: https://github.com/apache/iceberg-python/actions/runs/22681243975/job/65752048019

Fokko · 2026-03-26T18:47:24Z

+    _result: DataFileStatistics | None = None
+
+    @abstractmethod
+    def write(self, table: pa.Table) -> None:


A table looks to be the logical starting point, but I think an iterator of RecordBatches would also make sense. WDYT @kevinjqliu

Fokko · 2026-03-26T18:50:23Z

+    def partition(self, partition_spec: PartitionSpec, schema: Schema) -> Record:
+        return Record(*[self._partition_value(field, schema) for field in partition_spec.fields])
+
+    def to_serialized_dict(self) -> dict[str, Any]:


Might be nice to change this into a TypedDict as a return type

I moved it over from the original implementation. I can do a TypedDict in a follow up when I wire it through if that works?

Fokko · 2026-03-26T18:54:59Z

+    def get(cls, file_format: FileFormat) -> FileFormatModel:
+        if file_format not in cls._registry:
+            raise ValueError(f"No writer registered for {file_format}. Available: {list(cls._registry.keys())}")
+        return cls._registry[file_format]


I think PyIceberg diverges a bit from Java on this point. PyIceberg could have multiple implementatons for Parquet for example (Arrow/fsspec). Maybe we want something similar to the FileIO loading:

iceberg-python/pyiceberg/io/__init__.py

Line 303 in 82f6040

SCHEMA_TO_FILE_IO: dict[str, list[str]] = {

I implemented the FileFormatFactory as the Python equivalent of Java's FormatModelRegistry, keyed by FileFormat alone since Python only has Arrow (vs Java needing (FileFormat, Class<?>) for Spark/Flink/Generic). Let me know if you think it's worth adding a property-based override.

nssalian · 2026-03-30T18:06:15Z

@Fokko @kevinjqliu @geruh PTAL

nssalian · 2026-04-15T16:32:06Z

@geruh @kevinjqliu PTAL

geruh

Sorry for the late review here @nssalian, and thanks for starting this. I think a single format key is the right call for our python impl, since pyarrow is our universal data model (so far). I did a quick pass here lmk what you think!

geruh · 2026-04-15T23:17:13Z

+    value_counts: dict[int, int]
+    null_value_counts: dict[int, int]
+    nan_value_counts: dict[int, int]
+    column_aggregates: dict[int, StatsAggregator]


I still don't know how I feel about this. I think for now it's okay since we are working with mostly parquet. but then in ORC it would use the stripe metadata.

What we know is that the _partition_value() and partition() methods currently depend on column_aggregates to infer partition values from min/max. These could work from the serialized bounds instead but if refactoring is too much alternatively we could keep the DataFileStatistics in pyarrow class and introduce the shared type in your next phase as mentioned when parquet writer is actually extracted.

The rest of the class (to_serialized_dict(), counts, sizes) is already format-agnostic. It's just the column_aggregates that is the concern.

For this PR, it's a pure move with no behavioral change. When I'm adding ORC write support, I'll refactor _partition_value() to work from serialized bounds (or define a minimal protocol that both Parquet row group stats and ORC stripe stats can satisfy). That way the refactor happens alongside a concrete second format.

Let me know what you think.

Initial work for file format writer API

ca2a398

nssalian marked this pull request as ready for review March 3, 2026 19:01

nssalian added 2 commits March 4, 2026 09:31

Nit for CI fix

7d608d6

fix for mypy

0505cca

Fokko reviewed Mar 26, 2026

View reviewed changes

Comment thread pyiceberg/io/fileformat.py

Fokko reviewed Mar 26, 2026

View reviewed changes

Comment thread pyiceberg/io/fileformat.py

Fokko reviewed Mar 26, 2026

View reviewed changes

nssalian added 2 commits March 26, 2026 18:22

Merge remote-tracking branch 'apache/main' into file-format-initial-work

50f5270

Add test for result none

dea73b2

nssalian requested a review from Fokko March 27, 2026 15:09

Fokko approved these changes Mar 31, 2026

View reviewed changes

Fokko requested a review from kevinjqliu March 31, 2026 13:09

geruh self-requested a review April 1, 2026 07:33

geruh reviewed Apr 15, 2026

View reviewed changes

nssalian mentioned this pull request Apr 16, 2026

File Format API for PyIceberg #3100

Open

nssalian added 2 commits April 17, 2026 14:34

Merge remote-tracking branch 'apache/main' into file-format-initial-work

1a665c4

PR comments

ede6758

nssalian requested review from Fokko and geruh April 17, 2026 21:55

	from pyiceberg.io.fileformat import DataFileStatistics as DataFileStatistics
	from pyiceberg.io.fileformat import DataFileStatistics

Conversation

nssalian commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

Are these changes tested?

Are there any user-facing changes?

Uh oh!

nssalian commented Mar 6, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nssalian commented Mar 30, 2026

Uh oh!

nssalian commented Apr 15, 2026

Uh oh!

geruh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nssalian commented Mar 3, 2026 •

edited

Loading