Is your feature request related to a problem or challenge?
Hi icy crustaceans,
While working on adding the [ClusteredWriter](https://github.com/apache/iceberg-rust/pull/1646), I noticed that the existing FileWriterBuilder trait is quite limiting for partition-aware writers. The main issue is that the location generator is coupled with the low-level file writer (such as ParquetWriter), which effectively requires ParquetWriter itself to be partition-aware.
In addition, the current FileWriterBuilder trait cannot pass partition information from outer writers to inner writers, making it very difficult to implement partitioning writers:
fn build(self) -> impl Future<Output = Result<Self::R>> + Send;
A hacky workaround would be to have partitioning writers use the concrete type ParquetWriter as the inner writer, but I think this will lead to extensibility issues fairly quickly.
Describe the solution you'd like
I see a few aspects of this issue:
- Decouple
LocationGenerator and ParquetWriter by changing the FileWriterBuilder trait
By changing the FileWriterBuilder interface to the following, we can make ParquetWriter focus solely on writing a file without needing to know which partition it’s writing to:
fn build(self, output_file: OutputFile) -> impl Future<Output = Result<Self::R>> + Send;
We will also need to adjust IcebergWriterBuilder accordingly:
async fn build(self, output_file: OutputFile) -> Result<Self::R>;
- Make the functional writer the outermost layer
After modifying the builder traits, we still need to decide which layer of writers should hold the LocationGenerator. I propose moving functional writers such as RollingFileWriter and the upcoming ClusteredWriter to the outermost layer, and having them own the LocationGenerator.
Currently, the layering looks like this:
// DataFileWriter/IcebergWriter is the outermost layer
DataFileWriter(
RollingFileWriter( // currently implements FileWriter
ParquetWriter()
)
)
My proposal:
SomeFunctionalWriter( // no longer implements FileWriter
DataFileWriter(
ParquetWriter()
)
)
This new design provides several benefits:
-
Clear layering: Functional writers hold the LocationGenerator and generate new OutputFiles for the inner writers, while the inner writers focus only on writing a single file without worrying about path or location.
-
Flexibility: Functional writers no longer need to implement the FileWriter trait, making them easier to use and configure. They don’t necessarily need to implement any trait, though we could define something like:
FunctionalWriterBuilder {
async fn build(
self,
location_gen: LocationGenerator,
filename_gen: FileNameGenerator
) -> Result<Self::R>;
}
I’m still exploring and refining this idea, and would appreciate any feedback!
Willingness to contribute
I would be willing to contribute to this feature with guidance from the Iceberg Rust community
Is your feature request related to a problem or challenge?
Hi icy crustaceans,
While working on adding the
[ClusteredWriter](https://github.com/apache/iceberg-rust/pull/1646), I noticed that the existingFileWriterBuildertrait is quite limiting for partition-aware writers. The main issue is that the location generator is coupled with the low-level file writer (such asParquetWriter), which effectively requiresParquetWriteritself to be partition-aware.In addition, the current
FileWriterBuildertrait cannot pass partition information from outer writers to inner writers, making it very difficult to implement partitioning writers:A hacky workaround would be to have partitioning writers use the concrete type
ParquetWriteras the inner writer, but I think this will lead to extensibility issues fairly quickly.Describe the solution you'd like
I see a few aspects of this issue:
LocationGeneratorandParquetWriterby changing theFileWriterBuildertraitBy changing the
FileWriterBuilderinterface to the following, we can makeParquetWriterfocus solely on writing a file without needing to know which partition it’s writing to:We will also need to adjust
IcebergWriterBuilderaccordingly:After modifying the builder traits, we still need to decide which layer of writers should hold the
LocationGenerator. I propose moving functional writers such asRollingFileWriterand the upcomingClusteredWriterto the outermost layer, and having them own theLocationGenerator.Currently, the layering looks like this:
My proposal:
This new design provides several benefits:
Clear layering: Functional writers hold the
LocationGeneratorand generate newOutputFiles for the inner writers, while the inner writers focus only on writing a single file without worrying about path or location.Flexibility: Functional writers no longer need to implement the
FileWritertrait, making them easier to use and configure. They don’t necessarily need to implement any trait, though we could define something like:I’m still exploring and refining this idea, and would appreciate any feedback!
Willingness to contribute
I would be willing to contribute to this feature with guidance from the Iceberg Rust community