Skip to content

File I/O Submodule for TableOperations #12

@mccheah

Description

@mccheah

In Netflix/iceberg#107 it was discussed that InputFile and OutputFile instances should be pluggable. We discussed the fact that provision of InputFile and OutputFile instances should be handled by the TableOperations API. However, the Spark data source in particular only uses HadoopInputFile#fromPath for reading and HadoopOutputFile#fromPath for writing. Using TableOperations#newInputFile and TableOperations#newOutputFile, would also be difficult because calling these methods on the executors would require TableOperations instances to be Serializable.

We propose having the TableOperations API provide a FileIO module that handles the narrow role of reading, creating / writing, and deleting files. We propose the following:

interface FileIO extends Serializable {
  InputFile newInputFile(String path);
  OutputFile newOutputFile(String path);
  void deleteFile(String path);
}

Then the following method would be added to TableOperations, and we would remove TableOperations#newInputFile and TableOperations#newMetadataFile.

interface TableOperations {
  FileIO fileIo();
  String resolveNewMetadataPath(String metadataFilename);
}

The need for resolveNewMetadataPath is because the new FileIO abstraction considers all locations as full paths, but the old method TableOperations#newMetadataFile assumes the argument is a file name, not a full path. Therefore now callers that used to call TableOperations#newMetadataFile should first retrieve the full path and then pass that along to FileIO#newOutputFile. For convenience we could add a helper default method like so:

interface TableOperations {
  FileIO fileIo();
  String resolveNewMetadataPath(String metadataFilename);
  default OutputFile newMetadataFile(String fileName) {
    return fileIo().newOutputFile(resolveMetadataPath(fileName));
  }
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions