Skip to content

Circular dependency when FileIO depends on table properties #1908

@jackye1995

Description

@jackye1995

In Iceberg, FileIO is a part of the Table and TableOperations interfaces, and used for both data and metadata. This works fine when people store both data and metadata using the same IO. However, if people want to have customized FileIO features for data based on metadata information, it creates a circular dependency. For example, in TableOperations:

  1. user calls TableOperations.io() to get a FileIO
  2. that calls TableOperations.current() to get table properties
  3. that calls TableOperations.refresh() to get latest metadata
  4. that calls TableOperations.io() to get a FileIO to read the metadata file

When implementing the dynamic loading of FileIO (#1618 ), there was some discussion around this, and we basically decided to load FileIO through catalog properties and use the same FileIO for all table operations as a default behavior. Although users can have a customized FileIO for different tables if they want, the metadata and data aspect of it is still not decoupled. So far, I have heard multiple customer use cases around this, for example:

  1. use a different encryption mechanism for metadata and data, with encryption key stored as a table property
  2. check permissions for read and write access based on an access control list stored in table properties

Typically, users now let FileIO internally check the file path to determine what is the right mechanism to read the data, such as checking if keyword metadata is in the path or not to know if it is metadata. There might be also a layer of delegation added to pass calls to multiple different storage specific FileIOs. But I would consider this as a hack because it is basically reverse engineering who is calling FileIO.

I imagine a few different potential approaches (have not thought too much into details):

  1. use a different read and write mechanism for table properties, so that this circular dependency does not exist anymore.
  2. a new method default FileIO metaIO() { return io(); } could potentially be added and used for all metadata operations instead of io(), because at upstream we always know if we are writing data or metadata.

Has anyone thought about this problem before? Is this a situation that we think Iceberg should handle by design? Any suggestion would be appreciated, thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions