Circular dependency when FileIO depends on table properties

In Iceberg, FileIO is a part of the `Table` and `TableOperations` interfaces, and used for both data and metadata. This works fine when people store both data and metadata using the same IO. However, if people want to have customized FileIO features for data based on metadata information, it creates a circular dependency. For example, in `TableOperations`:
1. user calls `TableOperations.io()` to get a FileIO
2. that calls `TableOperations.current()` to get table properties
3. that calls `TableOperations.refresh()` to get latest metadata
4. that calls `TableOperations.io()` to get a FileIO to read the metadata file

When implementing the dynamic loading of FileIO (#1618 ), there was some discussion around this, and we basically decided to load FileIO through catalog properties and use the same FileIO for all table operations as a default behavior. Although users can have a customized FileIO for different tables if they want, the metadata and data aspect of it is still not decoupled. So far, I have heard multiple customer use cases around this, for example:

1. use a different encryption mechanism for metadata and data, with encryption key stored as a table property
2. check permissions for read and write access based on an access control list stored in table properties

Typically, users now let FileIO internally check the file path to determine what is the right mechanism to read the data, such as checking if keyword `metadata` is in the path or not to know if it is metadata. There might be also a layer of delegation added to pass calls to multiple different storage specific FileIOs. But I would consider this as a hack because it is basically reverse engineering who is calling FileIO.

I imagine a few different potential approaches (have not thought too much into details):
1. use a different read and write mechanism for table properties, so that this circular dependency does not exist anymore.
2. a new method `default FileIO metaIO() { return io(); }` could potentially be added and used for all metadata operations instead of `io()`, because at upstream we always know if we are writing data or metadata.

Has anyone thought about this problem before? Is this a situation that we think Iceberg should handle by design? Any suggestion would be appreciated, thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Circular dependency when FileIO depends on table properties #1908

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Circular dependency when FileIO depends on table properties #1908

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions