-
Notifications
You must be signed in to change notification settings - Fork 3k
Description
In Iceberg, FileIO is a part of the Table and TableOperations interfaces, and used for both data and metadata. This works fine when people store both data and metadata using the same IO. However, if people want to have customized FileIO features for data based on metadata information, it creates a circular dependency. For example, in TableOperations:
- user calls
TableOperations.io()to get a FileIO - that calls
TableOperations.current()to get table properties - that calls
TableOperations.refresh()to get latest metadata - that calls
TableOperations.io()to get a FileIO to read the metadata file
When implementing the dynamic loading of FileIO (#1618 ), there was some discussion around this, and we basically decided to load FileIO through catalog properties and use the same FileIO for all table operations as a default behavior. Although users can have a customized FileIO for different tables if they want, the metadata and data aspect of it is still not decoupled. So far, I have heard multiple customer use cases around this, for example:
- use a different encryption mechanism for metadata and data, with encryption key stored as a table property
- check permissions for read and write access based on an access control list stored in table properties
Typically, users now let FileIO internally check the file path to determine what is the right mechanism to read the data, such as checking if keyword metadata is in the path or not to know if it is metadata. There might be also a layer of delegation added to pass calls to multiple different storage specific FileIOs. But I would consider this as a hack because it is basically reverse engineering who is calling FileIO.
I imagine a few different potential approaches (have not thought too much into details):
- use a different read and write mechanism for table properties, so that this circular dependency does not exist anymore.
- a new method
default FileIO metaIO() { return io(); }could potentially be added and used for all metadata operations instead ofio(), because at upstream we always know if we are writing data or metadata.
Has anyone thought about this problem before? Is this a situation that we think Iceberg should handle by design? Any suggestion would be appreciated, thanks!