Skip to content

fix: add iceberg_type column for SqlCatalog#3263

Open
rchowell wants to merge 1 commit intoapache:mainfrom
rchowell:iceberg-type
Open

fix: add iceberg_type column for SqlCatalog#3263
rchowell wants to merge 1 commit intoapache:mainfrom
rchowell:iceberg-type

Conversation

@rchowell
Copy link
Copy Markdown
Contributor

Rationale for this change

Pyiceberg currently does not include the iceberg_type column in the iceberg_tables table for SQL-based catalogs. This caused an error when reading a SQL-based Catalog from iceberg-rust, which expected this column. I will also update iceberg-rust to be more defensive like this PR.

The fix here is done exactly how iceberg-java handles this. We do an idempotent ALTER TABLE iceberg_tables ADD COLUMN iceberg_type after ensuring iceberg_tables exists. We explicitly use the try/catch to make it idempotent because older sqlite versions do not support IF NOT EXISTS for column creation. For newly created tables, the addition to the sqlalchemy table will create the column. For existing tables, this backwards-compatible schema update will hit.

Are these changes tested?

Unit Testing

  1. Unit test for the migration case
  2. Unit test for new table creation and idempotency

End-to-end Testing

# Case 1. New Catalog

# Catalog definition
$ cat ~/.pyiceberg.yaml | yq .catalog.tmp
type: sql
uri: sqlite:////tmp/pyiceberg.db

# Trigger initialization of catalog tables
$ pyiceberg --catalog tmp list

# Verify Schema
sqlite> .schema iceberg_tables
CREATE TABLE iceberg_tables (
        catalog_name VARCHAR(255) NOT NULL, 
        table_namespace VARCHAR(255) NOT NULL, 
        table_name VARCHAR(255) NOT NULL, 
        metadata_location VARCHAR(1000), 
        previous_metadata_location VARCHAR(1000), 
 ---->  iceberg_type VARCHAR(5), 
        PRIMARY KEY (catalog_name, table_namespace, table_name)
);

# Case 2. Existing Catalog (before fix)

# This catalog was created without the fix
sqlite> .schema iceberg_tables
CREATE TABLE iceberg_tables (
        catalog_name VARCHAR(255) NOT NULL, 
        table_namespace VARCHAR(255) NOT NULL, 
        table_name VARCHAR(255) NOT NULL, 
        metadata_location VARCHAR(1000), 
        previous_metadata_location VARCHAR(1000), 
        PRIMARY KEY (catalog_name, table_namespace, table_name)
);

# Fix is now applied
# ...

# Trigger update by loading this catalog with these changes in pyiceberg
sqlite> .schema iceberg_tables
CREATE TABLE iceberg_tables (
        catalog_name VARCHAR(255) NOT NULL, 
        table_namespace VARCHAR(255) NOT NULL, 
        table_name VARCHAR(255) NOT NULL, 
        metadata_location VARCHAR(1000), 
        previous_metadata_location VARCHAR(1000),
 ---->  iceberg_type VARCHAR(5),
        PRIMARY KEY (catalog_name, table_namespace, table_name)
);

Are there any user-facing changes?

No

Copy link
Copy Markdown
Member

@geruh geruh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for raising this @rchowell! This is a matter of completing the implementation of our SqlCatalog. Here you are updating the table to have the column and handle the migration logic.

We will definitely need a follow up here to filter against this column for table operations. Looks like the java v1 implementation runs something like WHERE (iceberg_type = 'TABLE' OR iceberg_type IS NULL) Otherwise, a view can bleed into table operations.

Comment thread tests/catalog/test_sql.py
catalog_sqlite.close()


def _create_pre_migration_schema_tables(engine: Engine) -> None:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't look like you're calling this function but the logic is used on like 302+ below?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants