Skip to content

Cast from UUIDLiteral to other types? #522

@sebpretzer

Description

@sebpretzer

Apache Iceberg version

0.6.0 (latest release)

Please describe the bug 🐞

Apologies for not fully finding the root cause, tracing the issue was a bit difficult for me, and I am hoping others can guide me better.

The Issue

When attempting to add a row_filter to filter on a UUIDType during ‎DataScan.to_arrow():

table: pyiceberg.table.Table = ... 
df = table.scan(
        selected_fields=["uuid_col"],
        row_filter=f"uuid_col == '102cb62f-e6f8-4eb0-9973-d9b012ff0967'",
    ).to_arrow()

I get the following error:

.venv/lib/python3.11/site-packages/pyiceberg/table/__init__.py:1418: in to_arrow
    return project_table(
.venv/lib/python3.11/site-packages/pyiceberg/io/pyarrow.py:1114: in project_table
    tables = [f.result() for f in completed_futures if f.result()]
.venv/lib/python3.11/site-packages/pyiceberg/io/pyarrow.py:1114: in <listcomp>
    tables = [f.result() for f in completed_futures if f.result()]
/opt/homebrew/Caskroom/miniconda/base/lib/python3.11/concurrent/futures/_base.py:449: in result
    return self.__get_result()
/opt/homebrew/Caskroom/miniconda/base/lib/python3.11/concurrent/futures/_base.py:401: in __get_result
    raise self._exception
/opt/homebrew/Caskroom/miniconda/base/lib/python3.11/concurrent/futures/thread.py:58: in run
    result = self.fn(*self.args, **self.kwargs)
.venv/lib/python3.11/site-packages/pyiceberg/io/pyarrow.py:963: in _task_to_table
    bound_file_filter = bind(file_schema, translated_row_filter, case_sensitive=case_sensitive)
.venv/lib/python3.11/site-packages/pyiceberg/expressions/visitors.py:213: in bind
    return visit(expression, BindVisitor(schema, case_sensitive))
/opt/homebrew/Caskroom/miniconda/base/lib/python3.11/functools.py:909: in wrapper
    return dispatch(args[0].__class__)(*args, **kw)
.venv/lib/python3.11/site-packages/pyiceberg/expressions/visitors.py:185: in _
    return visitor.visit_unbound_predicate(predicate=obj)
.venv/lib/python3.11/site-packages/pyiceberg/expressions/visitors.py:250: in visit_unbound_predicate
    return predicate.bind(self.schema, case_sensitive=self.case_sensitive)
.venv/lib/python3.11/site-packages/pyiceberg/expressions/__init__.py:672: in bind
    lit = self.literal.to(bound_term.ref().field.field_type)
/opt/homebrew/Caskroom/miniconda/base/lib/python3.11/functools.py:946: in _method
    return method.__get__(obj, cls)(*args, **kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = UUIDLiteral(b"\xf0f\xa6\xe1'\x8dH\xe7\x82T\x1798'G>"), type_var = FixedType(length=16)

    @singledispatchmethod
    def to(self, type_var: IcebergType) -> Literal:  # type: ignore
>       raise TypeError(f"Cannot convert UUIDLiteral into {type_var}")
E       TypeError: Cannot convert UUIDLiteral into fixed[16]

.venv/lib/python3.11/site-packages/pyiceberg/expressions/literals.py:606: TypeError

Notes

  1. I see there is already a test to capture this (test_unpartitioned_uuid_table), but I was unable to figure out how to set up integration tests myself and step through this specific test.
  2. This does not occur when the row_filter is not specified. When returning all data, the pyarrow table schema is specified as fixed_size_binary[16].

Potential Solution?

I was able to fix this issue by patching UUIDLiteral like so:

class UUIDLiteral(Literal[bytes]):
    def __init__(self, value: bytes) -> None:
        super().__init__(value, bytes)

    @singledispatchmethod
    def to(self, type_var: IcebergType) -> Literal:  # type: ignore
        raise TypeError(f"Cannot convert UUIDLiteral into {type_var}")

    @to.register(UUIDType)
    def _(self, _: UUIDType) -> Literal[bytes]:
        return self

+    @to.register(FixedType)
+    def _(self, type_var: FixedType) -> Literal[bytes]:
+        if len(type_var) == UUID_BYTES_LENGTH:
+            return FixedLiteral(self.value)
+        else:
+            raise TypeError(
+                f"Cannot convert UUIDLiteral into {type_var}, different length: {len(type_var)} <> {UUID_BYTES_LENGTH}"
+            )

I was able to confirm the data with the row_filter and without is identical. I am not sure if this is the best way to go about it though? And I am not sure if this is a complete solution.

Please let me know if this in fact a bug, or I am missing something. Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions