Skip to content

Update PyArrow dependency to support StringView type #31475

@phillipleblanc

Description

@phillipleblanc

Bug description

Currently, Superset requires pyarrow>=14.0.1,<15, but this creates compatibility issues when working with databases that return StringView types (introduced in PyArrow 16).

I've tested Superset with PyArrow 18.1.0 and verified it works correctly in my (admittedly bare-bones) setup. This update would:

  1. Fix compatibility with databases returning StringView types
  2. Allow users to work with newer Arrow-based databases and tools
  3. Take advantage of performance improvements in newer PyArrow versions

Proposed change:
Update the pyarrow dependency in pyproject.toml from:
"pyarrow>=14.0.1, <15"
to:
"pyarrow>=14.0.1, <19"

Screenshots/recordings

No response

Superset version

master / latest-dev

Python version

3.10

Node version

Not applicable

Browser

Not applicable

Additional context

I'm using the https://github.com/influxdata/flightsql-dbapi DB API2 layer to query a database that returns native Arrow arrays. It is returning StringView types that pyarrow 14 can't understand. I force upgraded to pyarrow 18.1 and it started working.

2024-12-16 12:48:10,731:ERROR:flask_appbuilder.api:Unrecognized type: 24
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/flask_appbuilder/api/init.py", line 110, in wraps
    return f(self, *args, kwargs)
  File "/app/superset/views/base_api.py", line 127, in wraps
    raise ex
  File "/app/superset/views/base_api.py", line 121, in wraps
    duration, response = time_function(f, self, *args, kwargs)
  File "/app/superset/utils/core.py", line 1470, in time_function
    response = func(args, **kwargs)
  File "/app/superset/utils/log.py", line 255, in wrapper
    value = f(args, kwargs)
  File "/app/superset/databases/api.py", line 742, in table_metadata
    table_info = get_table_metadata(database, table_name, schema_name)
  File "/app/superset/databases/utils.py", line 67, in get_table_metadata
    columns = database.get_columns(table_name, schema_name)
  File "/app/superset/models/core.py", line 839, in get_columns
    return self.db_engine_spec.get_columns(
  File "/app/superset/db_engine_specs/base.py", line 1341, in get_columns
    cast(list[SQLAColumnType], inspector.get_columns(table_name, schema))
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/engine/reflection.py", line 497, in get_columns
    col_defs = self.dialect.get_columns(
  File "<string>", line 2, in get_columns
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/engine/reflection.py", line 55, in cache
    ret = fn(self, con, *args, kw)
  File "/usr/local/lib/python3.10/site-packages/flightsql/sqlalchemy.py", line 87, in get_columns
    return connection.connection.flightsql_get_columns(table, schema)
  File "/usr/local/lib/python3.10/site-packages/flightsql/util.py", line 8, in g
    return f(self, *args, kwargs)
  File "/usr/local/lib/python3.10/site-packages/flightsql/dbapi.py", line 173, in flightsql_get_columns
    reader = ipc.open_stream(table_schema)
  File "/usr/local/lib/python3.10/site-packages/pyarrow/ipc.py", line 190, in open_stream
    return RecordBatchStreamReader(source, options=options,
  File "/usr/local/lib/python3.10/site-packages/pyarrow/ipc.py", line 52, in init**
    self._open(source, options=options, memory_pool=memory_pool)
  File "pyarrow/ipc.pxi", line 929, in pyarrow.lib._RecordBatchStreamReader._open
  File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Unrecognized type: 24

Checklist

  • I have searched Superset docs and Slack and didn't find a solution to my problem.
  • I have searched the GitHub issue tracker and didn't find a similar bug report.
  • I have checked Superset's logs for errors and if I found a relevant Python stacktrace, I included it here as text in the "additional context" section.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions