Skip to content

Improve read performance #1004

@jamblejoe

Description

@jamblejoe

Is your feature request related to a problem? Please describe.
We are a slightly disappointed by the read performance from DB2 databases via this python package. We use ibm_db_dbi and pandas read_sql to read in data. This calls into

def _fetch_helper(self, fetch_size=-1):

where the result is processed in a python loop and every fetched tuple is appended to a list. This is suboptimal performance wise.

There is a faster solution since the end of last year due to this PR #971, fetchall implemented in

static PyObject *ibm_db_fetchall(PyObject *self, PyObject *args)

Here, the loop is implemented in C but there are a lot of python-checks for every fetched tuple. The performance is 4x faster than the ibm_db_dbi interface.

Other packages are providing a much better performance. E.g., the rust package https://github.com/pacman82/arrow-odbc wrapped in https://github.com/pacman82/arrow-odbc-py calls the IBM driver via the ODBC interface and bulk reads in rust into the arrow format. The python wrapper of the rust crate has a very easy syntax:

reader = arrow_odbc.read_arrow_batches_from_odbc(
        query=f"""
                SELECT col1, col2, ...
                FROM TEST_PULL """,
        connection_string=connection_string,
        batch_size=10_000,
    )

batches = []
for batch in reader:
    batches.append(batch)

df = pl.from_arrow(batches) # polars df

A little test gave us the following performance chart (lower is better)

Image
Note the logarithmic scale of the axis.

Summarized: arrow-odbc-py gives as ~15x performance boost over ibm_db_dbi interface and ~4x boost over fetchall from ibm_db.

Describe the solution you'd like
We would like to continue to use ibm_db_dbi, but the performance sacrifices compared to other packages are too large. Please enhance the performance of reading from DB2.

Describe alternatives you've considered
Other packages, like arrow-odbc-py, allow for significantly faster reads.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions