-
Notifications
You must be signed in to change notification settings - Fork 203
Description
Is your feature request related to a problem? Please describe.
We are a slightly disappointed by the read performance from DB2 databases via this python package. We use ibm_db_dbi and pandas read_sql to read in data. This calls into
Line 1764 in c3aaf02
| def _fetch_helper(self, fetch_size=-1): |
where the result is processed in a python loop and every fetched tuple is appended to a list. This is suboptimal performance wise.
There is a faster solution since the end of last year due to this PR #971, fetchall implemented in
Line 17042 in c3aaf02
| static PyObject *ibm_db_fetchall(PyObject *self, PyObject *args) |
Here, the loop is implemented in C but there are a lot of python-checks for every fetched tuple. The performance is 4x faster than the
ibm_db_dbi interface.
Other packages are providing a much better performance. E.g., the rust package https://github.com/pacman82/arrow-odbc wrapped in https://github.com/pacman82/arrow-odbc-py calls the IBM driver via the ODBC interface and bulk reads in rust into the arrow format. The python wrapper of the rust crate has a very easy syntax:
reader = arrow_odbc.read_arrow_batches_from_odbc(
query=f"""
SELECT col1, col2, ...
FROM TEST_PULL """,
connection_string=connection_string,
batch_size=10_000,
)
batches = []
for batch in reader:
batches.append(batch)
df = pl.from_arrow(batches) # polars df
A little test gave us the following performance chart (lower is better)

Note the logarithmic scale of the axis.
Summarized: arrow-odbc-py gives as ~15x performance boost over ibm_db_dbi interface and ~4x boost over fetchall from ibm_db.
Describe the solution you'd like
We would like to continue to use ibm_db_dbi, but the performance sacrifices compared to other packages are too large. Please enhance the performance of reading from DB2.
Describe alternatives you've considered
Other packages, like arrow-odbc-py, allow for significantly faster reads.