diff --git a/site/_data/contributors.yml b/site/_data/contributors.yml index 7bed83d2182..2a8d95b77b2 100644 --- a/site/_data/contributors.yml +++ b/site/_data/contributors.yml @@ -29,4 +29,7 @@ apacheId: julienledem githubId: julienledem role: PMC +- name: Michael König + apacheId: MathMagique + githubId: MathMagique # End contributors.yml diff --git a/site/_posts/2017-06-15-turbodbc-arrow.md b/site/_posts/2017-06-15-turbodbc-arrow.md new file mode 100644 index 00000000000..505cbb8d468 --- /dev/null +++ b/site/_posts/2017-06-15-turbodbc-arrow.md @@ -0,0 +1,113 @@ +--- +layout: post +title: "Connecting relational databases to the Apache Arrow world with turbodbc" +date: "2017-06-15 13:00:00 +0200" +author: MathMagique +categories: [application] +--- + + +The [Apache Arrow][1] project set out to become the universal data layer that +connects data processing systems without incurring serialization costs or +compromising on performance on a more general level. While relational databases +still lag behind in Apache Arrow adoption, the Python database module [turbodbc][2] +brings Apache Arrow support to these databases using a much older, more +specialized data exchange layer: [ODBC][3]. + +ODBC is a database interface that offers developers the option to transfer data +either in row-wise or column-wise fashion. Previous Python ODBC modules typically +use the row-wise approach, and often trade repeated database roundtrips for simplified +buffer handling. This makes them less suited for data-intensive applications, +particularly when interfacing with modern columnar analytical databases. + +In contrast, turbodbc was designed to leverage columnar data processing from day +one. Naturally, this implies using the columnar portion of the ODBC API. Equally +important, however, is to find new ways of providing columnar data to Python users +that exceed the capabilities of the row-wise API mandated by Python’s [PEP 249][4]. +Turbodbc has adopted Apache Arrow for this very task with the recently released +version 2.0.0: + +``` +>>> from turbodbc import connect +>>> connection = connect(dsn="My columnar database") +>>> cursor = connection.cursor() +>>> cursor.execute("SELECT some_integers, some_strings FROM my_table") +>>> cursor.fetchallarrow() +pyarrow.Table +some_integers: int64 +some_strings: string +``` + +With this new addition, the data flow for a result set of a typical SELECT query +is like this: +* The database prepares the result set and exposes it to the ODBC driver using + either row-wise or column-wise storage. +* Turbodbc has the ODBC driver write chunks of the result set into columnar buffers. +* These buffers are exposed to turbodbc's Apache Arrow frontend. This frontend + will create an Arrow table and fill in the buffered values. +* The previous steps are repeated until the entire result set is retrieved. + +![Data flow from relational databases to Python with turbodbc and the Apache Arrow frontend]({{ site.url }}/img/turbodbc_arrow.png){:class="img-responsive"} + +In practice, it is possible to achieve the following ideal situation: A 64-bit integer +column is stored as one contiguous block of memory in a columnar database. A huge chunk +of 64-bit integers is transferred over the network and the ODBC driver directly writes +it to a turbodbc buffer of 64-bit integers. The Arrow frontend accumulates these values +by copying the entire 64-bit buffer into a free portion of an Arrow table's 64-bit +integer column. + +Moving data from the database to an Arrow table and, thus, providing it to the Python +user can be as simple as copying memory blocks around, megabytes equivalent to hundred +thousands of rows at a time. The absence of serialization and conversion logic renders +the process extremely efficient. + +Once the data is stored in an Arrow table, Python users can continue to do some actual +work. They can convert it into a [Pandas dataframe][5] for data analysis +(using a quick `table.to_pandas()`), pass it on to +other data processing systems such as [Spark][6] or [Impala][7], or store it in the +[Parquet][8] file format. This way, non-Python systems are efficiently connected with +relational databases. + +In the future, turbodbc’s Apache Arrow support will be extended to use more sophisticated +features such as [dictionary-encoded][9] string fields. We also plan to pick smaller than +64-bit [data types][10] where possible. Last but not least, Arrow support will be extended +to cover the reverse direction of data flow, so that Python users can quickly insert +Arrow tables into relational databases. + +If you would like to learn more about turbodbc, check out the [GitHub project][2] and the +[project documentation][11]. If you want to learn more about how turbodbc implements the +nitty-gritty details, check out parts [one][12] and [two][13] of the +["Making of turbodbc"][12] series at [Blue Yonder's technology blog][14]. + + +[1]: https://arrow.apache.org/ +[2]: https://github.com/blue-yonder/turbodbc +[3]: https://en.wikipedia.org/wiki/Open_Database_Connectivity +[4]: https://www.python.org/dev/peps/pep-0249/ +[5]: https://arrow.apache.org/docs/python/pandas.html +[6]: http://spark.apache.org/ +[7]: http://impala.apache.org/ +[8]: http://parquet.apache.org/ +[9]: https://arrow.apache.org/docs/memory_layout.html#dictionary-encoding +[10]: https://arrow.apache.org/docs/metadata.html#integers +[11]: http://turbodbc.readthedocs.io/ +[12]: https://tech.blue-yonder.com/making-of-turbodbc-part-1-wrestling-with-the-side-effects-of-a-c-api/ +[13]: https://tech.blue-yonder.com/making-of-turbodbc-part-2-c-to-python/ +[14]: https://tech.blue-yonder.com/ diff --git a/site/img/turbodbc_arrow.png b/site/img/turbodbc_arrow.png new file mode 100644 index 00000000000..b534bf99a0e Binary files /dev/null and b/site/img/turbodbc_arrow.png differ