From b5b4df525e1fc878cac6f12f12a67df10248bde9 Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Tue, 1 Nov 2016 13:46:28 +0100 Subject: [PATCH 1/5] ARROW-356: Add documentation about reading Parquet Change-Id: I1810ccbb021a79f1da1474cc1b952ab98503f010 --- python/doc/index.rst | 14 ++++----- python/doc/parquet.rst | 65 ++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 72 insertions(+), 7 deletions(-) create mode 100644 python/doc/parquet.rst diff --git a/python/doc/index.rst b/python/doc/index.rst index 88725badc1e..0f29ce2acc4 100644 --- a/python/doc/index.rst +++ b/python/doc/index.rst @@ -31,14 +31,14 @@ additional functionality such as reading Apache Parquet files into Arrow structures. .. toctree:: - :maxdepth: 4 - :hidden: + :maxdepth: 2 + :caption: Getting Started Module Reference -Indices and tables -================== +.. toctree:: + :maxdepth: 2 + :caption: Additional Features + + Parquet format -* :ref:`genindex` -* :ref:`modindex` -* :ref:`search` diff --git a/python/doc/parquet.rst b/python/doc/parquet.rst new file mode 100644 index 00000000000..ea64c378038 --- /dev/null +++ b/python/doc/parquet.rst @@ -0,0 +1,65 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +.. or more contributor license agreements. See the NOTICE file +.. distributed with this work for additional information +.. regarding copyright ownership. The ASF licenses this file +.. to you under the Apache License, Version 2.0 (the +.. "License"); you may not use this file except in compliance +.. with the License. You may obtain a copy of the License at + +.. http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, +.. software distributed under the License is distributed on an +.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +.. KIND, either express or implied. See the License for the +.. specific language governing permissions and limitations +.. under the License. + +Reading/Writing Parquet files +============================= + +If you have built ``pyarrow`` with Parquet support, i.e. ``parquet-cpp`` was +found during the build, you can read files in the Parquet format to/from Arrow +memory structures. The Parquet support code is located in the +:mod:`pyarrow.parquet` module. + +Reading Parquet +--------------- + +To read a Parquet file into Arrow memory, you can use the following code +snippet. It will read the whole Parquet file into memory as an +:class:`pyarrow.table.Table`. + +.. code-block:: python + + import pyarrow + import pyarrow.parquet + + A = pyarrow + + table = A.parquet.read_table('') + +Writing Parquet +--------------- + +Given an instance of :class:`pyarrow.table.Table`, the most simple way to +persist it to Parquet is by using the :meth:`pyarrow.parquet.write_table` +method. + +.. code-block:: python + + import pyarrow + import pyarrow.parquet + + A = pyarrow + + table = A.Table(..) + A.parquet.write_table(table, '') + +By default this will write the Table as a single RowGroup using ``DICTIONARY`` +encoding. To increase the potential of parallelism a query engine can process +a Parquet file, set the ``chunk_size`` to a fraction of the total number of rows. + +If you also want to compress the columns, you can select a compression +method using the ``compression`` argument. Typically, ``GZIP`` is the choice if +you want to minimize size and ``SNAPPY`` for performance. From 744202a61dbc5b0d43a448cf8d849bed1f84d56d Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Fri, 4 Nov 2016 09:54:19 +0100 Subject: [PATCH 2/5] Document Pandas<->Arrow conversion Change-Id: Icd19435eb0202a972817adfd880983b1c937bef1 --- python/doc/index.rst | 2 ++ python/doc/pandas.rst | 61 ++++++++++++++++++++++++++++++++++++++++ python/pyarrow/table.pyx | 15 ++++++++++ 3 files changed, 78 insertions(+) create mode 100644 python/doc/pandas.rst diff --git a/python/doc/index.rst b/python/doc/index.rst index 0f29ce2acc4..6725ae707d9 100644 --- a/python/doc/index.rst +++ b/python/doc/index.rst @@ -34,6 +34,8 @@ structures. :maxdepth: 2 :caption: Getting Started + Installing pyarrow + Pandas Module Reference .. toctree:: diff --git a/python/doc/pandas.rst b/python/doc/pandas.rst new file mode 100644 index 00000000000..2e5a0318185 --- /dev/null +++ b/python/doc/pandas.rst @@ -0,0 +1,61 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +.. or more contributor license agreements. See the NOTICE file +.. distributed with this work for additional information +.. regarding copyright ownership. The ASF licenses this file +.. to you under the Apache License, Version 2.0 (the +.. "License"); you may not use this file except in compliance +.. with the License. You may obtain a copy of the License at + +.. http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, +.. software distributed under the License is distributed on an +.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +.. KIND, either express or implied. See the License for the +.. specific language governing permissions and limitations +.. under the License. + +Pandas Interface +================ + +To interface with Pandas, PyArrow provides various conversion routines to +consume Pandas structures and convert back to them. + +DataFrames +---------- + +The equivalent to a Pandas DataFrame in Arrow is a :class:`pyarrow.table.Table`. +Both consist of a set of named columns of equal length. While Pandas only +supports flat columnas, the Table also provides nested columns, thus it can +represent more data than a DataFrame, so a full conversion is not always possible. + +Conversion from a Table to a DataFrame is done by calling +:meth:`pyarrow.table.Table.to_pandas`. The inverse is then achieved by using +:meth:`pyarrow.from_pandas_dataframe`. This conversion routine provides the +convience parameter ``timestamps_to_ms``. Although Arrow supports timestamps of +different resolutions, Pandas only supports nanosecond timestamps and most +other systems (e.g. Parquet) only work on millisecond timestamps. This parameter +can be used to already do the time conversion during the Pandas to Arrow +conversion. + +.. code-block:: python + + import pyarrow as pa + import pandas as pd + + df = pd.DataFrame({"a": [1, 2, 3]}) + # Convert from Pandas to Arrow + table = pa.from_pandas_dataframe(df) + # Convert back to Pandas + df_new = table.to_pandas() + + +Series +------ + +In Arrow, the most similar structure to a Pandas Series is an Array. +It is a vector that contains data of the same type as linear memory. You can +convert a Pandas Series to an Arrow Array using :meth:`pyarrow.array.from_pandas_series`. +As Arrow Arrays are always nullable, you can supply an optional mask using +the ``mask`` parameter to mark all null-entries. + diff --git a/python/pyarrow/table.pyx b/python/pyarrow/table.pyx index 969571262ca..ec6683327d2 100644 --- a/python/pyarrow/table.pyx +++ b/python/pyarrow/table.pyx @@ -293,6 +293,8 @@ cdef class RecordBatch: cdef class Table: ''' + A collection of top-level named, equal length Arrow arrays. + Do not call this class's constructor directly. ''' @@ -330,6 +332,19 @@ cdef class Table: @staticmethod def from_arrays(names, arrays, name=None): + """ + Construct a Table from Arrow Arrays + + Parameters + ---------- + + names: list of str + Names for the table columns + arrays: list of pyarrow.array.Array + Equal-length arrays that should form the table. + name: str + (optional) name for the Table + """ cdef: Array arr c_string c_name From 0467e0ead0118882083b86d60a33078e172e1d66 Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Fri, 4 Nov 2016 10:39:49 +0100 Subject: [PATCH 3/5] Move installation instructions into Sphinx docs Change-Id: I49c9ad9587af5989c4e98d2d6de6d3ea756d8e4c --- python/doc/INSTALL.md | 101 ---------------------------- python/doc/install.rst | 147 +++++++++++++++++++++++++++++++++++++++++ 2 files changed, 147 insertions(+), 101 deletions(-) delete mode 100644 python/doc/INSTALL.md create mode 100644 python/doc/install.rst diff --git a/python/doc/INSTALL.md b/python/doc/INSTALL.md deleted file mode 100644 index 81eed565d91..00000000000 --- a/python/doc/INSTALL.md +++ /dev/null @@ -1,101 +0,0 @@ - - -## Building pyarrow (Apache Arrow Python library) - -First, clone the master git repository: - -```bash -git clone https://github.com/apache/arrow.git arrow -``` - -#### System requirements - -Building pyarrow requires: - -* A C++11 compiler - - * Linux: gcc >= 4.8 or clang >= 3.5 - * OS X: XCode 6.4 or higher preferred - -* [cmake][1] - -#### Python requirements - -You will need Python (CPython) 2.7, 3.4, or 3.5 installed. Earlier releases and -are not being targeted. - -> This library targets CPython only due to an emphasis on interoperability with -> pandas and NumPy, which are only available for CPython. - -The build requires NumPy, Cython, and a few other Python dependencies: - -```bash -pip install cython -cd arrow/python -pip install -r requirements.txt -``` - -#### Installing Arrow C++ library - -First, you should choose an installation location for Arrow C++. In the future -using the default system install location will work, but for now we are being -explicit: - -```bash -export ARROW_HOME=$HOME/local -``` - -Now, we build Arrow: - -```bash -cd arrow/cpp - -mkdir dev-build -cd dev-build - -cmake -DCMAKE_INSTALL_PREFIX=$ARROW_HOME .. - -make - -# Use sudo here if $ARROW_HOME requires it -make install -``` - -#### Install `pyarrow` - -```bash -cd arrow/python - -python setup.py install -``` - -> On XCode 6 and prior there are some known OS X `@rpath` issues. If you are -> unable to import pyarrow, upgrading XCode may be the solution. - - -```python -In [1]: import pyarrow - -In [2]: pyarrow.from_pylist([1,2,3]) -Out[2]: - -[ - 1, - 2, - 3 -] -``` - -[1]: https://cmake.org/ diff --git a/python/doc/install.rst b/python/doc/install.rst new file mode 100644 index 00000000000..a2fcfc888fe --- /dev/null +++ b/python/doc/install.rst @@ -0,0 +1,147 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +.. or more contributor license agreements. See the NOTICE file +.. distributed with this work for additional information +.. regarding copyright ownership. The ASF licenses this file +.. to you under the Apache License, Version 2.0 (the +.. "License"); you may not use this file except in compliance +.. with the License. You may obtain a copy of the License at + +.. http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, +.. software distributed under the License is distributed on an +.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +.. KIND, either express or implied. See the License for the +.. specific language governing permissions and limitations +.. under the License. + +Install PyArrow +=============== + +Conda +----- + +To install the latest version of PyArrow from conda-forge using conda: + +.. code-block:: bash + + conda install -c conda-forge pyarrow + +Pip +--- + +Install the latest version from PyPI: + +.. code-block:: bash + + pip install pyarrow + +.. note:: + Currently there are only binary artifcats available for Linux and MacOS. + Otherwise this will only pull the python sources and assumes an existing + installation of the C++ part of Arrow. + To retrieve the binary artifacts, you'll need a recent ``pip`` version that + supports features like the ``manylinux1`` tag. + +Building from source +-------------------- + +First, clone the master git repository: + +.. code-block:: bash + + git clone https://github.com/apache/arrow.git arrow + +System requirements +~~~~~~~~~~~~~~~~~~~ + +Building pyarrow requires: + +* A C++11 compiler + + * Linux: gcc >= 4.8 or clang >= 3.5 + * OS X: XCode 6.4 or higher preferred + +* `CMake `_ + +Python requirements +~~~~~~~~~~~~~~~~~~~ + +You will need Python (CPython) 2.7, 3.4, or 3.5 installed. Earlier releases and +are not being targeted. + +.. note:: + This library targets CPython only due to an emphasis on interoperability with + pandas and NumPy, which are only available for CPython. + +The build requires NumPy, Cython, and a few other Python dependencies: + +.. code-block:: bash + + pip install cython + cd arrow/python + pip install -r requirements.txt + +Installing Arrow C++ library +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +First, you should choose an installation location for Arrow C++. In the future +using the default system install location will work, but for now we are being +explicit: + +.. code-block:: bash + + export ARROW_HOME=$HOME/local + +Now, we build Arrow: + +.. code-block:: bash + + cd arrow/cpp + + mkdir dev-build + cd dev-build + + cmake -DCMAKE_INSTALL_PREFIX=$ARROW_HOME .. + + make + + # Use sudo here if $ARROW_HOME requires it + make install + +To get the optional Parquet support, you should also build and install +`parquet-cpp `_. + +Install `pyarrow` +~~~~~~~~~~~~~~~~~ + + +.. code-block:: bash + + cd arrow/python + + python setup.py install + +.. warning:: + On XCode 6 and prior there are some known OS X `@rpath` issues. If you are + unable to import pyarrow, upgrading XCode may be the solution. + +.. note:: + In development installations, you will also need to set a correct + ``LD_LIBARY_PATH``. This is most probably done with + ``export LD_LIBARY_PATH=$ARROW_HOME/lib:$LD_LIBARY_PATH``. + + +.. code-block:: python + + In [1]: import pyarrow + + In [2]: pyarrow.from_pylist([1,2,3]) + Out[2]: + + [ + 1, + 2, + 3 + ] + From 06b2f9c98ad338d0633834e2748d1ba0dfe2d455 Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Wed, 9 Nov 2016 22:49:26 +0100 Subject: [PATCH 4/5] Add tables describing dtype support Change-Id: If8a3b9703039d4ecc07feec7e90e2be8d52719d4 --- python/doc/pandas.rst | 57 +++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 55 insertions(+), 2 deletions(-) diff --git a/python/doc/pandas.rst b/python/doc/pandas.rst index 2e5a0318185..7c700748178 100644 --- a/python/doc/pandas.rst +++ b/python/doc/pandas.rst @@ -26,7 +26,7 @@ DataFrames The equivalent to a Pandas DataFrame in Arrow is a :class:`pyarrow.table.Table`. Both consist of a set of named columns of equal length. While Pandas only -supports flat columnas, the Table also provides nested columns, thus it can +supports flat columns, the Table also provides nested columns, thus it can represent more data than a DataFrame, so a full conversion is not always possible. Conversion from a Table to a DataFrame is done by calling @@ -53,9 +53,62 @@ conversion. Series ------ -In Arrow, the most similar structure to a Pandas Series is an Array. +In Arrow, the most similar structure to a Pandas Series is an Array. It is a vector that contains data of the same type as linear memory. You can convert a Pandas Series to an Arrow Array using :meth:`pyarrow.array.from_pandas_series`. As Arrow Arrays are always nullable, you can supply an optional mask using the ``mask`` parameter to mark all null-entries. +Type differences +---------------- + +With the current design of Pandas and Arrow, it is not possible to convert all +column types unmodified. One of the main issues here is that Pandas has no +support for nullable columns of arbitrary type. Also ``datetime64`` is currently +fixed to nanosecond resolution. On the other side, Arrow might be still missing +support for some types. + +Pandas -> Arrow Conversion +~~~~~~~~~~~~~~~~~~~~~~~~~~ + ++------------------------+--------------------------+ +| Source Type (Pandas) | Destination Type (Arrow) | ++========================+==========================+ +| ``bool`` | ``BOOL`` | ++------------------------+--------------------------+ +| ``(u)int{8,16,32,64}`` | ``(U)INT{8,16,32,64}`` | ++------------------------+--------------------------+ +| ``float32`` | ``FLOAT`` | ++------------------------+--------------------------+ +| ``float64`` | ``DOUBLE`` | ++------------------------+--------------------------+ +| ``str`` / ``unicode`` | ``STRING`` | ++------------------------+--------------------------+ +| ``pd.Timestamp`` | ``TIMESTAMP(unit=ns)`` | ++------------------------+--------------------------+ +| ``pd.Categorical`` | *not supported* | ++------------------------+--------------------------+ + +Arrow -> Pandas Conversion +~~~~~~~~~~~~~~~~~~~~~~~~~~ + ++-------------------------------------+--------------------------------------------------------+ +| Source Type (Arrow) | Destination Type (Pandas) | ++=====================================+========================================================+ +| ``BOOL`` | ``bool`` | ++-------------------------------------+--------------------------------------------------------+ +| ``BOOL`` *with nulls* | ``object`` (with values ``True``, ``False``, ``None``) | ++-------------------------------------+--------------------------------------------------------+ +| ``(U)INT{8,16,32,64}`` | ``(u)int{8,16,32,64}`` | ++-------------------------------------+--------------------------------------------------------+ +| ``(U)INT{8,16,32,64}`` *with nulls* | ``float64`` | ++-------------------------------------+--------------------------------------------------------+ +| ``FLOAT`` | ``float32`` | ++-------------------------------------+--------------------------------------------------------+ +| ``DOUBLE`` | ``float64`` | ++-------------------------------------+--------------------------------------------------------+ +| ``STRING`` | ``str`` | ++-------------------------------------+--------------------------------------------------------+ +| ``TIMESTAMP(unit=*)`` | ``pd.Timestamp`` (``np.datetime64[ns]``) | ++-------------------------------------+--------------------------------------------------------+ + From 530484fa30d9441d13d81ce67b4c77d8890b7bef Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Fri, 11 Nov 2016 13:20:04 +0100 Subject: [PATCH 5/5] Mention new setup instructions Change-Id: I60a263b30ece4651153f34f8bd43f83026b5714a --- python/doc/install.rst | 4 ++++ python/doc/parquet.rst | 3 ++- 2 files changed, 6 insertions(+), 1 deletion(-) diff --git a/python/doc/install.rst b/python/doc/install.rst index a2fcfc888fe..1bab0173016 100644 --- a/python/doc/install.rst +++ b/python/doc/install.rst @@ -120,6 +120,10 @@ Install `pyarrow` cd arrow/python + # --with-parquet enable the Apache Parquet support in PyArrow + # --build-type=release disables debugging information and turns on + # compiler optimizations for native code + python setup.py build_ext --with-parquet --build-type=release install python setup.py install .. warning:: diff --git a/python/doc/parquet.rst b/python/doc/parquet.rst index ea64c378038..674ed80f27c 100644 --- a/python/doc/parquet.rst +++ b/python/doc/parquet.rst @@ -21,7 +21,8 @@ Reading/Writing Parquet files If you have built ``pyarrow`` with Parquet support, i.e. ``parquet-cpp`` was found during the build, you can read files in the Parquet format to/from Arrow memory structures. The Parquet support code is located in the -:mod:`pyarrow.parquet` module. +:mod:`pyarrow.parquet` module and your package needs to be built with the +``--with-parquet`` flag for ``build_ext``. Reading Parquet ---------------