From b5b4df525e1fc878cac6f12f12a67df10248bde9 Mon Sep 17 00:00:00 2001
From: "Uwe L. Korn" <uwelk@xhochy.com>
Date: Tue, 1 Nov 2016 13:46:28 +0100
Subject: [PATCH 1/5] ARROW-356: Add documentation about reading Parquet

Change-Id: I1810ccbb021a79f1da1474cc1b952ab98503f010
---
 python/doc/index.rst   | 14 ++++-----
 python/doc/parquet.rst | 65 ++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 72 insertions(+), 7 deletions(-)
 create mode 100644 python/doc/parquet.rst
diff --git a/python/doc/index.rst b/python/doc/index.rst
index 88725badc1e..0f29ce2acc4 100644
--- a/python/doc/index.rst
+++ b/python/doc/index.rst
@@ -31,14 +31,14 @@ additional functionality such as reading Apache Parquet files into Arrow
 structures.
 
 .. toctree::
-   :maxdepth: 4
-   :hidden:
+   :maxdepth: 2
+   :caption: Getting Started
 
    Module Reference <modules.rst>
 
-Indices and tables
-==================
+.. toctree::
+   :maxdepth: 2
+   :caption: Additional Features
+
+   Parquet format <parquet.rst>
 
-* :ref:`genindex`
-* :ref:`modindex`
-* :ref:`search`
diff --git a/python/doc/parquet.rst b/python/doc/parquet.rst
new file mode 100644
index 00000000000..ea64c378038
--- /dev/null
+++ b/python/doc/parquet.rst
@@ -0,0 +1,65 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+Reading/Writing Parquet files
+=============================
+
+If you have built ``pyarrow`` with Parquet support, i.e. ``parquet-cpp`` was
+found during the build, you can read files in the Parquet format to/from Arrow
+memory structures. The Parquet support code is located in the
+:mod:`pyarrow.parquet` module.
+
+Reading Parquet
+---------------
+
+To read a Parquet file into Arrow memory, you can use the following code
+snippet. It will read the whole Parquet file into memory as an
+:class:`pyarrow.table.Table`.
+
+.. code-block:: python
+
+    import pyarrow
+    import pyarrow.parquet
+
+    A = pyarrow
+
+    table = A.parquet.read_table('<filename>')
+
+Writing Parquet
+---------------
+
+Given an instance of :class:`pyarrow.table.Table`, the most simple way to
+persist it to Parquet is by using the :meth:`pyarrow.parquet.write_table`
+method.
+
+.. code-block:: python
+
+    import pyarrow
+    import pyarrow.parquet
+
+    A = pyarrow
+
+    table = A.Table(..)
+    A.parquet.write_table(table, '<filename>')
+
+By default this will write the Table as a single RowGroup using ``DICTIONARY``
+encoding. To increase the potential of parallelism a query engine can process
+a Parquet file, set the ``chunk_size`` to a fraction of the total number of rows.
+
+If you also want to compress the columns, you can select a compression
+method using the ``compression`` argument. Typically, ``GZIP`` is the choice if
+you want to minimize size and ``SNAPPY`` for performance.

From 744202a61dbc5b0d43a448cf8d849bed1f84d56d Mon Sep 17 00:00:00 2001
From: "Uwe L. Korn" <uwelk@xhochy.com>
Date: Fri, 4 Nov 2016 09:54:19 +0100
Subject: [PATCH 2/5] Document Pandas<->Arrow conversion

Change-Id: Icd19435eb0202a972817adfd880983b1c937bef1
---
 python/doc/index.rst     |  2 ++
 python/doc/pandas.rst    | 61 ++++++++++++++++++++++++++++++++++++++++
 python/pyarrow/table.pyx | 15 ++++++++++
 3 files changed, 78 insertions(+)
 create mode 100644 python/doc/pandas.rst

diff --git a/python/doc/index.rst b/python/doc/index.rst
index 0f29ce2acc4..6725ae707d9 100644
--- a/python/doc/index.rst
+++ b/python/doc/index.rst
@@ -34,6 +34,8 @@ structures.
    :maxdepth: 2
    :caption: Getting Started
 
+   Installing pyarrow <install.rst>
+   Pandas <pandas.rst>
    Module Reference <modules.rst>
 
 .. toctree::
diff --git a/python/doc/pandas.rst b/python/doc/pandas.rst
new file mode 100644
index 00000000000..2e5a0318185
--- /dev/null
+++ b/python/doc/pandas.rst
@@ -0,0 +1,61 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+Pandas Interface
+================
+
+To interface with Pandas, PyArrow provides various conversion routines to
+consume Pandas structures and convert back to them.
+
+DataFrames
+----------
+
+The equivalent to a Pandas DataFrame in Arrow is a :class:`pyarrow.table.Table`.
+Both consist of a set of named columns of equal length. While Pandas only
+supports flat columnas, the Table also provides nested columns, thus it can
+represent more data than a DataFrame, so a full conversion is not always possible.
+
+Conversion from a Table to a DataFrame is done by calling
+:meth:`pyarrow.table.Table.to_pandas`. The inverse is then achieved by using
+:meth:`pyarrow.from_pandas_dataframe`. This conversion routine provides the
+convience parameter ``timestamps_to_ms``. Although Arrow supports timestamps of
+different resolutions, Pandas only supports nanosecond timestamps and most
+other systems (e.g. Parquet) only work on millisecond timestamps. This parameter
+can be used to already do the time conversion during the Pandas to Arrow
+conversion.
+
+.. code-block:: python
+
+    import pyarrow as pa
+    import pandas as pd
+
+    df = pd.DataFrame({"a": [1, 2, 3]})
+    # Convert from Pandas to Arrow
+    table = pa.from_pandas_dataframe(df)
+    # Convert back to Pandas
+    df_new = table.to_pandas()
+
+
+Series
+------
+
+In Arrow, the most similar structure to a Pandas Series is an Array. 
+It is a vector that contains data of the same type as linear memory. You can
+convert a Pandas Series to an Arrow Array using :meth:`pyarrow.array.from_pandas_series`.
+As Arrow Arrays are always nullable, you can supply an optional mask using
+the ``mask`` parameter to mark all null-entries.
+
diff --git a/python/pyarrow/table.pyx b/python/pyarrow/table.pyx
index 969571262ca..ec6683327d2 100644
--- a/python/pyarrow/table.pyx
+++ b/python/pyarrow/table.pyx
@@ -293,6 +293,8 @@ cdef class RecordBatch:
 
 cdef class Table:
     '''
+    A collection of top-level named, equal length Arrow arrays.
+
     Do not call this class's constructor directly.
     '''
 
@@ -330,6 +332,19 @@ cdef class Table:
 
     @staticmethod
     def from_arrays(names, arrays, name=None):
+        """
+        Construct a Table from Arrow Arrays
+
+        Parameters
+        ----------
+
+        names: list of str
+            Names for the table columns
+        arrays: list of pyarrow.array.Array
+            Equal-length arrays that should form the table.
+        name: str
+            (optional) name for the Table
+        """
         cdef:
             Array arr
             c_string c_name

From 0467e0ead0118882083b86d60a33078e172e1d66 Mon Sep 17 00:00:00 2001
From: "Uwe L. Korn" <uwelk@xhochy.com>
Date: Fri, 4 Nov 2016 10:39:49 +0100
Subject: [PATCH 3/5] Move installation instructions into Sphinx docs

Change-Id: I49c9ad9587af5989c4e98d2d6de6d3ea756d8e4c
---
 python/doc/INSTALL.md  | 101 ----------------------------
 python/doc/install.rst | 147 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 147 insertions(+), 101 deletions(-)
 delete mode 100644 python/doc/INSTALL.md
 create mode 100644 python/doc/install.rst

diff --git a/python/doc/INSTALL.md b/python/doc/INSTALL.md
deleted file mode 100644
index 81eed565d91..00000000000
--- a/python/doc/INSTALL.md
+++ /dev/null
@@ -1,101 +0,0 @@
-<!---
-  Licensed under the Apache License, Version 2.0 (the "License");
-  you may not use this file except in compliance with the License.
-  You may obtain a copy of the License at
-
-   http://www.apache.org/licenses/LICENSE-2.0
-
-  Unless required by applicable law or agreed to in writing, software
-  distributed under the License is distributed on an "AS IS" BASIS,
-  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-  See the License for the specific language governing permissions and
-  limitations under the License. See accompanying LICENSE file.
--->
-
-## Building pyarrow (Apache Arrow Python library)
-
-First, clone the master git repository:
-
-```bash
-git clone https://github.com/apache/arrow.git arrow
-```
-
-#### System requirements
-
-Building pyarrow requires:
-
-* A C++11 compiler
-
-  * Linux: gcc >= 4.8 or clang >= 3.5
-  * OS X: XCode 6.4 or higher preferred
-
-* [cmake][1]
-
-#### Python requirements
-
-You will need Python (CPython) 2.7, 3.4, or 3.5 installed. Earlier releases and
-are not being targeted.
-
-> This library targets CPython only due to an emphasis on interoperability with
-> pandas and NumPy, which are only available for CPython.
-
-The build requires NumPy, Cython, and a few other Python dependencies:
-
-```bash
-pip install cython
-cd arrow/python
-pip install -r requirements.txt
-```
-
-#### Installing Arrow C++ library
-
-First, you should choose an installation location for Arrow C++. In the future
-using the default system install location will work, but for now we are being
-explicit:
-
-```bash
-export ARROW_HOME=$HOME/local
-```
-
-Now, we build Arrow:
-
-```bash
-cd arrow/cpp
-
-mkdir dev-build
-cd dev-build
-
-cmake -DCMAKE_INSTALL_PREFIX=$ARROW_HOME ..
-
-make
-
-# Use sudo here if $ARROW_HOME requires it
-make install
-```
-
-#### Install `pyarrow`
-
-```bash
-cd arrow/python
-
-python setup.py install
-```
-
-> On XCode 6 and prior there are some known OS X `@rpath` issues. If you are
-> unable to import pyarrow, upgrading XCode may be the solution.
-
-
-```python
-In [1]: import pyarrow
-
-In [2]: pyarrow.from_pylist([1,2,3])
-Out[2]:
-<pyarrow.array.Int64Array object at 0x7f899f3e60e8>
-[
-  1,
-  2,
-  3
-]
-```
-
-[1]: https://cmake.org/
diff --git a/python/doc/install.rst b/python/doc/install.rst
new file mode 100644
index 00000000000..a2fcfc888fe
--- /dev/null
+++ b/python/doc/install.rst
@@ -0,0 +1,147 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+Install PyArrow
+===============
+
+Conda
+-----
+
+To install the latest version of PyArrow from conda-forge using conda:
+
+.. code-block:: bash
+
+    conda install -c conda-forge pyarrow
+
+Pip
+---
+
+Install the latest version from PyPI:
+
+.. code-block:: bash
+
+    pip install pyarrow
+
+.. note::
+    Currently there are only binary artifcats available for Linux and MacOS.
+    Otherwise this will only pull the python sources and assumes an existing
+    installation of the C++ part of Arrow.
+    To retrieve the binary artifacts, you'll need a recent ``pip`` version that
+    supports features like the ``manylinux1`` tag.
+
+Building from source
+--------------------
+
+First, clone the master git repository:
+
+.. code-block:: bash
+
+    git clone https://github.com/apache/arrow.git arrow
+
+System requirements
+~~~~~~~~~~~~~~~~~~~
+
+Building pyarrow requires:
+
+* A C++11 compiler
+
+  * Linux: gcc >= 4.8 or clang >= 3.5
+  * OS X: XCode 6.4 or higher preferred
+
+* `CMake <https://cmake.org/>`_
+
+Python requirements
+~~~~~~~~~~~~~~~~~~~
+
+You will need Python (CPython) 2.7, 3.4, or 3.5 installed. Earlier releases and
+are not being targeted.
+
+.. note::
+    This library targets CPython only due to an emphasis on interoperability with
+    pandas and NumPy, which are only available for CPython.
+
+The build requires NumPy, Cython, and a few other Python dependencies:
+
+.. code-block:: bash
+
+    pip install cython
+    cd arrow/python
+    pip install -r requirements.txt
+
+Installing Arrow C++ library
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+First, you should choose an installation location for Arrow C++. In the future
+using the default system install location will work, but for now we are being
+explicit:
+
+.. code-block:: bash
+    
+    export ARROW_HOME=$HOME/local
+
+Now, we build Arrow:
+
+.. code-block:: bash
+
+    cd arrow/cpp
+    
+    mkdir dev-build
+    cd dev-build
+    
+    cmake -DCMAKE_INSTALL_PREFIX=$ARROW_HOME ..
+    
+    make
+    
+    # Use sudo here if $ARROW_HOME requires it
+    make install
+
+To get the optional Parquet support, you should also build and install 
+`parquet-cpp <https://github.com/apache/parquet-cpp/blob/master/README.md>`_.
+
+Install `pyarrow`
+~~~~~~~~~~~~~~~~~
+
+
+.. code-block:: bash
+
+    cd arrow/python
+
+    python setup.py install
+
+.. warning::
+    On XCode 6 and prior there are some known OS X `@rpath` issues. If you are
+    unable to import pyarrow, upgrading XCode may be the solution.
+
+.. note::
+    In development installations, you will also need to set a correct
+    ``LD_LIBARY_PATH``. This is most probably done with
+    ``export LD_LIBARY_PATH=$ARROW_HOME/lib:$LD_LIBARY_PATH``.
+
+
+.. code-block:: python
+    
+    In [1]: import pyarrow
+
+    In [2]: pyarrow.from_pylist([1,2,3])
+    Out[2]:
+    <pyarrow.array.Int64Array object at 0x7f899f3e60e8>
+    [
+      1,
+      2,
+      3
+    ]
+

From 06b2f9c98ad338d0633834e2748d1ba0dfe2d455 Mon Sep 17 00:00:00 2001
From: "Uwe L. Korn" <uwelk@xhochy.com>
Date: Wed, 9 Nov 2016 22:49:26 +0100
Subject: [PATCH 4/5] Add tables describing dtype support

Change-Id: If8a3b9703039d4ecc07feec7e90e2be8d52719d4
---
 python/doc/pandas.rst | 57 +++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 55 insertions(+), 2 deletions(-)

diff --git a/python/doc/pandas.rst b/python/doc/pandas.rst
index 2e5a0318185..7c700748178 100644
--- a/python/doc/pandas.rst
+++ b/python/doc/pandas.rst
@@ -26,7 +26,7 @@ DataFrames
 
 The equivalent to a Pandas DataFrame in Arrow is a :class:`pyarrow.table.Table`.
 Both consist of a set of named columns of equal length. While Pandas only
-supports flat columnas, the Table also provides nested columns, thus it can
+supports flat columns, the Table also provides nested columns, thus it can
 represent more data than a DataFrame, so a full conversion is not always possible.
 
 Conversion from a Table to a DataFrame is done by calling
@@ -53,9 +53,62 @@ conversion.
 Series
 ------
 
-In Arrow, the most similar structure to a Pandas Series is an Array. 
+In Arrow, the most similar structure to a Pandas Series is an Array.
 It is a vector that contains data of the same type as linear memory. You can
 convert a Pandas Series to an Arrow Array using :meth:`pyarrow.array.from_pandas_series`.
 As Arrow Arrays are always nullable, you can supply an optional mask using
 the ``mask`` parameter to mark all null-entries.
 
+Type differences
+----------------
+
+With the current design of Pandas and Arrow, it is not possible to convert all
+column types unmodified. One of the main issues here is that Pandas has no
+support for nullable columns of arbitrary type. Also ``datetime64`` is currently
+fixed to nanosecond resolution. On the other side, Arrow might be still missing
+support for some types.
+
+Pandas -> Arrow Conversion
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
++------------------------+--------------------------+
+| Source Type (Pandas)   | Destination Type (Arrow) |
++========================+==========================+
+| ``bool``               | ``BOOL``                 |
++------------------------+--------------------------+
+| ``(u)int{8,16,32,64}`` | ``(U)INT{8,16,32,64}``   |
++------------------------+--------------------------+
+| ``float32``            | ``FLOAT``                |
++------------------------+--------------------------+
+| ``float64``            | ``DOUBLE``               |
++------------------------+--------------------------+
+| ``str`` / ``unicode``  | ``STRING``               |
++------------------------+--------------------------+
+| ``pd.Timestamp``       | ``TIMESTAMP(unit=ns)``   |
++------------------------+--------------------------+
+| ``pd.Categorical``     | *not supported*          |
++------------------------+--------------------------+
+
+Arrow -> Pandas Conversion
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
++-------------------------------------+--------------------------------------------------------+
+| Source Type (Arrow)                 | Destination Type (Pandas)                              |
++=====================================+========================================================+
+| ``BOOL``                            | ``bool``                                               |
++-------------------------------------+--------------------------------------------------------+
+| ``BOOL`` *with nulls*               | ``object`` (with values ``True``, ``False``, ``None``) |
++-------------------------------------+--------------------------------------------------------+
+| ``(U)INT{8,16,32,64}``              | ``(u)int{8,16,32,64}``                                 |
++-------------------------------------+--------------------------------------------------------+
+| ``(U)INT{8,16,32,64}`` *with nulls* | ``float64``                                            |
++-------------------------------------+--------------------------------------------------------+
+| ``FLOAT``                           | ``float32``                                            |
++-------------------------------------+--------------------------------------------------------+
+| ``DOUBLE``                          | ``float64``                                            |
++-------------------------------------+--------------------------------------------------------+
+| ``STRING``                          | ``str``                                                |
++-------------------------------------+--------------------------------------------------------+
+| ``TIMESTAMP(unit=*)``               | ``pd.Timestamp`` (``np.datetime64[ns]``)               |
++-------------------------------------+--------------------------------------------------------+
+

From 530484fa30d9441d13d81ce67b4c77d8890b7bef Mon Sep 17 00:00:00 2001
From: "Uwe L. Korn" <uwelk@xhochy.com>
Date: Fri, 11 Nov 2016 13:20:04 +0100
Subject: [PATCH 5/5] Mention new setup instructions

Change-Id: I60a263b30ece4651153f34f8bd43f83026b5714a
---
 python/doc/install.rst | 4 ++++
 python/doc/parquet.rst | 3 ++-
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/python/doc/install.rst b/python/doc/install.rst
index a2fcfc888fe..1bab0173016 100644
--- a/python/doc/install.rst
+++ b/python/doc/install.rst
@@ -120,6 +120,10 @@ Install `pyarrow`
 
     cd arrow/python
 
+    # --with-parquet enable the Apache Parquet support in PyArrow
+    # --build-type=release disables debugging information and turns on
+    #       compiler optimizations for native code
+    python setup.py build_ext --with-parquet --build-type=release install
     python setup.py install
 
 .. warning::
diff --git a/python/doc/parquet.rst b/python/doc/parquet.rst
index ea64c378038..674ed80f27c 100644
--- a/python/doc/parquet.rst
+++ b/python/doc/parquet.rst
@@ -21,7 +21,8 @@ Reading/Writing Parquet files
 If you have built ``pyarrow`` with Parquet support, i.e. ``parquet-cpp`` was
 found during the build, you can read files in the Parquet format to/from Arrow
 memory structures. The Parquet support code is located in the
-:mod:`pyarrow.parquet` module.
+:mod:`pyarrow.parquet` module and your package needs to be built with the
+``--with-parquet`` flag for ``build_ext``.
 
 Reading Parquet
 ---------------