From ffdb3621c7edf4adf9bd37ded16c798ca9ba6ca8 Mon Sep 17 00:00:00 2001 From: HyukjinKwon Date: Wed, 1 May 2019 22:03:04 +0900 Subject: [PATCH] Update documentation about Arrow version in PySpark as well --- docs/sql-pyspark-pandas-with-arrow.md | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/docs/sql-pyspark-pandas-with-arrow.md b/docs/sql-pyspark-pandas-with-arrow.md index f40b33eb97a93..b35a5692d6131 100644 --- a/docs/sql-pyspark-pandas-with-arrow.md +++ b/docs/sql-pyspark-pandas-with-arrow.md @@ -22,7 +22,7 @@ license: | * Table of contents {:toc} -## Apache Arrow in Spark +## Apache Arrow in PySpark Apache Arrow is an in-memory columnar data format that is used in Spark to efficiently transfer data between JVM and Python processes. This currently is most beneficial to Python users that @@ -35,7 +35,7 @@ working with Arrow-enabled data. If you install PySpark using pip, then PyArrow can be brought in as an extra dependency of the SQL module with the command `pip install pyspark[sql]`. Otherwise, you must ensure that PyArrow -is installed and available on all cluster nodes. The current supported version is 0.8.0. +is installed and available on all cluster nodes. The current supported version is 0.12.1. You can install using pip or conda from the conda-forge channel. See PyArrow [installation](https://arrow.apache.org/docs/python/install.html) for details. @@ -143,8 +143,7 @@ For detailed usage, please see [`pyspark.sql.functions.pandas_udf`](api/python/p ### Supported SQL Types Currently, all Spark SQL data types are supported by Arrow-based conversion except `MapType`, -`ArrayType` of `TimestampType`, and nested `StructType`. `BinaryType` is supported only when -installed PyArrow is equal to or higher than 0.10.0. +`ArrayType` of `TimestampType`, and nested `StructType`. ### Setting Arrow Batch Size