From bc98e4ebd09ac7f3533eaa95eca336dd8e66ba2c Mon Sep 17 00:00:00 2001 From: Alessandro Molina Date: Wed, 25 Aug 2021 11:50:47 +0200 Subject: [PATCH 1/8] Add links to the cookbook --- docs/source/index.rst | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/docs/source/index.rst b/docs/source/index.rst index 65aeb47ea9f..5579e8cd781 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -55,6 +55,16 @@ target environment.** Rust status +.. _toc.cookbook: + +.. toctree:: + :maxdepth: 1 + :caption: Cookbooks + + C++ + Python + R + .. _toc.columnar: .. toctree:: From 7a5f0fae39890243f236cf136a1b5add80b17862 Mon Sep 17 00:00:00 2001 From: Alessandro Molina Date: Wed, 25 Aug 2021 16:32:15 +0200 Subject: [PATCH 2/8] Improve doc for new users --- docs/source/python/getstarted.rst | 149 ++++++++++++++++++++++++++++++ docs/source/python/index.rst | 16 +++- 2 files changed, 160 insertions(+), 5 deletions(-) create mode 100644 docs/source/python/getstarted.rst diff --git a/docs/source/python/getstarted.rst b/docs/source/python/getstarted.rst new file mode 100644 index 00000000000..4af82b367e1 --- /dev/null +++ b/docs/source/python/getstarted.rst @@ -0,0 +1,149 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +.. or more contributor license agreements. See the NOTICE file +.. distributed with this work for additional information +.. regarding copyright ownership. The ASF licenses this file +.. to you under the Apache License, Version 2.0 (the +.. "License"); you may not use this file except in compliance +.. with the License. You may obtain a copy of the License at + +.. http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, +.. software distributed under the License is distributed on an +.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +.. KIND, either express or implied. See the License for the +.. specific language governing permissions and limitations +.. under the License. + +.. _getstarted: + +Getting Started +=============== + +Arrow manages data in Arrays (:class:`pyarrow.Array`), which can be +grouped in tables (:class:`pyarrow.Table`) to represent columns of data +in tabular data. + +Arrow also exposes supports for various formats to get those tabular +data in and out of disk and networks. Most commonly used formats are +Parquet (:ref:`parquet`) and the IPC format (:ref:`ipc`). + +Creating Arrays and Tables +-------------------------- + +Arrays in Arrow are collections of data of uniform type. That allows +arrow to use the best performing implementation to store the data and +perform computation of it. So each array is meant to have data and +a type + +.. ipython:: python + + import pyarrow as pa + + days = pa.array([1, 12, 17, 23, 28], type=pa.int8()) + +multiple arrays can be combined in tables to form the columns +in tabular data according to a provided schema + +.. ipython:: python + + months = pa.array([1, 3, 5, 7, 1], type=pa.int8()) + years = pa.array([1990, 2000, 1995, 2000, 1995], type=pa.int16()) + + birthdays_table = pa.table([days, months, years], + schema=pa.schema([ + ('days', days.type), + ('months', months.type), + ('years', years.type) + ])) + + birthdays_table + +See :ref:`data` for more details. + +Saving and Loading Tables +------------------------- + +Once you have a tabular data, Arrow provides out of the box +the features to save and restore that data for common formats +like parquet + +.. ipython:: python + + import pyarrow.parquet as pq + + pq.write_table(birthdays_table, 'birthdays.parquet') + +Once you have your data on disk, loading it back is as easy, +and Arrow is heavily optimized for memory and speed so loading +data will be as quick as possible + +.. ipython:: python + + reloaded_birthdays = pq.read_table('birthdays.parquet') + + reloaded_birthdays + +Saving and loading back data in arrow is usually done through +:ref:`parquet`, :ref:`ipc`, :ref:`csv` or :ref:`json` formats. + +Performing Computations +----------------------- + +Arrow ships with a bunch of compute functions that can be applied +to its arrays, so through the compute functions it's possible to apply +transformations to the data + +.. ipython:: python + + import pyarrow.compute as pc + + pc.value_counts(birthdays_table["years"]) + +See :ref:`compute` for a list of available compute functions and +how to use them. + +Working with big data +--------------------- + +Arrow also provides the :class:`pyarrow.dataset` api to work with +big data, which will handle for you partitioning of your data in +smaller chunks + +.. ipython:: python + + import pyarrow.dataset as ds + + ds.write_dataset(birthdays_table, "savedir", format="parquet", + partitioning=ds.partitioning( + pa.schema([birthdays_table.schema.field("years")]) + )) + +Loading back the partitioned dataset will detect the chunks + +.. ipython:: python + + birthdays_dataset = ds.dataset("savedir", schema=birthdays_table.schema, + partitioning=ds.partitioning(field_names=["years"])) + + birthdays_dataset.files + +and will lazily load chunks of data only when iterating over them + +.. ipython:: python + + import datetime + + current_year = datetime.datetime.utcnow().year + for table_chunk in birthdays_dataset.to_batches(): + print("AGES", pc.abs(pc.subtract(table_chunk["years"], current_year))) + +For further details on how to work with big datasets, how to filter them, +how to project them etc... refer to :ref:`dataset` documentation. + +Continuining from here +---------------------- + +For digging further into Arrow, you might want to read the +:doc:`PyArrow Documentation <./index>` itself or the +`Arrow Python Cookbook `_ diff --git a/docs/source/python/index.rst b/docs/source/python/index.rst index cc7383044e0..14fe21b9bfa 100644 --- a/docs/source/python/index.rst +++ b/docs/source/python/index.rst @@ -15,12 +15,17 @@ .. specific language governing permissions and limitations .. under the License. -Python bindings -=============== +PyArrow - Apache Arrow Python bindings +====================================== This is the documentation of the Python API of Apache Arrow. For more details -on the Arrow format and other language bindings see the -:doc:`parent documentation <../index>`. +on the Arrow format and other language bindings + +Apache Arrow is a development platform for in-memory analytics. +It contains a set of technologies that enable big data systems to store, process and move data fast. + +See the :doc:`parent documentation <../index>` for additional details on +the Arrow Project itself. The Arrow Python bindings (also named "PyArrow") have first-class integration with NumPy, pandas, and built-in Python objects. They are based on the C++ @@ -34,9 +39,10 @@ files into Arrow structures. :maxdepth: 2 install - memory + getstarted data compute + memory ipc filesystems filesystems_deprecated From a534f37fc283db7d5e251f05f654a5c26b4b14f1 Mon Sep 17 00:00:00 2001 From: Alessandro Molina Date: Mon, 30 Aug 2021 12:01:37 +0200 Subject: [PATCH 3/8] Apply suggestions from code review Co-authored-by: Joris Van den Bossche --- docs/source/python/getstarted.rst | 21 ++++++++++----------- 1 file changed, 10 insertions(+), 11 deletions(-) diff --git a/docs/source/python/getstarted.rst b/docs/source/python/getstarted.rst index 4af82b367e1..79ab43977ff 100644 --- a/docs/source/python/getstarted.rst +++ b/docs/source/python/getstarted.rst @@ -24,7 +24,7 @@ Arrow manages data in Arrays (:class:`pyarrow.Array`), which can be grouped in tables (:class:`pyarrow.Table`) to represent columns of data in tabular data. -Arrow also exposes supports for various formats to get those tabular +Arrow also provides support for various formats to get those tabular data in and out of disk and networks. Most commonly used formats are Parquet (:ref:`parquet`) and the IPC format (:ref:`ipc`). @@ -32,8 +32,8 @@ Creating Arrays and Tables -------------------------- Arrays in Arrow are collections of data of uniform type. That allows -arrow to use the best performing implementation to store the data and -perform computation of it. So each array is meant to have data and +Arrow to use the best performing implementation to store the data and +perform computations on it. So each array is meant to have data and a type .. ipython:: python @@ -42,7 +42,7 @@ a type days = pa.array([1, 12, 17, 23, 28], type=pa.int8()) -multiple arrays can be combined in tables to form the columns +Multiple arrays can be combined in tables to form the columns in tabular data according to a provided schema .. ipython:: python @@ -64,9 +64,9 @@ See :ref:`data` for more details. Saving and Loading Tables ------------------------- -Once you have a tabular data, Arrow provides out of the box +Once you have tabular data, Arrow provides out of the box the features to save and restore that data for common formats -like parquet +like Parquet: .. ipython:: python @@ -85,7 +85,7 @@ data will be as quick as possible reloaded_birthdays Saving and loading back data in arrow is usually done through -:ref:`parquet`, :ref:`ipc`, :ref:`csv` or :ref:`json` formats. +:ref:`parquet`, :ref:`ipc` (:ref:`feather`), :ref:`csv` or :ref:`json` formats. Performing Computations ----------------------- @@ -123,8 +123,7 @@ Loading back the partitioned dataset will detect the chunks .. ipython:: python - birthdays_dataset = ds.dataset("savedir", schema=birthdays_table.schema, - partitioning=ds.partitioning(field_names=["years"])) + birthdays_dataset = ds.dataset("savedir", format="parquet", partitioning=["years"]) birthdays_dataset.files @@ -136,10 +135,10 @@ and will lazily load chunks of data only when iterating over them current_year = datetime.datetime.utcnow().year for table_chunk in birthdays_dataset.to_batches(): - print("AGES", pc.abs(pc.subtract(table_chunk["years"], current_year))) + print("AGES", pc.subtract(current_year, table_chunk["years"])) For further details on how to work with big datasets, how to filter them, -how to project them etc... refer to :ref:`dataset` documentation. +how to project them, etc., refer to :ref:`dataset` documentation. Continuining from here ---------------------- From c59d08c8a6bdc0ed82bdba02d61ec9f02757227d Mon Sep 17 00:00:00 2001 From: Alessandro Molina Date: Mon, 30 Aug 2021 15:39:26 +0200 Subject: [PATCH 4/8] Address feedback --- docs/source/python/getstarted.rst | 12 ++++-------- docs/source/python/index.rst | 5 ++--- 2 files changed, 6 insertions(+), 11 deletions(-) diff --git a/docs/source/python/getstarted.rst b/docs/source/python/getstarted.rst index 79ab43977ff..756e6157efe 100644 --- a/docs/source/python/getstarted.rst +++ b/docs/source/python/getstarted.rst @@ -43,19 +43,15 @@ a type days = pa.array([1, 12, 17, 23, 28], type=pa.int8()) Multiple arrays can be combined in tables to form the columns -in tabular data according to a provided schema +in tabular data when attached to a column name .. ipython:: python months = pa.array([1, 3, 5, 7, 1], type=pa.int8()) years = pa.array([1990, 2000, 1995, 2000, 1995], type=pa.int16()) - birthdays_table = pa.table([days, months, years], - schema=pa.schema([ - ('days', days.type), - ('months', months.type), - ('years', years.type) - ])) + birthdays_table = pa.table([days, months, years], + names=["days", "months", "years"]) birthdays_table @@ -74,7 +70,7 @@ like Parquet: pq.write_table(birthdays_table, 'birthdays.parquet') -Once you have your data on disk, loading it back is as easy, +Once you have your data on disk, loading it back is a single function call, and Arrow is heavily optimized for memory and speed so loading data will be as quick as possible diff --git a/docs/source/python/index.rst b/docs/source/python/index.rst index 14fe21b9bfa..0ffa40545d9 100644 --- a/docs/source/python/index.rst +++ b/docs/source/python/index.rst @@ -18,14 +18,13 @@ PyArrow - Apache Arrow Python bindings ====================================== -This is the documentation of the Python API of Apache Arrow. For more details -on the Arrow format and other language bindings +This is the documentation of the Python API of Apache Arrow. Apache Arrow is a development platform for in-memory analytics. It contains a set of technologies that enable big data systems to store, process and move data fast. See the :doc:`parent documentation <../index>` for additional details on -the Arrow Project itself. +the Arrow Project itself, on the Arrow format and the other language bindings. The Arrow Python bindings (also named "PyArrow") have first-class integration with NumPy, pandas, and built-in Python objects. They are based on the C++ From a6ca9448353dafd2b141e8b8b736045b3f475bbe Mon Sep 17 00:00:00 2001 From: Alessandro Molina Date: Mon, 30 Aug 2021 15:45:04 +0200 Subject: [PATCH 5/8] provide link names --- docs/source/python/getstarted.rst | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/docs/source/python/getstarted.rst b/docs/source/python/getstarted.rst index 756e6157efe..90de8bf2d08 100644 --- a/docs/source/python/getstarted.rst +++ b/docs/source/python/getstarted.rst @@ -81,7 +81,8 @@ data will be as quick as possible reloaded_birthdays Saving and loading back data in arrow is usually done through -:ref:`parquet`, :ref:`ipc` (:ref:`feather`), :ref:`csv` or :ref:`json` formats. +:ref:`Parquet `, :ref:`IPC format ` (:ref:`feather`), :ref:`CSV ` or +:ref:`Line-Delimited JSON ` formats. Performing Computations ----------------------- From 5cc90ea8aaa6e6a280d2da3730db7d290b696f3a Mon Sep 17 00:00:00 2001 From: Alessandro Molina Date: Tue, 31 Aug 2021 15:55:28 +0200 Subject: [PATCH 6/8] replace big data with large data --- docs/source/python/getstarted.rst | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/source/python/getstarted.rst b/docs/source/python/getstarted.rst index 90de8bf2d08..95b647bc167 100644 --- a/docs/source/python/getstarted.rst +++ b/docs/source/python/getstarted.rst @@ -100,11 +100,11 @@ transformations to the data See :ref:`compute` for a list of available compute functions and how to use them. -Working with big data ---------------------- +Working with large data +----------------------- Arrow also provides the :class:`pyarrow.dataset` api to work with -big data, which will handle for you partitioning of your data in +large data, which will handle for you partitioning of your data in smaller chunks .. ipython:: python From eea3795e103ab3c6c1a391ded545a98eeb711324 Mon Sep 17 00:00:00 2001 From: Alessandro Molina Date: Tue, 31 Aug 2021 15:56:21 +0200 Subject: [PATCH 7/8] tables too --- docs/source/python/getstarted.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/source/python/getstarted.rst b/docs/source/python/getstarted.rst index 95b647bc167..45c6e7958d3 100644 --- a/docs/source/python/getstarted.rst +++ b/docs/source/python/getstarted.rst @@ -88,8 +88,8 @@ Performing Computations ----------------------- Arrow ships with a bunch of compute functions that can be applied -to its arrays, so through the compute functions it's possible to apply -transformations to the data +to its arrays and tables, so through the compute functions +it's possible to apply transformations to the data .. ipython:: python From c131048773e5c98711df569854fa024be10ff705 Mon Sep 17 00:00:00 2001 From: Alessandro Molina Date: Wed, 1 Sep 2021 11:15:35 +0200 Subject: [PATCH 8/8] Tweak --- docs/source/python/getstarted.rst | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/source/python/getstarted.rst b/docs/source/python/getstarted.rst index 45c6e7958d3..36e4707ad61 100644 --- a/docs/source/python/getstarted.rst +++ b/docs/source/python/getstarted.rst @@ -20,7 +20,7 @@ Getting Started =============== -Arrow manages data in Arrays (:class:`pyarrow.Array`), which can be +Arrow manages data in arrays (:class:`pyarrow.Array`), which can be grouped in tables (:class:`pyarrow.Table`) to represent columns of data in tabular data. @@ -81,8 +81,8 @@ data will be as quick as possible reloaded_birthdays Saving and loading back data in arrow is usually done through -:ref:`Parquet `, :ref:`IPC format ` (:ref:`feather`), :ref:`CSV ` or -:ref:`Line-Delimited JSON ` formats. +:ref:`Parquet `, :ref:`IPC format ` (:ref:`feather`), +:ref:`CSV ` or :ref:`Line-Delimited JSON ` formats. Performing Computations ----------------------- @@ -103,7 +103,7 @@ how to use them. Working with large data ----------------------- -Arrow also provides the :class:`pyarrow.dataset` api to work with +Arrow also provides the :class:`pyarrow.dataset` API to work with large data, which will handle for you partitioning of your data in smaller chunks