From 66022a6a0e8cb611b78817f0437f564118812fb5 Mon Sep 17 00:00:00 2001 From: Fokko Driesprong Date: Fri, 10 Nov 2023 21:00:31 +0100 Subject: [PATCH 1/2] Docs: Add section on pandas --- mkdocs/docs/api.md | 41 ++++++++++++++++++++++++++++++++++++++++- 1 file changed, 40 insertions(+), 1 deletion(-) diff --git a/mkdocs/docs/api.md b/mkdocs/docs/api.md index d716a138a2..613c33e51b 100644 --- a/mkdocs/docs/api.md +++ b/mkdocs/docs/api.md @@ -318,7 +318,7 @@ In this case it is up to the engine itself to filter the file itself. Below, `to !!! note "Requirements" - This requires [PyArrow to be installed](index.md). + This requires [`pyarrow` to be installed](index.md). @@ -346,6 +346,45 @@ tpep_dropoff_datetime: [[2021-04-01 00:47:59.000000,...,2021-05-01 00:14:47.0000 This will only pull in the files that that might contain matching rows. +### Pandas + + + +!!! note "Requirements" + This requires [`pandas` to be installed](index.md). + + + +PyIceberg makes it easy to filter out data from a huge table and pull it into a Pandas dataframe locally. This will only fetch Parquet files that that might contain matching data. This will reduce IO and therefore improve performance and reduce cost. + +```python +table.scan( + row_filter="trip_distance >= 10.0", + selected_fields=("VendorID", "tpep_pickup_datetime", "tpep_dropoff_datetime"), +).to_pandas() +``` + +This will return a Pandas dataframe: + +``` + VendorID tpep_pickup_datetime tpep_dropoff_datetime +0 2 2021-04-01 00:28:05+00:00 2021-04-01 00:47:59+00:00 +1 1 2021-04-01 00:39:01+00:00 2021-04-01 00:57:39+00:00 +2 2 2021-04-01 00:14:42+00:00 2021-04-01 00:42:59+00:00 +3 1 2021-04-01 00:17:17+00:00 2021-04-01 00:43:38+00:00 +4 1 2021-04-01 00:24:04+00:00 2021-04-01 00:56:20+00:00 +... ... ... ... +116976 2 2021-04-30 23:56:18+00:00 2021-05-01 00:29:13+00:00 +116977 2 2021-04-30 23:07:41+00:00 2021-04-30 23:37:18+00:00 +116978 2 2021-04-30 23:38:28+00:00 2021-05-01 00:12:04+00:00 +116979 2 2021-04-30 23:33:00+00:00 2021-04-30 23:59:00+00:00 +116980 2 2021-04-30 23:44:25+00:00 2021-05-01 00:14:47+00:00 + +[116981 rows x 3 columns] +``` + +It is recommended to use Pandas 2 or later, because it stores the data in an [Apache Arrow backend](https://datapythonista.me/blog/pandas-20-and-the-arrow-revolution-part-i) which avoids copies of data. + ### DuckDB From aa8941316aaf25051c0e98e29ae6a6966ceda9cd Mon Sep 17 00:00:00 2001 From: Fokko Driesprong Date: Tue, 14 Nov 2023 17:55:00 +0100 Subject: [PATCH 2/2] Update mkdocs/docs/api.md --- mkdocs/docs/api.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mkdocs/docs/api.md b/mkdocs/docs/api.md index 613c33e51b..e2f726afe8 100644 --- a/mkdocs/docs/api.md +++ b/mkdocs/docs/api.md @@ -355,7 +355,7 @@ This will only pull in the files that that might contain matching rows. -PyIceberg makes it easy to filter out data from a huge table and pull it into a Pandas dataframe locally. This will only fetch Parquet files that that might contain matching data. This will reduce IO and therefore improve performance and reduce cost. +PyIceberg makes it easy to filter out data from a huge table and pull it into a Pandas dataframe locally. This will only fetch the relevant Parquet files for the query and apply the filter. This will reduce IO and therefore improve performance and reduce cost. ```python table.scan(