ARROW-8063: [Python][Dataset] Start user guide for pyarrow.dataset #6779

jorisvandenbossche · 2020-03-31T09:28:15Z

No description provided.

github-actions · 2020-03-31T09:31:30Z

https://issues.apache.org/jira/browse/ARROW-8063

kszucs · 2020-04-06T23:51:18Z

Don't forget to trigger test-ubuntu-18.04-docs crossbow task before merging.

bkietz

This looks like a great start, thanks @jorisvandenbossche!

I have some suggestions on wording but the overall structure seems solid.

docs/source/python/dataset.rst

bkietz · 2020-04-09T19:34:04Z

docs/source/python/dataset.rst

Suggested change

# creating a dummy dataset: directory with two files

table = pa.table({'col1': range(3), 'col2': np.random.randn(3)})

pq.write_table(table, "parquet_dataset_manual/data_file1.parquet")

pq.write_table(table, "parquet_dataset_manual/data_file2.parquet")

To create a Dataset from a list of files, we need to specify the schema, format,

filesystem, and paths manually:

.. ipython:: python

import pyarrow.fs

schema = pa.schema([("file", pa.int64()), ("col1", pa.int64()), ("col2", pa.float64())])

dataset = ds.FileSystemDataset(

schema, None, ds.ParquetFileFormat(), pa.fs.LocalFileSystem(),

["parquet_dataset_manual/data_file1.parquet", "parquet_dataset_manual/data_file2.parquet"],

[ds.field('file') == 1, ds.field('file') == 2])

# creating a dummy dataset: directory with two files

table = pa.table({'col1': range(3), 'col2': np.random.randn(3)})

pq.write_table(table, "parquet_dataset_manual/old.parquet")

pq.write_table(table, "parquet_dataset_manual/new.parquet")

To create a Dataset from a list of files, we need to specify the schema, format,

filesystem, and paths manually:

.. ipython:: python

import pyarrow.fs

schema = pa.schema([("year", pa.int32()), ("col1", pa.int64()), ("col2", pa.float64())])

dataset = ds.FileSystemDataset(

schema, None, ds.ParquetFileFormat(), pa.fs.LocalFileSystem(),

["parquet_dataset_manual/old.parquet", "parquet_dataset_manual/new.parquet"],

[ds.field('year') < 2020, ds.field('year') >= 2020])

I like the idea of your improvement, but this is actually an interesting case. If you do such a larger/smaller filter, it can't really be added to the table as a column (just tried, and currently it gives all nulls).

While interesting, it might not be the best example to show first.

But will change to using years in the file names (but just with exact names), that will already make the example nicer.

And also, filtering with that kind of expressions is also not very intuitively.
If you have the partition expressions as [ds.field('year') < 2020, ds.field('year') >= 2020]), then doing a filter with ds.field('year') == 2020, you might intuitively expect that it will read the the second file, as "year == 2020" is a subset of "year >= 2020", but this is not the case.

you might intuitively expect that it will read the the second file, as "year == 2020" is a subset of "year >= 2020", but this is not the case.

I certainly did expect that. I will investigate.

Also the change here from int64 -> int32 in the schema doesn't work, because the partition expressions specified for the files are default int64 ..

docs/source/python/dataset.rst

jorisvandenbossche · 2020-04-09T19:44:48Z

@bkietz thanks for the feedback!

Co-Authored-By: Benjamin Kietzman <bengilgit@gmail.com>

pitrou

Just two things that stuck out.

docs/source/python/dataset.rst

pitrou · 2020-04-14T12:12:41Z

docs/source/python/dataset.rst

+
+.. TODO Full blown example with NYC taxi data to show off, afterwards explain all parts:
+
+.. ipython:: python


Hmm, I'm frankly not fond of "ipython" directives. The more we add, the slower building the docs becomes and the more it deters from editing/improving them.

Yeah, I am happy to change this (I mostly used it because it was already used elsewhere).

However, I think ideally we still check the code examples for correctness, where applicable. For example by using pytest doctests on them. This can be run separate as tests, and doesn't need to be part of doc building.

docs/source/python/dataset.rst

kszucs

LGTM

jorisvandenbossche force-pushed the ARROW-8063-dataset-docs branch 2 times, most recently from 0915cfb to 818add7 Compare April 6, 2020 20:45

jorisvandenbossche force-pushed the ARROW-8063-dataset-docs branch 2 times, most recently from 47bdaf2 to 5afd61f Compare April 9, 2020 15:38

bkietz requested changes Apr 9, 2020

View reviewed changes

jorisvandenbossche force-pushed the ARROW-8063-dataset-docs branch from b0bb657 to c30b3b8 Compare April 10, 2020 12:00

kszucs mentioned this pull request Apr 10, 2020

ARROW-7965: [Python] Refine higher level dataset API #6505

Closed

jorisvandenbossche and others added 8 commits April 14, 2020 12:01

ARROW-8063: [Python][Dataset] Start user guide for pyarrow.dataset

3d6daf0

add content

1058d30

add section on partitioned datasets

2302712

Apply suggestions from code review

7150c5f

Co-Authored-By: Benjamin Kietzman <bengilgit@gmail.com>

small fixups

7139007

update expression wording + manual example

0bcdcec

add section on cloud storage

82ef277

update manual dataset API + sphinx fixes

d1e771c

jorisvandenbossche force-pushed the ARROW-8063-dataset-docs branch from c30b3b8 to d1e771c Compare April 14, 2020 11:37

add small example for scan() iterating through record batches

6da02e4

jorisvandenbossche marked this pull request as ready for review April 14, 2020 12:00

pitrou reviewed Apr 14, 2020

View reviewed changes

bkietz reviewed Apr 14, 2020

View reviewed changes

docs/source/python/dataset.rst Outdated Show resolved Hide resolved

update intro

7d7ba3f

jorisvandenbossche force-pushed the ARROW-8063-dataset-docs branch from b234947 to 7d7ba3f Compare April 14, 2020 13:21

jorisvandenbossche added 2 commits April 14, 2020 16:03

add note in the parquet user guide about use_legacy_dataset=False

24f7216

fix titles in index.rst

45a956d

kszucs approved these changes Apr 14, 2020

View reviewed changes

kszucs closed this in 9274c1b Apr 14, 2020

jorisvandenbossche deleted the ARROW-8063-dataset-docs branch April 15, 2020 07:41

asfimport mentioned this pull request Apr 14, 2020

[Python] Add user guide documentation for Datasets API #24276

Closed

-    # creating a dummy dataset: directory with two files
-    table = pa.table({'col1': range(3), 'col2': np.random.randn(3)})
-    pq.write_table(table, "parquet_dataset_manual/data_file1.parquet")
-    pq.write_table(table, "parquet_dataset_manual/data_file2.parquet")
-To create a Dataset from a list of files, we need to specify the schema, format,
-filesystem, and paths manually:
-.. ipython:: python
-    import pyarrow.fs
-    schema = pa.schema([("file", pa.int64()), ("col1", pa.int64()), ("col2", pa.float64())])
-    dataset = ds.FileSystemDataset(
-        schema, None, ds.ParquetFileFormat(), pa.fs.LocalFileSystem(),
-        ["parquet_dataset_manual/data_file1.parquet", "parquet_dataset_manual/data_file2.parquet"],
-        [ds.field('file') == 1, ds.field('file') == 2])
+    # creating a dummy dataset: directory with two files
+    table = pa.table({'col1': range(3), 'col2': np.random.randn(3)})
+    pq.write_table(table, "parquet_dataset_manual/old.parquet")
+    pq.write_table(table, "parquet_dataset_manual/new.parquet")
+To create a Dataset from a list of files, we need to specify the schema, format,
+filesystem, and paths manually:
+.. ipython:: python
+    import pyarrow.fs
+    schema = pa.schema([("year", pa.int32()), ("col1", pa.int64()), ("col2", pa.float64())])
+    dataset = ds.FileSystemDataset(
+        schema, None, ds.ParquetFileFormat(), pa.fs.LocalFileSystem(),
+        ["parquet_dataset_manual/old.parquet", "parquet_dataset_manual/new.parquet"],
+        [ds.field('year') < 2020, ds.field('year') >= 2020])


		.. TODO Full blown example with NYC taxi data to show off, afterwards explain all parts:

		.. ipython:: python

ARROW-8063: [Python][Dataset] Start user guide for pyarrow.dataset #6779

ARROW-8063: [Python][Dataset] Start user guide for pyarrow.dataset #6779

Uh oh!

Conversation

jorisvandenbossche commented Mar 31, 2020

Uh oh!

github-actions bot commented Mar 31, 2020

Uh oh!

kszucs commented Apr 6, 2020

Uh oh!

bkietz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bkietz Apr 9, 2020

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche Apr 10, 2020

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche Apr 10, 2020

Choose a reason for hiding this comment

Uh oh!

bkietz Apr 10, 2020

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche Apr 14, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jorisvandenbossche commented Apr 9, 2020

Uh oh!

pitrou left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pitrou Apr 14, 2020

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche Apr 14, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kszucs left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants