-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-8063: [Python][Dataset] Start user guide for pyarrow.dataset #6779
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-8063: [Python][Dataset] Start user guide for pyarrow.dataset #6779
Conversation
0915cfb to
818add7
Compare
|
Don't forget to trigger |
47bdaf2 to
5afd61f
Compare
bkietz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a great start, thanks @jorisvandenbossche!
I have some suggestions on wording but the overall structure seems solid.
docs/source/python/dataset.rst
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| # creating a dummy dataset: directory with two files | |
| table = pa.table({'col1': range(3), 'col2': np.random.randn(3)}) | |
| pq.write_table(table, "parquet_dataset_manual/data_file1.parquet") | |
| pq.write_table(table, "parquet_dataset_manual/data_file2.parquet") | |
| To create a Dataset from a list of files, we need to specify the schema, format, | |
| filesystem, and paths manually: | |
| .. ipython:: python | |
| import pyarrow.fs | |
| schema = pa.schema([("file", pa.int64()), ("col1", pa.int64()), ("col2", pa.float64())]) | |
| dataset = ds.FileSystemDataset( | |
| schema, None, ds.ParquetFileFormat(), pa.fs.LocalFileSystem(), | |
| ["parquet_dataset_manual/data_file1.parquet", "parquet_dataset_manual/data_file2.parquet"], | |
| [ds.field('file') == 1, ds.field('file') == 2]) | |
| # creating a dummy dataset: directory with two files | |
| table = pa.table({'col1': range(3), 'col2': np.random.randn(3)}) | |
| pq.write_table(table, "parquet_dataset_manual/old.parquet") | |
| pq.write_table(table, "parquet_dataset_manual/new.parquet") | |
| To create a Dataset from a list of files, we need to specify the schema, format, | |
| filesystem, and paths manually: | |
| .. ipython:: python | |
| import pyarrow.fs | |
| schema = pa.schema([("year", pa.int32()), ("col1", pa.int64()), ("col2", pa.float64())]) | |
| dataset = ds.FileSystemDataset( | |
| schema, None, ds.ParquetFileFormat(), pa.fs.LocalFileSystem(), | |
| ["parquet_dataset_manual/old.parquet", "parquet_dataset_manual/new.parquet"], | |
| [ds.field('year') < 2020, ds.field('year') >= 2020]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of your improvement, but this is actually an interesting case. If you do such a larger/smaller filter, it can't really be added to the table as a column (just tried, and currently it gives all nulls).
While interesting, it might not be the best example to show first.
But will change to using years in the file names (but just with exact names), that will already make the example nicer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And also, filtering with that kind of expressions is also not very intuitively.
If you have the partition expressions as [ds.field('year') < 2020, ds.field('year') >= 2020]), then doing a filter with ds.field('year') == 2020, you might intuitively expect that it will read the the second file, as "year == 2020" is a subset of "year >= 2020", but this is not the case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you might intuitively expect that it will read the the second file, as "year == 2020" is a subset of "year >= 2020", but this is not the case.
I certainly did expect that. I will investigate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also the change here from int64 -> int32 in the schema doesn't work, because the partition expressions specified for the files are default int64 ..
|
@bkietz thanks for the feedback! |
b0bb657 to
c30b3b8
Compare
Co-Authored-By: Benjamin Kietzman <bengilgit@gmail.com>
c30b3b8 to
d1e771c
Compare
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just two things that stuck out.
docs/source/python/dataset.rst
Outdated
|
|
||
| .. TODO Full blown example with NYC taxi data to show off, afterwards explain all parts: | ||
| .. ipython:: python |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I'm frankly not fond of "ipython" directives. The more we add, the slower building the docs becomes and the more it deters from editing/improving them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I am happy to change this (I mostly used it because it was already used elsewhere).
However, I think ideally we still check the code examples for correctness, where applicable. For example by using pytest doctests on them. This can be run separate as tests, and doesn't need to be part of doc building.
b234947 to
7d7ba3f
Compare
kszucs
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
No description provided.