Skip to content

Conversation

@jorisvandenbossche
Copy link
Member

No description provided.

@github-actions
Copy link

@jorisvandenbossche jorisvandenbossche force-pushed the ARROW-8063-dataset-docs branch 2 times, most recently from 0915cfb to 818add7 Compare April 6, 2020 20:45
@kszucs
Copy link
Member

kszucs commented Apr 6, 2020

Don't forget to trigger test-ubuntu-18.04-docs crossbow task before merging.

@jorisvandenbossche jorisvandenbossche force-pushed the ARROW-8063-dataset-docs branch 2 times, most recently from 47bdaf2 to 5afd61f Compare April 9, 2020 15:38
Copy link
Member

@bkietz bkietz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a great start, thanks @jorisvandenbossche!

I have some suggestions on wording but the overall structure seems solid.

Comment on lines 322 to 339
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# creating a dummy dataset: directory with two files
table = pa.table({'col1': range(3), 'col2': np.random.randn(3)})
pq.write_table(table, "parquet_dataset_manual/data_file1.parquet")
pq.write_table(table, "parquet_dataset_manual/data_file2.parquet")
To create a Dataset from a list of files, we need to specify the schema, format,
filesystem, and paths manually:
.. ipython:: python
import pyarrow.fs
schema = pa.schema([("file", pa.int64()), ("col1", pa.int64()), ("col2", pa.float64())])
dataset = ds.FileSystemDataset(
schema, None, ds.ParquetFileFormat(), pa.fs.LocalFileSystem(),
["parquet_dataset_manual/data_file1.parquet", "parquet_dataset_manual/data_file2.parquet"],
[ds.field('file') == 1, ds.field('file') == 2])
# creating a dummy dataset: directory with two files
table = pa.table({'col1': range(3), 'col2': np.random.randn(3)})
pq.write_table(table, "parquet_dataset_manual/old.parquet")
pq.write_table(table, "parquet_dataset_manual/new.parquet")
To create a Dataset from a list of files, we need to specify the schema, format,
filesystem, and paths manually:
.. ipython:: python
import pyarrow.fs
schema = pa.schema([("year", pa.int32()), ("col1", pa.int64()), ("col2", pa.float64())])
dataset = ds.FileSystemDataset(
schema, None, ds.ParquetFileFormat(), pa.fs.LocalFileSystem(),
["parquet_dataset_manual/old.parquet", "parquet_dataset_manual/new.parquet"],
[ds.field('year') < 2020, ds.field('year') >= 2020])

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of your improvement, but this is actually an interesting case. If you do such a larger/smaller filter, it can't really be added to the table as a column (just tried, and currently it gives all nulls).

While interesting, it might not be the best example to show first.

But will change to using years in the file names (but just with exact names), that will already make the example nicer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And also, filtering with that kind of expressions is also not very intuitively.
If you have the partition expressions as [ds.field('year') < 2020, ds.field('year') >= 2020]), then doing a filter with ds.field('year') == 2020, you might intuitively expect that it will read the the second file, as "year == 2020" is a subset of "year >= 2020", but this is not the case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you might intuitively expect that it will read the the second file, as "year == 2020" is a subset of "year >= 2020", but this is not the case.

I certainly did expect that. I will investigate.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also the change here from int64 -> int32 in the schema doesn't work, because the partition expressions specified for the files are default int64 ..

@jorisvandenbossche
Copy link
Member Author

@bkietz thanks for the feedback!

@jorisvandenbossche jorisvandenbossche marked this pull request as ready for review April 14, 2020 12:00
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just two things that stuck out.


.. TODO Full blown example with NYC taxi data to show off, afterwards explain all parts:
.. ipython:: python
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I'm frankly not fond of "ipython" directives. The more we add, the slower building the docs becomes and the more it deters from editing/improving them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I am happy to change this (I mostly used it because it was already used elsewhere).

However, I think ideally we still check the code examples for correctness, where applicable. For example by using pytest doctests on them. This can be run separate as tests, and doesn't need to be part of doc building.

Copy link
Member

@kszucs kszucs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kszucs kszucs closed this in 9274c1b Apr 14, 2020
@jorisvandenbossche jorisvandenbossche deleted the ARROW-8063-dataset-docs branch April 15, 2020 07:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants