-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-7965: [Python] Refine higher level dataset API #6505
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
That makes the usage indeed a bit easier. But I am wondering: how expensive is "finishing" the factory? As now you are finishing all sub-datasets, just to reuse the non-finished factory later one. |
|
Yeah, that's a downside until we wrap both the DatasetFactory and Dataset with a single class which does the discovery lazily. |
|
I agree with the goal here, but I wonder if the solution should perhaps be in C++? That way we wouldn't have to reimplement this in R too. |
cc @bkietz |
|
If pushing into C++, that will have the same problem of the double cost of the dataset discovery |
|
@jorisvandenbossche in order to reuse the inspected schemas AFAICS we need "cache" them in the C++ instances and return them from |
|
Trying to put in words a bit more structured what I said yesterday in the meeting: Right now, we need to keep a reference to the dataset factories to be able to reuse those factories, since But instead of fixing this on the dataset side by holding this reference, can we change the implementation of Looking at the |
|
OK, I wanted to put this in some pseudo python code to explain it better, which lead me to see that my suggestion is right now indeed not possible ;) The current logic: The issue that makes it not possible right now to pass a vector of datasets, is that the sub-datasets need to be finished with a potentially different (unified) schema. |
|
@bkietz @fsaintjacques with your deeper knowledge of the C++ dataset code, could you confirm or not whether my idea above is possible: having a UnionDataset constructor taking a vector of Dataset of objects instead of vector of DatasetFactories ? |
|
@jorisvandenbossche this is already supported in C++, but currently the schemas must be identical rather than merely compatible. Creating a view of a dataset with differing schema would probably be straightforward, created https://issues.apache.org/jira/browse/ARROW-8164 to track this feature |
|
Thanks for opening the issue! |
|
What's the status of this patch? It's still in the 0.17.0 backlog |
|
This is blocked by #6721 (which is blocked by a strange R failure, I think). I added it to the 0.17 milestone, as it it would be nice to get this in, as I noted in the JIRA:
But it is also certainly not a blocker, since the datasets API is still experimental anyway. |
|
#6721 has been merged, so this is unblocked |
95a2508 to
f761dbd
Compare
jorisvandenbossche
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for this PR! The diff is a bit hard to interpret, but generally looks good to me.
One thing I am not really sure about is the string URI value to specify the filesystem. What's the use for that? If you have a URI, just pass it to the source? This seems a needless complication to me, and not something we support elsewhere? (or do we?)
|
@github-actions crossbow submit conda-win-vs2015-py36 |
|
Revision: dacded6 Submitted crossbow builds: ursa-labs/crossbow @ actions-90
|
3f21c9a to
5692edc
Compare
|
@github-actions crossbow submit conda-win-vs2015-py36 |
|
Revision: 5692edc Submitted crossbow builds: ursa-labs/crossbow @ actions-111
|
|
Failures are due to "ignore_prefix" -> "selector_ignore_prefix" rename on the C++ side. Since all Ursabot python builds are passing, it also seems they are not running any of the parquet or dataset related tests .. |
|
@github-actions crossbow submit conda-win-vs2015-py36 |
|
Revision: acbe2cb Submitted crossbow builds: ursa-labs/crossbow @ actions-113
|
Updated.
Yes, but we're not actively maintaining the ursabot builds now. |
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only skimmed through this, a few comments.
|
|
||
| def factory(path_or_paths, filesystem=None, partitioning=None, | ||
| format=None): | ||
| def _ensure_filesystem(fs_or_uri): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can be for a follow-up JIRA, but we might want to move this helper to pyarrow.fs, and use some this also in other places where we accept filesystems? (although I actually don't know if there are already many such places ..)
|
@kszucs can you also add this patch to this branch: --- a/python/pyarrow/tests/test_parquet.py
+++ b/python/pyarrow/tests/test_parquet.py
@@ -2490,13 +2490,7 @@ def _assert_dataset_paths(dataset, paths, use_legacy_dataset):
assert set(map(str, paths)) == {x.path for x in dataset.pieces}
else:
paths = [str(path.as_posix()) for path in paths]
- if hasattr(dataset._dataset, 'files'):
- assert set(paths) == set(dataset._dataset.files)
- else:
- # UnionDataset
- # TODO(temp hack) remove this branch once ARROW-7965 is in (which
- # will change this to a FileSystemDataset)
- assert dataset.read().num_rows == 50
+ assert set(paths) == set(dataset._dataset.files)(that's a clean-up for a hack I added yesterday becuase of a list of paths giving a UnionDataset, which this PR is fixing) |
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you. A few more comments.
|
Thanks Joris, Ben, Antoine! Merging. |
Provides a more intuitive way to construct nested dataset:
In the future we might want to introduce a new Dataset class which wraps functionality of both the dataset actory and the materialized dataset enabling optimizations over rediscovery of already materialized datasets.