Skip to content

Conversation

@kszucs
Copy link
Member

@kszucs kszucs commented Feb 28, 2020

Provides a more intuitive way to construct nested dataset:

# instead of using confusing factory function
dataset([
     factory("s3://old-taxi-data", format="parquet"),
     factory("local/path/to/new/data", format="csv")
])

# let the user to construct a new dataset directly from dataset objects
dataset([ 
    dataset("s3://old-taxi-data", format="parquet"),
    dataset("local/path/to/new/data", format="csv")
])

In the future we might want to introduce a new Dataset class which wraps functionality of both the dataset actory and the materialized dataset enabling optimizations over rediscovery of already materialized datasets.

@kszucs kszucs changed the title [Python] Hold a reference to the dataset factory for later reuse ARROW-7965: [Python] Hold a reference to the dataset factory for later reuse Feb 28, 2020
@apache apache deleted a comment from github-actions bot Feb 28, 2020
@github-actions
Copy link

@jorisvandenbossche
Copy link
Member

That makes the usage indeed a bit easier. But I am wondering: how expensive is "finishing" the factory? As now you are finishing all sub-datasets, just to reuse the non-finished factory later one.

@kszucs
Copy link
Member Author

kszucs commented Feb 28, 2020

Yeah, that's a downside until we wrap both the DatasetFactory and Dataset with a single class which does the discovery lazily.

@nealrichardson
Copy link
Member

I agree with the goal here, but I wonder if the solution should perhaps be in C++? That way we wouldn't have to reimplement this in R too.

@kszucs
Copy link
Member Author

kszucs commented Mar 2, 2020

I agree with the goal here, but I wonder if the solution should perhaps be in C++? That way we wouldn't have to reimplement this in R too.

cc @bkietz

@jorisvandenbossche
Copy link
Member

If pushing into C++, that will have the same problem of the double cost of the dataset discovery

@kszucs
Copy link
Member Author

kszucs commented Mar 2, 2020

@jorisvandenbossche in order to reuse the inspected schemas AFAICS we need "cache" them in the C++ instances and return them from InspectSchemas() unless we request an explicit re-discovery.

@jorisvandenbossche
Copy link
Member

Trying to put in words a bit more structured what I said yesterday in the meeting:

Right now, we need to keep a reference to the dataset factories to be able to reuse those factories, since UnionDatasetFactory is written to expect a vector of factories.

But instead of fixing this on the dataset side by holding this reference, can we change the implementation of UnionDatasetFactory to actually accepts a vector of datasets instead of a vector of factories?

Looking at the UnionDatasetFactory implementation, it doesn't look like that still needs the schema's of all files/fragments of its sub-datasets. Rather, it only needs the finished schema of its sub-datasets (and then it checks if those are compatible / can be unified). But assuming that those sub-datasets would also created with a factory, the schema returned from child_factory->Inspect() will be the same as the schema from the finished dataset from that child_factory.

@jorisvandenbossche
Copy link
Member

OK, I wanted to put this in some pseudo python code to explain it better, which lead me to see that my suggestion is right now indeed not possible ;)

The current logic:

def UnionDatasetFactory(factories):
    # inspect
    schemas = []
    for fact in factories:
        schemas.append(fact.inspect())
    schema = UnifySchemas(schemas)

    # finish
    datasets = []
    for fact in factories:
        datasets.append(fact.finish(schema))
    return UnionDataset(datasets, schema)

The issue that makes it not possible right now to pass a vector of datasets, is that the sub-datasets need to be finished with a potentially different (unified) schema.
So my question is then: would it be possible to kind of create a new "view" on an existing dataset with a different (but compatible!) schema? In the end, you could create a new Dataset passing all building blocks of the original dataset (the format, filesystem, forest, file_partitions) but with only specifying a different schema?

@jorisvandenbossche
Copy link
Member

@bkietz @fsaintjacques with your deeper knowledge of the C++ dataset code, could you confirm or not whether my idea above is possible: having a UnionDataset constructor taking a vector of Dataset of objects instead of vector of DatasetFactories ?

@bkietz
Copy link
Member

bkietz commented Mar 19, 2020

@jorisvandenbossche this is already supported in C++, but currently the schemas must be identical rather than merely compatible. Creating a view of a dataset with differing schema would probably be straightforward, created https://issues.apache.org/jira/browse/ARROW-8164 to track this feature

@jorisvandenbossche
Copy link
Member

Thanks for opening the issue!

@wesm
Copy link
Member

wesm commented Apr 3, 2020

What's the status of this patch? It's still in the 0.17.0 backlog

@jorisvandenbossche
Copy link
Member

This is blocked by #6721 (which is blocked by a strange R failure, I think).

I added it to the 0.17 milestone, as it it would be nice to get this in, as I noted in the JIRA:

This depends on ARROW-8164, but if that gets merged quickly, it would be nice tackle this issue for 0.17 since that would enable us to remove factory() from the high-level user API (which wasn't there yet in 0.16, so this would avoid it ever being in a released version)

But it is also certainly not a blocker, since the datasets API is still experimental anyway.

@nealrichardson
Copy link
Member

#6721 has been merged, so this is unblocked

@kszucs kszucs force-pushed the dataset-factory-reference branch from 95a2508 to f761dbd Compare April 9, 2020 18:32
@kszucs kszucs changed the title ARROW-7965: [Python] Hold a reference to the dataset factory for later reuse ARROW-7965: [Python] Refine higher level dataset API Apr 10, 2020
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for this PR! The diff is a bit hard to interpret, but generally looks good to me.

One thing I am not really sure about is the string URI value to specify the filesystem. What's the use for that? If you have a URI, just pass it to the source? This seems a needless complication to me, and not something we support elsewhere? (or do we?)

@kszucs
Copy link
Member Author

kszucs commented Apr 10, 2020

@github-actions crossbow submit conda-win-vs2015-py36

@github-actions
Copy link

Revision: dacded6

Submitted crossbow builds: ursa-labs/crossbow @ actions-90

Task Status
conda-win-vs2015-py36 Azure

@kszucs kszucs force-pushed the dataset-factory-reference branch from 3f21c9a to 5692edc Compare April 14, 2020 02:16
@kszucs
Copy link
Member Author

kszucs commented Apr 14, 2020

@github-actions crossbow submit conda-win-vs2015-py36

@github-actions
Copy link

Revision: 5692edc

Submitted crossbow builds: ursa-labs/crossbow @ actions-111

Task Status
conda-win-vs2015-py36 Azure

@jorisvandenbossche
Copy link
Member

Failures are due to "ignore_prefix" -> "selector_ignore_prefix" rename on the C++ side.

Since all Ursabot python builds are passing, it also seems they are not running any of the parquet or dataset related tests ..

@kszucs
Copy link
Member Author

kszucs commented Apr 14, 2020

@github-actions crossbow submit conda-win-vs2015-py36

@github-actions
Copy link

Revision: acbe2cb

Submitted crossbow builds: ursa-labs/crossbow @ actions-113

Task Status
conda-win-vs2015-py36 Azure

@kszucs
Copy link
Member Author

kszucs commented Apr 14, 2020

Failures are due to "ignore_prefix" -> "selector_ignore_prefix" rename on the C++ side.

Updated.

Since all Ursabot python builds are passing, it also seems they are not running any of the parquet or dataset related tests ..

Yes, but we're not actively maintaining the ursabot builds now.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only skimmed through this, a few comments.


def factory(path_or_paths, filesystem=None, partitioning=None,
format=None):
def _ensure_filesystem(fs_or_uri):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be for a follow-up JIRA, but we might want to move this helper to pyarrow.fs, and use some this also in other places where we accept filesystems? (although I actually don't know if there are already many such places ..)

@jorisvandenbossche
Copy link
Member

@kszucs can you also add this patch to this branch:

--- a/python/pyarrow/tests/test_parquet.py
+++ b/python/pyarrow/tests/test_parquet.py
@@ -2490,13 +2490,7 @@ def _assert_dataset_paths(dataset, paths, use_legacy_dataset):
         assert set(map(str, paths)) == {x.path for x in dataset.pieces}
     else:
         paths = [str(path.as_posix()) for path in paths]
-        if hasattr(dataset._dataset, 'files'):
-            assert set(paths) == set(dataset._dataset.files)
-        else:
-            # UnionDataset
-            # TODO(temp hack) remove this branch once ARROW-7965 is in (which
-            # will change this to a FileSystemDataset)
-            assert dataset.read().num_rows == 50
+        assert set(paths) == set(dataset._dataset.files)

(that's a clean-up for a hack I added yesterday becuase of a list of paths giving a UnionDataset, which this PR is fixing)

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. A few more comments.

@kszucs
Copy link
Member Author

kszucs commented Apr 14, 2020

Thanks Joris, Ben, Antoine! Merging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants