Skip to content

Conversation

@lidavidm
Copy link
Member

The API here is a little different than before, but this allows you to give a partition schema with dictionary fields, without having to fill in the dictionaries themselves. I opted to do this as part of the factory to avoid having to let partitionings be 'half constructed', and because the factory has the necessary logic already. As a bonus, this also lets you now check the inferred and actual types against each other.

I also fixed a bug when using non-int32 dictionary indices: a cast was missing.

@github-actions
Copy link

@bkietz bkietz self-requested a review March 11, 2021 17:50
Copy link
Member

@bkietz bkietz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this! Some minor comments

Copy link
Member

@bkietz bkietz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Only looked at the cython/python part, which looks good.

We should only still update the partitioning() function/docstring in dataset.py, I think.
Now that the discover method also takes a schema, it's not fully clear what the difference is between doing DirectoryPartitioning(schema) and DirectoryPartitioning.discover(schema) The first requires passing the dictionaries as well, I suppose? (in case of having dictionary fields) But then that should be documented in the docstring, and we might want to update the partitioning function to use the discover() method in case a schema is provided but no dictionaries?

lidavidm and others added 2 commits March 12, 2021 14:54
Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>
@lidavidm
Copy link
Member Author

I tried updating ds.partitioning() to give a PartitioningFactory when a schema (but not dictionaries) is given, but that breaks things (e.g. writers expecting ds.partitioning() to give a Partitioning still)…how about a separate flag?

@jorisvandenbossche
Copy link
Member

What you currently pushed here breaks other tests?

Maybe we could also check if the schema has any dictionary type?

@lidavidm
Copy link
Member Author

What you currently pushed here breaks other tests?

Yup, I realized that right after I pushed.

Maybe we could also check if the schema has any dictionary type?

We could; we'd have to do that recursively, right? In case of a nested dictionary. (…though is that handled anyways?) It also doesn't help the fact that we need a Partitioning, not a PartitioningFactory, when we want to write data, so the auto-detection might be a little too magical…

@jorisvandenbossche
Copy link
Member

We could; we'd have to do that recursively, right? In case of a nested dictionary. (…though is that handled anyways?)

I don't think we can parse nested types from the file paths?
In that case, we wouldn't need to check it recursively.

From a user point of view, having to specify dictionaries="infer" feels superfluous, as it is clear that's needed (but to be clear, this PR is already a nice improvement compared to the current situation! ;))

It also doesn't help the fact that we need a Partitioning, not a PartitioningFactory, when we want to write data, so the auto-detection might be a little too magical…

Hmm, yes, that complicates things. When writing, you don't need to specify the dictionaries. But indeed you still need the actual Partitioning and not the factory. So returning the factory if the schemas has a dictionary type and no dictionaries are passed, would then fail when writing ..

The current API mixing both for reading/writing and the full object / the factory makes it a bit complex ..

@lidavidm
Copy link
Member Author

We could also split out read_partitioning and write_partitioning functions, perhaps, or add a similar flag, and accept the API break.

@jorisvandenbossche
Copy link
Member

I think the current PR (with the dictionary="infer") is already a clear, strict improvement over the situation in master, so that's fine for me.
(having to split in a reading/writing partitioning object is also not very nice)

@bkietz
Copy link
Member

bkietz commented Mar 15, 2021

+1, merging

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants