Skip to content

Conversation

@xhochy
Copy link
Member

@xhochy xhochy commented Sep 25, 2018

No description provided.

@xhochy
Copy link
Member Author

xhochy commented Sep 25, 2018

Just as a heads-up. This is missing some more granular tests and the exact filtering converts to Pandas and back again. Changing that to work on pure Arrow tables will be a lot more work.

I will separate out some things into smaller pull requests.

@xhochy
Copy link
Member Author

xhochy commented Sep 25, 2018

cc @rgruener I think you were one of the interested people in this topic.

Copy link

@rgruener rgruener left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good, thanks for this!

if isinstance(filters[0][0], six.string_types):
# We have encountered the situation where we have one nesting level too few:
# We have [(,,), ..] instead of [[(,,), ..]]
filters = [filters]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This automatic fix isn't my favorite though I see that it is necessary to not have a braking change in the api for ParquetDataset (with the filters argument). Perhaps though it would be better to throw an error here and have this fix in that specific case instead of allowing a wrong nesting level in all cases.

inner_indexer = (ser < value) & inner_indexer
elif op == ">":
inner_indexer = (ser > value) & inner_indexer
elif op == "in":

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about the operator "not in"?

return min_value < val
elif op == ">":
return max_value > val
elif op == "in":

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about "not in"

self.validate_schemas()

if filters:
if filters is not None:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

filters can now be either List[Tuple] or List[List[Tuple]] and can either filter on partitions or row groups depending on if the dataset is partitioned. Either way the current state should be mentioned in the docstring (unless that will be worked out before this PR is merged)

@emkornfield
Copy link
Contributor

@xhochy is this still WIP? Anything I can help with?

@xhochy
Copy link
Member Author

xhochy commented Mar 3, 2019

@emkornfield I'm not actively working on this at the moment. Feel free to pick this up or just to integrate parts of the PR into arrow.

@wesm
Copy link
Member

wesm commented Jun 14, 2019

Closing this for now. I'll refer to this PR in the course of implementing Scanner logic in the Datasets project

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants