-
-
Notifications
You must be signed in to change notification settings - Fork 27
Remove predicate pushdown from read_parquet #305
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Yes memory usage is very low, around 10% of the vm, so doesn't really matter |
|
Then in general I'm good with this. CC'ing @rjzamora in case he has thoughts. My sense is that it would be good if this worked faster in Arrow, and that I'm excited to add this functionality when it does, but it probably doesn't make sense to keep this functionality in until Arrow performance is faster. |
|
I propose waiting until tomorrow to give @rjzamora a chance to weigh in, but not much longer than that (this is easy to revert if anyone has strong thoughts). |
rjzamora
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I propose waiting until tomorrow to give @rjzamora a chance to weigh in
Thanks for the ping, and sorry I have been so distracted for the past few weeks.
@phofl - Can you remind me if dask/dask#10478 helped with this at all? I agree that we should (at least temporarily) avoid row-wise filtering in Arrow since there is an obvious performance penalty. However, It would be nice if we could still "absorb" the filter expression into the read-parquet expression.
I'm not sure I understand this. You're suggesting that we aborb it into the expression, and then use pandas to perform the actual filtering? |
Sort of. There are a few different reasons we push down filters into the parquet expression, so I'll try to describe them separately:
As far as I understand, (3) is the primary problem here. There is probably some overhead from attempting (2) when the filters are acting on un-sorted columns, but row-wise filtering is the primary bottleneck we are trying to remove here. In dask/dask#10478, I found that I suspect that the benchmarks will still be faster with this PR than with dask/dask#10478, because we are probably filtering on unsorted columns, and so it is unnecessary for us to spend any time parsing the parquet statistics. In the long term, I'd really like for us to find a clean way to "decide" (or configure) whether or not we should gather and parse statistics. However, it's also okay if there is a desire to simply turn this behavior off until we can come up with that system. With that said, I'd feel a bit better if we were able to continue testing predicate pushdown and were just avoiding it by default. |
|
Item 1 doesn't seem compelling to me. Things don't seem simpler, they just seem more collapsed (one complex term, rather than two more orthogonal ones). Item 2 does seem compelling to me, especially in the case when row groups can be culled. This seems like genuine tension between processing less data, and processing more data, but quickly. My hope is that over time dask/arrow get more efficient and this tension goes away. Item 3 seems less compelling to me, especially given that we know that this memory use is temporary and bounded. Feel free to disagree with any of the above, I could be wrong on any of that. My sense is that we probably decide the tension based off what benchmarks tell us. My sense is to keep things simple for now, wait until we run into serious use cases where people have sorted row groups, and then revisit this. Thoughts? |
Agree that this is not a sufficient reason as long as other optimizations are working correctly. I honestly don't remember the specific case where not pushing down the filter was blocking other optimizations, but I'm glad that it's not a problem anymore.
I don't think dask or arrow is particularly "bad" at this. The problem with the benchmarks in question is just that we are reading from remote storage, and so the time needed to parse metadata is non-trivial (not really "slow"). I don't think we should notice much of a difference between this PR and dask/dask#10478 for local storage. This is why I'd like for us to always turn off row-wise filtering within dask, and simply avoid file/row-group filtering by default (but still make it optional). I'm sure people working with local storage or hive-partitioned data would typically want to opt in.
Also agree with this. |
Ah, interesting. Well, we can also selectively run this optimization based on the path. My inclination is to switch this optimization off for now, and then switch it on once we find compelling information to switch it back on selectively for local storage. Maybe we have that compelling information today. @rjzamora do you have those workloads today, where you see measurable performance improvements from partition pruning? Are there workloads where you think this will provide a good speedup? |
|
I do also want to “turn off” this optimization for now. There are some real cases where predicate pushdown is beneficial, but my sense is that these cases are rare. The users I’m thinking of tend to specify the filters argument manually, so turning off this optimization shouldn’t affect them anyway. With that said, I am a certainly bummed that I will no longer be able to demonstrate this optimization for naive code that happens to be targeting hive-partitioned and/or timeseries data. However, I definitely don’t think that’s good enough reason to keep the optimization in place. It would be nice if I could still use a utility to extract the filter expression that could be applied (if desired), but this is tricky to do without actually pushing the filters down. |
|
I agree, I think we can get better here but this will require some effort. I'll merge this for now, reverting is very easy if we discover something later on |


Removing this temporarily speeds up our workflows by a lot, here is a tpch run: https://github.com/coiled/benchmarks/actions/runs/6249699594
base is current main, sample is this PR