Skip to content

Conversation

@phofl
Copy link
Collaborator

@phofl phofl commented Sep 20, 2023

Removing this temporarily speeds up our workflows by a lot, here is a tpch run: https://github.com/coiled/benchmarks/actions/runs/6249699594

base is current main, sample is this PR

@mrocklin
Copy link
Member

It looks like performance increases a ton, but memory usage increases also?

Screen Shot 2023-09-20 at 10 30 42 AM

Screen Shot 2023-09-20 at 10 30 54 AM

This probably makes sense to do if memory usage on these is pretty low anyway. My guess is that this memory use is transitory, as larger pandas dataframes come into and out of memory.

@phofl
Copy link
Collaborator Author

phofl commented Sep 20, 2023

Yes memory usage is very low, around 10% of the vm, so doesn't really matter

@mrocklin
Copy link
Member

Then in general I'm good with this. CC'ing @rjzamora in case he has thoughts. My sense is that it would be good if this worked faster in Arrow, and that I'm excited to add this functionality when it does, but it probably doesn't make sense to keep this functionality in until Arrow performance is faster.

@mrocklin
Copy link
Member

I propose waiting until tomorrow to give @rjzamora a chance to weigh in, but not much longer than that (this is easy to revert if anyone has strong thoughts).

Copy link
Member

@rjzamora rjzamora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I propose waiting until tomorrow to give @rjzamora a chance to weigh in

Thanks for the ping, and sorry I have been so distracted for the past few weeks.

@phofl - Can you remind me if dask/dask#10478 helped with this at all? I agree that we should (at least temporarily) avoid row-wise filtering in Arrow since there is an obvious performance penalty. However, It would be nice if we could still "absorb" the filter expression into the read-parquet expression.

@mrocklin
Copy link
Member

However, It would be nice if we could still "absorb" the filter expression into the read-parquet expression.

I'm not sure I understand this. You're suggesting that we aborb it into the expression, and then use pandas to perform the actual filtering?

@rjzamora
Copy link
Member

I'm not sure I understand this. You're suggesting that we absorb it into the expression, and then use pandas to perform the actual filtering?

Sort of. There are a few different reasons we push down filters into the parquet expression, so I'll try to describe them separately:

  1. To simplify the general expression graph by pushing down and combining various filtering operations into a single DNF filter representation within the read-parquet task. This used to be important for proper expr optimization, but probably doesn't matter as much anymore (judging by the fact that tests are passing).
  2. To drop unnecessary files and/or row groups before IO.
  3. To minimize memory pressure by performing row-wise filtering at IO time.

As far as I understand, (3) is the primary problem here. There is probably some overhead from attempting (2) when the filters are acting on un-sorted columns, but row-wise filtering is the primary bottleneck we are trying to remove here. In dask/dask#10478, I found that read_parquet filtering was significantly faster when we delayed the row-wise filtering step until just after IO (still using pyarrow, but not at IO time).

I suspect that the benchmarks will still be faster with this PR than with dask/dask#10478, because we are probably filtering on unsorted columns, and so it is unnecessary for us to spend any time parsing the parquet statistics.

In the long term, I'd really like for us to find a clean way to "decide" (or configure) whether or not we should gather and parse statistics. However, it's also okay if there is a desire to simply turn this behavior off until we can come up with that system. With that said, I'd feel a bit better if we were able to continue testing predicate pushdown and were just avoiding it by default.

@mrocklin
Copy link
Member

Item 1 doesn't seem compelling to me. Things don't seem simpler, they just seem more collapsed (one complex term, rather than two more orthogonal ones).

Item 2 does seem compelling to me, especially in the case when row groups can be culled. This seems like genuine tension between processing less data, and processing more data, but quickly. My hope is that over time dask/arrow get more efficient and this tension goes away.

Item 3 seems less compelling to me, especially given that we know that this memory use is temporary and bounded.

Feel free to disagree with any of the above, I could be wrong on any of that. My sense is that we probably decide the tension based off what benchmarks tell us. My sense is to keep things simple for now, wait until we run into serious use cases where people have sorted row groups, and then revisit this.

Thoughts?

@rjzamora
Copy link
Member

Item 1 doesn't seem compelling to me. Things don't seem simpler, they just seem more collapsed (one complex term, rather than two more orthogonal ones).

Agree that this is not a sufficient reason as long as other optimizations are working correctly. I honestly don't remember the specific case where not pushing down the filter was blocking other optimizations, but I'm glad that it's not a problem anymore.

Item 2 does seem compelling to me, especially in the case when row groups can be culled. This seems like genuine tension between processing less data, and processing more data, but quickly. My hope is that over time dask/arrow get more efficient and this tension goes away.

I don't think dask or arrow is particularly "bad" at this. The problem with the benchmarks in question is just that we are reading from remote storage, and so the time needed to parse metadata is non-trivial (not really "slow"). I don't think we should notice much of a difference between this PR and dask/dask#10478 for local storage.

This is why I'd like for us to always turn off row-wise filtering within dask, and simply avoid file/row-group filtering by default (but still make it optional). I'm sure people working with local storage or hive-partitioned data would typically want to opt in.

Item 3 seems less compelling to me, especially given that we know that this memory use is temporary and bounded.

Also agree with this.

@mrocklin
Copy link
Member

The problem with the benchmarks in question is just that we are reading from remote storage, and so the time needed to parse metadata is non-trivial (not really "slow"). I don't think we should notice much of a difference between this PR and dask/dask#10478 for local storage.

Ah, interesting. Well, we can also selectively run this optimization based on the path.

My inclination is to switch this optimization off for now, and then switch it on once we find compelling information to switch it back on selectively for local storage. Maybe we have that compelling information today. @rjzamora do you have those workloads today, where you see measurable performance improvements from partition pruning? Are there workloads where you think this will provide a good speedup?

@rjzamora
Copy link
Member

I do also want to “turn off” this optimization for now. There are some real cases where predicate pushdown is beneficial, but my sense is that these cases are rare. The users I’m thinking of tend to specify the filters argument manually, so turning off this optimization shouldn’t affect them anyway.

With that said, I am a certainly bummed that I will no longer be able to demonstrate this optimization for naive code that happens to be targeting hive-partitioned and/or timeseries data. However, I definitely don’t think that’s good enough reason to keep the optimization in place. It would be nice if I could still use a utility to extract the filter expression that could be applied (if desired), but this is tricky to do without actually pushing the filters down.

@phofl
Copy link
Collaborator Author

phofl commented Sep 21, 2023

I agree, I think we can get better here but this will require some effort. I'll merge this for now, reverting is very easy if we discover something later on

@phofl phofl merged commit bd03286 into dask:main Sep 21, 2023
@phofl phofl deleted the parquet branch September 21, 2023 10:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants