fix: remove unnecessary column projection for count aggregate#5950
fix: remove unnecessary column projection for count aggregate#5950jackye1995 merged 8 commits intolance-format:mainfrom
Conversation
Code ReviewThis PR simplifies P1: Performance Regression - Full Projection in COUNT(*)The new implementation now reads all columns instead of just metadata, which is a significant performance regression for large tables: Old plan (optimized): New plan (unoptimized): The old Recommendation: Before calling scanner.project(&Vec::<String>::default())?;
scanner.with_row_id();Or the aggregate planner should be taught to optimize P1: Breaking Change - Public API Removal
Recommendation: Either:
Test coverage is good with new tests covering various scenarios (filters, vector search, FTS). |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
| with pytest.raises( | ||
| ValueError, match="should not be called on a plan selecting columns" | ||
| ): | ||
| ds.scanner(filter="a < 50", columns=["a"], with_row_id=True).count_rows() |
There was a problem hiding this comment.
I think it makes sense to actually return a value instead of failing this case. With the latest way of optimizing, it will always just do a metadata projection and avoid data scan.
There was a problem hiding this comment.
It is doing a data scan isn't it? a is not indexed and so there is no way to satisfy the count request without actually scanning the column.
I don't entirely agree that this shouldn't be an error but I also don't disagree enough to complain. I think the only valid concern I could have is that a user doing something like... ds.scanner(columns=["a"]).count_rows() might think this is the same as SELECT COUNT(a) FROM ... (i.e. that it returns the count of non-null rows) but that's a pretty weak argument.
So...feel free to ignore this comment 😛
There was a problem hiding this comment.
oh okay, I somehow ignored the filter part when trying to reason why it was asserting failure...
ds.scanner(columns=["a"]).count_rows() might think this is the same as SELECT COUNT(a) FROM ...
ohh I see the reasoning now, thanks for explaining. I think it is still clear, that ds.scanner(columns=["a"]).count_rows() is not the same as SELECT COUNT(a) because it should be equivalent to something like ds.scanner(columns=["a"], filter="a IS NOT NULL").count_rows().
Addressed in a subsequent commit
The function is not really used anywhere publicly. |
| with pytest.raises( | ||
| ValueError, match="should not be called on a plan selecting columns" | ||
| ): | ||
| ds.scanner(filter="a < 50", columns=["a"], with_row_id=True).count_rows() |
There was a problem hiding this comment.
It is doing a data scan isn't it? a is not indexed and so there is no way to satisfy the count request without actually scanning the column.
I don't entirely agree that this shouldn't be an error but I also don't disagree enough to complain. I think the only valid concern I could have is that a user doing something like... ds.scanner(columns=["a"]).count_rows() might think this is the same as SELECT COUNT(a) FROM ... (i.e. that it returns the count of non-null rows) but that's a pretty weak argument.
So...feel free to ignore this comment 😛
| pub aggregates: Vec<Expr>, | ||
| /// Column names required by this aggregate (computed at construction). | ||
| /// For COUNT(*), this is empty. For SUM(x), GROUP BY y, this contains [x, y]. | ||
| pub required_columns: Vec<String>, |
There was a problem hiding this comment.
Super minor nit: I think either Aggregate should not have pub fields (you could add an accessor for group_by and aggregates if you wanted) or required_columns should just be computed at plan time and not stored as part of this struct.
The current structure (public fields) makes it seem like Aggregate is a "data class" and users are expected to do something like...
let agg = Aggregate { group_by: ..., aggregates: ..., required_columns: ... }
But really there is no easy way to create this other than Aggregate::new and required_columns is an implementation detail.
| .boxed() | ||
| } | ||
| let mut scanner = self.clone(); | ||
| scanner.aggregate(AggregateExpr::builder().count_star().build())?; |
| if self.limit.is_some() || self.offset.is_some() { | ||
| log::warn!( | ||
| "count_rows called with limit or offset which could have surprising results" | ||
| ); | ||
| } |
There was a problem hiding this comment.
You can probably get rid of this warning at this point. The entire operation will fail so no need to warn about it.
089a829 to
37be886
Compare
Make sure we always only do a metadata projection to avoid scanning data when doing a count.
Also:
create_count_plansincecount_rowsis now just a case of aggregate, no longer need a dedicated query plan.aggregate_required_columnsby storing required columns directly at aggregate construction time.