Skip to content

fix: remove unnecessary column projection for count aggregate#5950

Merged
jackye1995 merged 8 commits intolance-format:mainfrom
jackye1995:deprecate-count-rows
Feb 14, 2026
Merged

fix: remove unnecessary column projection for count aggregate#5950
jackye1995 merged 8 commits intolance-format:mainfrom
jackye1995:deprecate-count-rows

Conversation

@jackye1995
Copy link
Copy Markdown
Contributor

@jackye1995 jackye1995 commented Feb 13, 2026

Make sure we always only do a metadata projection to avoid scanning data when doing a count.

Also:

  1. remove create_count_plan since count_rows is now just a case of aggregate, no longer need a dedicated query plan.
  2. remove duplicated calls to aggregate_required_columns by storing required columns directly at aggregate construction time.

@github-actions
Copy link
Copy Markdown
Contributor

Code Review

This PR simplifies count_rows by using the new scanner aggregate functionality from #5911. The refactor is clean and reduces code duplication.

P1: Performance Regression - Full Projection in COUNT(*)

The new implementation now reads all columns instead of just metadata, which is a significant performance regression for large tables:

Old plan (optimized):

AggregateExec: mode=Single, gby=[], aggr=[count_rows]
  LanceRead: uri=..., projection=[], row_id=true  # ← empty projection

New plan (unoptimized):

AggregateExec: mode=Single, gby=[], aggr=[count(Int32(1))]
  LanceRead: uri=..., projection=[x, y, category]  # ← ALL columns loaded

The old create_count_plan specifically enforced is_metadata_only() projection and required with_row_id() to ensure minimal data loading. The new approach loses this optimization and will load unnecessary column data for COUNT(*) queries.

Recommendation: Before calling scanner.aggregate(...) in count_rows(), ensure the scanner is configured with an empty projection + row_id, similar to the original implementation's requirements. This could be done with:

scanner.project(&Vec::<String>::default())?;
scanner.with_row_id();

Or the aggregate planner should be taught to optimize COUNT(*) to avoid reading unnecessary columns.

P1: Breaking Change - Public API Removal

create_count_plan is a public function (pub fn) that's being removed without deprecation. If any external code calls this method, it will break at compile time.

Recommendation: Either:

  1. Add #[deprecated(...)] attribute pointing to the new approach, or
  2. Document this as a breaking change in release notes if the function was always considered internal

Test coverage is good with new tests covering various scenarios (filters, vector search, FTS).

@jackye1995 jackye1995 marked this pull request as draft February 13, 2026 05:46
@codecov
Copy link
Copy Markdown

codecov Bot commented Feb 13, 2026

Codecov Report

❌ Patch coverage is 85.00000% with 6 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance/src/dataset/scanner.rs 75.00% 0 Missing and 6 partials ⚠️

📢 Thoughts on this report? Let us know!

@jackye1995 jackye1995 changed the title refactor: use scanner aggregate for count_rows fix: remove column projection for count aggregate Feb 13, 2026
@github-actions github-actions Bot added the bug Something isn't working label Feb 13, 2026
with pytest.raises(
ValueError, match="should not be called on a plan selecting columns"
):
ds.scanner(filter="a < 50", columns=["a"], with_row_id=True).count_rows()
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense to actually return a value instead of failing this case. With the latest way of optimizing, it will always just do a metadata projection and avoid data scan.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is doing a data scan isn't it? a is not indexed and so there is no way to satisfy the count request without actually scanning the column.

I don't entirely agree that this shouldn't be an error but I also don't disagree enough to complain. I think the only valid concern I could have is that a user doing something like... ds.scanner(columns=["a"]).count_rows() might think this is the same as SELECT COUNT(a) FROM ... (i.e. that it returns the count of non-null rows) but that's a pretty weak argument.

So...feel free to ignore this comment 😛

Copy link
Copy Markdown
Contributor Author

@jackye1995 jackye1995 Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh okay, I somehow ignored the filter part when trying to reason why it was asserting failure...

ds.scanner(columns=["a"]).count_rows() might think this is the same as SELECT COUNT(a) FROM ...

ohh I see the reasoning now, thanks for explaining. I think it is still clear, that ds.scanner(columns=["a"]).count_rows() is not the same as SELECT COUNT(a) because it should be equivalent to something like ds.scanner(columns=["a"], filter="a IS NOT NULL").count_rows().

@jackye1995 jackye1995 changed the title fix: remove column projection for count aggregate fix: remove unnecessary column projection for count aggregate Feb 13, 2026
@jackye1995
Copy link
Copy Markdown
Contributor Author

P1: Performance Regression - Full Projection in COUNT(*)

Addressed in a subsequent commit

P1: Breaking Change - Public API Removal

The function is not really used anywhere publicly.

@jackye1995 jackye1995 marked this pull request as ready for review February 13, 2026 07:51
with pytest.raises(
ValueError, match="should not be called on a plan selecting columns"
):
ds.scanner(filter="a < 50", columns=["a"], with_row_id=True).count_rows()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is doing a data scan isn't it? a is not indexed and so there is no way to satisfy the count request without actually scanning the column.

I don't entirely agree that this shouldn't be an error but I also don't disagree enough to complain. I think the only valid concern I could have is that a user doing something like... ds.scanner(columns=["a"]).count_rows() might think this is the same as SELECT COUNT(a) FROM ... (i.e. that it returns the count of non-null rows) but that's a pretty weak argument.

So...feel free to ignore this comment 😛

Comment thread rust/lance-datafusion/src/aggregate.rs Outdated
pub aggregates: Vec<Expr>,
/// Column names required by this aggregate (computed at construction).
/// For COUNT(*), this is empty. For SUM(x), GROUP BY y, this contains [x, y].
pub required_columns: Vec<String>,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super minor nit: I think either Aggregate should not have pub fields (you could add an accessor for group_by and aggregates if you wanted) or required_columns should just be computed at plan time and not stored as part of this struct.

The current structure (public fields) makes it seem like Aggregate is a "data class" and users are expected to do something like...

let agg = Aggregate { group_by: ..., aggregates: ..., required_columns: ... }

But really there is no easy way to create this other than Aggregate::new and required_columns is an implementation detail.

.boxed()
}
let mut scanner = self.clone();
scanner.aggregate(AggregateExpr::builder().count_star().build())?;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice cleanup!

Comment thread rust/lance/src/dataset/scanner.rs Outdated
Comment on lines 1925 to 1929
if self.limit.is_some() || self.offset.is_some() {
log::warn!(
"count_rows called with limit or offset which could have surprising results"
);
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can probably get rid of this warning at this point. The entire operation will fail so no need to warn about it.

@github-actions github-actions Bot added the java label Feb 13, 2026
@jackye1995 jackye1995 merged commit f829eef into lance-format:main Feb 14, 2026
28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working java python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants