fix: remove unnecessary column projection for count aggregate by jackye1995 · Pull Request #5950 · lance-format/lance

jackye1995 · 2026-02-13T05:28:56Z

Make sure we always only do a metadata projection to avoid scanning data when doing a count.

Also:

remove create_count_plan since count_rows is now just a case of aggregate, no longer need a dedicated query plan.
remove duplicated calls to aggregate_required_columns by storing required columns directly at aggregate construction time.

github-actions · 2026-02-13T05:30:06Z

Code Review

This PR simplifies count_rows by using the new scanner aggregate functionality from #5911. The refactor is clean and reduces code duplication.

P1: Performance Regression - Full Projection in COUNT(*)

The new implementation now reads all columns instead of just metadata, which is a significant performance regression for large tables:

Old plan (optimized):

AggregateExec: mode=Single, gby=[], aggr=[count_rows]
  LanceRead: uri=..., projection=[], row_id=true  # ← empty projection

New plan (unoptimized):

AggregateExec: mode=Single, gby=[], aggr=[count(Int32(1))]
  LanceRead: uri=..., projection=[x, y, category]  # ← ALL columns loaded

The old create_count_plan specifically enforced is_metadata_only() projection and required with_row_id() to ensure minimal data loading. The new approach loses this optimization and will load unnecessary column data for COUNT(*) queries.

Recommendation: Before calling scanner.aggregate(...) in count_rows(), ensure the scanner is configured with an empty projection + row_id, similar to the original implementation's requirements. This could be done with:

scanner.project(&Vec::<String>::default())?;
scanner.with_row_id();

Or the aggregate planner should be taught to optimize COUNT(*) to avoid reading unnecessary columns.

P1: Breaking Change - Public API Removal

create_count_plan is a public function (pub fn) that's being removed without deprecation. If any external code calls this method, it will break at compile time.

Recommendation: Either:

Add #[deprecated(...)] attribute pointing to the new approach, or
Document this as a breaking change in release notes if the function was always considered internal

Test coverage is good with new tests covering various scenarios (filters, vector search, FTS).

codecov · 2026-02-13T06:02:30Z

Codecov Report

❌ Patch coverage is 85.00000% with 6 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
rust/lance/src/dataset/scanner.rs	75.00%	0 Missing and 6 partials ⚠️

📢 Thoughts on this report? Let us know!

jackye1995 · 2026-02-13T07:10:09Z

-    with pytest.raises(
-        ValueError, match="should not be called on a plan selecting columns"
-    ):
-        ds.scanner(filter="a < 50", columns=["a"], with_row_id=True).count_rows()


I think it makes sense to actually return a value instead of failing this case. With the latest way of optimizing, it will always just do a metadata projection and avoid data scan.

It is doing a data scan isn't it? a is not indexed and so there is no way to satisfy the count request without actually scanning the column.

I don't entirely agree that this shouldn't be an error but I also don't disagree enough to complain. I think the only valid concern I could have is that a user doing something like... ds.scanner(columns=["a"]).count_rows() might think this is the same as SELECT COUNT(a) FROM ... (i.e. that it returns the count of non-null rows) but that's a pretty weak argument.

So...feel free to ignore this comment 😛

oh okay, I somehow ignored the filter part when trying to reason why it was asserting failure...

ds.scanner(columns=["a"]).count_rows() might think this is the same as SELECT COUNT(a) FROM ...

ohh I see the reasoning now, thanks for explaining. I think it is still clear, that ds.scanner(columns=["a"]).count_rows() is not the same as SELECT COUNT(a) because it should be equivalent to something like ds.scanner(columns=["a"], filter="a IS NOT NULL").count_rows().

jackye1995 · 2026-02-13T07:33:49Z

P1: Performance Regression - Full Projection in COUNT(*)

Addressed in a subsequent commit

P1: Breaking Change - Public API Removal

The function is not really used anywhere publicly.

westonpace · 2026-02-13T14:10:36Z

-    with pytest.raises(
-        ValueError, match="should not be called on a plan selecting columns"
-    ):
-        ds.scanner(filter="a < 50", columns=["a"], with_row_id=True).count_rows()


It is doing a data scan isn't it? a is not indexed and so there is no way to satisfy the count request without actually scanning the column.

I don't entirely agree that this shouldn't be an error but I also don't disagree enough to complain. I think the only valid concern I could have is that a user doing something like... ds.scanner(columns=["a"]).count_rows() might think this is the same as SELECT COUNT(a) FROM ... (i.e. that it returns the count of non-null rows) but that's a pretty weak argument.

So...feel free to ignore this comment 😛

westonpace · 2026-02-13T14:14:48Z

    pub aggregates: Vec<Expr>,
+    /// Column names required by this aggregate (computed at construction).
+    /// For COUNT(*), this is empty. For SUM(x), GROUP BY y, this contains [x, y].
+    pub required_columns: Vec<String>,


Super minor nit: I think either Aggregate should not have pub fields (you could add an accessor for group_by and aggregates if you wanted) or required_columns should just be computed at plan time and not stored as part of this struct.

The current structure (public fields) makes it seem like Aggregate is a "data class" and users are expected to do something like...

let agg = Aggregate { group_by: ..., aggregates: ..., required_columns: ... }

But really there is no easy way to create this other than Aggregate::new and required_columns is an implementation detail.

westonpace · 2026-02-13T14:16:19Z

-        .boxed()
-    }
+            let mut scanner = self.clone();
+            scanner.aggregate(AggregateExpr::builder().count_star().build())?;


Nice cleanup!

westonpace · 2026-02-13T14:17:12Z

            if self.limit.is_some() || self.offset.is_some() {
                log::warn!(
                    "count_rows called with limit or offset which could have surprising results"
                );
            }


You can probably get rid of this warning at this point. The entire operation will fail so no need to warn about it.

jackye1995 marked this pull request as draft February 13, 2026 05:46

github-actions Bot added the python label Feb 13, 2026

jackye1995 changed the title ~~refactor: use scanner aggregate for count_rows~~ fix: remove column projection for count aggregate Feb 13, 2026

github-actions Bot added the bug Something isn't working label Feb 13, 2026

jackye1995 commented Feb 13, 2026

View reviewed changes

jackye1995 changed the title ~~fix: remove column projection for count aggregate~~ fix: remove unnecessary column projection for count aggregate Feb 13, 2026

jackye1995 marked this pull request as ready for review February 13, 2026 07:51

westonpace approved these changes Feb 13, 2026

View reviewed changes

jackye1995 added 8 commits February 13, 2026 14:25

refactor: use scanner aggregate for count_rows

a49a091

cleanup

cb1bfaf

fix optimization

7c33453

fix test

37c9cba

fix lint

fcccfc6

avoid duplicated call to aggregate_required_columns

552edae

address comments

009df29

fix clippy

37be886

jackye1995 force-pushed the deprecate-count-rows branch from 089a829 to 37be886 Compare February 13, 2026 22:31

github-actions Bot added the java label Feb 13, 2026

jackye1995 merged commit f829eef into lance-format:main Feb 14, 2026
28 checks passed

andrea-reale mentioned this pull request Mar 30, 2026

emilk/fix write starvation rerun-io/lance#12

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: remove unnecessary column projection for count aggregate#5950

fix: remove unnecessary column projection for count aggregate#5950
jackye1995 merged 8 commits intolance-format:mainfrom
jackye1995:deprecate-count-rows

jackye1995 commented Feb 13, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Feb 13, 2026

Uh oh!

codecov Bot commented Feb 13, 2026 •

edited

Loading

Uh oh!

jackye1995 Feb 13, 2026

Uh oh!

westonpace Feb 13, 2026

Uh oh!

jackye1995 Feb 13, 2026 •

edited

Loading

Uh oh!

jackye1995 commented Feb 13, 2026

Uh oh!

westonpace Feb 13, 2026

Uh oh!

westonpace Feb 13, 2026

Uh oh!

westonpace Feb 13, 2026

Uh oh!

westonpace Feb 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jackye1995 commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Feb 13, 2026

Code Review

P1: Performance Regression - Full Projection in COUNT(*)

P1: Breaking Change - Public API Removal

Uh oh!

codecov Bot commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jackye1995 Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

westonpace Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

jackye1995 Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jackye1995 commented Feb 13, 2026

Uh oh!

westonpace Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

westonpace Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

westonpace Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

westonpace Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jackye1995 commented Feb 13, 2026 •

edited

Loading

codecov Bot commented Feb 13, 2026 •

edited

Loading

jackye1995 Feb 13, 2026 •

edited

Loading