Skip to content

feat: support aggregate in scanner#5911

Merged
jackye1995 merged 9 commits intolance-format:mainfrom
jackye1995:substrait-df
Feb 13, 2026
Merged

feat: support aggregate in scanner#5911
jackye1995 merged 9 commits intolance-format:mainfrom
jackye1995:substrait-df

Conversation

@jackye1995
Copy link
Copy Markdown
Contributor

@jackye1995 jackye1995 commented Feb 8, 2026

The main use case of this PR is to allow engines like Spark and Trino to pushdown an aggregate into Lance scanner in distributed worker when possible. Today we technically already supports COUNT(*) pushdown through scanner.count_rows() to count rows of each fragment distributedly, this is a more generic version of that. My plan is to allow an engine to pass a Substrait Aggregate expression to scanner in the worker to support pushdown other aggregations like SUM, MAX, MIN.

Another alternative I have thought about is to just update the dataset.sql() API to accept a full Substrait plan so we can execute a plan with aggregate, and update distributed worker to run a SQL statement instead of running the scanner. But doing this feature in scanner feels more aligned with how engines implement the distributed execution. Basically whatever that could be executed by a single worker in a distributed environment (predicate pushdown, column projection, aggregate pushdown) should be supported by the scanner.

Note that with this change, we can technically remove create_count_plan since it's just a subcase of create_aggregate_plan, but we are not doing it in this PR. Once we agree upon this direction, I will do a separated PR to refactor that.

@github-actions github-actions Bot added the enhancement New feature or request label Feb 8, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Feb 8, 2026

Code Review

Summary

This PR adds Substrait aggregate pushdown support to Lance, enabling query engines like Trino to push aggregates (COUNT, SUM, AVG, MIN, MAX with GROUP BY) to the storage layer.

P0/P1 Issues

P1: Test assertion is a no-op (line 311 in substrait.rs)

assert\!(result.is_ok() || result.is_err());

This assertion always passes. Either remove the test or make it meaningful by testing specific expected behavior (e.g., assert the specific error type when extensions are missing).

P1: Memory allocation on hot path (scanner.rs:1382)

self.substrait_aggregate = Some(aggregate_rel.to_vec());

The aggregate_substrait() method copies the entire byte buffer. Consider accepting Vec<u8> directly with an into() pattern to allow callers to avoid the copy when they already have ownership.

P1: Incomplete implementation marker (substrait.rs:231)

let order_by = Vec::new(); // TODO: parse agg_func.sorts if needed

ORDER BY in aggregates (e.g., ARRAY_AGG(...) ORDER BY ...) is not supported but no error is returned if a user provides sorts. Consider either implementing it or returning an error when \!agg_func.sorts.is_empty().

Minor Observations (not blocking)

  • The parse_aggregate_rel function with default extensions (line 142-149) may be confusing since it will silently fail to resolve most aggregate functions. The documentation could clarify when this variant is appropriate to use.

  • Consider documenting thread-safety guarantees for the new substrait_aggregate field if Scanner is intended to be used across threads.

Overall: Good test coverage with 14 tests covering various aggregate scenarios. The implementation correctly leverages DataFusion's physical planning and Substrait parsing infrastructure.

@jackye1995 jackye1995 changed the title feat(lance): add Substrait aggregate pushdown support feat(lance): support Substrait aggregate in scanner Feb 8, 2026
@jackye1995 jackye1995 changed the title feat(lance): support Substrait aggregate in scanner feat: support Substrait aggregate in scanner Feb 8, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Feb 8, 2026

Codecov Report

❌ Patch coverage is 61.18211% with 243 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance/src/dataset/scanner.rs 40.68% 144 Missing and 12 partials ⚠️
rust/lance-datafusion/src/substrait.rs 76.03% 66 Missing and 21 partials ⚠️

📢 Thoughts on this report? Let us know!

@jackye1995 jackye1995 force-pushed the substrait-df branch 3 times, most recently from afc2170 to cf4c194 Compare February 9, 2026 06:55
Add support for aggregates via Substrait AggregateRel specification.

Key changes:
- Add `AggregateSpec` enum with Substrait and Datafusion variants
- Add `aggregate_substrait()` and `aggregate_expr()` methods to Scanner
- Add `create_aggregate_plan()` to build execution plan with AggregateExec
- Add Substrait parsing utilities in lance-datafusion for AggregateRel
- Implement type coercion for UserDefined signature functions (e.g., AVG)
- Support output column aliases via RelRoot.names

Supported: COUNT, SUM, AVG, MIN, MAX with GROUP BY.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
jackye1995 and others added 6 commits February 8, 2026 23:21
…e API

- Rename AggregateSpec to AggregateExpr for consistency
- Add helper constructors: substrait() and datafusion()
- Combine aggregate_substrait() and aggregate_expr() into single aggregate() method
- Update tests to use new API

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Create lance-datafusion/src/aggregate.rs for Aggregate struct
- Remove #[cfg(feature = "substrait")] from create_aggregate_plan
- create_aggregate_plan now works without substrait feature

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Parse agg_func.sorts for ordered aggregates like ARRAY_AGG.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add tests for FIRST_VALUE with ORDER BY ASC and DESC to verify
the sorts parsing works correctly.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@jackye1995 jackye1995 changed the title feat: support Substrait aggregate in scanner feat: support aggregate in scanner Feb 9, 2026
Copy link
Copy Markdown
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good addition. The scanner creates very limited linear SQL plans (SCAN -> FILTER -> SORT? -> LIMIT? PROJECT) and so I think this still fits (SCAN -> FILTER -> SORT? -> LIMIT? -> PROJECT -> AGGREGATE?).

My only concern is making sure we have a good non-Substrait path. But we can add that in a follow-up too.

Comment on lines +11 to +12
pub group_by: Vec<Expr>,
pub aggregates: Vec<Expr>,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's go ahead and comment these two.

Comment on lines +1840 to +1843
/// Create an execution plan with aggregation.
///
/// Requires `aggregate()` to be called first.
pub fn create_aggregate_plan(&self) -> BoxFuture<'_, Result<Arc<dyn ExecutionPlan>>> {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need a separate create_aggregate_plan? Can we just append the aggregate expr onto the end of create_plan when aggregate is set? It would be surprising to me as a user if I called aggregate and then create_plan just ignored it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a use-case (like my plan_splits proposal?) that would benefit from separating this logic? Maybe create_aggregate_plan is called in plan_spits and create_plan? If the goal is to create two separate entrypoints, one for executing an expr and one for breaking it into multiple partitions maybe there is utility. I agree that setting an aggregate and then having it not apply when calling create_plan would surprise me as a user.

Comment on lines +471 to +475
Datafusion {
group_by: Vec<Expr>,
aggregates: Vec<Expr>,
output_names: Vec<String>,
},
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can do this in a follow-up but there should be a way for callers to specify an AggregateExpr without needing to use DF or Substrait. Ideally with some kind of builder.

Comment thread rust/lance/src/dataset/scanner.rs Outdated
Datafusion {
group_by: Vec<Expr>,
aggregates: Vec<Expr>,
output_names: Vec<String>,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed? Datafusion Expr already has a name concept. In other words, if a user wants the MAX(temp) aggregate to be named max_temp they can do .alias("max_temp") on the Expr right?

Ok(agg_expr)
}

/// Apply type coercion to aggregate arguments for UserDefined signature functions.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed? Why don't we do type coercion on non-user defined functions?

Comment thread rust/lance/src/dataset/scanner.rs Outdated
coerced_expr
};

let (agg_expr, _filter, _order_by) = create_aggregate_expr_and_maybe_filter(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does it mean if _filter or _order_by are set? Do we need to utilize these?

Comment thread rust/lance/src/dataset/scanner.rs Outdated
agg_func.params.null_treatment,
)))
}
other => Ok(other.clone()),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be an error if we hit this branch?

Comment thread rust/lance-datafusion/src/aggregate.rs Outdated
Comment on lines +13 to +14
/// Output column names in order: group_by columns first, then aggregates.
pub output_names: Vec<String>,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mention this in scanner.rs too but having output_names be a separate property is a Substrait thing. I think we could instead alias the expressions when we are parsing the Substrait. This keeps the logic in Scanner simpler.


/// Set aggregation.
pub fn aggregate(&mut self, aggregate: AggregateExpr) -> &mut Self {
self.aggregate = Some(aggregate);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably emit a warning or error if a user calls both aggregate and limit.

Comment thread rust/lance/src/dataset/scanner.rs Outdated
)
})?;

let plan = self.create_plan().await?;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a user calls order_by how should we interpret it? Ordering the data before the aggregate (what it is doing today) is just meaningless work. We could interpret it as a request to order the data after the aggregate? Or we could just return an error and say you can't do both? Or we could just log a warning and ignore the sort?

Copy link
Copy Markdown
Contributor

@hamersaw hamersaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the premise that some aggregates can be performed more efficiently by the scanner than by the distributed execution framework? I think in scenarios where we have to read all of the data it should be pretty similar performance-wise if this is done by a datafusion step or by the framework. There are specific cases, like MIN / MAX where we could serve from Zonemaps if there is no filter; is this the kind of query we're trying to optimize for?

Comment on lines +1840 to +1843
/// Create an execution plan with aggregation.
///
/// Requires `aggregate()` to be called first.
pub fn create_aggregate_plan(&self) -> BoxFuture<'_, Result<Arc<dyn ExecutionPlan>>> {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a use-case (like my plan_splits proposal?) that would benefit from separating this logic? Maybe create_aggregate_plan is called in plan_spits and create_plan? If the goal is to create two separate entrypoints, one for executing an expr and one for breaking it into multiple partitions maybe there is utility. I agree that setting an aggregate and then having it not apply when calling create_plan would surprise me as a user.

Copy link
Copy Markdown
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor nits but good to go otherwise. Thanks for tackling this!

Comment on lines +548 to +563
/// Add a column to group by.
pub fn group_by(mut self, column: impl Into<String>) -> Self {
self.group_by.push(col(column.into()));
self
}

/// Add multiple columns to group by.
pub fn group_by_columns(
mut self,
columns: impl IntoIterator<Item = impl Into<String>>,
) -> Self {
for column in columns {
self.group_by.push(col(column.into()));
}
self
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit: can we comment that multiple invocations of group_by or group_by_columns will add to the list (and not replace it). E.g. .group_by("x").group_by_columns(["y", "z"]) will group by x, y, and z.

Comment thread rust/lance/src/dataset/scanner.rs Outdated
Comment on lines +573 to +574
/// Add COUNT(column) aggregate.
pub fn count(self, column: impl Into<String>) -> AggregateExprBuilderWithPendingAggregate {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit: the difference between count_star and count is always subtle for newcomers to SQL. Can we expand this comment?

    /// Add COUNT(column) aggregate.
    ///
    /// Unlike count_star this will only return the number of rows where `column`
    /// is not NULL

Comment thread rust/lance/src/dataset/scanner.rs Outdated
Comment on lines +622 to +627
/// Builder state with a pending aggregate that can be aliased.
#[derive(Debug, Clone)]
pub struct AggregateExprBuilderWithPendingAggregate {
builder: AggregateExprBuilder,
pending: Expr,
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I don't love having two structs (and duplicating all the methods) but it isn't the end of the world. One solution I've seen to this problem has been to use const generics (not sure the below compiles but should communicate the idea)...

pub struct AggregateExprBuilder<const HAS_PENDING_AGG: bool> {
    group_by: Vec<Expr>,
    aggregates: Vec<Expr>,
}

impl<const HAS_PENDING_AGG: bool> AggregateExprBuilder<HAS_PENDING_AGG> {
    pub fn new() -> AggregateExprBuilder<false> {
        AggregateExprBuilder<false> {
            group_by: Vec::default(),
            aggregates: Vec::default(),
        }
    }
    ...
    /// Add SUM(column) aggregate.
    pub fn sum(mut self, column: impl Into<String>) -> AggregateExprBuilder<true> {
        self.aggregates.push(functions_aggregate::sum::sum(col(column.into())));
        AggregateExprBuilder<true> {
            group_by: self.group_by,
            aggregates: self.aggregates
        }
    }
}

impl AggregateExprBuilder<true> {
    /// Set an alias for the pending aggregate.
    pub fn alias(mut self, name: impl Into<String>) -> AggregateExprBuilder<true> {
        let aliased = self.aggregates.pop().alias(name.into());
        self.aggregates.push(aliased);
        self
    }
}

Comment thread rust/lance/src/dataset/scanner.rs Outdated
if self.aggregate.is_some() {
if self.limit.is_some() || self.offset.is_some() {
return Err(Error::InvalidInput {
source: "Cannot use limit/offset with aggregate. Apply limit after aggregation instead.".into(),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
source: "Cannot use limit/offset with aggregate. Apply limit after aggregation instead.".into(),
source: "Cannot use limit/offset with aggregate. Apply limit to the result instead.".into(),

Comment thread rust/lance/src/dataset/scanner.rs Outdated
}
if self.ordering.is_some() {
return Err(Error::InvalidInput {
source: "Cannot use order_by with aggregate. Apply ordering after aggregation instead.".into(),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
source: "Cannot use order_by with aggregate. Apply ordering after aggregation instead.".into(),
source: "Cannot use order_by with aggregate. Apply ordering to the result instead.".into(),

// Stage 2.5: aggregate (if set, applies aggregate and returns early)
if let Some(agg_spec) = &self.aggregate {
// Take columns needed for aggregation
plan = self.take(plan, self.projection_plan.physical_projection.clone())?;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this a no-op? We should always have the physical projection loaded already? I don't think aggregates are allowed to reference additional columns so we shouldn't need a take here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually not a no-op if we do an aggregate on top of vector search of full text search, the source returns search results with scores rather than the full projection columns. The take() function handles both cases pretty gracefully since TakeExec::try_new returns None when no new columns are needed, causing it to simply return the original plan unchanged. So while it's often a no-op for simple scans, we keep it there for correctness in other source types, and the cost is just a minimal check.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a few new tests to cover those cases

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, ok. Normally we'd do that take at stage 5. So we are pulling it up here instead which is fine since we prevent an aggregate with a limit (otherwise we'd be taking too much if we did that here)

All good!

Comment thread rust/lance/src/dataset/scanner.rs Outdated
// Stage 2: filter
plan = filter_plan.refine_filter(plan, self).await?;

// Stage 2.5: aggregate (if set, applies aggregate and returns early)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should really just get rid of the whole "stage" concept in the comments 😆

plan = filter_plan.refine_filter(plan, self).await?;

// Stage 2.5: aggregate (if set, applies aggregate and returns early)
if let Some(agg_spec) = &self.aggregate {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm slightly torn on whether it makes sense to apply the aggregate before or after the projection. I fear we will have complaints either way.

If we apply the aggregate here (before projection) we cannot reference system columns or projected columns in the aggregate (e.g. can't do MAX(_rowoffset) or MAX(x * 2)). If we apply the aggregate later (stage 7.5) then we can't reference the aggregate in the projection (e.g. can't do 2 * MAX(x)).

I suppose it only really makes sense to be utilizing aggregates that can be pushed down and we wouldn't push down something like MAX(x * 2) anyways (I mean...we could...but I have no desire to do so).

Ok...I think I've convinced myself I like this how it is!

@jackye1995
Copy link
Copy Markdown
Contributor Author

Thanks for the reviews!

@jackye1995 jackye1995 merged commit 8457062 into lance-format:main Feb 13, 2026
32 checks passed
jackye1995 added a commit that referenced this pull request Feb 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants