Skip to content

add support for filtered projections#18342

Merged
clintropolis merged 10 commits intoapache:masterfrom
clintropolis:filtered-projections
Aug 7, 2025
Merged

add support for filtered projections#18342
clintropolis merged 10 commits intoapache:masterfrom
clintropolis:filtered-projections

Conversation

@clintropolis
Copy link
Copy Markdown
Member

@clintropolis clintropolis commented Jul 30, 2025

Description

This PR adds a filter property to AggregateProjectionSpec to allow projections to be pre-filtered to further reduce row counts for views to power queries with commonly used filters. The filter is built into a ValueMatcher in OnHeapAggregateProjection and rows are checked to match before adding to the facts table. At query time, if a projection has a filter we check to see that the query either contains the exact filter, or is part of an AND filter to ensure that the projection contains all of the rows which would normally match if querying the base table instead. If the query filter matches, the projection filter is removed from the query since it is no longer necessary.

The filter matching is pretty brittle right now and requires an exact match, but this can be improved in follow-up work.


This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@github-actions github-actions Bot added Area - Batch Ingestion Area - Segment Format and Ser/De Area - Ingestion Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 labels Jul 30, 2025
@@ -553,6 +555,7 @@ public void testInsertOnExternalDataSourceWithCatalogProjections(String contextN
new AggregateProjectionMetadata.Schema(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a builder to avoid the need for null filter everywhere?

}

// if the projection has a filter, the query must contain this filter match
if (filter != null) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, put this logic and rewriteFilter in their own dedicated file. The logic will only grow more complex over time.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, this whole method is giant so I started to split all the sub parts out into separate functions to like do the grouping match, aggregator match, and realized there is another place we are doing the query filter match, i think they need to be consolidated to be able to support projections where the filter inputs aren't actually stored in the projection as well as it would be more efficient to look at filter stuff once. I think we would need to shift the filter matching to happen last since it needs the remapped columns the be populated

Will consider splitting into a separate file, though in that case should almost all of the logic be moved out of this file into like Projections which already has findMatchingProjection which is what calls this matches method? That seems fine to me, just pass in the AggregateProjectionMetadata.Schema to most of these methods. Or I guess i could make a utility file per match type, like a GroupingMatcher and FilterMatcher and AggregatorMatcher 🤷

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, locally i've moved most of the logic into Projections (and moved the ConstantTimeColumn stuff in Projections to its own file), and it seems nicer so far, more to do tho so havent pushed yet

}
// at least one child must have been rewritten to rewrite the AND
if (childRewritten) {
if (newChildren.size() > 1) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can newChildren be empty, if all filters under the AND of queryFilter were part of projectionFilter?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think not with the current rather weak equals based matching logic - either the parent AND would match a projection AND, or at most a single child of this AND can match. Doing an overlap of AND children would be an improvement, and run into the problem you're asking about I think. I'll see if I can wire that up tho since it seems important for a query with something like x = 'a' and y = 'b' and z = 'c' to match a projection with x = 'a' and y = 'b'

@Nullable
public static Filter rewriteFilter(@Nullable Filter projectionFilter, @Nullable Filter queryFilter)
{
if (projectionFilter == null || queryFilter == null) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can queryFilter really be null? It seems like this method is only called for nonnull query filters.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, not anymore

this.timeColumnName = timeColumnName;
this.timeColumnPosition = foundTimePosition;
this.granularity = granularity == null ? Granularities.ALL : granularity;
this.effectiveGranularity = granularity == null ? Granularities.ALL : granularity;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we're expecting the same size/ ordering bwteen groupingColumns and OrderBy between line 223 and 224, this assumption is guaranteed from ProjectionsSpec but not obvious here, maybe worth adding some checks.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also it seems like this effectiveGranularity and orderingWithTimeSubstitution logic is very similar to the computeOrdering function, maybe it could be merged?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we're expecting the same size/ ordering bwteen groupingColumns and OrderBy between line 223 and 224, this assumption is guaranteed from ProjectionsSpec but not obvious here, maybe worth adding some checks.

yea, good spot, it is true they unfortunately are coupled right now, though in the future I think i would like to make them more independent rather than consolidate to keep us free to potentially allow user choice in ordering in the future. Given that this is not new in this PR I would rather deal with this in a follow-up

}
AggregateProjectionSpec that = (AggregateProjectionSpec) o;
return Objects.equals(name, that.name)
&& Objects.equals(filter, that.filter)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it intentional timeColumnName is not included in equals/ hashCode/ toString method, but ordering is? Both are derived so seems like not needed.

Copy link
Copy Markdown
Member Author

@clintropolis clintropolis Aug 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is sort of aspirational - I think it would be possible someday for ordering to not be fixed to grouping if we ever open up the comparator used to sort data when building segments. However, no such plans for the time column.

(however the other part you noticed about relation between grouping and ordering maybe sort of breaks this)

Copy link
Copy Markdown
Contributor

@gianm gianm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved, with some thoughts on the tests.

import java.util.Map;
import java.util.Set;

class ProjectionsTest
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

public class? not sure it matters, just odd not to see the keyword.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

used intellij to generate a test file and saw it without the public, i looked into it and apparently in junit5 it is not necessary so have been rolling with it 🤷 https://docs.junit.org/current/user-guide/#writing-tests-classes-and-methods

}

@Test
void testSchemaMatchFilter()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're going to need easier, more concise ways to write tests for filter matching. As the filter matching gets more sophisticated, there will be a lot of variations to test to ensure everything is working properly.

Maybe we can define a few "standard" datasources for the tests, each one with a base table and some projection(s). Then each test would just need to define a CursorBuildSpec and the expected ProjectionMatch.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yea, i was writing these tests as mainly synthetic to smoke test stuff as they are not backed by real data. I think for the bulk of tests I would agree we want something like what CursorFactoryProjectionTest is doing with a matrix of incremental and immutable segments but with a more cursor + filter focus and make it test that results are the same with and without the projection similar to what the native filter tests are doing

cursorBuildSpec,
new RowSignatureChecker(baseTable)
);
Projections.ProjectionMatch expected = new Projections.ProjectionMatch(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit- moving ProjectionMatch out into its own class would help a little with making the test code more concise.

@clintropolis clintropolis merged commit fe57bc3 into apache:master Aug 7, 2025
95 of 96 checks passed
@clintropolis clintropolis deleted the filtered-projections branch August 7, 2025 18:16
@cecemei cecemei added this to the 35.0.0 milestone Oct 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Area - Batch Ingestion Area - Ingestion Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 Area - Segment Format and Ser/De

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants