-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[feature](inverted index) Implement es-like boolean query #58545
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
zclllyybb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need some tests?
Temporary commit, code is not yet complete. |
48331af to
431d2bf
Compare
|
run buildall |
TPC-H: Total hot run time: 36138 ms |
TPC-DS: Total hot run time: 179060 ms |
ClickBench: Total hot run time: 27.34 s |
|
run buildall |
TPC-H: Total hot run time: 35414 ms |
TPC-DS: Total hot run time: 178707 ms |
ClickBench: Total hot run time: 27.18 s |
|
run buildall |
TPC-H: Total hot run time: 35988 ms |
TPC-DS: Total hot run time: 178006 ms |
ClickBench: Total hot run time: 27.47 s |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
8feba37 to
f864ddb
Compare
|
run buildall |
TPC-H: Total hot run time: 35117 ms |
TPC-DS: Total hot run time: 177988 ms |
ClickBench: Total hot run time: 27.57 s |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
airborne12
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
) ### What problem does this PR solve? Issue Number: close #xxx Related PR: #58545 Problem Summary: This PR introduces two new features for the SEARCH function: #### 1. Lucene Boolean Mode Adds a `mode` option to enable Lucene/Elasticsearch-style query parsing: ```sql -- Enable Lucene mode via JSON options SELECT * FROM docs WHERE search('apple AND banana', '{"default_field":"title","mode":"lucene"}'); -- With minimum_should_match SELECT * FROM docs WHERE search('apple AND banana OR cherry', '{"default_field":"title","mode":"lucene","minimum_should_match":1}'); ``` **Key differences from standard mode:** - AND/OR/NOT work as left-to-right modifiers (not traditional boolean algebra) - Uses MUST/SHOULD/MUST_NOT internally (like Lucene's Occur enum) - Pure NOT queries return empty results (need positive clause) **Behavior comparison:** | Query | Standard Mode | Lucene Mode | |-------|--------------|-------------| | `a AND b` | a ∩ b | +a +b (both MUST) | | `a OR b` | a ∪ b | a b (both SHOULD, min=1) | | `NOT a` | ¬a | Empty (no positive clause) | | `a AND NOT b` | a ∩ ¬b | +a -b (MUST a, MUST_NOT b) | | `a AND b OR c` | (a ∩ b) ∪ c | +a b c (only a is MUST) | #### 2. Escape Characters in DSL Support for escaping special characters using backslash: | Escape | Description | Example | |--------|-------------|---------| | `\ ` | Literal space | `title:First\ Value` matches "First Value" | | `\(` `\)` | Literal parentheses | `title:hello\(world\)` matches "hello(world)" | | `\:` | Literal colon | `title:key\:value` matches "key:value" | | `\\` | Literal backslash | `title:path\\to\\file` matches "path\to\file" |
) ### What problem does this PR solve? Issue Number: close #xxx Related PR: #58545 Problem Summary: This PR introduces two new features for the SEARCH function: #### 1. Lucene Boolean Mode Adds a `mode` option to enable Lucene/Elasticsearch-style query parsing: ```sql -- Enable Lucene mode via JSON options SELECT * FROM docs WHERE search('apple AND banana', '{"default_field":"title","mode":"lucene"}'); -- With minimum_should_match SELECT * FROM docs WHERE search('apple AND banana OR cherry', '{"default_field":"title","mode":"lucene","minimum_should_match":1}'); ``` **Key differences from standard mode:** - AND/OR/NOT work as left-to-right modifiers (not traditional boolean algebra) - Uses MUST/SHOULD/MUST_NOT internally (like Lucene's Occur enum) - Pure NOT queries return empty results (need positive clause) **Behavior comparison:** | Query | Standard Mode | Lucene Mode | |-------|--------------|-------------| | `a AND b` | a ∩ b | +a +b (both MUST) | | `a OR b` | a ∪ b | a b (both SHOULD, min=1) | | `NOT a` | ¬a | Empty (no positive clause) | | `a AND NOT b` | a ∩ ¬b | +a -b (MUST a, MUST_NOT b) | | `a AND b OR c` | (a ∩ b) ∪ c | +a b c (only a is MUST) | #### 2. Escape Characters in DSL Support for escaping special characters using backslash: | Escape | Description | Example | |--------|-------------|---------| | `\ ` | Literal space | `title:First\ Value` matches "First Value" | | `\(` `\)` | Literal parentheses | `title:hello\(world\)` matches "hello(world)" | | `\:` | Literal colon | `title:key\:value` matches "key:value" | | `\\` | Literal backslash | `title:path\\to\\file` matches "path\to\file" |
…feature This commit adds the necessary dependency files from PR #58545 to fix compilation errors in the cherry-picked PR #59394 (lucene bool mode for search function). Changes include: - Updated clucene submodule to include skipToBlock/nextDeltaPosition methods - Added OccurBooleanQuery and related classes (occur.h, occur_boolean_query.h, occur_boolean_weight.h/cpp, boolean_query_builder.h) - Moved operator.h to boolean_query/ directory and fixed include paths - Updated function_search.h/cpp to use correct include paths - Various query_v2 file updates for compatibility Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…uery to branch-4.0 Cherry-pick the full implementation and unit tests from PR #58545 to branch-4.0. Most source code was already added in previous commits as dependencies for PR #59394. This commit completes the cherry-pick by adding: Source file fixes: - regexp_weight.cpp: Fixed to use make_segment_postings() helper Unit test files (new): - boolean_query/boolean_query_builder_test.cpp: Tests for query builders - buffered_union_test.cpp: Tests for BufferedUnion scorer - disjunction_scorer_test.cpp: Tests for DisjunctionScorer - exclude_scorer_test.cpp: Tests for ExcludeScorer - occur_boolean_query_test.cpp: Tests for OccurBooleanQuery - reqopt_scorer_test.cpp: Tests for ReqOptScorer Unit test files (updated to PR version): - boolean_query_test.cpp: Updated to use OperatorBooleanQueryBuilder - intersection_test.cpp: Updated API calls - segment_postings_test.cpp: Updated to PR version All tests compile and pass verification. Related PR: #58545 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Cherry-pick PR #58545 to branch-4.0. This PR implements ES-like boolean query for inverted index, including: Source files: - OccurBooleanQuery and related classes for ES-style MUST/SHOULD/MUST_NOT - OperatorBooleanQuery refactored to separate file - DisjunctionScorer, ExcludeScorer, ReqOptScorer implementations - BufferedUnion for efficient union operations - Updated intersection and segment_postings APIs Unit tests: - boolean_query_builder_test.cpp - buffered_union_test.cpp - disjunction_scorer_test.cpp - exclude_scorer_test.cpp - occur_boolean_query_test.cpp - reqopt_scorer_test.cpp - Updated existing test files Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…58545 (#59766) ## Summary Cherry-pick PR #58545 to branch-4.0. This PR implements ES-like boolean query for inverted index, including: **Source files:** - OccurBooleanQuery and related classes for ES-style MUST/SHOULD/MUST_NOT - OperatorBooleanQuery refactored to separate file - DisjunctionScorer, ExcludeScorer, ReqOptScorer implementations - BufferedUnion for efficient union operations - Updated intersection and segment_postings APIs **Unit tests:** - boolean_query_builder_test.cpp - buffered_union_test.cpp - disjunction_scorer_test.cpp - exclude_scorer_test.cpp - occur_boolean_query_test.cpp - reqopt_scorer_test.cpp - Updated existing test files ## Test plan - [x] BE unit tests compile successfully - [x] Unit tests pass verification Related PR: #58545 🤖 Generated with [Claude Code](https://claude.ai/code) Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
) ### What problem does this PR solve? Issue Number: close #xxx Related PR: #58545 Problem Summary: This PR introduces two new features for the SEARCH function: #### 1. Lucene Boolean Mode Adds a `mode` option to enable Lucene/Elasticsearch-style query parsing: ```sql -- Enable Lucene mode via JSON options SELECT * FROM docs WHERE search('apple AND banana', '{"default_field":"title","mode":"lucene"}'); -- With minimum_should_match SELECT * FROM docs WHERE search('apple AND banana OR cherry', '{"default_field":"title","mode":"lucene","minimum_should_match":1}'); ``` **Key differences from standard mode:** - AND/OR/NOT work as left-to-right modifiers (not traditional boolean algebra) - Uses MUST/SHOULD/MUST_NOT internally (like Lucene's Occur enum) - Pure NOT queries return empty results (need positive clause) **Behavior comparison:** | Query | Standard Mode | Lucene Mode | |-------|--------------|-------------| | `a AND b` | a ∩ b | +a +b (both MUST) | | `a OR b` | a ∪ b | a b (both SHOULD, min=1) | | `NOT a` | ¬a | Empty (no positive clause) | | `a AND NOT b` | a ∩ ¬b | +a -b (MUST a, MUST_NOT b) | | `a AND b OR c` | (a ∩ b) ∪ c | +a b c (only a is MUST) | #### 2. Escape Characters in DSL Support for escaping special characters using backslash: | Escape | Description | Example | |--------|-------------|---------| | `\ ` | Literal space | `title:First\ Value` matches "First Value" | | `\(` `\)` | Literal parentheses | `title:hello\(world\)` matches "hello(world)" | | `\:` | Literal colon | `title:key\:value` matches "key:value" | | `\\` | Literal backslash | `title:path\\to\\file` matches "path\to\file" |
…che#59394) ### What problem does this PR solve? Issue Number: close #xxx Related PR: apache#58545 Problem Summary: This PR introduces two new features for the SEARCH function: #### 1. Lucene Boolean Mode Adds a `mode` option to enable Lucene/Elasticsearch-style query parsing: ```sql -- Enable Lucene mode via JSON options SELECT * FROM docs WHERE search('apple AND banana', '{"default_field":"title","mode":"lucene"}'); -- With minimum_should_match SELECT * FROM docs WHERE search('apple AND banana OR cherry', '{"default_field":"title","mode":"lucene","minimum_should_match":1}'); ``` **Key differences from standard mode:** - AND/OR/NOT work as left-to-right modifiers (not traditional boolean algebra) - Uses MUST/SHOULD/MUST_NOT internally (like Lucene's Occur enum) - Pure NOT queries return empty results (need positive clause) **Behavior comparison:** | Query | Standard Mode | Lucene Mode | |-------|--------------|-------------| | `a AND b` | a ∩ b | +a +b (both MUST) | | `a OR b` | a ∪ b | a b (both SHOULD, min=1) | | `NOT a` | ¬a | Empty (no positive clause) | | `a AND NOT b` | a ∩ ¬b | +a -b (MUST a, MUST_NOT b) | | `a AND b OR c` | (a ∩ b) ∪ c | +a b c (only a is MUST) | #### 2. Escape Characters in DSL Support for escaping special characters using backslash: | Escape | Description | Example | |--------|-------------|---------| | `\ ` | Literal space | `title:First\ Value` matches "First Value" | | `\(` `\)` | Literal parentheses | `title:hello\(world\)` matches "hello(world)" | | `\:` | Literal colon | `title:key\:value` matches "key:value" | | `\\` | Literal backslash | `title:path\\to\\file` matches "path\to\file" |
…unction #59394 (#59745) Cherry-picked from #59394 **Note:** This PR depends on #59766 (cherry-pick of #58545) being merged first. ## Summary Introduce lucene bool mode for search function. ## Test plan - [ ] Regression tests (after dependency PR merged) Related PRs: #59394 Depends on: #59766 Co-authored-by: Jack <jiangkai@selectdb.com>
What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary:
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)