-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[feature](search) introduce lucene bool mode for search function #59394
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
FE UT Coverage ReportIncrement line coverage |
TPC-H: Total hot run time: 36862 ms |
TPC-DS: Total hot run time: 179700 ms |
ClickBench: Total hot run time: 28.42 s |
FE Regression Coverage ReportIncrement line coverage |
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
TPC-H: Total hot run time: 35186 ms |
TPC-DS: Total hot run time: 179364 ms |
ClickBench: Total hot run time: 28.6 s |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
FE Regression Coverage ReportIncrement line coverage |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
FE Regression Coverage ReportIncrement line coverage |
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
TPC-H: Total hot run time: 34694 ms |
TPC-DS: Total hot run time: 179520 ms |
ClickBench: Total hot run time: 28.45 s |
FE UT Coverage ReportIncrement line coverage |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
FE Regression Coverage ReportIncrement line coverage |
… search function Add documentation for two new features in the SEARCH function: 1. Lucene Boolean Mode: - JSON-based options parameter (mode, minimum_should_match) - Left-to-right modifier parsing (MUST/SHOULD/MUST_NOT) - Behavior comparison table with standard mode 2. Escape Characters: - Support for escaping special characters in DSL - Backslash escapes for space, parentheses, colon, backslash Updated both English and Chinese versions of search-function.md. Related PR: apache/doris#59394 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
… search function Add documentation for two new features in the SEARCH function: 1. Lucene Boolean Mode: - JSON-based options parameter (mode, minimum_should_match) - Left-to-right modifier parsing (MUST/SHOULD/MUST_NOT) - Behavior comparison table with standard mode 2. Escape Characters: - Support for escaping special characters in DSL - Backslash escapes for space, parentheses, colon, backslash Updated both English and Chinese versions of search-function.md. Related PR: apache/doris#59394 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
run check_coverage |
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
zclllyybb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
) ### What problem does this PR solve? Issue Number: close #xxx Related PR: #58545 Problem Summary: This PR introduces two new features for the SEARCH function: #### 1. Lucene Boolean Mode Adds a `mode` option to enable Lucene/Elasticsearch-style query parsing: ```sql -- Enable Lucene mode via JSON options SELECT * FROM docs WHERE search('apple AND banana', '{"default_field":"title","mode":"lucene"}'); -- With minimum_should_match SELECT * FROM docs WHERE search('apple AND banana OR cherry', '{"default_field":"title","mode":"lucene","minimum_should_match":1}'); ``` **Key differences from standard mode:** - AND/OR/NOT work as left-to-right modifiers (not traditional boolean algebra) - Uses MUST/SHOULD/MUST_NOT internally (like Lucene's Occur enum) - Pure NOT queries return empty results (need positive clause) **Behavior comparison:** | Query | Standard Mode | Lucene Mode | |-------|--------------|-------------| | `a AND b` | a ∩ b | +a +b (both MUST) | | `a OR b` | a ∪ b | a b (both SHOULD, min=1) | | `NOT a` | ¬a | Empty (no positive clause) | | `a AND NOT b` | a ∩ ¬b | +a -b (MUST a, MUST_NOT b) | | `a AND b OR c` | (a ∩ b) ∪ c | +a b c (only a is MUST) | #### 2. Escape Characters in DSL Support for escaping special characters using backslash: | Escape | Description | Example | |--------|-------------|---------| | `\ ` | Literal space | `title:First\ Value` matches "First Value" | | `\(` `\)` | Literal parentheses | `title:hello\(world\)` matches "hello(world)" | | `\:` | Literal colon | `title:key\:value` matches "key:value" | | `\\` | Literal backslash | `title:path\\to\\file` matches "path\to\file" |
…feature This commit adds the necessary dependency files from PR #58545 to fix compilation errors in the cherry-picked PR #59394 (lucene bool mode for search function). Changes include: - Updated clucene submodule to include skipToBlock/nextDeltaPosition methods - Added OccurBooleanQuery and related classes (occur.h, occur_boolean_query.h, occur_boolean_weight.h/cpp, boolean_query_builder.h) - Moved operator.h to boolean_query/ directory and fixed include paths - Updated function_search.h/cpp to use correct include paths - Various query_v2 file updates for compatibility Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…uery to branch-4.0 Cherry-pick the full implementation and unit tests from PR #58545 to branch-4.0. Most source code was already added in previous commits as dependencies for PR #59394. This commit completes the cherry-pick by adding: Source file fixes: - regexp_weight.cpp: Fixed to use make_segment_postings() helper Unit test files (new): - boolean_query/boolean_query_builder_test.cpp: Tests for query builders - buffered_union_test.cpp: Tests for BufferedUnion scorer - disjunction_scorer_test.cpp: Tests for DisjunctionScorer - exclude_scorer_test.cpp: Tests for ExcludeScorer - occur_boolean_query_test.cpp: Tests for OccurBooleanQuery - reqopt_scorer_test.cpp: Tests for ReqOptScorer Unit test files (updated to PR version): - boolean_query_test.cpp: Updated to use OperatorBooleanQueryBuilder - intersection_test.cpp: Updated API calls - segment_postings_test.cpp: Updated to PR version All tests compile and pass verification. Related PR: #58545 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
) ### What problem does this PR solve? Issue Number: close #xxx Related PR: #58545 Problem Summary: This PR introduces two new features for the SEARCH function: #### 1. Lucene Boolean Mode Adds a `mode` option to enable Lucene/Elasticsearch-style query parsing: ```sql -- Enable Lucene mode via JSON options SELECT * FROM docs WHERE search('apple AND banana', '{"default_field":"title","mode":"lucene"}'); -- With minimum_should_match SELECT * FROM docs WHERE search('apple AND banana OR cherry', '{"default_field":"title","mode":"lucene","minimum_should_match":1}'); ``` **Key differences from standard mode:** - AND/OR/NOT work as left-to-right modifiers (not traditional boolean algebra) - Uses MUST/SHOULD/MUST_NOT internally (like Lucene's Occur enum) - Pure NOT queries return empty results (need positive clause) **Behavior comparison:** | Query | Standard Mode | Lucene Mode | |-------|--------------|-------------| | `a AND b` | a ∩ b | +a +b (both MUST) | | `a OR b` | a ∪ b | a b (both SHOULD, min=1) | | `NOT a` | ¬a | Empty (no positive clause) | | `a AND NOT b` | a ∩ ¬b | +a -b (MUST a, MUST_NOT b) | | `a AND b OR c` | (a ∩ b) ∪ c | +a b c (only a is MUST) | #### 2. Escape Characters in DSL Support for escaping special characters using backslash: | Escape | Description | Example | |--------|-------------|---------| | `\ ` | Literal space | `title:First\ Value` matches "First Value" | | `\(` `\)` | Literal parentheses | `title:hello\(world\)` matches "hello(world)" | | `\:` | Literal colon | `title:key\:value` matches "key:value" | | `\\` | Literal backslash | `title:path\\to\\file` matches "path\to\file" |
…che#59394) ### What problem does this PR solve? Issue Number: close #xxx Related PR: apache#58545 Problem Summary: This PR introduces two new features for the SEARCH function: #### 1. Lucene Boolean Mode Adds a `mode` option to enable Lucene/Elasticsearch-style query parsing: ```sql -- Enable Lucene mode via JSON options SELECT * FROM docs WHERE search('apple AND banana', '{"default_field":"title","mode":"lucene"}'); -- With minimum_should_match SELECT * FROM docs WHERE search('apple AND banana OR cherry', '{"default_field":"title","mode":"lucene","minimum_should_match":1}'); ``` **Key differences from standard mode:** - AND/OR/NOT work as left-to-right modifiers (not traditional boolean algebra) - Uses MUST/SHOULD/MUST_NOT internally (like Lucene's Occur enum) - Pure NOT queries return empty results (need positive clause) **Behavior comparison:** | Query | Standard Mode | Lucene Mode | |-------|--------------|-------------| | `a AND b` | a ∩ b | +a +b (both MUST) | | `a OR b` | a ∪ b | a b (both SHOULD, min=1) | | `NOT a` | ¬a | Empty (no positive clause) | | `a AND NOT b` | a ∩ ¬b | +a -b (MUST a, MUST_NOT b) | | `a AND b OR c` | (a ∩ b) ∪ c | +a b c (only a is MUST) | #### 2. Escape Characters in DSL Support for escaping special characters using backslash: | Escape | Description | Example | |--------|-------------|---------| | `\ ` | Literal space | `title:First\ Value` matches "First Value" | | `\(` `\)` | Literal parentheses | `title:hello\(world\)` matches "hello(world)" | | `\:` | Literal colon | `title:key\:value` matches "key:value" | | `\\` | Literal backslash | `title:path\\to\\file` matches "path\to\file" |
…unction #59394 (#59745) Cherry-picked from #59394 **Note:** This PR depends on #59766 (cherry-pick of #58545) being merged first. ## Summary Introduce lucene bool mode for search function. ## Test plan - [ ] Regression tests (after dependency PR merged) Related PRs: #59394 Depends on: #59766 Co-authored-by: Jack <jiangkai@selectdb.com>
… search function Add documentation for two new features in the SEARCH function: 1. Lucene Boolean Mode: - JSON-based options parameter (mode, minimum_should_match) - Left-to-right modifier parsing (MUST/SHOULD/MUST_NOT) - Behavior comparison table with standard mode 2. Escape Characters: - Support for escaping special characters in DSL - Backslash escapes for space, parentheses, colon, backslash Updated both English and Chinese versions of search-function.md. Related PR: apache/doris#59394 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
#59845) ### What problem does this PR solve? Issue Number: close #xxx Related PR: #59394 Problem Summary: This PR adds `fields` and `type` parameters to the SEARCH function, allowing queries to search across multiple fields with a single query term. This is similar to Elasticsearch's multi_match query with `best_fields` and `cross_fields` types. #### Multi-Field Search Support ```sql -- Single term across multiple fields (best_fields mode - default) SELECT * FROM docs WHERE search('hello', '{"fields":["title","content"]}'); -- Equivalent to: (title:hello) OR (content:hello) -- Multi-term with AND operator (best_fields mode - default) SELECT * FROM docs WHERE search('hello world', '{"fields":["title","content"],"default_operator":"and"}'); -- Equivalent to: (title:hello AND title:world) OR (content:hello AND content:world) -- Multi-term with cross_fields mode SELECT * FROM docs WHERE search('hello world', '{"fields":["title","content"],"default_operator":"and","type":"cross_fields"}'); -- Equivalent to: (title:hello OR content:hello) AND (title:world OR content:world) -- Combined with Lucene mode SELECT * FROM docs WHERE search('machine AND learning', '{"fields":["title","content"],"mode":"lucene","minimum_should_match":0}'); ``` #### Type Parameter Options | Type | Description | Behavior | |------|-------------|----------| | `best_fields` (default) | All terms must match within the **SAME** field | `"hello world"` → `(title:hello AND title:world) OR (content:hello AND content:world)` | | `cross_fields` | Terms can match across **DIFFERENT** fields | `"hello world"` → `(title:hello OR content:hello) AND (title:world OR content:world)` | **Key features:** - `type` parameter controls how terms are matched across fields - `best_fields` (default): Finds documents where all terms appear in the same field - ideal for relevance ranking - `cross_fields`: Treats multiple fields as one big field - ideal for name searches across first_name/last_name - Compatible with both standard mode and Lucene boolean mode - `fields` and `default_field` are mutually exclusive - Supports functions (EXACT, ANY, ALL) across fields - Supports wildcard queries across fields **Behavior examples:** | Query | Fields | Type | Expanded DSL | |-------|--------|------|--------------| | `hello` | `["title","content"]` | best_fields | `(title:hello) OR (content:hello)` | | `hello world` (AND) | `["title","content"]` | best_fields | `(title:hello AND title:world) OR (content:hello AND content:world)` | | `hello world` (AND) | `["title","content"]` | cross_fields | `(title:hello OR content:hello) AND (title:world OR content:world)` | | `EXACT(foo bar)` | `["title","content"]` | any | `(title:EXACT(foo bar) OR content:EXACT(foo bar))` | | `hello AND category:tech` | `["title","content"]` | any | `(title:hello OR content:hello) AND category:tech` | **Use case examples:** - **Product search**: Use `best_fields` when searching product name and description - prefer products where query terms appear together - **Person name search**: Use `cross_fields` when searching first_name and last_name - "John Smith" should match documents with `first_name:John` and `last_name:Smith` ### Release note - Add multi-field search support for SEARCH function (`fields` parameter) - Add `type` parameter with `best_fields` (default) and `cross_fields` modes - `best_fields`: All terms must match within the same field (default, matches Elasticsearch behavior) - `cross_fields`: Terms can match across different fields - Compatible with Lucene mode for MUST/SHOULD/MUST_NOT semantics
#59845) ### What problem does this PR solve? Issue Number: close #xxx Related PR: #59394 Problem Summary: This PR adds `fields` and `type` parameters to the SEARCH function, allowing queries to search across multiple fields with a single query term. This is similar to Elasticsearch's multi_match query with `best_fields` and `cross_fields` types. #### Multi-Field Search Support ```sql -- Single term across multiple fields (best_fields mode - default) SELECT * FROM docs WHERE search('hello', '{"fields":["title","content"]}'); -- Equivalent to: (title:hello) OR (content:hello) -- Multi-term with AND operator (best_fields mode - default) SELECT * FROM docs WHERE search('hello world', '{"fields":["title","content"],"default_operator":"and"}'); -- Equivalent to: (title:hello AND title:world) OR (content:hello AND content:world) -- Multi-term with cross_fields mode SELECT * FROM docs WHERE search('hello world', '{"fields":["title","content"],"default_operator":"and","type":"cross_fields"}'); -- Equivalent to: (title:hello OR content:hello) AND (title:world OR content:world) -- Combined with Lucene mode SELECT * FROM docs WHERE search('machine AND learning', '{"fields":["title","content"],"mode":"lucene","minimum_should_match":0}'); ``` #### Type Parameter Options | Type | Description | Behavior | |------|-------------|----------| | `best_fields` (default) | All terms must match within the **SAME** field | `"hello world"` → `(title:hello AND title:world) OR (content:hello AND content:world)` | | `cross_fields` | Terms can match across **DIFFERENT** fields | `"hello world"` → `(title:hello OR content:hello) AND (title:world OR content:world)` | **Key features:** - `type` parameter controls how terms are matched across fields - `best_fields` (default): Finds documents where all terms appear in the same field - ideal for relevance ranking - `cross_fields`: Treats multiple fields as one big field - ideal for name searches across first_name/last_name - Compatible with both standard mode and Lucene boolean mode - `fields` and `default_field` are mutually exclusive - Supports functions (EXACT, ANY, ALL) across fields - Supports wildcard queries across fields **Behavior examples:** | Query | Fields | Type | Expanded DSL | |-------|--------|------|--------------| | `hello` | `["title","content"]` | best_fields | `(title:hello) OR (content:hello)` | | `hello world` (AND) | `["title","content"]` | best_fields | `(title:hello AND title:world) OR (content:hello AND content:world)` | | `hello world` (AND) | `["title","content"]` | cross_fields | `(title:hello OR content:hello) AND (title:world OR content:world)` | | `EXACT(foo bar)` | `["title","content"]` | any | `(title:EXACT(foo bar) OR content:EXACT(foo bar))` | | `hello AND category:tech` | `["title","content"]` | any | `(title:hello OR content:hello) AND category:tech` | **Use case examples:** - **Product search**: Use `best_fields` when searching product name and description - prefer products where query terms appear together - **Person name search**: Use `cross_fields` when searching first_name and last_name - "John Smith" should match documents with `first_name:John` and `last_name:Smith` ### Release note - Add multi-field search support for SEARCH function (`fields` parameter) - Add `type` parameter with `best_fields` (default) and `cross_fields` modes - `best_fields`: All terms must match within the same field (default, matches Elasticsearch behavior) - `cross_fields`: Terms can match across different fields - Compatible with Lucene mode for MUST/SHOULD/MUST_NOT semantics
apache#59845) ### What problem does this PR solve? Issue Number: close #xxx Related PR: apache#59394 Problem Summary: This PR adds `fields` and `type` parameters to the SEARCH function, allowing queries to search across multiple fields with a single query term. This is similar to Elasticsearch's multi_match query with `best_fields` and `cross_fields` types. #### Multi-Field Search Support ```sql -- Single term across multiple fields (best_fields mode - default) SELECT * FROM docs WHERE search('hello', '{"fields":["title","content"]}'); -- Equivalent to: (title:hello) OR (content:hello) -- Multi-term with AND operator (best_fields mode - default) SELECT * FROM docs WHERE search('hello world', '{"fields":["title","content"],"default_operator":"and"}'); -- Equivalent to: (title:hello AND title:world) OR (content:hello AND content:world) -- Multi-term with cross_fields mode SELECT * FROM docs WHERE search('hello world', '{"fields":["title","content"],"default_operator":"and","type":"cross_fields"}'); -- Equivalent to: (title:hello OR content:hello) AND (title:world OR content:world) -- Combined with Lucene mode SELECT * FROM docs WHERE search('machine AND learning', '{"fields":["title","content"],"mode":"lucene","minimum_should_match":0}'); ``` #### Type Parameter Options | Type | Description | Behavior | |------|-------------|----------| | `best_fields` (default) | All terms must match within the **SAME** field | `"hello world"` → `(title:hello AND title:world) OR (content:hello AND content:world)` | | `cross_fields` | Terms can match across **DIFFERENT** fields | `"hello world"` → `(title:hello OR content:hello) AND (title:world OR content:world)` | **Key features:** - `type` parameter controls how terms are matched across fields - `best_fields` (default): Finds documents where all terms appear in the same field - ideal for relevance ranking - `cross_fields`: Treats multiple fields as one big field - ideal for name searches across first_name/last_name - Compatible with both standard mode and Lucene boolean mode - `fields` and `default_field` are mutually exclusive - Supports functions (EXACT, ANY, ALL) across fields - Supports wildcard queries across fields **Behavior examples:** | Query | Fields | Type | Expanded DSL | |-------|--------|------|--------------| | `hello` | `["title","content"]` | best_fields | `(title:hello) OR (content:hello)` | | `hello world` (AND) | `["title","content"]` | best_fields | `(title:hello AND title:world) OR (content:hello AND content:world)` | | `hello world` (AND) | `["title","content"]` | cross_fields | `(title:hello OR content:hello) AND (title:world OR content:world)` | | `EXACT(foo bar)` | `["title","content"]` | any | `(title:EXACT(foo bar) OR content:EXACT(foo bar))` | | `hello AND category:tech` | `["title","content"]` | any | `(title:hello OR content:hello) AND category:tech` | **Use case examples:** - **Product search**: Use `best_fields` when searching product name and description - prefer products where query terms appear together - **Person name search**: Use `cross_fields` when searching first_name and last_name - "John Smith" should match documents with `first_name:John` and `last_name:Smith` ### Release note - Add multi-field search support for SEARCH function (`fields` parameter) - Add `type` parameter with `best_fields` (default) and `cross_fields` modes - `best_fields`: All terms must match within the same field (default, matches Elasticsearch behavior) - `cross_fields`: Terms can match across different fields - Compatible with Lucene mode for MUST/SHOULD/MUST_NOT semantics
… search function (#3276) Add documentation for two new features in the SEARCH function: 1. Lucene Boolean Mode: - JSON-based options parameter (mode, minimum_should_match) - Left-to-right modifier parsing (MUST/SHOULD/MUST_NOT) - Behavior comparison table with standard mode 2. Escape Characters: - Support for escaping special characters in DSL - Backslash escapes for space, parentheses, colon, backslash Updated both English and Chinese versions of search-function.md. Related PR: apache/doris#59394 ## Versions - [ ] dev - [ ] 4.x - [ ] 3.x - [ ] 2.1 ## Languages - [ ] Chinese - [ ] English ## Docs Checklist - [ ] Checked by AI - [ ] Test Cases Built Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
What problem does this PR solve?
Issue Number: close #xxx
Related PR: #58545
Problem Summary:
This PR introduces two new features for the SEARCH function:
1. Lucene Boolean Mode
Adds a
modeoption to enable Lucene/Elasticsearch-style query parsing:Key differences from standard mode:
Behavior comparison:
a AND ba OR bNOT aa AND NOT ba AND b OR c2. Escape Characters in DSL
Support for escaping special characters using backslash:
\title:First\ Valuematches "First Value"\(\)title:hello\(world\)matches "hello(world)"\:title:key\:valuematches "key:value"\\title:path\\to\\filematches "path\to\file"Release note
mode: "lucene",minimum_should_match)Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)