feat: opensearch multimodal: support filters, adjust defaults#12319
Conversation
Update OpenSearch multimodal vector store component to parse and apply filter_expression JSON to search queries, wrapping existing queries in a bool with filter clauses and applying limit (size) and min_score from the filter object when present. Also validate filter_expression JSON and raise a clear ValueError on parse errors. Adjust component inputs and defaults: remove "JSON" from input_types, change default auth_mode to "jwt", and set bearer_prefix default to false. Uses existing _coerce_filter_clauses helper to build filter clauses.
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Reference PR: langflow-ai/openrag#1247 |
There was a problem hiding this comment.
Pull request overview
This PR updates the OpenSearch multimodal vector store component to tighten input validation, change authentication defaults, and enhance raw_search() so it can apply the same filter_expression semantics as search().
Changes:
- Restricts
docs_metadatainput types to["Data"]only. - Switches auth defaults to JWT and disables the
Bearerprefix by default. - Extends
raw_search()to parse/applyfilter_expression, including propagatinglimitand score threshold into the query body.
Reviewed changes
Copilot reviewed 1 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py |
Adjusts input/auth defaults and adds filter-expression support to raw_search(). |
src/lfx/src/lfx/_assets/component_index.json |
Updates the component registry snapshot (defaults/code hash) to reflect the new component behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Apply filter_expression if configured (same parsing as search()) | ||
| filter_obj = None | ||
| if getattr(self, "filter_expression", "") and self.filter_expression.strip(): | ||
| try: | ||
| filter_obj = json.loads(self.filter_expression) | ||
| except json.JSONDecodeError as e: | ||
| msg = f"Invalid filter_expression JSON: {e}" | ||
| raise ValueError(msg) from e | ||
|
|
||
| filter_clauses = self._coerce_filter_clauses(filter_obj) | ||
|
|
There was a problem hiding this comment.
filter_expression is parsed with json.loads(...), but the result isn’t validated to be an object/dict before using .get(...). Valid JSON like [], "foo", or 123 will raise an AttributeError here (and may also break inside _coerce_filter_clauses). Recommendation: after json.loads, enforce isinstance(filter_obj, dict) (otherwise raise a ValueError explaining the expected shape), and consider coercing limit to an int (or rejecting non-integers) before assigning it to query_body["size"].
| if filter_obj: | ||
| # Apply limit if not already set in the raw query | ||
| if "size" not in query_body: | ||
| limit = filter_obj.get("limit") | ||
| if limit is not None: | ||
| query_body["size"] = limit | ||
|
|
||
| # Apply score_threshold / scoreThreshold as min_score if not already set | ||
| if "min_score" not in query_body: | ||
| score_threshold = filter_obj.get("score_threshold") or filter_obj.get("scoreThreshold") | ||
| if isinstance(score_threshold, (int, float)) and score_threshold > 0: | ||
| query_body["min_score"] = score_threshold |
There was a problem hiding this comment.
filter_expression is parsed with json.loads(...), but the result isn’t validated to be an object/dict before using .get(...). Valid JSON like [], "foo", or 123 will raise an AttributeError here (and may also break inside _coerce_filter_clauses). Recommendation: after json.loads, enforce isinstance(filter_obj, dict) (otherwise raise a ValueError explaining the expected shape), and consider coercing limit to an int (or rejecting non-integers) before assigning it to query_body["size"].
| if filter_clauses: | ||
| if "query" in query_body: | ||
| original_query = query_body["query"] | ||
| query_body["query"] = { | ||
| "bool": { | ||
| "must": [original_query], | ||
| "filter": filter_clauses, | ||
| } | ||
| } | ||
| else: | ||
| query_body["query"] = { | ||
| "bool": { | ||
| "must": [{"match_all": {}}], | ||
| "filter": filter_clauses, | ||
| } | ||
| } |
There was a problem hiding this comment.
This mutates query_body in-place. When raw_query is a dict (or when self.search_query is a dict), query_body likely aliases the caller-owned object; applying filters then permanently changes that dict for subsequent calls and can cause surprising side effects. Recommendation: copy the raw dict before modification (e.g., deep copy if nested structures are expected) so raw_search() operates on an isolated query body.
| # Apply filter_expression if configured (same parsing as search()) | ||
| filter_obj = None | ||
| if getattr(self, "filter_expression", "") and self.filter_expression.strip(): | ||
| try: | ||
| filter_obj = json.loads(self.filter_expression) | ||
| except json.JSONDecodeError as e: | ||
| msg = f"Invalid filter_expression JSON: {e}" | ||
| raise ValueError(msg) from e | ||
|
|
||
| filter_clauses = self._coerce_filter_clauses(filter_obj) |
There was a problem hiding this comment.
filter_expression parsing/validation logic is now duplicated between search() and raw_search(). To reduce drift (e.g., one path adding support for new keys like scoreThreshold while the other doesn’t), consider extracting a shared helper (e.g., _parse_filter_expression() returning (filter_obj, filter_clauses)), and reuse it in both methods.
Codecov Report✅ All modified and coverable lines are covered by tests. ❌ Your project status has failed because the head coverage (45.54%) is below the target coverage (60.00%). You can increase the head coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## release-1.9.0 #12319 +/- ##
=================================================
+ Coverage 48.86% 48.88% +0.01%
=================================================
Files 1897 1897
Lines 167656 167656
Branches 23193 23188 -5
=================================================
+ Hits 81928 81957 +29
+ Misses 84817 84789 -28
+ Partials 911 910 -1
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
Cristhianzl
left a comment
There was a problem hiding this comment.
Summary
This PR makes three types of changes to OpenSearchVectorStoreComponentMultimodalMultiEmbedding:
- Default changes:
auth_modedefault"basic"→"jwt",bearer_prefixdefaultTrue→False,input_typesrestricted from["Data", "JSON"]to["Data"]. - New feature:
raw_search()now parses and appliesfilter_expression(JSON filters,limit,score_threshold/scoreThreshold) — reusing_coerce_filter_clauses()and mirroringsearch()logic.
Files Changed
| File | Change | Lines |
|---|---|---|
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py |
Filter support in raw_search, default changes |
+44 / -3 |
src/lfx/src/lfx/_assets/component_index.json |
Reflects new defaults and hash (ignored) | +6 / -7 |
Review by Category
🔴 CRITICAL: Security & PII
| Item | ✓ | Notes |
|---|---|---|
| No PII in logs | ✅ | No new logging added; existing logger.info(f"query: {query_body}") at line ~462 logs the full query body including filter clauses — pre-existing, not introduced by this PR |
| No secrets/credentials in code | ✅ | No secrets introduced |
| No hardcoded API keys | ✅ | N/A |
| No internal details in error messages to end users | ✅ | ValueError for invalid JSON includes the parse error — acceptable for developer-facing component |
JSON injection via filter_expression |
✅ | json.loads() parses the string, then _coerce_filter_clauses() validates structure (only accepts known term/terms patterns). No raw string interpolation into queries |
Result: ✅ PASS
🔴 CRITICAL: DRY Principle
| Item | ✓ | Notes |
|---|---|---|
| No duplicate type definitions | ✅ | No new types added |
| No duplicate logic | The filter-parsing block (lines ~417-425) is a copy-paste of the identical block in search() (lines ~1533-1539): same getattr guard, same json.loads, same ValueError message. This is introduced duplication |
|
| No duplicate constants | ✅ | N/A |
| Shared module reuse | ✅ | _coerce_filter_clauses() is correctly reused — good |
Observation: The filter-parsing + _coerce_filter_clauses + bool-wrapping + limit/score_threshold extraction pattern appears in both search() and raw_search(). The parsing and _coerce_filter_clauses call are short (~10 lines) but the full pattern with limit/score_threshold is ~25 lines duplicated. This could be extracted to a helper like _parse_and_apply_filters(query_body).
Result:
🔴 CRITICAL: File Structure Limits
| Item | ✓ | Notes |
|---|---|---|
opensearch_multimodal.py line count |
❌ | 2021 lines — far exceeds the 500-line limit. This is pre-existing but the PR adds 41 more lines to an already oversized file |
| Functions per file | 33 methods/functions — very high, pre-existing | |
| Main classes per file | ✅ | 1 class + 2 module-level helpers |
| Mixed responsibilities | Pre-existing: auth, embedding, search, indexing, filtering all in one class |
Result:
🟠 IMPORTANT: Architecture & Structure
| Item | ✓ | Notes |
|---|---|---|
| Single Responsibility | ✅ | Changes are scoped to raw_search filter support and default adjustments |
| Layer Separation | ✅ | No layer violations |
| No business logic in handlers | ✅ | N/A |
Observation: The raw_search filter integration mirrors search() correctly — same parse → coerce → wrap-in-bool flow. The approach is consistent.
Result: ✅ PASS
🟠 IMPORTANT: Code Quality
| Item | ✓ | Notes |
|---|---|---|
| Strong typing | ✅ | Types maintained |
No any/object types |
✅ | N/A |
| Immutability | ✅ | query_body is mutated in-place but is locally created — acceptable |
| Clean naming | ✅ | filter_obj, filter_clauses, score_threshold — consistent with search() |
Observation (default change risk): Changing auth_mode default from "basic" to "jwt" and bearer_prefix from True to False is a breaking change for users who rely on the current defaults. Existing configurations saved in flows/JSON will keep their stored values, but new component instances will default to JWT. This should be called out in release notes.
Result: ✅ PASS (code quality is fine; default change is a product decision)
🟠 IMPORTANT: Error Handling
| Item | ✓ | Notes |
|---|---|---|
| No silent failures | ✅ | Invalid JSON raises ValueError with context |
| Explicit error handling | ✅ | json.JSONDecodeError caught and re-raised as ValueError with message |
| Meaningful error context | ✅ | f"Invalid filter_expression JSON: {e}" — same pattern as search() |
Concern: If filter_expression attribute doesn't exist, getattr(self, "filter_expression", "") returns "" safely. However, if it exists but is None, .strip() will raise AttributeError. The getattr guard with "" default plus the truthy check should prevent this — the and self.filter_expression.strip() short-circuits. This is safe.
Result: ✅ PASS
🟡 RECOMMENDED: Observability
| Item | ✓ | Notes |
|---|---|---|
| Logging at key points | ✅ | The existing logger.info(f"query: {query_body}") at line ~462 will now include filter clauses — sufficient |
| No sensitive data in logs | Pre-existing — logger.info(f"query: {query_body}") logs the full query body. If filter clauses contain user-identifying data (e.g., owner field values), these go to logs. Not introduced by this PR |
Result: ✅ PASS
🟡 RECOMMENDED: Comments
| Item | ✓ | Notes |
|---|---|---|
| No comments explaining WHAT | ✅ | — |
| Only WHY comments | ✅ | # Apply filter_expression if configured (same parsing as search()) — explains intent and cross-references |
| No TODO without ticket | ✅ | No TODOs added |
Result: ✅ PASS
🟢 TESTING
| Item | ✓ | Notes |
|---|---|---|
| Unit tests for core logic | ❌ | No tests provided |
| Happy path covered | ❌ | Missing: raw_search with valid filter_expression |
| Adversarial/error cases | ❌ | Missing: raw_search with invalid JSON filter_expression |
| Edge cases | ❌ | Missing: raw_search with empty filter, filter without "query" key in body |
| Default value changes tested | ❌ | Missing: verify auth_mode defaults to "jwt", bearer_prefix defaults to False |
Missing Test Scenarios (required)
| Scenario | Severity |
|---|---|
raw_search with valid filter_expression JSON — verify clauses are injected into query body |
🔴 CRITICAL |
raw_search with invalid JSON in filter_expression — verify ValueError raised |
🔴 CRITICAL |
raw_search with filter_expression containing limit — verify size is set |
🟡 RECOMMENDED |
raw_search with filter_expression containing score_threshold — verify min_score is set |
🟡 RECOMMENDED |
raw_search with query body that already has "query" key vs one that doesn't — verify both bool-wrap paths |
🟡 RECOMMENDED |
raw_search with filter_expression but size already in query_body — verify no override |
🟡 RECOMMENDED |
Default values for auth_mode and bearer_prefix on new component instance |
🟡 RECOMMENDED |
Result: ❌ FAIL — no tests provided for new functionality
🟢 LEGACY CODE AWARENESS
| Item | ✓ | Notes |
|---|---|---|
| Not prolonging bad patterns | Adds 41 lines to a 2021-line file. The duplication with search() could be avoided with a shared helper |
|
| No copy-paste from legacy | The filter-parsing block IS copy-pasted from search() |
Result:
Overall Assessment
Score: ⚠️ REQUEST CHANGES
| Category | ✓ |
|---|---|
| 🔴 Security & PII | ✅ |
| 🔴 DRY | |
| 🔴 File Structure | |
| 🟠 Architecture | ✅ |
| 🟠 Code Quality | ✅ |
| 🟠 Error Handling | ✅ |
| 🟡 Observability | ✅ |
| 🟡 Comments | ✅ |
| 🟢 Testing | ❌ |
| 🟢 Legacy Awareness |
Blocking Issues
- No tests — The PR adds ~41 lines of new logic (filter parsing, bool query wrapping, limit/score_threshold extraction) with zero test coverage. At minimum, tests for the happy path and invalid JSON error path are required.
Recommendations (non-blocking)
- Extract shared filter logic — The filter-parsing + apply pattern (~25 lines) is duplicated between
search()andraw_search(). Extract to a private method like_apply_filter_expression(query_body: dict) -> dictto keep DRY. - Default change documentation — The
auth_mode(basic→jwt) andbearer_prefix(True→False) changes are breaking for new instances. Document in release notes. input_typesrestriction — Removing"JSON"from accepted input types for the table input could break existing flows that feed JSON data. Verify backwards compatibility.- Pre-existing: File is 2021 lines with 33 methods — consider splitting in a future refactor.
# Conflicts: # src/lfx/src/lfx/_assets/component_index.json
* opensearch multimodal: support filters, adjust defaults Update OpenSearch multimodal vector store component to parse and apply filter_expression JSON to search queries, wrapping existing queries in a bool with filter clauses and applying limit (size) and min_score from the filter object when present. Also validate filter_expression JSON and raise a clear ValueError on parse errors. Adjust component inputs and defaults: remove "JSON" from input_types, change default auth_mode to "jwt", and set bearer_prefix default to false. Uses existing _coerce_filter_clauses helper to build filter clauses. * Update component_index.json * [autofix.ci] apply automated fixes * [autofix.ci] apply automated fixes * resolve review comments * [autofix.ci] apply automated fixes * fix ruff errors * [autofix.ci] apply automated fixes * fix ruff errors * [autofix.ci] apply automated fixes --------- Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com> Co-authored-by: himavarshagoutham <himavarshajan17@gmail.com> Co-authored-by: Himavarsha <40851462+HimavarshaVS@users.noreply.github.com>
* opensearch multimodal: support filters, adjust defaults Update OpenSearch multimodal vector store component to parse and apply filter_expression JSON to search queries, wrapping existing queries in a bool with filter clauses and applying limit (size) and min_score from the filter object when present. Also validate filter_expression JSON and raise a clear ValueError on parse errors. Adjust component inputs and defaults: remove "JSON" from input_types, change default auth_mode to "jwt", and set bearer_prefix default to false. Uses existing _coerce_filter_clauses helper to build filter clauses. * Update component_index.json * [autofix.ci] apply automated fixes * [autofix.ci] apply automated fixes * resolve review comments * [autofix.ci] apply automated fixes * fix ruff errors * [autofix.ci] apply automated fixes * fix ruff errors * [autofix.ci] apply automated fixes --------- Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com> Co-authored-by: himavarshagoutham <himavarshajan17@gmail.com> Co-authored-by: Himavarsha <40851462+HimavarshaVS@users.noreply.github.com>
This pull request updates the
OpenSearchVectorStoreComponentMultimodalMultiEmbeddingclass inopensearch_multimodal.pyto improve how authentication and filtering are handled. The main changes include stricter input types, defaulting to JWT authentication, altering the default bearer prefix setting, and enhancing theraw_searchmethod to support filter expressions in a more robust way.Authentication and Input Handling:
"basic"to"jwt"in theauth_modedropdown input, making JWT the default for new configurations.bearer_prefixboolean input fromTruetoFalse, so the "Bearer " prefix is not included by default in authentication headers.input_typesfor a specific input from["Data", "JSON"]to just["Data"], enforcing stricter input validation.Filtering and Query Improvements:
Enhanced the
raw_searchmethod to supportfilter_expressionparsing and application, including:search()method.limitandscore_threshold/scoreThresholdfrom the filter expression to the query if not already set.Testing