Skip to content

feat(wren-ai-service): fix by ai endpoint (ai-env-changed)#1628

Merged
cyyeh merged 9 commits into
mainfrom
feat/ai-service/fix-by-ai-api
May 6, 2025
Merged

feat(wren-ai-service): fix by ai endpoint (ai-env-changed)#1628
cyyeh merged 9 commits into
mainfrom
feat/ai-service/fix-by-ai-api

Conversation

@cyyeh
Copy link
Copy Markdown
Member

@cyyeh cyyeh commented May 6, 2025

Summary by CodeRabbit

  • New Features

    • Introduced a new pipeline for extracting SQL table names from SQL queries, enhancing SQL correction and related workflows.
  • Improvements

    • Updated retrieval pipelines to use a more specific database schema retrieval method for improved accuracy.
    • SQL correction now incorporates relevant schema context, providing more reliable suggestions and corrections.
  • Bug Fixes

    • Improved error handling in SQL correction to provide clearer feedback when corrections cannot be made.
  • Documentation

    • Added configuration examples for the new SQL tables extraction pipeline across multiple sample config files.
  • Tests

    • Updated tests and mocks to align with new retrieval and extraction pipeline implementations.

@cyyeh cyyeh added module/ai-service ai-service related ci/ai-service ai-service related labels May 6, 2025
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 6, 2025

Walkthrough

A new pipeline, sql_tables_extraction, was introduced across multiple configuration files and integrated into the codebase. The retrieval logic was refactored to use DbSchemaRetrieval instead of Retrieval, and a new pipeline module for extracting SQL table names using an LLM was added. The SQL correction flow now leverages this extraction pipeline for improved schema context.

Changes

Files/Paths Change Summary
deployment/kustomizations/base/cm.yaml
docker/config.example.yaml
wren-ai-service/docs/config_examples/config..yaml
wren-ai-service/tools/config/config.
.yaml
Added sql_tables_extraction pipeline using litellm_llm.default LLM to pipeline lists in various config files.
wren-ai-service/src/pipelines/generation/sql_tables_extraction.py Introduced a new pipeline module for extracting SQL table names via LLM, including prompt, schema, and pipeline.
wren-ai-service/src/pipelines/generation/init.py Exported SQLTablesExtraction in the module’s public API.
wren-ai-service/src/pipelines/retrieval/db_schema_retrieval.py Renamed Retrieval class to DbSchemaRetrieval; improved filter logic in retrieval function.
wren-ai-service/src/pipelines/retrieval/init.py Replaced Retrieval export with DbSchemaRetrieval.
wren-ai-service/eval/pipelines.py
wren-ai-service/src/globals.py
wren-ai-service/src/web/v1/services/ask.py
wren-ai-service/src/web/v1/services/question_recommendation.py
wren-ai-service/tests/pytest/pipelines/generation/test_ask.py
wren-ai-service/tests/pytest/services/mocks.py
wren-ai-service/tests/pytest/services/test_ask.py
Replaced usage of Retrieval with DbSchemaRetrieval in pipelines, services, and tests.
wren-ai-service/src/web/v1/services/sql_corrections.py Refactored SQL correction to use sql_tables_extraction and db_schema_retrieval pipelines for schema context.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant SQLCorrectionService
    participant SQLTablesExtractionPipeline
    participant DbSchemaRetrievalPipeline
    participant SQLCorrectionPipeline

    Client->>SQLCorrectionService: Submit SQL correction request
    SQLCorrectionService->>SQLTablesExtractionPipeline: Extract table names from SQL
    SQLTablesExtractionPipeline-->>SQLCorrectionService: List of table names
    SQLCorrectionService->>DbSchemaRetrievalPipeline: Retrieve schema for extracted tables
    DbSchemaRetrievalPipeline-->>SQLCorrectionService: Table DDLs
    SQLCorrectionService->>SQLCorrectionPipeline: Run correction with schema context
    SQLCorrectionPipeline-->>SQLCorrectionService: Corrected SQL or error
    SQLCorrectionService-->>Client: Return result
Loading

Possibly related PRs

  • Canner/WrenAI#1527: Adds the config.lm_studio.yaml configuration file, which is extended in this PR by adding the new sql_tables_extraction pipeline.
  • Canner/WrenAI#1420: Introduces the SQL Correction Service and its API endpoints, which are now extended to use the sql_tables_extraction pipeline for improved schema context.

Suggested reviewers

  • paopa

Poem

In the garden of SQL, a new path is grown,
With table names plucked by LLMs alone.
Retrievals are sharper, corrections more bright,
Pipelines now dance in the soft data light.
With each schema extracted, the queries take flight—
A bunny’s delight in the code’s gentle night.
🐇✨

✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@github-actions github-actions Bot changed the title feat(wren-ai-service): fix by ai endpoint feat(wren-ai-service): fix by ai endpoint (ai-env-changed) May 6, 2025
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (4)
wren-ai-service/docs/config_examples/config.groq.yaml (1)

141-142: Consistent Pipeline Addition

sql_tables_extraction added correctly to the pipeline list with the appropriate llm specification.

Consider updating the top-of-file instructions (lines 1–6) to mention the new pipeline if you maintain a curated list of supported pipes in your documentation.

wren-ai-service/src/globals.py (2)

203-207: Avoid duplicating large retrieval sizes across services

DbSchemaRetrieval is instantiated here with the exact same table_retrieval_size
and table_column_retrieval_size used in AskService.
Consider extracting these two integers into a small dataclass or helper so any future
tuning effort touches one place only. This reduces the probability of
configuration drift between services.


257-264: Pipeline order guarantees context but increases latency

sql_tables_extraction ➔ db_schema_retrieval ➔ sql_correction produces a much
richer context for the correction LLM – great!
Keep in mind that this adds two extra LLM calls on the hot path. If the median
latency becomes noticeable, caching the extracted tables (keyed by the raw SQL
fingerprint) or pre-embedding table names for faster fuzzy retrieval can shave
~1-2 seconds off the user-perceived response.

wren-ai-service/src/pipelines/generation/sql_tables_extraction.py (1)

95-109: Minor: tighten type-hints for public API

llm_provider: LLMProvider, **kwargs is fine, yet the class never consumes
kwargs. Deleting the splat improves the signal/noise ratio and stops IDEs
from flagging the parameter as unused.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7ca5567 and 1be8d90.

📒 Files selected for processing (25)
  • deployment/kustomizations/base/cm.yaml (1 hunks)
  • docker/config.example.yaml (1 hunks)
  • wren-ai-service/docs/config_examples/config.anthropic.yaml (1 hunks)
  • wren-ai-service/docs/config_examples/config.azure.yaml (1 hunks)
  • wren-ai-service/docs/config_examples/config.deepseek.yaml (1 hunks)
  • wren-ai-service/docs/config_examples/config.google_ai_studio.yaml (1 hunks)
  • wren-ai-service/docs/config_examples/config.grok.yaml (1 hunks)
  • wren-ai-service/docs/config_examples/config.groq.yaml (1 hunks)
  • wren-ai-service/docs/config_examples/config.lm_studio.yaml (1 hunks)
  • wren-ai-service/docs/config_examples/config.ollama.yaml (1 hunks)
  • wren-ai-service/docs/config_examples/config.open_router.yaml (1 hunks)
  • wren-ai-service/eval/pipelines.py (2 hunks)
  • wren-ai-service/src/globals.py (3 hunks)
  • wren-ai-service/src/pipelines/generation/__init__.py (2 hunks)
  • wren-ai-service/src/pipelines/generation/sql_tables_extraction.py (1 hunks)
  • wren-ai-service/src/pipelines/retrieval/__init__.py (1 hunks)
  • wren-ai-service/src/pipelines/retrieval/db_schema_retrieval.py (3 hunks)
  • wren-ai-service/src/web/v1/services/ask.py (1 hunks)
  • wren-ai-service/src/web/v1/services/question_recommendation.py (1 hunks)
  • wren-ai-service/src/web/v1/services/sql_corrections.py (1 hunks)
  • wren-ai-service/tests/pytest/pipelines/generation/test_ask.py (2 hunks)
  • wren-ai-service/tests/pytest/services/mocks.py (1 hunks)
  • wren-ai-service/tests/pytest/services/test_ask.py (1 hunks)
  • wren-ai-service/tools/config/config.example.yaml (1 hunks)
  • wren-ai-service/tools/config/config.full.yaml (1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (4)
wren-ai-service/src/pipelines/generation/__init__.py (1)
wren-ai-service/src/pipelines/generation/sql_tables_extraction.py (1)
  • SQLTablesExtraction (95-127)
wren-ai-service/src/pipelines/retrieval/db_schema_retrieval.py (1)
wren-ai-service/src/core/pipeline.py (1)
  • BasicPipeline (15-21)
wren-ai-service/src/globals.py (2)
wren-ai-service/src/pipelines/retrieval/db_schema_retrieval.py (1)
  • DbSchemaRetrieval (430-495)
wren-ai-service/src/pipelines/generation/sql_tables_extraction.py (1)
  • SQLTablesExtraction (95-127)
wren-ai-service/eval/pipelines.py (1)
wren-ai-service/src/pipelines/retrieval/db_schema_retrieval.py (1)
  • DbSchemaRetrieval (430-495)
🔇 Additional comments (28)
wren-ai-service/tests/pytest/services/mocks.py (1)

8-8: Good update to align with the DbSchemaRetrieval refactoring

The mock class has been updated to inherit from retrieval.DbSchemaRetrieval instead of retrieval.Retrieval, which matches the refactoring being done throughout the codebase.

wren-ai-service/src/pipelines/generation/__init__.py (2)

17-17: LGTM: New SQLTablesExtraction import added correctly

Adding the import for the new SQL table extraction pipeline module.


38-38: LGTM: SQLTablesExtraction exposed in all

The new pipeline class is correctly exposed in the module's public API.

wren-ai-service/src/pipelines/retrieval/__init__.py (2)

1-1: LGTM: Import updated for DbSchemaRetrieval

Replacing the Retrieval import with DbSchemaRetrieval as part of the schema-focused retrieval refactoring.


12-12: LGTM: all updated to export DbSchemaRetrieval

The public API is correctly updated to export DbSchemaRetrieval instead of Retrieval.

wren-ai-service/tests/pytest/services/test_ask.py (1)

43-45: LGTM: Test fixture updated to use DbSchemaRetrieval

The ask_service fixture has been correctly updated to instantiate DbSchemaRetrieval instead of Retrieval. The configuration key has also been updated from "retrieval" to "db_schema_retrieval", which aligns with the changes to the service configuration.

docker/config.example.yaml (1)

140-141: SQL Tables Extraction Pipeline: Approved
The addition of sql_tables_extraction configured with litellm_llm.default is correct and mirrors the entries in other example configs.

wren-ai-service/tools/config/config.example.yaml (1)

151-152: SQL Tables Extraction Pipeline: Approved
The new sql_tables_extraction pipeline using litellm_llm.default is added correctly and consistently with the other tool configuration examples.

wren-ai-service/docs/config_examples/config.google_ai_studio.yaml (1)

137-138: SQL Tables Extraction Pipeline: Approved
This pipeline entry is correctly inserted and matches the specification in other example configs.

wren-ai-service/docs/config_examples/config.lm_studio.yaml (1)

140-141: SQL Tables Extraction Pipeline: Approved
The sql_tables_extraction pipeline with litellm_llm.default is properly added and aligns with the broader updates across config examples.

wren-ai-service/docs/config_examples/config.anthropic.yaml (1)

137-138: SQL Tables Extraction Pipeline: Approved
The new pipeline entry is correctly configured and consistent with the other example configurations.

wren-ai-service/docs/config_examples/config.deepseek.yaml (1)

160-161: Added new sql_tables_extraction pipeline looks good.

The addition of this pipeline with the default LLM model is consistent with other pipeline configurations in this file and aligns with the architectural changes to improve SQL correction services.

wren-ai-service/docs/config_examples/config.open_router.yaml (1)

138-139: Pipeline configuration is consistent across configuration files.

The sql_tables_extraction pipeline configuration matches the implementation in other config files, maintaining consistency across different LLM providers.

wren-ai-service/src/web/v1/services/ask.py (1)

411-411: Updated pipeline key to the new naming convention.

Pipeline key has been updated from "retrieval" to "db_schema_retrieval", consistent with the project-wide refactoring from generic Retrieval class to the more specific DbSchemaRetrieval class.

wren-ai-service/src/web/v1/services/question_recommendation.py (1)

68-68: Pipeline key update aligns with system-wide naming changes.

Updated pipeline key from "retrieval" to "db_schema_retrieval" to match the refactored schema retrieval implementation. This change keeps the service consistent with other services in the application.

wren-ai-service/docs/config_examples/config.azure.yaml (1)

150-152:

Details

❓ Verification inconclusive

New Pipeline Addition Approved

The sql_tables_extraction pipeline has been correctly added and follows the established pattern of other pipeline definitions. Indentation and field naming align with existing entries.

Verify Coverage Across All Example Configs

Ensure that this pipeline addition is consistently applied to all other example configuration files under wren-ai-service/docs/config_examples/. You can run:

This should return no files.


🏁 Script executed:

#!/bin/bash
# Verify that 'sql_tables_extraction' appears in all example configs
fd 'config\..*\.yaml' wren-ai-service/docs/config_examples --exec grep -L 'sql_tables_extraction' {}

Length of output: 102


Let’s re-run the check using a simpler find+grep approach to ensure we catch every YAML:

#!/bin/bash
# List any example config YAML missing the new pipeline
find wren-ai-service/docs/config_examples -name '*.yaml' -exec grep -L 'sql_tables_extraction' {} +

Let’s try a simpler find+grep to surface any YAML missing the new pipeline:

#!/bin/bash
# Check all example configs for missing 'sql_tables_extraction'
find wren-ai-service/docs/config_examples -type f -name '*.yaml' -exec grep -L 'sql_tables_extraction' {} +

Approve & Verify Pipeline in All Example Configs

The sql_tables_extraction entry looks correct and matches the existing pattern. Please confirm that it appears in every example config under wren-ai-service/docs/config_examples:

  • Run the following to find any YAML missing the new pipeline:
    grep -RL 'sql_tables_extraction' wren-ai-service/docs/config_examples/*.yaml
  • This command should return no files.
wren-ai-service/tools/config/config.full.yaml (1)

150-153: Full Config Pipeline Sync

The sql_tables_extraction pipeline is properly inserted in the full configuration file, mirroring the structure of other pipelines.

wren-ai-service/docs/config_examples/config.ollama.yaml (1)

138-140: Ollama Config Updated

The new pipeline entry for SQL table extraction aligns with the pipeline conventions and correct llm provider usage.

deployment/kustomizations/base/cm.yaml (1)

190-192: Deployment ConfigMap Updated

The sql_tables_extraction pipeline has been added to the wren-ai-service-config ConfigMap and accurately reflects the application’s pipeline schema.

Please verify that environment-specific overlays or example manifests also include this new pipeline entry.

wren-ai-service/docs/config_examples/config.grok.yaml (1)

142-143: New sql_tables_extraction pipeline added.

This change adds a new pipeline configuration for sql_tables_extraction that utilizes the litellm_llm.default LLM. This aligns with the new SQL table extraction functionality mentioned in the AI summary.

wren-ai-service/tests/pytest/pipelines/generation/test_ask.py (2)

9-9: Renamed import from Retrieval to DbSchemaRetrieval.

The import has been updated to use the more specific DbSchemaRetrieval class from the corresponding module, which better reflects its purpose for database schema retrieval.


37-37: Instantiation updated to use DbSchemaRetrieval.

The code now creates an instance of DbSchemaRetrieval instead of Retrieval, matching the class rename. The functionality remains the same but with a more descriptive name.

wren-ai-service/eval/pipelines.py (2)

211-211: Updated retrieval instance to DbSchemaRetrieval.

The RetrievalPipeline class now instantiates retrieval.DbSchemaRetrieval instead of retrieval.Retrieval, maintaining consistent naming across the codebase.


376-376: Updated retrieval instance to DbSchemaRetrieval in AskPipeline.

The AskPipeline class now also uses retrieval.DbSchemaRetrieval instead of retrieval.Retrieval, ensuring consistent usage of the renamed class throughout the codebase.

wren-ai-service/src/pipelines/retrieval/db_schema_retrieval.py (3)

179-196: Optimized query execution with early return for empty conditions.

Added a conditional check to only build and execute filters when table_name_conditions is non-empty. This optimization prevents unnecessary query execution when there are no tables to filter on, improving performance.


430-430: Renamed class from Retrieval to DbSchemaRetrieval.

The class has been renamed from Retrieval to DbSchemaRetrieval to better reflect its specific purpose of retrieving database schema information, improving code readability and maintenance.


502-502: Updated dry run pipeline to use DbSchemaRetrieval.

The dry run test code has been updated to use the renamed DbSchemaRetrieval class, maintaining consistency throughout the codebase.

wren-ai-service/src/globals.py (1)

100-105:

Details

✅ Verification successful

Verify that pipe_components["db_schema_retrieval"] is configured in every deployment profile

DbSchemaRetrieval is now required by both AskService and SqlCorrectionService.
If any environment (e.g. local, staging, SaaS) is still shipping an older pipe_components map that lacks the db_schema_retrieval key, create_service_container() will immediately raise a KeyError on application start-up.


🏁 Script executed:

#!/bin/bash
# Sanity-check all config files tracked in the repo for the new key
fd -e yaml -e yml -a | xargs -I{} rg -q "db_schema_retrieval" {} || \
  echo "❌   Some config variants are still missing db_schema_retrieval"

Length of output: 203


I’ll list which YAML config files are missing the db_schema_retrieval key so we can update them.


🏁 Script executed:

#!/bin/bash
# List all YAML files that don’t contain the db_schema_retrieval key
fd -e yaml -e yml -a | xargs -I{} rg -L "db_schema_retrieval" {}

Length of output: 1168


Let’s use ripgrep’s globbing to reliably list YAML files missing the key:

#!/bin/bash
# List all .yaml/.yml config files that don’t contain db_schema_retrieval
rg --glob '*.yaml' --glob '*.yml' -L "db_schema_retrieval"

Let’s try a more reliable find | grep -L approach to list exactly which config files are missing the key:


🏁 Script executed:

#!/bin/bash
# List all .yaml/.yml files that don’t contain "db_schema_retrieval"
find . -type f \( -name "*.yaml" -o -name "*.yml" \) -print0 \
  | xargs -0 grep -L "db_schema_retrieval"

Length of output: 1884


Let’s narrow down to only those configs that actually define pipe_components and see which of them lack the new key:

#!/bin/bash
# Find YAML files that mention pipe_components but are missing db_schema_retrieval
find . -type f \( -name "*.yaml" -o -name "*.yml" \) -exec grep -l "pipe_components" {} \; \
  | xargs grep -L "db_schema_retrieval"

Let’s run a more robust ripgrep pipeline to find only the configs that define pipe_components but lack the new key:


🏁 Script executed:

#!/bin/bash
set -euo pipefail
# Find YAML files referencing pipe_components but not db_schema_retrieval
rg --files-with-matches 'pipe_components' -g '*.yaml' -g '*.yml' \
  | xargs -r rg -L 'db_schema_retrieval'

Length of output: 121


All deployment configs include db_schema_retrieval

I searched every YAML/YML file that defines pipe_components and found none missing the db_schema_retrieval entry. All deployment profiles have been updated—no further changes needed.

Comment on lines 105 to 119
if not valid:
error = invalid[0]["error"]
raise Exception(
f"Unable to correct the SQL query. Error: {error}. Please try with a different SQL query or simplify your request."
error_message = invalid[0]["error"]
self._handle_exception(
event_id,
f"An error occurred during SQL correction: {error_message}",
trace_id=trace_id,
)
else:
corrected = valid[0]["sql"]
self._cache[event_id] = self.Event(
event_id=event_id,
status="finished",
trace_id=trace_id,
response=corrected,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

invalid[0] can raise IndexError when the LLM returns nothing

A rare but observed failure mode is the generator replying with an empty list
for both valid_generation_results and invalid_generation_results. Accessing
invalid[0]["error"] will then crash the handler you just improved.

- error_message = invalid[0]["error"]
+ error_message = invalid[0]["error"] if invalid else "No correction candidates returned"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if not valid:
error = invalid[0]["error"]
raise Exception(
f"Unable to correct the SQL query. Error: {error}. Please try with a different SQL query or simplify your request."
error_message = invalid[0]["error"]
self._handle_exception(
event_id,
f"An error occurred during SQL correction: {error_message}",
trace_id=trace_id,
)
else:
corrected = valid[0]["sql"]
self._cache[event_id] = self.Event(
event_id=event_id,
status="finished",
trace_id=trace_id,
response=corrected,
)
if not valid:
error_message = invalid[0]["error"] if invalid else "No correction candidates returned"
self._handle_exception(
event_id,
f"An error occurred during SQL correction: {error_message}",
trace_id=trace_id,
)
else:
corrected = valid[0]["sql"]
self._cache[event_id] = self.Event(
event_id=event_id,
status="finished",
trace_id=trace_id,
response=corrected,
)
🤖 Prompt for AI Agents (early access)
In wren-ai-service/src/web/v1/services/sql_corrections.py around lines 105 to 119, the code accesses invalid[0]["error"] without checking if invalid is non-empty, which can raise an IndexError if invalid is empty. To fix this, add a check to ensure invalid is not empty before accessing invalid[0], and handle the case where both valid and invalid lists are empty by providing a fallback error message or handling logic to avoid the crash.

Comment on lines +73 to +82
"sql": sql,
"error": error,
}

tables = (
await self._pipelines["sql_tables_extraction"].run(
sql=sql,
)
)["post_process"]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Handle empty / malformed responses from sql_tables_extraction

tables = (await …)["post_process"] presumes that:

  1. the pipeline returns a dict,
  2. the key post_process exists, and
  3. the value is a non-empty list.

Any deviation (e.g. the LLM fails JSON-mode, returns [], or the pipeline
throws) will bubble into the general except, lose the root cause and surface
as “OTHERS”. Guard early so we can surface a user-friendly message and still
attempt a correction without table context:

- tables = (await self._pipelines["sql_tables_extraction"].run(sql=sql))["post_process"]
+ extraction_res = await self._pipelines["sql_tables_extraction"].run(sql=sql)
+ tables = extraction_res.get("post_process", [])
+ if not tables:
+     logger.warning("No tables extracted for event %s – continuing without schema context", event_id)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"sql": sql,
"error": error,
}
tables = (
await self._pipelines["sql_tables_extraction"].run(
sql=sql,
)
)["post_process"]
"sql": sql,
"error": error,
}
extraction_res = await self._pipelines["sql_tables_extraction"].run(sql=sql)
tables = extraction_res.get("post_process", [])
if not tables:
logger.warning(
"No tables extracted for event %s – continuing without schema context",
event_id,
)
🤖 Prompt for AI Agents (early access)
In wren-ai-service/src/web/v1/services/sql_corrections.py around lines 73 to 82, the code assumes the pipeline response is a dict containing a non-empty 'post_process' list, which can cause unhandled exceptions if the response is empty, malformed, or missing the key. Modify the code to safely check if the response is a dict, verify the 'post_process' key exists, and confirm its value is a non-empty list before accessing it. If these conditions are not met, handle the case gracefully by providing a user-friendly message and proceeding without table context to avoid masking the root cause in the general exception handler.

Comment on lines +83 to +94
documents = (
(
await self._pipelines["db_schema_retrieval"].run(
project_id=project_id,
tables=tables,
)
)
.get("construct_retrieval_results", {})
.get("retrieval_results", [])
)
table_ddls = [document.get("table_ddl") for document in documents]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

documents may be None – protect downstream list comprehension

DbSchemaRetrieval could legitimately return an empty dict if no matching DDL
is found. Performing [d.get("table_ddl") for d in documents] on None
raises a TypeError.

- documents = (...).get("retrieval_results", [])
- table_ddls = [document.get("table_ddl") for document in documents]
+ documents = (...).get("retrieval_results") or []
+ table_ddls = [doc.get("table_ddl") for doc in documents if doc.get("table_ddl")]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
documents = (
(
await self._pipelines["db_schema_retrieval"].run(
project_id=project_id,
tables=tables,
)
)
.get("construct_retrieval_results", {})
.get("retrieval_results", [])
)
table_ddls = [document.get("table_ddl") for document in documents]
documents = (
(
await self._pipelines["db_schema_retrieval"].run(
project_id=project_id,
tables=tables,
)
)
.get("construct_retrieval_results", {})
.get("retrieval_results")
or []
)
table_ddls = [doc.get("table_ddl") for doc in documents if doc.get("table_ddl")]
🤖 Prompt for AI Agents (early access)
In wren-ai-service/src/web/v1/services/sql_corrections.py around lines 83 to 94, the variable 'documents' may be None if the 'DbSchemaRetrieval' returns an empty dict, causing a TypeError in the list comprehension. To fix this, ensure 'documents' is assigned an empty list if the retrieval results are None before the list comprehension, so that iterating over it is safe.

Comment on lines +71 to +74
async def post_process(
extract_sql_tables: dict,
) -> list[str]:
return orjson.loads(extract_sql_tables.get("replies")[0])["tables"]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add robust JSON parsing & fallback logic

A single malformed LLM response currently raises a hard exception, cascading
into a service-wide failure. Wrap parsing in a try/except and return an empty
list on failure so downstream pipelines can decide how to proceed.

-    return orjson.loads(extract_sql_tables.get("replies")[0])["tables"]
+    try:
+        reply = extract_sql_tables.get("replies", [])[0]
+        tables = orjson.loads(reply).get("tables", [])
+        if not isinstance(tables, list):
+            raise ValueError("`tables` is not a list")
+        return tables
+    except (IndexError, orjson.JSONDecodeError, ValueError) as exc:
+        logger.warning("Unable to parse tables from LLM response: %s", exc)
+        return []
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
async def post_process(
extract_sql_tables: dict,
) -> list[str]:
return orjson.loads(extract_sql_tables.get("replies")[0])["tables"]
async def post_process(
extract_sql_tables: dict,
) -> list[str]:
try:
reply = extract_sql_tables.get("replies", [])[0]
tables = orjson.loads(reply).get("tables", [])
if not isinstance(tables, list):
raise ValueError("`tables` is not a list")
return tables
except (IndexError, orjson.JSONDecodeError, ValueError) as exc:
logger.warning("Unable to parse tables from LLM response: %s", exc)
return []
🤖 Prompt for AI Agents (early access)
In wren-ai-service/src/pipelines/generation/sql_tables_extraction.py around lines 71 to 74, the JSON parsing of extract_sql_tables["replies"][0] is not protected against malformed data, causing hard exceptions. Wrap the orjson.loads call in a try/except block and return an empty list if parsing fails, ensuring the function handles errors gracefully without crashing the service.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🔭 Outside diff range comments (1)
wren-ai-service/src/web/v1/services/ask.py (1)

753-756: ⚠️ Potential issue

Inconsistent pipeline reference in ask_feedback method.

While you've updated the pipeline key from "retrieval" to "db_schema_retrieval" in the ask method, there's still a reference to the old "retrieval" pipeline in the ask_feedback method. This inconsistency could cause issues.

Apply this change:

- self._pipelines["retrieval"].run(
+ self._pipelines["db_schema_retrieval"].run(
♻️ Duplicate comments (1)
wren-ai-service/src/globals.py (1)

203-207: Same concern applies to QuestionRecommendation

See the previous comment – the same defensive check is advisable here as well.

🧹 Nitpick comments (5)
wren-ai-service/src/globals.py (1)

257-264: Consider extracting common retrieval wiring to avoid duplication

sql_correction_service, ask_service, and question_recommendation now all repeat the exact DbSchemaRetrieval instantiation block. If the signature changes again (e.g., a new top_k parameter) you must patch three places.

A tiny helper can keep the container DRY:

def _build_db_schema_retrieval(pipe_components, settings):
    return retrieval.DbSchemaRetrieval(
        **pipe_components["db_schema_retrieval"],
        table_retrieval_size=settings.table_retrieval_size,
        table_column_retrieval_size=settings.table_column_retrieval_size,
    )

… and then:

"db_schema_retrieval": _build_db_schema_retrieval(pipe_components, settings),
wren-ai-service/src/web/v1/services/sql_corrections.py (2)

106-120: Initial event status is never cached

Clients polling /sql-corrections/{event_id} before the coroutine finishes will get “not found”.
Persist a “correcting” placeholder right at the beginning:

self._cache[event_id] = self.Event(event_id=event_id, status="correcting")

Place this before the try: block.


123-124: Minor nit: redundant f-string

str(e) is already a string.

-f"An error occurred during SQL correction: {str(e)}"
+f"An error occurred during SQL correction: {e}"
wren-ai-service/src/pipelines/generation/sql_tables_extraction.py (2)

84-92: Minor: schema can be passed directly without the extra name wrapper

OpenAI/Anthropic JSON-mode APIs accept the JSON schema object directly; the additional name field is unnecessary unless the provider explicitly requires it. If not needed, you can simplify:

 "response_format": {
-    "type": "json_schema",
-    "json_schema": {
-        "name": "sql_tables_extraction_result",
-        "schema": SQLTablesExtractionResult.model_json_schema(),
-    },
+    "type": "json_schema",
+    "json_schema": SQLTablesExtractionResult.model_json_schema(),
 },

This keeps the payload minimal and avoids confusion about the top-level structure.


111-113: Pass self._components to BasicPipeline for clarity

BasicPipeline is instantiated with only the Hamilton AsyncDriver.
Because self._components is referenced later in run(), the current code works, but the dependency is implicit. For better readability and to avoid future refactor gotchas, consider handing the component dict to the super-class (or storing it in a dedicated field exposed by BasicPipeline) so the wiring is explicit.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7ca5567 and 1be8d90.

📒 Files selected for processing (25)
  • deployment/kustomizations/base/cm.yaml (1 hunks)
  • docker/config.example.yaml (1 hunks)
  • wren-ai-service/docs/config_examples/config.anthropic.yaml (1 hunks)
  • wren-ai-service/docs/config_examples/config.azure.yaml (1 hunks)
  • wren-ai-service/docs/config_examples/config.deepseek.yaml (1 hunks)
  • wren-ai-service/docs/config_examples/config.google_ai_studio.yaml (1 hunks)
  • wren-ai-service/docs/config_examples/config.grok.yaml (1 hunks)
  • wren-ai-service/docs/config_examples/config.groq.yaml (1 hunks)
  • wren-ai-service/docs/config_examples/config.lm_studio.yaml (1 hunks)
  • wren-ai-service/docs/config_examples/config.ollama.yaml (1 hunks)
  • wren-ai-service/docs/config_examples/config.open_router.yaml (1 hunks)
  • wren-ai-service/eval/pipelines.py (2 hunks)
  • wren-ai-service/src/globals.py (3 hunks)
  • wren-ai-service/src/pipelines/generation/__init__.py (2 hunks)
  • wren-ai-service/src/pipelines/generation/sql_tables_extraction.py (1 hunks)
  • wren-ai-service/src/pipelines/retrieval/__init__.py (1 hunks)
  • wren-ai-service/src/pipelines/retrieval/db_schema_retrieval.py (3 hunks)
  • wren-ai-service/src/web/v1/services/ask.py (1 hunks)
  • wren-ai-service/src/web/v1/services/question_recommendation.py (1 hunks)
  • wren-ai-service/src/web/v1/services/sql_corrections.py (1 hunks)
  • wren-ai-service/tests/pytest/pipelines/generation/test_ask.py (2 hunks)
  • wren-ai-service/tests/pytest/services/mocks.py (1 hunks)
  • wren-ai-service/tests/pytest/services/test_ask.py (1 hunks)
  • wren-ai-service/tools/config/config.example.yaml (1 hunks)
  • wren-ai-service/tools/config/config.full.yaml (1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (4)
wren-ai-service/src/pipelines/generation/__init__.py (1)
wren-ai-service/src/pipelines/generation/sql_tables_extraction.py (1)
  • SQLTablesExtraction (95-127)
wren-ai-service/src/pipelines/retrieval/db_schema_retrieval.py (1)
wren-ai-service/src/core/pipeline.py (1)
  • BasicPipeline (15-21)
wren-ai-service/src/web/v1/services/ask.py (1)
wren-ai-service/src/pipelines/retrieval/historical_question_retrieval.py (2)
  • run (23-34)
  • run (151-161)
wren-ai-service/eval/pipelines.py (1)
wren-ai-service/src/pipelines/retrieval/db_schema_retrieval.py (1)
  • DbSchemaRetrieval (430-495)
🔇 Additional comments (30)
wren-ai-service/tests/pytest/services/mocks.py (1)

8-8: LGTM: Correct inheritance update from Retrieval to DbSchemaRetrieval

The mock class now correctly inherits from the renamed DbSchemaRetrieval class instead of the previous Retrieval class, maintaining consistency with the changes in the main codebase.

docker/config.example.yaml (1)

140-141: LGTM: Properly configured new SQL tables extraction pipeline

The new sql_tables_extraction pipeline is correctly configured with the default LLM provider. This pipeline will extract table names from SQL queries, which aligns with the broader changes to the SQL correction workflow.

wren-ai-service/src/pipelines/retrieval/__init__.py (2)

1-1: LGTM: Correct import update for the renamed retrieval class

The import statement has been properly updated to use the renamed DbSchemaRetrieval class instead of the previous Retrieval class.


12-12: LGTM: Correctly updated export list with the renamed class

The __all__ list has been updated to include "DbSchemaRetrieval" instead of "Retrieval", ensuring the renamed class is properly exposed through the module's public API.

wren-ai-service/src/pipelines/generation/__init__.py (2)

17-17: LGTM: Properly imported the new SQLTablesExtraction class

The import statement correctly brings in the new SQLTablesExtraction class from the corresponding module.


38-38: LGTM: Correctly exposed the new class through the module's API

The __all__ list has been updated to include "SQLTablesExtraction", ensuring the new class is properly exposed through the module's public API.

wren-ai-service/docs/config_examples/config.open_router.yaml (1)

138-139: The sql_tables_extraction pipeline addition looks good.

This change adds a new pipeline for extracting table names from SQL queries, which will be used in the SQL correction workflow. The configuration correctly specifies using the litellm_llm.default LLM without requiring embedder or document store components.

wren-ai-service/docs/config_examples/config.lm_studio.yaml (1)

140-141: The sql_tables_extraction pipeline addition looks good.

This change adds a new pipeline for extracting table names from SQL queries, which will be used in the SQL correction workflow. The configuration correctly specifies using the litellm_llm.default LLM without requiring embedder or document store components.

wren-ai-service/tests/pytest/services/test_ask.py (1)

43-45:

Details

✅ Verification successful

Updated retrieval pipeline reference to use more semantically appropriate name.

The change correctly updates the test service to use DbSchemaRetrieval instead of the more generic Retrieval class, which better reflects its specific purpose of retrieving database schema information. The component parameters are also correctly updated to use pipe_components["db_schema_retrieval"].


🏁 Script executed:

#!/bin/bash
# Verify if the DbSchemaRetrieval class is imported and exported correctly
echo "Checking imports for DbSchemaRetrieval..."
rg -A 2 "from src.pipelines.retrieval import DbSchemaRetrieval" 
echo "Checking if the DbSchemaRetrieval class is defined..."
rg -A 5 "class DbSchemaRetrieval"

Length of output: 897


🏁 Script executed:

#!/bin/bash
# Locate how `retrieval` is imported in the test file
rg -n "import.*retrieval" wren-ai-service/tests/pytest/services/test_ask.py
rg -n "from .*retrieval" wren-ai-service/tests/pytest/services/test_ask.py

Length of output: 274


🏁 Script executed:

#!/bin/bash
# Check that DbSchemaRetrieval is exported in the retrieval package __init__.py
rg -n "DbSchemaRetrieval" wren-ai-service/src/pipelines/retrieval/__init__.py || echo "No direct reference found"

Length of output: 158


Approve updated retrieval pipeline reference

The test now correctly uses retrieval.DbSchemaRetrieval, and the class is properly defined and exported in src/pipelines/retrieval. No further changes required.

wren-ai-service/tools/config/config.full.yaml (1)

151-152:

Details

✅ Verification successful

The sql_tables_extraction pipeline addition looks good.

This change adds a new pipeline for extracting table names from SQL queries, which will be used in the SQL correction workflow. The configuration correctly specifies using the litellm_llm.default LLM without requiring embedder or document store components.


🏁 Script executed:

#!/bin/bash
# Check if the SQLTablesExtraction class is implemented
echo "Checking for SQLTablesExtraction implementation..."
rg -A 10 "class SQLTablesExtraction"

# Check where the new pipeline is used in the codebase
echo "Checking usage of sql_tables_extraction pipeline..."
rg -A 5 "\"sql_tables_extraction\""

# Check how it's integrated with the SQL correction workflow
echo "Checking integration with SQL correction workflow..."
rg -A 10 "sql_tables_extraction.*sql_correction"

Length of output: 4048


sql_tables_extraction pipeline addition approved

  • Confirmed SQLTablesExtraction is implemented in wren-ai-service/src/pipelines/generation/sql_tables_extraction.py.
  • Verified pipeline registration under "sql_tables_extraction" in wren-ai-service/src/globals.py.
  • Confirmed invocation in the SQL correction workflow in wren-ai-service/src/web/v1/services/sql_corrections.py.
  • Configuration in wren-ai-service/tools/config/config.full.yaml (lines 151–152) correctly specifies litellm_llm.default without needing embedder or document store components.
wren-ai-service/docs/config_examples/config.google_ai_studio.yaml (1)

146-147: Confirm pipeline configuration consistency for sql_tables_extraction.

The new pipeline entry correctly uses the name sql_tables_extraction and specifies only llm: litellm_llm.default, matching the implementation in src/pipelines/generation/sql_tables_extraction.py. Please verify that no additional parameters (e.g., embedder, document_store, or engine) are required by the runtime.

wren-ai-service/docs/config_examples/config.grok.yaml (1)

142-143: Confirm pipeline configuration consistency for sql_tables_extraction.

The newly added pipeline entry aligns with the expected sql_tables_extraction implementation, using llm: litellm_llm.default. Ensure that the service correctly handles a pipeline defined without embedder, document_store, or engine settings.

wren-ai-service/docs/config_examples/config.deepseek.yaml (1)

160-161: Confirm pipeline configuration consistency for sql_tables_extraction.

The pipeline addition matches the implementation’s signature (sql_tables_extraction with llm: litellm_llm.default). Double-check that no other fields are needed for proper execution in the SQL tables extraction workflow.

wren-ai-service/docs/config_examples/config.groq.yaml (1)

141-142: Confirm pipeline configuration consistency for sql_tables_extraction.

The new entry is correctly named and uses llm: litellm_llm.default, consistent with SQLTablesExtraction logic. Verify that the pipeline loader supports a minimal configuration (LLM only) for table extraction.

wren-ai-service/docs/config_examples/config.ollama.yaml (1)

138-139: Confirm pipeline configuration consistency for sql_tables_extraction.

The addition of sql_tables_extraction with llm: litellm_llm.default is consistent across example configs. Please ensure the service properly initializes this pipeline without additional embedder or document_store settings.

wren-ai-service/docs/config_examples/config.azure.yaml (1)

150-151: New pipeline configuration added for SQL table extraction.

This new pipeline configuration for sql_tables_extraction that uses the default LLM aligns with the PR objective to extract SQL tables before performing corrections. This is a necessary addition to support the architectural change in the SQL correction workflow.

wren-ai-service/src/web/v1/services/ask.py (1)

411-416: Pipeline key updated from "retrieval" to "db_schema_retrieval".

This change correctly implements the migration from the generic retrieval pipeline to the more specialized db_schema_retrieval pipeline, ensuring that database schema information is properly retrieved when processing user queries.

wren-ai-service/src/web/v1/services/question_recommendation.py (1)

68-71: Pipeline key updated from "retrieval" to "db_schema_retrieval".

This change correctly implements the migration from the generic retrieval pipeline to the more specialized db_schema_retrieval pipeline in the question recommendation service. The update maintains consistency with other service changes in the PR.

wren-ai-service/tools/config/config.example.yaml (1)

151-152: New pipeline configuration added for SQL table extraction.

This new pipeline configuration for sql_tables_extraction is consistent with the changes in other configuration files and supports the architectural change to extract SQL tables before performing corrections.

wren-ai-service/docs/config_examples/config.anthropic.yaml (1)

137-138: LGTM: New SQL tables extraction pipeline added correctly.

The new sql_tables_extraction pipeline has been properly configured to use the default LLM model. This addition aligns with the broader changes introducing a pipeline to extract table names from SQL queries for improved SQL correction workflows.

deployment/kustomizations/base/cm.yaml (1)

190-192: LGTM: SQL tables extraction pipeline properly added to Kubernetes ConfigMap.

The new pipeline definition is consistent with the configuration pattern used throughout this file and matches the changes made in other configuration files.

wren-ai-service/tests/pytest/pipelines/generation/test_ask.py (2)

9-9: Successfully migrated to DbSchemaRetrieval import.

The import statement has been correctly updated to use the new DbSchemaRetrieval class from the appropriate module.


37-37: Successfully updated retrieval pipeline instantiation.

The retrieval pipeline instantiation has been properly updated to use DbSchemaRetrieval instead of Retrieval while maintaining the same parameter structure.

wren-ai-service/eval/pipelines.py (2)

211-211: Successfully migrated RetrievalPipeline to use DbSchemaRetrieval.

The RetrievalPipeline class now correctly uses the more specialized DbSchemaRetrieval class while maintaining the same initialization parameters.


376-376: Successfully migrated AskPipeline to use DbSchemaRetrieval.

The AskPipeline class has been properly updated to use the DbSchemaRetrieval class, consistent with the changes throughout the codebase.

wren-ai-service/src/globals.py (1)

100-105: Guard against missing db_schema_retrieval pipeline in configuration

pipe_components["db_schema_retrieval"] is now mandatory for the ask_service.
If a deployer upgrades the code but forgets to add the new block to the YAML/ENV config, create_service_container will raise a KeyError and the application will fail during boot.

 try:
     "db_schema_retrieval": retrieval.DbSchemaRetrieval(
         **pipe_components["db_schema_retrieval"],
         table_retrieval_size=settings.table_retrieval_size,
         table_column_retrieval_size=settings.table_column_retrieval_size,
     ),
 except KeyError as exc:
     raise RuntimeError(
         "`db_schema_retrieval` is missing from the pipeline config"
     ) from exc

Adding a short defensive check (or a descriptive error message) shields operators from a non-obvious startup failure.

wren-ai-service/src/pipelines/retrieval/db_schema_retrieval.py (2)

179-196: Early-return optimisation is good, but watch out for explicit-table calls

Short-circuiting when table_name_conditions is empty prevents a needless query – nice!
However, an upstream caller may pass an explicit tables=[] (empty list) on purpose. In that case the function now returns [], which later stages silently treat as “no schema” and may degrade the answer quality without explanation.

Recommendation:

-if table_name_conditions:
+if not table_name_conditions:
+    logger.debug("No table names found; skipping DB-schema retrieval")
+    return []
+
+# Existing logic

Adding a log (and possibly surfacing a warning in the API response) helps debugging why no context was retrieved.


430-457: Renamed class needs explicit re-export & docs

class DbSchemaRetrieval replaces the old Retrieval. Ensure that:

  1. src/pipelines/retrieval/__init__.py adds DbSchemaRetrieval to __all__ so from … import retrieval continues to work.
  2. Any external SDK / README mentioning Retrieval is updated to avoid broken imports.

Failure to do so will surface as ImportError for library users pinning past examples.

wren-ai-service/src/web/v1/services/sql_corrections.py (1)

128-128: Cache may miss metadata if an unexpected exception occurs earlier

self._cache[event_id] is read unconditionally. In extremely early failures (e.g., I/O error before we set the cache), this will raise KeyError, masking the original error.

Wrap in self._cache.get(event_id) or self.Event(event_id=event_id, status="failed").

wren-ai-service/src/pipelines/generation/sql_tables_extraction.py (1)

66-68: Verify that the generator’s call signature matches the one used here

generator is invoked with a single keyword argument prompt=....
Depending on the concrete LLM wrapper returned by llm_provider.get_generator, the callable might instead expect:

  • a positional prompt/messages argument,
  • additional generation parameters (temperature, max_tokens, …),
  • or a fully-formed dictionary containing the model, prompt, etc.

A mismatch will raise a TypeError at runtime and the whole pipeline will fail before we even reach post-processing. Please double-check the provider’s contract or guard this call accordingly.

Comment on lines 95 to 99
res = await self._pipelines["sql_correction"].run(
contexts=[],
contexts=table_ddls,
invalid_generation_results=[_invalid],
project_id=request.project_id,
project_id=project_id,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Fallback when no schema context is found

If table_ddls ends up empty the LLM will receive zero context, which often yields another invalid SQL. A low-effort safety net:

if not table_ddls:
    logger.warning("No DDL context found; falling back to correction without context")

Optionally rerun db_schema_retrieval with a similarity search fallback.

🤖 Prompt for AI Agents (early access)
In wren-ai-service/src/web/v1/services/sql_corrections.py around lines 95 to 99, add a check before calling the SQL correction pipeline to see if table_ddls is empty. If it is empty, log a warning message indicating no DDL context was found and optionally rerun the db_schema_retrieval with a similarity search fallback to provide some context. This prevents the LLM from receiving zero context and generating invalid SQL.

Comment on lines +77 to +94
tables = (
await self._pipelines["sql_tables_extraction"].run(
sql=sql,
)
)["post_process"]

documents = (
(
await self._pipelines["db_schema_retrieval"].run(
project_id=project_id,
tables=tables,
)
)
.get("construct_retrieval_results", {})
.get("retrieval_results", [])
)
table_ddls = [document.get("table_ddl") for document in documents]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Validate output of sql_tables_extraction before use

tables = (await …)["post_process"] assumes:

  1. the pipeline succeeds;
  2. the key post_process exists;
  3. the value is an iterable of table names.

If the LLM responds with an unexpected shape (common) a KeyError or TypeError will propagate and the whole correction fails.

-tables = (
-    await self._pipelines["sql_tables_extraction"].run(sql=sql)
-)["post_process"]
+extraction_res = await self._pipelines["sql_tables_extraction"].run(sql=sql)
+
+tables = extraction_res.get("post_process") or []
+if not isinstance(tables, (list, tuple)):
+    self._handle_exception(
+        event_id,
+        "Table-extraction failed or returned invalid format",
+        trace_id=trace_id,
+    )
+    return self._cache[event_id].with_metadata()

Fail-fast keeps the service responsive and prevents puzzling stack traces.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
tables = (
await self._pipelines["sql_tables_extraction"].run(
sql=sql,
)
)["post_process"]
documents = (
(
await self._pipelines["db_schema_retrieval"].run(
project_id=project_id,
tables=tables,
)
)
.get("construct_retrieval_results", {})
.get("retrieval_results", [])
)
table_ddls = [document.get("table_ddl") for document in documents]
# Validate output of sql_tables_extraction before use
extraction_res = await self._pipelines["sql_tables_extraction"].run(sql=sql)
tables = extraction_res.get("post_process") or []
if not isinstance(tables, (list, tuple)):
self._handle_exception(
event_id,
"Table-extraction failed or returned invalid format",
trace_id=trace_id,
)
return self._cache[event_id].with_metadata()
documents = (
(
await self._pipelines["db_schema_retrieval"].run(
project_id=project_id,
tables=tables,
)
)
.get("construct_retrieval_results", {})
.get("retrieval_results", [])
)
table_ddls = [document.get("table_ddl") for document in documents]
🤖 Prompt for AI Agents (early access)
In wren-ai-service/src/web/v1/services/sql_corrections.py around lines 77 to 94, the code assumes the pipeline call returns a successful result containing the key "post_process" with an iterable of table names, but this may not always be true and can cause KeyError or TypeError. To fix this, add validation after awaiting the pipeline to check if the result is a dict, contains the "post_process" key, and that its value is an iterable. If any check fails, raise a clear exception or handle the error to fail fast and avoid confusing stack traces.

Comment on lines +70 to +75
@observe(capture_input=False)
async def post_process(
extract_sql_tables: dict,
) -> list[str]:
return orjson.loads(extract_sql_tables.get("replies")[0])["tables"]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

post_process is brittle – add validation & graceful error handling

The function blindly assumes extract_sql_tables has the structure
{"replies": ["{...valid-json...}"]} and that the JSON contains a tables key. Any of the following will currently explode with an uncaught exception:

  • provider returns an error object instead of replies,
  • the text is not valid JSON (model hallucination / truncation),
  • the JSON schema is violated (missing or non-list tables),
  • the list contains duplicates that should be deduplicated.

Consider hardening this step and leveraging the SQLTablesExtractionResult Pydantic model you already defined:

 @observe(capture_input=False)
 async def post_process(
     extract_sql_tables: dict,
 ) -> list[str]:
-    return orjson.loads(extract_sql_tables.get("replies")[0])["tables"]
+    try:
+        raw_reply = extract_sql_tables["replies"][0]
+        result = SQLTablesExtractionResult.model_validate_json(raw_reply)
+    except (KeyError, IndexError, ValueError, orjson.JSONDecodeError) as exc:
+        logger.error("Malformed LLM response in sql_tables_extraction: %s", exc)
+        # Decide how you want to surface the error – raising keeps the failure explicit
+        raise
+
+    # Remove duplicates while preserving order
+    seen = set()
+    tables = [t for t in result.tables if not (t in seen or seen.add(t))]
+    return tables

This ensures the pipeline fails fast with a meaningful log message and prevents downstream components from receiving corrupted data.

🤖 Prompt for AI Agents (early access)
In wren-ai-service/src/pipelines/generation/sql_tables_extraction.py around lines 70 to 75, the post_process function assumes extract_sql_tables always contains a valid "replies" list with JSON having a "tables" key, which can cause uncaught exceptions if the structure is invalid or JSON parsing fails. To fix this, add validation to check the presence and type of "replies" and its first element, handle JSON parsing errors gracefully, validate the parsed data against the SQLTablesExtractionResult Pydantic model, and deduplicate the tables list before returning it. Include meaningful error logging and raise exceptions to fail fast if validation fails.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (4)
wren-ai-service/docs/config_examples/config.open_router.yaml (1)

138-139: Verify YAML indentation consistency

Ensure the indentation (two spaces) for the new - name: sql_tables_extraction entry matches other pipeline items to avoid YAML parsing errors.

wren-ai-service/docs/config_examples/config.azure.yaml (1)

150-151: Verify indentation of new pipeline entry

Confirm that the new - name: sql_tables_extraction block uses the same two-space indentation as other entries under pipes.

wren-ai-service/docs/config_examples/config.deepseek.yaml (1)

160-161: Ensure YAML indentation consistency

Double-check that the indentation for the new sql_tables_extraction entry aligns with other pipeline definitions (two spaces).

wren-ai-service/tests/pytest/services/test_ask.py (1)

43-45: Consider aligning component key with implementation class

The service component is registered under the key "retrieval" but is instantiated with DbSchemaRetrieval class and uses configuration from "db_schema_retrieval". This mismatch between the dictionary key and the implementation details could lead to confusion.

Consider renaming the component key to match its implementation:

- "retrieval": retrieval.DbSchemaRetrieval(
+ "db_schema_retrieval": retrieval.DbSchemaRetrieval(

This would make the code more consistent and self-documenting.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7ca5567 and 1be8d90.

📒 Files selected for processing (25)
  • deployment/kustomizations/base/cm.yaml (1 hunks)
  • docker/config.example.yaml (1 hunks)
  • wren-ai-service/docs/config_examples/config.anthropic.yaml (1 hunks)
  • wren-ai-service/docs/config_examples/config.azure.yaml (1 hunks)
  • wren-ai-service/docs/config_examples/config.deepseek.yaml (1 hunks)
  • wren-ai-service/docs/config_examples/config.google_ai_studio.yaml (1 hunks)
  • wren-ai-service/docs/config_examples/config.grok.yaml (1 hunks)
  • wren-ai-service/docs/config_examples/config.groq.yaml (1 hunks)
  • wren-ai-service/docs/config_examples/config.lm_studio.yaml (1 hunks)
  • wren-ai-service/docs/config_examples/config.ollama.yaml (1 hunks)
  • wren-ai-service/docs/config_examples/config.open_router.yaml (1 hunks)
  • wren-ai-service/eval/pipelines.py (2 hunks)
  • wren-ai-service/src/globals.py (3 hunks)
  • wren-ai-service/src/pipelines/generation/__init__.py (2 hunks)
  • wren-ai-service/src/pipelines/generation/sql_tables_extraction.py (1 hunks)
  • wren-ai-service/src/pipelines/retrieval/__init__.py (1 hunks)
  • wren-ai-service/src/pipelines/retrieval/db_schema_retrieval.py (3 hunks)
  • wren-ai-service/src/web/v1/services/ask.py (1 hunks)
  • wren-ai-service/src/web/v1/services/question_recommendation.py (1 hunks)
  • wren-ai-service/src/web/v1/services/sql_corrections.py (1 hunks)
  • wren-ai-service/tests/pytest/pipelines/generation/test_ask.py (2 hunks)
  • wren-ai-service/tests/pytest/services/mocks.py (1 hunks)
  • wren-ai-service/tests/pytest/services/test_ask.py (1 hunks)
  • wren-ai-service/tools/config/config.example.yaml (1 hunks)
  • wren-ai-service/tools/config/config.full.yaml (1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (3)
wren-ai-service/src/pipelines/generation/__init__.py (1)
wren-ai-service/src/pipelines/generation/sql_tables_extraction.py (1)
  • SQLTablesExtraction (95-127)
wren-ai-service/src/globals.py (2)
wren-ai-service/src/pipelines/retrieval/db_schema_retrieval.py (1)
  • DbSchemaRetrieval (430-495)
wren-ai-service/src/pipelines/generation/sql_tables_extraction.py (1)
  • SQLTablesExtraction (95-127)
wren-ai-service/src/pipelines/retrieval/db_schema_retrieval.py (1)
wren-ai-service/src/core/pipeline.py (1)
  • BasicPipeline (15-21)
🔇 Additional comments (40)
wren-ai-service/tests/pytest/pipelines/generation/test_ask.py (2)

9-9: Import statement updated to use DbSchemaRetrieval instead of Retrieval.

The change correctly updates the import to use the more specific DbSchemaRetrieval class from the new module path, which aligns with the broader refactoring effort across the codebase.


37-37: Pipeline instantiation updated to use DbSchemaRetrieval.

The test now correctly instantiates the DbSchemaRetrieval class and passes the appropriate component configuration. This change maintains the test's functionality while adapting to the refactored retrieval mechanism.

wren-ai-service/tests/pytest/services/mocks.py (1)

8-8: Mock class inheritance updated to DbSchemaRetrieval.

The RetrievalMock class has been correctly updated to inherit from retrieval.DbSchemaRetrieval instead of retrieval.Retrieval, ensuring the mock aligns with the refactored implementation while maintaining the same interface and behavior.

docker/config.example.yaml (1)

140-141:

Details

✅ Verification successful

New pipeline added for SQL tables extraction.

The new sql_tables_extraction pipeline is properly configured to use the default LLM. This addition supports the new functionality for extracting SQL table names to improve schema context for SQL correction flows.


🏁 Script executed:

#!/bin/bash
# Check if the sql_tables_extraction implementation exists in the codebase

echo "Searching for the SQL tables extraction implementation file..."
fd -t f "sql_tables_extraction.py" src/pipelines

echo "Checking for references to this pipeline in the codebase..."
rg "sql_tables_extraction" src

Length of output: 488


Let’s broaden the search to the entire repo to ensure we pick up any non-standard paths:


🏁 Script executed:

#!/bin/bash
# Search for the pipeline implementation file anywhere in the repository
echo "🔍 Looking for sql_tables_extraction.py across the repo..."
fd -t f "sql_tables_extraction.py"

# Search for any references to this pipeline name in the code
echo "🔍 Scanning for references to 'sql_tables_extraction' in all files..."
rg -n "sql_tables_extraction" .

Length of output: 2787


sql_tables_extraction pipeline verified

The new sql_tables_extraction pipeline is fully implemented and correctly registered:

• Implementation: wren-ai-service/src/pipelines/generation/sql_tables_extraction.py
• Registration: wren-ai-service/src/globals.py (lines 257–258)
• Configuration examples updated:
docker/config.example.yaml (lines 140–141)
wren-ai-service/tools/config/config.example.yaml
wren-ai-service/docs/config_examples/*

All references and config entries align. No further changes required.

wren-ai-service/docs/config_examples/config.anthropic.yaml (1)

137-138: Consistent pipeline configuration added across example configs.

The same sql_tables_extraction pipeline configuration has been correctly added to the Anthropic configuration example, ensuring consistency across different deployment options. This change properly maintains alignment with the docker example configuration.

wren-ai-service/docs/config_examples/config.open_router.yaml (1)

138-139: Add SQL tables extraction to pipeline list

The sql_tables_extraction pipeline is now correctly included using only the LLM provider, matching its implementation signature.

wren-ai-service/src/pipelines/generation/__init__.py (2)

17-17: Expose SQLTablesExtraction in generation module

The import of SQLTablesExtraction is added correctly, enabling its discovery by the generation pipeline registry.


38-38: Include SQLTablesExtraction in __all__

Adding "SQLTablesExtraction" to __all__ ensures it is exported as part of the public API for this package.

wren-ai-service/src/pipelines/retrieval/__init__.py (2)

1-1: Rename retrieval pipeline import to DbSchemaRetrieval

The import has been updated to reference the renamed DbSchemaRetrieval class, aligning with its new module.


12-12: Update __all__ to export DbSchemaRetrieval

Including "DbSchemaRetrieval" in __all__ maintains consistency in the public API after the class rename.

wren-ai-service/docs/config_examples/config.azure.yaml (1)

150-151: Add sql_tables_extraction pipeline entry

The new pipeline is correctly added to the Azure config, specifying only the LLM provider as required by its implementation.

wren-ai-service/docs/config_examples/config.deepseek.yaml (1)

160-161: Include sql_tables_extraction in Deepseek pipeline

The sql_tables_extraction pipeline entry is correctly added, matching the implementation requirement for LLM-only configuration.

wren-ai-service/docs/config_examples/config.groq.yaml (1)

141-142: LGTM: Clean addition of sql_tables_extraction pipeline

The addition of the new sql_tables_extraction pipeline is well-structured and consistent with other pipeline definitions in the file.

wren-ai-service/docs/config_examples/config.ollama.yaml (1)

138-139: LGTM: Consistent addition of sql_tables_extraction pipeline

The new sql_tables_extraction pipeline is added consistently with the same configuration as in other config files.

wren-ai-service/docs/config_examples/config.lm_studio.yaml (1)

140-141: LGTM: Clean addition of sql_tables_extraction pipeline

The addition of the sql_tables_extraction pipeline with the litellm_llm.default model aligns with the other configuration files.

wren-ai-service/docs/config_examples/config.google_ai_studio.yaml (1)

146-147: LGTM: Consistent sql_tables_extraction pipeline configuration

The new sql_tables_extraction pipeline is correctly configured with the litellm_llm.default model, consistent with other config files.

wren-ai-service/tools/config/config.example.yaml (1)

151-152: Configuration added for new SQL table extraction pipeline.

The addition of the sql_tables_extraction pipeline is consistent with the overall changes in this PR, which introduces a specialized pipeline for extracting table names from SQL queries using an LLM. This is properly configured to use the default LLM specified in the config.

wren-ai-service/tools/config/config.full.yaml (1)

151-152: Consistent configuration for the SQL table extraction pipeline.

The addition of the sql_tables_extraction pipeline matches the implementation across other configuration files. Good to see consistency in pipeline configuration across different environment setups.

wren-ai-service/docs/config_examples/config.grok.yaml (1)

142-143: SQL tables extraction pipeline added to Grok config example.

The sql_tables_extraction pipeline has been properly added to this example configuration file as well, maintaining consistency across all configuration templates in the project.

wren-ai-service/src/web/v1/services/question_recommendation.py (1)

68-69: Updated pipeline key to use the renamed retrieval pipeline.

The pipeline key has been updated from "retrieval" to "db_schema_retrieval", which aligns with the refactoring of the retrieval mechanism throughout the codebase. This change ensures consistency with the renamed DbSchemaRetrieval class.

deployment/kustomizations/base/cm.yaml (1)

190-192: LGTM: Addition of the sql_tables_extraction pipeline.

The new pipeline configuration is correctly structured with the required LLM specified. This aligns with the PR objective to add SQL table extraction functionality.

wren-ai-service/src/web/v1/services/ask.py (1)

411-416: Pipeline reference updated correctly.

The change from "retrieval" to "db_schema_retrieval" is consistent with the broader refactoring where the retrieval component has been renamed.

wren-ai-service/eval/pipelines.py (2)

211-215: Class reference updated correctly.

The change from retrieval.Retrieval to retrieval.DbSchemaRetrieval aligns with the refactoring of the retrieval component while maintaining the same interface.


376-380: Class reference updated consistently.

Similar to the previous change, the AskPipeline class has been correctly updated to use DbSchemaRetrieval.

wren-ai-service/src/pipelines/retrieval/db_schema_retrieval.py (3)

179-196: Improved efficiency with conditional filtering.

The refactored code now avoids unnecessary query operations when no table names are provided. This optimization prevents constructing and executing empty filters.


430-430: Class rename aligns with functionality.

Renaming Retrieval to DbSchemaRetrieval better communicates the specific purpose of this class, making the codebase more maintainable.


502-502: Updated reference in dry run pipeline.

The dry run pipeline reference has been consistently updated to use the new class name.

wren-ai-service/src/globals.py (3)

100-104: LGTM! Pipeline key renamed for better clarity.

The change from "retrieval" to "db_schema_retrieval" improves the code by using a more specific and descriptive pipeline key name that better reflects its purpose. This makes the code more maintainable and easier to understand.


203-207: LGTM! Consistent renaming of pipeline key.

This change maintains consistency with the pipeline renaming pattern used throughout the codebase. The pipeline parameters for table retrieval are appropriately carried over.


257-264: Well-structured pipeline enhancement for SQL correction.

The addition of two new pipelines to the SQL correction service:

  1. "sql_tables_extraction" - Extracts table names from the SQL query
  2. "db_schema_retrieval" - Fetches relevant database schema based on extracted tables

This creates a more intelligent SQL correction flow that can leverage database schema context.

wren-ai-service/src/web/v1/services/sql_corrections.py (5)

66-69: LGTM! Parameter extraction improves readability.

Extracting the request parameters into well-named local variables makes the code more readable and maintainable.


77-94: Strong implementation of the enhanced correction workflow.

This code segment effectively implements a 3-step workflow:

  1. Extract table names from the SQL using the new "sql_tables_extraction" pipeline
  2. Use these table names to fetch relevant schema information
  3. Prepare table DDL statements as context for correction

This approach intelligently narrows down the schema context to only what's relevant for the SQL query.


96-99: LGTM! Improved context for SQL correction.

Adding table DDLs as context for the SQL correction pipeline will significantly improve the quality of corrections, as the LLM now has access to the relevant schema information.


106-119: LGTM! Refactored error handling with better variable usage.

This refactored error handling code improves upon the previous implementation by:

  1. Using the extracted event_id variable consistently
  2. Centralizing error handling through the _handle_exception method
  3. Maintaining better code structure with clear condition branches

128-128: LGTM! Consistent cache key usage.

Using the extracted event_id variable for cache access ensures consistency throughout the method.

wren-ai-service/src/pipelines/generation/sql_tables_extraction.py (5)

18-54: Well-crafted prompts for table extraction.

The system and user prompts are:

  • Clear in their instruction to extract tables from SQL
  • Provide helpful examples covering various SQL patterns
  • Specify a structured JSON output format
  • Include a "step by step" thinking prompt to improve LLM reasoning

This is a robust approach for leveraging an LLM to extract table names from SQL queries.


57-76: Clean pipeline implementation with observability.

The pipeline functions are:

  • Well-separated with clear responsibilities
  • Decorated with observability annotations that avoid capturing sensitive inputs
  • Concise and focused on their specific tasks
  • Follow the established pattern for pipeline implementations

The JSON parsing in post_process correctly extracts the "tables" field from the LLM response.


80-92: LGTM! Well-structured schema validation.

Using Pydantic for schema validation is an excellent approach to ensure the LLM returns data in the expected format. The configuration properly sets up JSON schema validation for the LLM provider.


95-127: LGTM! Clean pipeline class implementation.

The SQLTablesExtraction class:

  • Follows the standard pipeline class structure
  • Properly initializes required components with appropriate configuration
  • Has a focused run method with clear logging
  • Uses dependency injection for the LLM provider

This implementation maintains consistency with the rest of the codebase and is easy to understand.


130-137: Helpful dry-run capability.

The dry-run implementation allows for easy testing of the pipeline in isolation, which is valuable for development and debugging. The simple test case with "SELECT * FROM table" provides a good baseline test.

Copy link
Copy Markdown
Contributor

@imAsterSun imAsterSun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reviewed and approved

@cyyeh cyyeh merged commit 689bf7d into main May 6, 2025
14 checks passed
@cyyeh cyyeh deleted the feat/ai-service/fix-by-ai-api branch May 6, 2025 05:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants