Skip to content

VER-265 - Optimize fetch related snippets#21

Merged
quancao-ea merged 1 commit intomainfrom
features/optimize-fetch-related-snippets
Oct 1, 2025
Merged

VER-265 - Optimize fetch related snippets#21
quancao-ea merged 1 commit intomainfrom
features/optimize-fetch-related-snippets

Conversation

@quancao-ea
Copy link
Copy Markdown
Collaborator

@quancao-ea quancao-ea commented Sep 30, 2025

Important

Optimize search_related_snippets_public function with CTEs for performance and readability improvements, adding a 30-second timeout.

  • Optimization:
    • Refactor search_related_snippets_public to use CTEs snippets_with_similarity, similar_snippets, label_summary, and final_snippets for better performance and readability.
    • Add statement_timeout of 30 seconds to prevent long-running queries.
  • Behavior:
    • Filters snippets by similarity > match_threshold and limits results to match_count.
    • Aggregates labels for snippets using label_summary CTE.
    • Constructs final JSON response from final_snippets CTE.

This description was created by Ellipsis for 0744d20. You can customize this summary. It will automatically update as commits are pushed.

Summary by CodeRabbit

  • New Features
    • Language-aware summaries in related snippet results.
  • Bug Fixes
    • Ensures labels are always present in results for consistency.
  • Refactor
    • Streamlined query pipeline to compute similarity, enrich metadata, and assemble final results more reliably.
    • Improved output structure for related snippets, providing clearer JSON fields and metadata.

@quancao-ea quancao-ea self-assigned this Sep 30, 2025
@linear
Copy link
Copy Markdown

linear Bot commented Sep 30, 2025

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Sep 30, 2025

Walkthrough

Refactors the SQL for search_related_snippets_public into multiple CTEs: computes vector similarity, filters by threshold, aggregates labels JSON, selects language-specific summaries, and assembles final snippet fields before JSON construction. Adjusts JOIN placement and field aliases, and sets a statement_timeout at the end.

Changes

Cohort / File(s) Summary
Related snippets SQL refactor
supabase/database/sql/search_related_snippets_public.sql
Split single-CTE query into staged CTEs: compute similarity (snippets_with_similarity), filter (similar_snippets), precompute labels (label_summary), assemble final fields (final_snippets), select language-specific summary, COALESCE labels, update JSON field sources, and append statement_timeout.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Client
  participant DB as Postgres
  participant CTE1 as snippets_with_similarity
  participant CTE2 as similar_snippets
  participant CTE3 as label_summary
  participant CTE4 as final_snippets

  Client->>DB: call search_related_snippets_public(params)
  DB->>CTE1: compute vector similarity for candidate snippets
  CTE1-->>CTE2: filter by similarity threshold
  CTE2-->>CTE3: aggregate labels per snippet (JSON)
  CTE2-->>CTE4: assemble fields (summary by language, metadata)
  CTE3-->>CTE4: join labels (COALESCE)
  CTE4-->>DB: return finalized rows as JSON
  DB-->>Client: JSON array of related snippets
  note over DB: SET statement_timeout applied at end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested reviewers

  • nhphong

Poem

I hop through CTEs, step by step,
From vectors to labels, tidy and kept.
A whisker-twitch, thresholds align—
Summaries speak in the reader’s tongue fine.
JSON baskets neatly spun—
Query refactor: job well done! 🐇✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Out of Scope Changes Check ⚠️ Warning In addition to query restructuring for performance, the PR introduces functional changes such as language-based summary selection, label JSON aggregation, and a statement_timeout setting that alter output semantics. These enhancements are not mentioned in the linked issue objectives, which focus solely on investigating performance issues, so they appear to be out of scope. Consider splitting out functional enhancements into a separate pull request or isolating purely performance-related SQL changes here to keep the scope aligned with the investigation objective.
Linked Issues Check ❓ Inconclusive The linked issue VER-265 calls for investigating and diagnosing performance problems when fetching related snippets, but the PR applies an optimized SQL implementation without providing any diagnostic findings, benchmarks, or root-cause analysis. It is therefore unclear whether the PR fulfills the investigation objective or simply delivers an optimization. Please include documentation of investigation steps, performance measurements before and after the changes, or a summary of root-cause analysis to demonstrate how the PR satisfies the linked issue’s investigative requirements.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title “VER-265 - Optimize fetch related snippets” succinctly and accurately captures the main change of this pull request, which is to improve the performance of the related snippets query. It references the linked issue and uses clear, specific language without extraneous details. This makes it immediately clear to reviewers what the primary focus of the PR is.
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.
✨ Finishing touches
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch features/optimize-fetch-related-snippets

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🧪 Early access (Sonnet 4.5): enabled

We are currently testing the Sonnet 4.5 model, which is expected to improve code review quality. However, this model may lead to increased noise levels in the review comments. Please disable the early access features if the noise level causes any inconvenience.

Note:

  • Public repositories are always opted into early access features.
  • You can enable or disable early access features from the CodeRabbit UI or by updating the CodeRabbit configuration file.

Comment @coderabbitai help to get the list of available commands and usage tips.

@quancao-ea quancao-ea requested a review from nhphong September 30, 2025 11:53
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @quancao-ea, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request focuses on enhancing the efficiency and reliability of the search_related_snippets_public SQL function. The primary goal is to optimize the retrieval of similar code snippets by restructuring the query logic. This involves applying filtering and limiting operations earlier in the execution flow and isolating the aggregation of labels into a dedicated CTE. These changes collectively aim to reduce computational load, improve response times, and prevent potential performance bottlenecks by introducing a statement timeout.

Highlights

  • Query Performance Optimization: The SQL function for fetching related snippets has been refactored to improve its performance and efficiency.
  • Early Filtering and Limiting: The LIMIT and ORDER BY clauses are now applied earlier in the query execution plan, reducing the dataset size before more complex operations.
  • Decoupled Label Aggregation: Label aggregation has been moved into a separate Common Table Expression (CTE), optimizing data processing by only aggregating labels for already identified similar snippets.
  • Statement Timeout: A SET statement_timeout TO '30s'; has been added to the PL/pgSQL function to prevent long-running queries and improve stability.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the search_related_snippets_public function to improve performance by using Common Table Expressions (CTEs). The new structure correctly defers expensive joins and aggregations until after the most similar snippets have been identified and limited, which is a significant improvement. I have one suggestion to further refine the query to ensure optimal performance by combining two of the CTEs. The addition of a statement_timeout is also a great defensive measure against long-running queries.

Comment on lines +24 to +40
snippets_with_similarity AS (
SELECT
s.id,
s.title,
s.file_path,
s.recorded_at,
s.comment_count,
s.start_time,
a.radio_station_name,
a.radio_station_code,
a.location_state,
CASE
WHEN p_language = 'spanish' THEN s.summary ->> 'spanish'
ELSE s.summary ->> 'english'
END AS summary,
1 - (se.embedding <=> source_embedding) as similarity,
jsonb_agg(l) as labels
1 - (se.embedding <=> source_embedding) as similarity
FROM snippet_embeddings se
JOIN snippets s ON s.id = se.snippet
JOIN audio_files a ON a.id = s.audio_file
LEFT JOIN snippet_labels sl ON s.id = sl.snippet
LEFT JOIN labels l ON sl.label = l.id
WHERE
se.snippet != snippet_id
AND 1 - (se.embedding <=> source_embedding) > match_threshold
AND s.status = 'Processed'
GROUP BY
),
similar_snippets AS (
SELECT *
FROM snippets_with_similarity
WHERE similarity > match_threshold
ORDER BY similarity DESC
LIMIT match_count
),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

While breaking the query into multiple CTEs improves readability, separating snippets_with_similarity and similar_snippets this way could lead to a significant performance degradation. The snippets_with_similarity CTE might be materialized by the query planner, forcing a full scan of snippet_embeddings to calculate similarity for every snippet. Although modern PostgreSQL planners often optimize this by inlining the CTE, it's not guaranteed.

To ensure optimal performance and make the intent clearer, it's better to combine these two CTEs. This guarantees that the match_threshold filter is applied during the initial vector search, which is much more efficient.

    similar_snippets AS (
        SELECT
            s.id,
            1 - (se.embedding <=> source_embedding) as similarity
        FROM snippet_embeddings se
        JOIN snippets s ON s.id = se.snippet
        WHERE
            se.snippet != snippet_id
            AND s.status = 'Processed'
            AND 1 - (se.embedding <=> source_embedding) > match_threshold
        ORDER BY similarity DESC
        LIMIT match_count
    ),

Copy link
Copy Markdown
Contributor

@ellipsis-dev ellipsis-dev Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important

Looks good to me! 👍

Reviewed everything up to 0744d20 in 2 minutes and 1 seconds. Click for details.
  • Reviewed 123 lines of code in 1 files
  • Skipped 0 files when reviewing.
  • Skipped posting 3 draft comments. View those below.
  • Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
1. supabase/database/sql/search_related_snippets_public.sql:58
  • Draft comment:
    The ordering from the 'similar_snippets' CTE may be lost after joining in 'final_snippets'. Consider adding an explicit ORDER BY (e.g. by similarity DESC) in the final CTE or outer SELECT to ensure the JSON array is returned in the intended order.
  • Reason this comment was not posted:
    Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 20% vs. threshold = 50% While the comment raises a valid theoretical concern about SQL ordering guarantees, PostgreSQL generally preserves CTE ordering in simple cases like this. The joins in final_snippets are all 1:1 joins that shouldn't affect ordering. Adding another ORDER BY would be redundant and could impact performance. The comment seems overly cautious. I could be wrong about PostgreSQL's ordering guarantees across CTEs. There might be edge cases where the order could be affected. Even if there are edge cases, the code has worked this way and adding another ORDER BY would be redundant in the common case. The theoretical risk doesn't justify the complexity/performance trade-off. The comment should be removed as it suggests a change that would add complexity without clear benefit, given PostgreSQL's typical behavior with CTE ordering.
2. supabase/database/sql/search_related_snippets_public.sql:101
  • Draft comment:
    Consider adding a comment explaining the rationale for setting 'statement_timeout' to 30s to aid future maintainers.
  • Reason this comment was not posted:
    Comment did not seem useful. Confidence is useful = 30% <= threshold 50% The comment is asking the author to add an explanation for a specific code change. While it doesn't directly ask for confirmation or testing, it suggests adding a comment for future maintainers. This is not a direct code suggestion or a request for a test, so it might not be considered useful under the given rules.
3. supabase/database/sql/search_related_snippets_public.sql:24
  • Draft comment:
    By removing the match_threshold filter from the base similarity CTE, the function now computes similarity for all processed snippets. Consider applying the threshold earlier (if supported by your index/knn search) to reduce computation on large datasets.
  • Reason this comment was not posted:
    Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 30% vs. threshold = 50% The comment is technically correct - more rows will be processed before filtering. However, without knowledge of the indexing strategy, table sizes, and whether the database can push down the threshold filter, it's speculative whether this is actually a significant performance issue. The comment itself acknowledges this uncertainty with "if supported by your index/knn search". I might be too quick to dismiss performance concerns. Large table scans could be very expensive. While performance could be an issue, we don't have enough context to know if this change actually causes problems. The comment is speculative. Delete the comment because it's speculative about potential performance issues without clear evidence that it's actually problematic.

Workflow ID: wflow_RQQvtBmbGPFpsTWP

You can customize Ellipsis by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
supabase/database/sql/search_related_snippets_public.sql (1)

23-33: Consider applying the similarity threshold earlier.

The snippets_with_similarity CTE computes similarity for all snippets before filtering by match_threshold in the next CTE. For large datasets, this could be inefficient as it calculates similarity scores that will be discarded.

Consider combining the first two CTEs and applying the threshold filter directly:

-WITH
-snippets_with_similarity AS (
+WITH similar_snippets AS (
     SELECT
         s.id,
         1 - (se.embedding <=> source_embedding) as similarity
     FROM snippet_embeddings se
     JOIN snippets s ON s.id = se.snippet
     WHERE
         se.snippet != snippet_id
         AND s.status = 'Processed'
-),
-similar_snippets AS (
-    SELECT *
-    FROM snippets_with_similarity
-    WHERE similarity > match_threshold
+        AND (1 - (se.embedding <=> source_embedding)) > match_threshold
     ORDER BY similarity DESC
     LIMIT match_count
 ),

This allows the query planner to potentially use vector index operators more efficiently and avoids materializing intermediate results.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 29cf672 and 0744d20.

📒 Files selected for processing (1)
  • supabase/database/sql/search_related_snippets_public.sql (2 hunks)
🔇 Additional comments (4)
supabase/database/sql/search_related_snippets_public.sql (4)

41-57: LGTM! Effective pre-aggregation of labels.

Pre-aggregating labels into JSON at this stage is a solid optimization. It avoids repeated joins and the COALESCE correctly handles snippets without labels.


58-79: LGTM! Clean assembly of final result set.

The final_snippets CTE effectively assembles all required fields:

  • Language-specific summary selection using CASE is clean
  • Appropriate use of LEFT JOIN for optional labels
  • All necessary fields are included for the final JSON output

80-98: LGTM! Correct JSON construction and NULL handling.

The final JSON aggregation is properly structured and the COALESCE at line 98 correctly handles the case where no similar snippets are found.


100-101: Verify the 30-second statement_timeout is sufficient.

The function-level statement_timeout is a solid safeguard, but no existing benchmarks or load tests cover search_related_snippets_public (supabase/database/sql/search_related_snippets_public.sql lines 100–101). Please validate—ideally with production-scale data volumes, varied match_count values, and concurrent execution—that 30 s meets your SLAs, and adjust or add performance tests as needed.

@quancao-ea quancao-ea merged commit ab5c558 into main Oct 1, 2025
2 checks passed
@quancao-ea quancao-ea deleted the features/optimize-fetch-related-snippets branch October 1, 2025 03:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants