VER-265 - Optimize fetch related snippets by quancao-ea · Pull Request #21 · PublicDataWorks/verdad

quancao-ea · 2025-09-30T11:53:52Z

Important

Optimize search_related_snippets_public function with CTEs for performance and readability improvements, adding a 30-second timeout.

Optimization:
- Refactor search_related_snippets_public to use CTEs snippets_with_similarity, similar_snippets, label_summary, and final_snippets for better performance and readability.
- Add statement_timeout of 30 seconds to prevent long-running queries.
Behavior:
- Filters snippets by similarity > match_threshold and limits results to match_count.
- Aggregates labels for snippets using label_summary CTE.
- Constructs final JSON response from final_snippets CTE.

^{This description was created by}^{for 0744d20. You can customize this summary. It will automatically update as commits are pushed.}

Summary by CodeRabbit

New Features
- Language-aware summaries in related snippet results.
Bug Fixes
- Ensures labels are always present in results for consistency.
Refactor
- Streamlined query pipeline to compute similarity, enrich metadata, and assemble final results more reliably.
- Improved output structure for related snippets, providing clearer JSON fields and metadata.

linear · 2025-09-30T11:53:54Z

VER-265 Investigate fetching related snippets performance issue

coderabbitai · 2025-09-30T11:53:59Z

Walkthrough

Refactors the SQL for search_related_snippets_public into multiple CTEs: computes vector similarity, filters by threshold, aggregates labels JSON, selects language-specific summaries, and assembles final snippet fields before JSON construction. Adjusts JOIN placement and field aliases, and sets a statement_timeout at the end.

Changes

Cohort / File(s)	Summary
Related snippets SQL refactor `supabase/database/sql/search_related_snippets_public.sql`	Split single-CTE query into staged CTEs: compute similarity (snippets_with_similarity), filter (similar_snippets), precompute labels (label_summary), assemble final fields (final_snippets), select language-specific summary, COALESCE labels, update JSON field sources, and append statement_timeout.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Client
  participant DB as Postgres
  participant CTE1 as snippets_with_similarity
  participant CTE2 as similar_snippets
  participant CTE3 as label_summary
  participant CTE4 as final_snippets

  Client->>DB: call search_related_snippets_public(params)
  DB->>CTE1: compute vector similarity for candidate snippets
  CTE1-->>CTE2: filter by similarity threshold
  CTE2-->>CTE3: aggregate labels per snippet (JSON)
  CTE2-->>CTE4: assemble fields (summary by language, metadata)
  CTE3-->>CTE4: join labels (COALESCE)
  CTE4-->>DB: return finalized rows as JSON
  DB-->>Client: JSON array of related snippets
  note over DB: SET statement_timeout applied at end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

VER-254 - Related snippet for public view #15 — Also modifies search_related_snippets_public with similar query logic; this PR further refactors into multiple CTEs and adjusts assembly.
[f] VER-261 - Optimize get snippets with labels #18 — Touches related-snippets label aggregation and CTE-based similarity; overlaps in label/summary assembly.
[f] 248 - Related snippets update #13 — Introduces language-based summary selection and label retrieval changes in the same query area.

Suggested reviewers

nhphong

Poem

I hop through CTEs, step by step,
From vectors to labels, tidy and kept.
A whisker-twitch, thresholds align—
Summaries speak in the reader’s tongue fine.
JSON baskets neatly spun—
Query refactor: job well done! 🐇✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Out of Scope Changes Check	⚠️ Warning	In addition to query restructuring for performance, the PR introduces functional changes such as language-based summary selection, label JSON aggregation, and a statement_timeout setting that alter output semantics. These enhancements are not mentioned in the linked issue objectives, which focus solely on investigating performance issues, so they appear to be out of scope.	Consider splitting out functional enhancements into a separate pull request or isolating purely performance-related SQL changes here to keep the scope aligned with the investigation objective.
Linked Issues Check	❓ Inconclusive	The linked issue VER-265 calls for investigating and diagnosing performance problems when fetching related snippets, but the PR applies an optimized SQL implementation without providing any diagnostic findings, benchmarks, or root-cause analysis. It is therefore unclear whether the PR fulfills the investigation objective or simply delivers an optimization.	Please include documentation of investigation steps, performance measurements before and after the changes, or a summary of root-cause analysis to demonstrate how the PR satisfies the linked issue’s investigative requirements.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title “VER-265 - Optimize fetch related snippets” succinctly and accurately captures the main change of this pull request, which is to improve the performance of the related snippets query. It references the linked issue and uses clear, specific language without extraneous details. This makes it immediately clear to reviewers what the primary focus of the PR is.
Docstring Coverage	✅ Passed	No functions found in the changes. Docstring coverage check skipped.

✨ Finishing touches

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch features/optimize-fetch-related-snippets

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🧪 Early access (Sonnet 4.5): enabled

We are currently testing the Sonnet 4.5 model, which is expected to improve code review quality. However, this model may lead to increased noise levels in the review comments. Please disable the early access features if the noise level causes any inconvenience.

Note:

Public repositories are always opted into early access features.
You can enable or disable early access features from the CodeRabbit UI or by updating the CodeRabbit configuration file.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist · 2025-09-30T11:54:05Z

Summary of Changes

Hello @quancao-ea, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request focuses on enhancing the efficiency and reliability of the search_related_snippets_public SQL function. The primary goal is to optimize the retrieval of similar code snippets by restructuring the query logic. This involves applying filtering and limiting operations earlier in the execution flow and isolating the aggregation of labels into a dedicated CTE. These changes collectively aim to reduce computational load, improve response times, and prevent potential performance bottlenecks by introducing a statement timeout.

Highlights

Query Performance Optimization: The SQL function for fetching related snippets has been refactored to improve its performance and efficiency.
Early Filtering and Limiting: The LIMIT and ORDER BY clauses are now applied earlier in the query execution plan, reducing the dataset size before more complex operations.
Decoupled Label Aggregation: Label aggregation has been moved into a separate Common Table Expression (CTE), optimizing data processing by only aggregating labels for already identified similar snippets.
Statement Timeout: A SET statement_timeout TO '30s'; has been added to the PL/pgSQL function to prevent long-running queries and improve stability.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request refactors the search_related_snippets_public function to improve performance by using Common Table Expressions (CTEs). The new structure correctly defers expensive joins and aggregations until after the most similar snippets have been identified and limited, which is a significant improvement. I have one suggestion to further refine the query to ensure optimal performance by combining two of the CTEs. The addition of a statement_timeout is also a great defensive measure against long-running queries.

gemini-code-assist · 2025-09-30T11:55:43Z

+    snippets_with_similarity AS (
        SELECT
            s.id,
-            s.title,
-            s.file_path,
-            s.recorded_at,
-            s.comment_count,
-            s.start_time,
-            a.radio_station_name,
-            a.radio_station_code,
-            a.location_state,
-            CASE
-                WHEN p_language = 'spanish' THEN s.summary ->> 'spanish'
-                ELSE s.summary ->> 'english'
-            END AS summary,
-            1 - (se.embedding <=> source_embedding) as similarity,
-            jsonb_agg(l) as labels
+            1 - (se.embedding <=> source_embedding) as similarity
        FROM snippet_embeddings se
        JOIN snippets s ON s.id = se.snippet
-        JOIN audio_files a ON a.id = s.audio_file
-        LEFT JOIN snippet_labels sl ON s.id = sl.snippet
-        LEFT JOIN labels l ON sl.label = l.id
        WHERE
            se.snippet != snippet_id
-            AND 1 - (se.embedding <=> source_embedding) > match_threshold
            AND s.status = 'Processed'
-        GROUP BY 
+    ),
+    similar_snippets AS (
+        SELECT *
+        FROM snippets_with_similarity
+        WHERE similarity > match_threshold
+        ORDER BY similarity DESC
+        LIMIT match_count
+    ),


While breaking the query into multiple CTEs improves readability, separating snippets_with_similarity and similar_snippets this way could lead to a significant performance degradation. The snippets_with_similarity CTE might be materialized by the query planner, forcing a full scan of snippet_embeddings to calculate similarity for every snippet. Although modern PostgreSQL planners often optimize this by inlining the CTE, it's not guaranteed.

To ensure optimal performance and make the intent clearer, it's better to combine these two CTEs. This guarantees that the match_threshold filter is applied during the initial vector search, which is much more efficient.

similar_snippets AS ( SELECT s.id, 1 - (se.embedding <=> source_embedding) as similarity FROM snippet_embeddings se JOIN snippets s ON s.id = se.snippet WHERE se.snippet != snippet_id AND s.status = 'Processed' AND 1 - (se.embedding <=> source_embedding) > match_threshold ORDER BY similarity DESC LIMIT match_count ),

ellipsis-dev

Important

Looks good to me! 👍

Reviewed everything up to 0744d20 in 2 minutes and 1 seconds. Click for details.

Reviewed 123 lines of code in 1 files
Skipped 0 files when reviewing.
Skipped posting 3 draft comments. View those below.
Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.

1. supabase/database/sql/search_related_snippets_public.sql:58

Draft comment:
The ordering from the 'similar_snippets' CTE may be lost after joining in 'final_snippets'. Consider adding an explicit ORDER BY (e.g. by similarity DESC) in the final CTE or outer SELECT to ensure the JSON array is returned in the intended order.
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 20% vs. threshold = 50% While the comment raises a valid theoretical concern about SQL ordering guarantees, PostgreSQL generally preserves CTE ordering in simple cases like this. The joins in final_snippets are all 1:1 joins that shouldn't affect ordering. Adding another ORDER BY would be redundant and could impact performance. The comment seems overly cautious. I could be wrong about PostgreSQL's ordering guarantees across CTEs. There might be edge cases where the order could be affected. Even if there are edge cases, the code has worked this way and adding another ORDER BY would be redundant in the common case. The theoretical risk doesn't justify the complexity/performance trade-off. The comment should be removed as it suggests a change that would add complexity without clear benefit, given PostgreSQL's typical behavior with CTE ordering.

2. supabase/database/sql/search_related_snippets_public.sql:101

Draft comment:
Consider adding a comment explaining the rationale for setting 'statement_timeout' to 30s to aid future maintainers.
Reason this comment was not posted:
Comment did not seem useful. Confidence is useful = 30% <= threshold 50% The comment is asking the author to add an explanation for a specific code change. While it doesn't directly ask for confirmation or testing, it suggests adding a comment for future maintainers. This is not a direct code suggestion or a request for a test, so it might not be considered useful under the given rules.

3. supabase/database/sql/search_related_snippets_public.sql:24

Draft comment:
By removing the match_threshold filter from the base similarity CTE, the function now computes similarity for all processed snippets. Consider applying the threshold earlier (if supported by your index/knn search) to reduce computation on large datasets.
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 30% vs. threshold = 50% The comment is technically correct - more rows will be processed before filtering. However, without knowledge of the indexing strategy, table sizes, and whether the database can push down the threshold filter, it's speculative whether this is actually a significant performance issue. The comment itself acknowledges this uncertainty with "if supported by your index/knn search". I might be too quick to dismiss performance concerns. Large table scans could be very expensive. While performance could be an issue, we don't have enough context to know if this change actually causes problems. The comment is speculative. Delete the comment because it's speculative about potential performance issues without clear evidence that it's actually problematic.

Workflow ID: wflow_RQQvtBmbGPFpsTWP

^{You can customize}^{by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

supabase/database/sql/search_related_snippets_public.sql (1)
23-33: Consider applying the similarity threshold earlier.

The snippets_with_similarity CTE computes similarity for all snippets before filtering by match_threshold in the next CTE. For large datasets, this could be inefficient as it calculates similarity scores that will be discarded.

Consider combining the first two CTEs and applying the threshold filter directly:
-WITH
-snippets_with_similarity AS (
+WITH similar_snippets AS (
     SELECT
         s.id,
         1 - (se.embedding <=> source_embedding) as similarity
     FROM snippet_embeddings se
     JOIN snippets s ON s.id = se.snippet
     WHERE
         se.snippet != snippet_id
         AND s.status = 'Processed'
-),
-similar_snippets AS (
-    SELECT *
-    FROM snippets_with_similarity
-    WHERE similarity > match_threshold
+        AND (1 - (se.embedding <=> source_embedding)) > match_threshold
     ORDER BY similarity DESC
     LIMIT match_count
 ),
This allows the query planner to potentially use vector index operators more efficiently and avoids materializing intermediate results.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 29cf672 and 0744d20.

📒 Files selected for processing (1)

supabase/database/sql/search_related_snippets_public.sql (2 hunks)

🔇 Additional comments (4)

supabase/database/sql/search_related_snippets_public.sql (4)

41-57: LGTM! Effective pre-aggregation of labels.

Pre-aggregating labels into JSON at this stage is a solid optimization. It avoids repeated joins and the COALESCE correctly handles snippets without labels.

58-79: LGTM! Clean assembly of final result set.

The final_snippets CTE effectively assembles all required fields:

Language-specific summary selection using CASE is clean

Appropriate use of LEFT JOIN for optional labels

All necessary fields are included for the final JSON output

80-98: LGTM! Correct JSON construction and NULL handling.

The final JSON aggregation is properly structured and the COALESCE at line 98 correctly handles the case where no similar snippets are found.

100-101: Verify the 30-second statement_timeout is sufficient.

The function-level statement_timeout is a solid safeguard, but no existing benchmarks or load tests cover search_related_snippets_public (supabase/database/sql/search_related_snippets_public.sql lines 100–101). Please validate—ideally with production-scale data volumes, varied match_count values, and concurrent execution—that 30 s meets your SLAs, and adjust or add performance tests as needed.

[f] Optimize fetch related snippets

0744d20

quancao-ea self-assigned this Sep 30, 2025

quancao-ea requested a review from nhphong September 30, 2025 11:53

gemini-code-assist Bot reviewed Sep 30, 2025

View reviewed changes

ellipsis-dev Bot reviewed Sep 30, 2025

View reviewed changes

coderabbitai Bot reviewed Sep 30, 2025

View reviewed changes

nhphong approved these changes Sep 30, 2025

View reviewed changes

quancao-ea merged commit ab5c558 into main Oct 1, 2025
2 checks passed

quancao-ea deleted the features/optimize-fetch-related-snippets branch October 1, 2025 03:55

coderabbitai Bot mentioned this pull request Oct 4, 2025

[f] VER-267: Adaptive retrieval for searching related snippets #23

Merged

coderabbitai Bot mentioned this pull request Jan 15, 2026

fix: resolve search timeout by using HNSW index #54

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VER-265 - Optimize fetch related snippets#21

VER-265 - Optimize fetch related snippets#21
quancao-ea merged 1 commit intomainfrom
features/optimize-fetch-related-snippets

quancao-ea commented Sep 30, 2025 •

edited by coderabbitai Bot

Loading

Uh oh!

linear Bot commented Sep 30, 2025

Uh oh!

coderabbitai Bot commented Sep 30, 2025 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Sep 30, 2025

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Sep 30, 2025

Uh oh!

ellipsis-dev Bot left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

quancao-ea commented Sep 30, 2025 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

linear Bot commented Sep 30, 2025

Uh oh!

coderabbitai Bot commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

gemini-code-assist Bot commented Sep 30, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

ellipsis-dev Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

quancao-ea commented Sep 30, 2025 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Sep 30, 2025 •

edited

Loading