VER-265 - Optimize fetch related snippets#21
Conversation
WalkthroughRefactors the SQL for search_related_snippets_public into multiple CTEs: computes vector similarity, filters by threshold, aggregates labels JSON, selects language-specific summaries, and assembles final snippet fields before JSON construction. Adjusts JOIN placement and field aliases, and sets a statement_timeout at the end. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Client
participant DB as Postgres
participant CTE1 as snippets_with_similarity
participant CTE2 as similar_snippets
participant CTE3 as label_summary
participant CTE4 as final_snippets
Client->>DB: call search_related_snippets_public(params)
DB->>CTE1: compute vector similarity for candidate snippets
CTE1-->>CTE2: filter by similarity threshold
CTE2-->>CTE3: aggregate labels per snippet (JSON)
CTE2-->>CTE4: assemble fields (summary by language, metadata)
CTE3-->>CTE4: join labels (COALESCE)
CTE4-->>DB: return finalized rows as JSON
DB-->>Client: JSON array of related snippets
note over DB: SET statement_timeout applied at end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✨ Finishing touches🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🧪 Early access (Sonnet 4.5): enabledWe are currently testing the Sonnet 4.5 model, which is expected to improve code review quality. However, this model may lead to increased noise levels in the review comments. Please disable the early access features if the noise level causes any inconvenience. Note:
Comment |
Summary of ChangesHello @quancao-ea, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request focuses on enhancing the efficiency and reliability of the Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request refactors the search_related_snippets_public function to improve performance by using Common Table Expressions (CTEs). The new structure correctly defers expensive joins and aggregations until after the most similar snippets have been identified and limited, which is a significant improvement. I have one suggestion to further refine the query to ensure optimal performance by combining two of the CTEs. The addition of a statement_timeout is also a great defensive measure against long-running queries.
| snippets_with_similarity AS ( | ||
| SELECT | ||
| s.id, | ||
| s.title, | ||
| s.file_path, | ||
| s.recorded_at, | ||
| s.comment_count, | ||
| s.start_time, | ||
| a.radio_station_name, | ||
| a.radio_station_code, | ||
| a.location_state, | ||
| CASE | ||
| WHEN p_language = 'spanish' THEN s.summary ->> 'spanish' | ||
| ELSE s.summary ->> 'english' | ||
| END AS summary, | ||
| 1 - (se.embedding <=> source_embedding) as similarity, | ||
| jsonb_agg(l) as labels | ||
| 1 - (se.embedding <=> source_embedding) as similarity | ||
| FROM snippet_embeddings se | ||
| JOIN snippets s ON s.id = se.snippet | ||
| JOIN audio_files a ON a.id = s.audio_file | ||
| LEFT JOIN snippet_labels sl ON s.id = sl.snippet | ||
| LEFT JOIN labels l ON sl.label = l.id | ||
| WHERE | ||
| se.snippet != snippet_id | ||
| AND 1 - (se.embedding <=> source_embedding) > match_threshold | ||
| AND s.status = 'Processed' | ||
| GROUP BY | ||
| ), | ||
| similar_snippets AS ( | ||
| SELECT * | ||
| FROM snippets_with_similarity | ||
| WHERE similarity > match_threshold | ||
| ORDER BY similarity DESC | ||
| LIMIT match_count | ||
| ), |
There was a problem hiding this comment.
While breaking the query into multiple CTEs improves readability, separating snippets_with_similarity and similar_snippets this way could lead to a significant performance degradation. The snippets_with_similarity CTE might be materialized by the query planner, forcing a full scan of snippet_embeddings to calculate similarity for every snippet. Although modern PostgreSQL planners often optimize this by inlining the CTE, it's not guaranteed.
To ensure optimal performance and make the intent clearer, it's better to combine these two CTEs. This guarantees that the match_threshold filter is applied during the initial vector search, which is much more efficient.
similar_snippets AS (
SELECT
s.id,
1 - (se.embedding <=> source_embedding) as similarity
FROM snippet_embeddings se
JOIN snippets s ON s.id = se.snippet
WHERE
se.snippet != snippet_id
AND s.status = 'Processed'
AND 1 - (se.embedding <=> source_embedding) > match_threshold
ORDER BY similarity DESC
LIMIT match_count
),There was a problem hiding this comment.
Important
Looks good to me! 👍
Reviewed everything up to 0744d20 in 2 minutes and 1 seconds. Click for details.
- Reviewed
123lines of code in1files - Skipped
0files when reviewing. - Skipped posting
3draft comments. View those below. - Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
1. supabase/database/sql/search_related_snippets_public.sql:58
- Draft comment:
The ordering from the 'similar_snippets' CTE may be lost after joining in 'final_snippets'. Consider adding an explicit ORDER BY (e.g. by similarity DESC) in the final CTE or outer SELECT to ensure the JSON array is returned in the intended order. - Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 20% vs. threshold = 50% While the comment raises a valid theoretical concern about SQL ordering guarantees, PostgreSQL generally preserves CTE ordering in simple cases like this. The joins in final_snippets are all 1:1 joins that shouldn't affect ordering. Adding another ORDER BY would be redundant and could impact performance. The comment seems overly cautious. I could be wrong about PostgreSQL's ordering guarantees across CTEs. There might be edge cases where the order could be affected. Even if there are edge cases, the code has worked this way and adding another ORDER BY would be redundant in the common case. The theoretical risk doesn't justify the complexity/performance trade-off. The comment should be removed as it suggests a change that would add complexity without clear benefit, given PostgreSQL's typical behavior with CTE ordering.
2. supabase/database/sql/search_related_snippets_public.sql:101
- Draft comment:
Consider adding a comment explaining the rationale for setting 'statement_timeout' to 30s to aid future maintainers. - Reason this comment was not posted:
Comment did not seem useful. Confidence is useful =30%<= threshold50%The comment is asking the author to add an explanation for a specific code change. While it doesn't directly ask for confirmation or testing, it suggests adding a comment for future maintainers. This is not a direct code suggestion or a request for a test, so it might not be considered useful under the given rules.
3. supabase/database/sql/search_related_snippets_public.sql:24
- Draft comment:
By removing the match_threshold filter from the base similarity CTE, the function now computes similarity for all processed snippets. Consider applying the threshold earlier (if supported by your index/knn search) to reduce computation on large datasets. - Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 30% vs. threshold = 50% The comment is technically correct - more rows will be processed before filtering. However, without knowledge of the indexing strategy, table sizes, and whether the database can push down the threshold filter, it's speculative whether this is actually a significant performance issue. The comment itself acknowledges this uncertainty with "if supported by your index/knn search". I might be too quick to dismiss performance concerns. Large table scans could be very expensive. While performance could be an issue, we don't have enough context to know if this change actually causes problems. The comment is speculative. Delete the comment because it's speculative about potential performance issues without clear evidence that it's actually problematic.
Workflow ID: wflow_RQQvtBmbGPFpsTWP
You can customize by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
supabase/database/sql/search_related_snippets_public.sql (1)
23-33: Consider applying the similarity threshold earlier.The
snippets_with_similarityCTE computes similarity for all snippets before filtering bymatch_thresholdin the next CTE. For large datasets, this could be inefficient as it calculates similarity scores that will be discarded.Consider combining the first two CTEs and applying the threshold filter directly:
-WITH -snippets_with_similarity AS ( +WITH similar_snippets AS ( SELECT s.id, 1 - (se.embedding <=> source_embedding) as similarity FROM snippet_embeddings se JOIN snippets s ON s.id = se.snippet WHERE se.snippet != snippet_id AND s.status = 'Processed' -), -similar_snippets AS ( - SELECT * - FROM snippets_with_similarity - WHERE similarity > match_threshold + AND (1 - (se.embedding <=> source_embedding)) > match_threshold ORDER BY similarity DESC LIMIT match_count ),This allows the query planner to potentially use vector index operators more efficiently and avoids materializing intermediate results.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
supabase/database/sql/search_related_snippets_public.sql(2 hunks)
🔇 Additional comments (4)
supabase/database/sql/search_related_snippets_public.sql (4)
41-57: LGTM! Effective pre-aggregation of labels.Pre-aggregating labels into JSON at this stage is a solid optimization. It avoids repeated joins and the
COALESCEcorrectly handles snippets without labels.
58-79: LGTM! Clean assembly of final result set.The
final_snippetsCTE effectively assembles all required fields:
- Language-specific summary selection using
CASEis clean- Appropriate use of
LEFT JOINfor optional labels- All necessary fields are included for the final JSON output
80-98: LGTM! Correct JSON construction and NULL handling.The final JSON aggregation is properly structured and the
COALESCEat line 98 correctly handles the case where no similar snippets are found.
100-101: Verify the 30-second statement_timeout is sufficient.The function-level
statement_timeoutis a solid safeguard, but no existing benchmarks or load tests coversearch_related_snippets_public(supabase/database/sql/search_related_snippets_public.sql lines 100–101). Please validate—ideally with production-scale data volumes, variedmatch_countvalues, and concurrent execution—that 30 s meets your SLAs, and adjust or add performance tests as needed.
Important
Optimize
search_related_snippets_publicfunction with CTEs for performance and readability improvements, adding a 30-second timeout.search_related_snippets_publicto use CTEssnippets_with_similarity,similar_snippets,label_summary, andfinal_snippetsfor better performance and readability.statement_timeoutof 30 seconds to prevent long-running queries.similarity > match_thresholdand limits results tomatch_count.label_summaryCTE.final_snippetsCTE.This description was created by
for 0744d20. You can customize this summary. It will automatically update as commits are pushed.
Summary by CodeRabbit