VER-305: Integrate knowledge base with Stage 1 - Initial detection process#67
Conversation
Use semantic search to retrieve relevant verified facts from the KB and inject them into both initial and main detection prompts, helping Gemini better identify known disinformation and avoid false positives. - Add kb_context.py: chunked embedding + deduplication across chunks to cover all topics in 30-min radio broadcasts - Refactor detection prompts as templates with .format() placeholders for kb_context, metadata, and transcription - Thread OpenAI client and kb_context through flows, tasks, executors - Add KB_STAGE1_CHUNK_SIZE and KB_STAGE1_MATCH_COUNT_PER_CHUNK constants
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the disinformation detection capabilities in Stage 1 of the processing pipeline by integrating a knowledge base. The system can now retrieve verified facts relevant to a given transcription and provide them to the LLM, guiding it to more accurately identify disinformation and avoid flagging truthful content. This integration aims to reduce false positives and improve the overall reliability of the detection process. Highlights
Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request successfully integrates a knowledge base into the Stage 1 detection process. The changes include updating LLM prompts to utilize knowledge base context, adding a new kb_context.py module for retrieving facts, and modifying processing flows to incorporate this new context. The implementation is solid, though I've identified a minor code duplication in src/processing_pipeline/stage_1/flows.py for fetching the knowledge base context. I've suggested a refactoring to improve maintainability. Overall, the changes are well-aligned with the feature's goal.
| # Fetch KB context using the initial transcription | ||
| initial_transcription = stage_1_llm_response.get("initial_transcription", "") | ||
| kb_context = fetch_kb_context(supabase_client, openai_client, initial_transcription) |
There was a problem hiding this comment.
This logic for fetching the knowledge base context is duplicated in the regenerate_timestamped_transcript flow on lines 262-264. To improve maintainability and adhere to the DRY (Don't Repeat Yourself) principle, consider extracting this logic into a private helper function.
For example:
def _get_kb_context_for_response(supabase_client: SupabaseClient, openai_client: OpenAI, stage_1_llm_response: dict) -> str | None:
"""Fetches KB context for a given stage 1 LLM response."""
initial_transcription = stage_1_llm_response.get("initial_transcription", "")
return fetch_kb_context(supabase_client, openai_client, initial_transcription)This helper can then be called in both redo_main_detection and regenerate_timestamped_transcript to reduce redundancy.
WalkthroughThis PR integrates knowledge base context retrieval into the Stage 1 detection pipeline. It introduces a new kb_context module that retrieves and formats KB entries using OpenAI embeddings and Supabase queries, updates prompt templates with knowledge base guidance sections, modifies executor and task signatures to accept kb_context parameters, and wires OpenAI client creation throughout the flow. Changes
Sequence DiagramsequenceDiagram
participant Flow as Stage 1 Flow
participant Audio as process_audio_file()
participant Initial as initial_disinformation_<br/>detection_with_gemini()
participant KB as KB Context<br/>Retrieval
participant OpenAI as OpenAI Client<br/>(Embeddings)
participant Supabase as Supabase KB<br/>Query
participant Detect as Disinformation<br/>Detection Executor
Flow->>Audio: Call with openai_client
Audio->>Initial: Run with transcription
Initial->>KB: fetch_kb_context(transcription)
KB->>OpenAI: Embed text chunks
OpenAI-->>KB: Return embeddings
KB->>Supabase: Query KB with embeddings
Supabase-->>KB: Return matching entries
KB-->>Initial: Return formatted KB context
Initial->>Detect: Run with kb_context
Detect-->>Initial: Return detection results
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
📝 Coding Plan
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 Pylint (4.0.5)src/processing_pipeline/stage_1/kb_context.py************* Module .pylintrc ... [truncated 848 characters] ... rc.processing_pipeline.stage_1.kb_context", src/processing_pipeline/stage_1/constants.py************* Module .pylintrc src/processing_pipeline/stage_1/executors.py************* Module .pylintrc ... [truncated 12854 characters] ... inal_transcription",
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment Tip CodeRabbit can approve the review once all CodeRabbit's comments are resolved.Enable the |
There was a problem hiding this comment.
🧹 Nitpick comments (3)
src/processing_pipeline/stage_1/flows.py (1)
298-307: Inconsistent error handling between Gemini and OpenAI client creation.
_create_gemini_client()returnsNonewhen the API key is missing, while_create_openai_client()raises aValueError. This inconsistency could lead to confusing behavior:
- Missing
GOOGLE_GEMINI_KEY→ silentNone→ runtime error later whengemini_clientis used- Missing
OPENAI_API_KEY→ immediateValueErrorwith clear messageConsider aligning the behavior for consistency.
♻️ Suggested fix for consistent error handling
def _create_gemini_client() -> genai.Client | None: gemini_key = os.getenv("GOOGLE_GEMINI_KEY") - return genai.Client(api_key=gemini_key) if gemini_key else None + if not gemini_key: + raise ValueError("GOOGLE_GEMINI_KEY environment variable is not set") + return genai.Client(api_key=gemini_key)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/processing_pipeline/stage_1/flows.py` around lines 298 - 307, The Gemini client creator _create_gemini_client currently returns None when GOOGLE_GEMINI_KEY is unset, causing inconsistent behavior with _create_openai_client which raises ValueError; change _create_gemini_client to validate the env var and raise a ValueError with a clear message if GOOGLE_GEMINI_KEY is missing (mirror the pattern used in _create_openai_client) so callers always get a concrete client or an explicit error.src/processing_pipeline/stage_1/kb_context.py (2)
52-58: Character-based chunking may split words or sentences mid-stream.The current implementation splits text at fixed character boundaries regardless of word or sentence boundaries. This could result in incomplete or nonsensical chunks being embedded, potentially degrading KB search quality.
Consider splitting on sentence or paragraph boundaries, or at minimum on whitespace near the chunk boundary.
♻️ Suggested improvement for word-boundary-aware chunking
def _split_into_chunks(text: str, chunk_size: int) -> list[str]: chunks = [] - for i in range(0, len(text), chunk_size): - chunk = text[i:i + chunk_size] - if chunk: - chunks.append(chunk) + start = 0 + while start < len(text): + end = start + chunk_size + if end < len(text): + # Try to find a whitespace near the boundary to avoid splitting words + space_idx = text.rfind(' ', start, end) + if space_idx > start: + end = space_idx + 1 + chunk = text[start:end].strip() + if chunk: + chunks.append(chunk) + start = end return chunks🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/processing_pipeline/stage_1/kb_context.py` around lines 52 - 58, The _split_into_chunks function currently slices by fixed character windows and can cut words/sentences; change it to be boundary-aware by finding a nearest safe break (sentence or whitespace) before the chunk_size limit: for example, attempt to split on sentence boundaries (use a sentence tokenizer like nltk.sent_tokenize or a simple regex to detect sentence-ending punctuation) and if none are available within the window, fall back to the last whitespace before chunk_size (use str.rfind(' ', 0, i+chunk_size) or similar) and only hard-split if no whitespace exists; update the _split_into_chunks implementation to iterate through text advancing by these boundary-aware cut points and ensure it still returns list[str].
25-40: Consider handling partial failures during KB search loop.If the OpenAI embeddings call succeeds but a subsequent
search_kb_entriescall fails mid-loop, all progress is lost. While the task-level retry (infetch_kb_context) will restart the operation, this could be inefficient for large transcriptions.For resilience, consider catching and logging individual search failures while continuing with other chunks.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/processing_pipeline/stage_1/kb_context.py` around lines 25 - 40, The KB search loop can fail mid-iteration and lose all progress; wrap the call to supabase_client.search_kb_entries inside a try/except in the embeddings loop (the block that iterates over embeddings and calls supabase_client.search_kb_entries with KB_SEARCH_MATCH_THRESHOLD and KB_STAGE1_MATCH_COUNT_PER_CHUNK), log the exception (including the embedding index or a short context) and continue to the next embedding so already-collected entries in the seen dict are preserved; keep updating seen as before when results are returned and ensure fetch_kb_context’s task-level retry still applies for total failure cases.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@src/processing_pipeline/stage_1/flows.py`:
- Around line 298-307: The Gemini client creator _create_gemini_client currently
returns None when GOOGLE_GEMINI_KEY is unset, causing inconsistent behavior with
_create_openai_client which raises ValueError; change _create_gemini_client to
validate the env var and raise a ValueError with a clear message if
GOOGLE_GEMINI_KEY is missing (mirror the pattern used in _create_openai_client)
so callers always get a concrete client or an explicit error.
In `@src/processing_pipeline/stage_1/kb_context.py`:
- Around line 52-58: The _split_into_chunks function currently slices by fixed
character windows and can cut words/sentences; change it to be boundary-aware by
finding a nearest safe break (sentence or whitespace) before the chunk_size
limit: for example, attempt to split on sentence boundaries (use a sentence
tokenizer like nltk.sent_tokenize or a simple regex to detect sentence-ending
punctuation) and if none are available within the window, fall back to the last
whitespace before chunk_size (use str.rfind(' ', 0, i+chunk_size) or similar)
and only hard-split if no whitespace exists; update the _split_into_chunks
implementation to iterate through text advancing by these boundary-aware cut
points and ensure it still returns list[str].
- Around line 25-40: The KB search loop can fail mid-iteration and lose all
progress; wrap the call to supabase_client.search_kb_entries inside a try/except
in the embeddings loop (the block that iterates over embeddings and calls
supabase_client.search_kb_entries with KB_SEARCH_MATCH_THRESHOLD and
KB_STAGE1_MATCH_COUNT_PER_CHUNK), log the exception (including the embedding
index or a short context) and continue to the next embedding so
already-collected entries in the seen dict are preserved; keep updating seen as
before when results are returned and ensure fetch_kb_context’s task-level retry
still applies for total failure cases.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 1ca69a98-2e24-4949-89e9-66df00f4fc77
📒 Files selected for processing (7)
prompts/stage_1/main/detection_user_prompt.mdprompts/stage_1/preprocess/initial_detection_user_prompt.mdsrc/processing_pipeline/stage_1/constants.pysrc/processing_pipeline/stage_1/executors.pysrc/processing_pipeline/stage_1/flows.pysrc/processing_pipeline/stage_1/kb_context.pysrc/processing_pipeline/stage_1/tasks.py
Summary by CodeRabbit