Skip to content

Integrate Gemini 2.5 into the Analysis Pipeline#16

Merged
quancao-ea merged 9 commits intomainfrom
features/integrate-gemini-2.0-flash-exp
Sep 17, 2025
Merged

Integrate Gemini 2.5 into the Analysis Pipeline#16
quancao-ea merged 9 commits intomainfrom
features/integrate-gemini-2.0-flash-exp

Conversation

@nhphong
Copy link
Copy Markdown
Collaborator

@nhphong nhphong commented Dec 30, 2024

Summary by CodeRabbit

  • New Features

    • New Gemini 2.5 transcription pipeline and surfaced disinformation snippet for reviewer workflows.
  • Improvements

    • Migrated to Google Gemini 2.5 (Pro/Flash) for transcription and analysis.
    • Stronger safety controls (including civic integrity) and improved JSON validation/fallbacks.
    • Streamlined reviewer prompt for clearer, transcription-focused guidance.
    • Switched audio/data retrieval to Supabase.
  • Chores

    • Added google-genai and upgraded supabase dependency.

@nhphong nhphong self-assigned this Dec 30, 2024
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Dec 30, 2024

Walkthrough

Adds Google GenAI v2 client usage and Gemini 2.5 model identifiers across the pipeline, replaces the Gemini 1206 transcription generator with a Gemini 2.5 Pro implementation, migrates stages to client.models.generate_content with GenerateContentConfig/SafetySetting, updates Supabase/main flows, and bumps dependencies and tests.

Changes

Cohort / File(s) Change Summary
Dependencies
requirements.txt
Added google-genai==1.11.0; upgraded supabase 2.9.02.15.0.
Model constants & prompts
src/processing_pipeline/constants.py, prompts/*
Added GEMINI_2_5_PRO and GEMINI_2_5_FLASH; added prompt/schema loader functions (get_timestamped_transcription_generation_prompt(), get_timestamped_transcription_generation_output_schema()); renamed prompt loader to get_gemini_2_5_pro_transcription_generation_prompt(); updated prompts/Stage_4_review_prompt.md.
Transcription generator: removal & addition
src/processing_pipeline/gemini_1206_transcription_generator.py (removed), src/processing_pipeline/gemini_2_5_pro_transcription_generator.py (added)
Removed Gemini1206TranscriptionGenerator; added Gemini25ProTranscriptionGenerator using genai.Client for file upload, polling, client.models.generate_content with GenerateContentConfig/ThinkingConfig and SafetySetting list, and file cleanup.
Stage migrations (1, 1 preprocess, 3, timestamped, 4)
src/processing_pipeline/stage_1.py, src/processing_pipeline/stage_1_preprocess.py, src/processing_pipeline/stage_3.py, src/processing_pipeline/timestamped_transcription_generator.py, src/processing_pipeline/stage_4.py
Replaced legacy google.generativeai usage with google.genai.Client; switched to client.models.generate_content and GenerateContentConfig/SafetySetting lists; updated model constants to GEMINI_2_5_*; Stage4 now constructs a GoogleSearch Tool in the config, parses response.text and falls back to schema-driven validation via the client when parsing fails; grounding metadata extraction preserved.
Supabase, main, and utils
src/processing_pipeline/supabase_utils.py, src/main.py
Replaced S3-based test harness with Supabase snippet fetch; prepare_snippet_for_review now returns (transcription, disinformation_snippet, metadata, analysis_json); Stage4Executor.run updated to accept disinformation_snippet; reviewed_by hardcoded value switched to GEMINI_2_5_PRO.
Tests (many modules)
tests/processing_pipeline/*, tests/test_supabase_utils.py
Updated tests to mock google.genai.Client and client.models.generate_content, adjusted expected call signatures/configs and response shapes (response.text, response.candidates), switched assertions to use GEMINI_2_5_PRO, added tests for Gemini25ProTranscriptionGenerator, and removed old 1206 tests.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant Caller as Main / Test
    participant Supabase as SupabaseClient
    participant Stage4 as Stage4Executor
    participant GenAI as google.genai.Client
    participant Tool as GoogleSearchTool

    Caller->>Supabase: fetch_snippet(id)
    Supabase-->>Caller: snippet (transcription, metadata, analysis_json, disinformation_snippet)
    Caller->>Stage4: run(transcription, disinformation_snippet, metadata, analysis_json)
    Stage4->>GenAI: client.models.generate_content(model=GEMINI_2_5_PRO, contents, config{tools, safety,...})
    Note right of GenAI: config may include GoogleSearch Tool and SafetySetting list
    GenAI-->>Stage4: response (text, candidates...)
    alt response.text parses as JSON
        Stage4->>Stage4: extract JSON and grounding_metadata
    else parse fails
        Stage4->>GenAI: client.models.generate_content(model=GEMINI_2_5_PRO, contents=validation_prompt, config{schema/mime})
        GenAI-->>Stage4: validated JSON text
        Stage4->>Stage4: extract JSON and grounding_metadata
    end
    Stage4->>Supabase: submit_snippet_review(result, reviewed_by=GEMINI_2_5_PRO)
    Supabase-->>Stage4: ack
    Stage4-->>Caller: processed result
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

"A Rabbit's Note on the Merge"
I hopped through constants, changed the GenAI trail,
New 2.5 prompts and clients set sail.
Tools, safety, tests—cleaned up the nest,
Snippets fetched, validated, and given rest.
Nibble, run, and then—retest! 🥕✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The PR title "Integrate Gemini 2.5 into the Analysis Pipeline" is a concise, single-sentence summary that accurately captures the main change—migrating the pipeline and tests to Gemini 2.5 model identifiers and the new genai client API—so it is clear and relevant for reviewers scanning history.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch features/integrate-gemini-2.0-flash-exp

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 Pylint (3.3.7)
src/processing_pipeline/stage_4.py

************* Module .pylintrc
.pylintrc:1:0: F0011: error while parsing the configuration: File contains no section headers.
file: '.pylintrc', line: 1
'disable=C0116\n' (config-parse-error)
[
{
"type": "convention",
"module": "src.processing_pipeline.stage_4",
"obj": "",
"line": 103,
"column": 0,
"endLine": null,
"endColumn": null,
"path": "src/processing_pipeline/stage_4.py",
"symbol": "line-too-long",
"message": "Line too long (118/100)",
"message-id": "C0301"
},
{
"type": "convention",
"module": "src.processing_pipeline.stage_4",
"obj": "",
"line": 158,
"column": 0,
"endLine": null,
"endColumn": null,
"path": "src/processing_pipeline/stage_4.py",
"symbol": "line-too-long",
"message": "Line too long (116/100)",
"message-id": "C0301"
},
{
"type": "convention",
"module": "src

... [truncated 11710 characters] ...

eline.stage_4",
"obj": "Stage4Executor.__ensure_json_format",
"line": 332,
"column": 8,
"endLine": 335,
"endColumn": 111,
"path": "src/processing_pipeline/stage_4.py",
"symbol": "no-else-return",
"message": "Unnecessary "else" after "return", remove the "else" and de-indent the code inside it",
"message-id": "R1705"
},
{
"type": "refactor",
"module": "src.processing_pipeline.stage_4",
"obj": "Stage4Executor",
"line": 190,
"column": 0,
"endLine": 190,
"endColumn": 20,
"path": "src/processing_pipeline/stage_4.py",
"symbol": "too-few-public-methods",
"message": "Too few public methods (1/2)",
"message-id": "R0903"
}
]


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (5)
src/processing_pipeline/stage_4.py (2)

6-14: Remove unused imports.

HarmCategory and HarmBlockThreshold are never referenced in the code. They can be safely removed to reduce clutter and align with the static analysis suggestions.

🧰 Tools
🪛 Ruff (0.8.2)

11-11: google.genai.types.HarmCategory imported but unused

Remove unused import

(F401)


12-12: google.genai.types.HarmBlockThreshold imported but unused

Remove unused import

(F401)


318-318: Potential JSON parsing exception.

json.loads(response.text) will raise if the response is invalid JSON. Though it's caught by "is_convertible": false logic next, consider logging partial text for debugging if an unhandled exception occurs.

src/processing_pipeline/constants.py (2)

47-47: Consider file handling guard.

In get_timestamped_transcription_generation_prompt(), if the file is missing or locked, an exception is raised. Consider a try/except or with open(...) as f pattern for safer file operations.


51-51: Add fallback logic for missing JSON file.

Similar to the prompt file, consider gracefully handling file IO errors.

src/processing_pipeline/supabase_utils.py (1)

255-255: Avoid hard-coding in production.

"reviewed_by": GEMINI_2_0_FLASH_EXP is explicitly set. Consider passing the model name as a parameter or storing in a configuration.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 248d06e and e6e2c84.

📒 Files selected for processing (6)
  • requirements.txt (1 hunks)
  • src/processing_pipeline/constants.py (2 hunks)
  • src/processing_pipeline/stage_4.py (5 hunks)
  • src/processing_pipeline/supabase_utils.py (2 hunks)
  • tests/processing_pipeline/test_stage_4.py (8 hunks)
  • tests/test_supabase_utils.py (2 hunks)
✅ Files skipped from review due to trivial changes (1)
  • requirements.txt
🧰 Additional context used
🪛 Ruff (0.8.2)
src/processing_pipeline/stage_4.py

11-11: google.genai.types.HarmCategory imported but unused

Remove unused import

(F401)


12-12: google.genai.types.HarmBlockThreshold imported but unused

Remove unused import

(F401)

🔇 Additional comments (19)
src/processing_pipeline/stage_4.py (5)

18-18: Model constant added.

The definition of GEMINI_2_0_FLASH_EXP ensures alignment with the new model version. This change looks good.


208-210: Ensure consistent environment usage.

We set gemini_key from GOOGLE_GEMINI_PAID_KEY. Confirm that future references (e.g., in __ensure_json_format) safely rely on the same or a compatible environment variable.


221-251: Confirm safety settings approach.

All categories are set to "BLOCK_NONE". Verify if this is the intended behavior, as it permits any content.


268-269: Different environment variable for second prompt.

Here we read GOOGLE_GEMINI_KEY instead of GOOGLE_GEMINI_PAID_KEY. Ensure that using a separate key for the JSON-formatting prompt is intentional.


287-315: High token limit.

max_output_tokens=8192 may be memory-intensive. Confirm that you won't encounter performance or quota issues.

src/processing_pipeline/constants.py (1)

5-5: Added new model constant.

GEMINI_2_0_FLASH_EXP = "gemini-2.0-flash-exp" looks good.

src/processing_pipeline/supabase_utils.py (1)

3-3: Introduced new constant import.

Pulling in GEMINI_2_0_FLASH_EXP aligns with the move to the new model.

tests/processing_pipeline/test_stage_4.py (10)

3-3: Extended mock import usage.

Importing ANY from unittest.mock enhances test flexibility. This increment in mocking scope looks good.


17-17: Retain legacy constant for coverage.

GEMINI_1_5_PRO is still used in tests to confirm backward compatibility or second call usage.


36-45: Mocking the new client approach.

These lines correctly patch google.genai.Client and configure the model. Ensure any additional calls to client in the code are also mocked if needed.


102-103: Multiple response candidates.

Adding candidates helps test grounding metadata usage. Good improvement for capturing more test scenarios.


104-108: Cafe usage of models.generate_content mock.

Switching to models.generate_content for test calls is consistent with the library changes.


459-459: Multiple calls in single test.

Using side_effect with a list is effective for testing consecutive calls. Nice approach to verifying each call in sequence.


475-476: Validate first model call.

Ensuring the first call uses GEMINI_2_0_FLASH_EXP demonstrates that the pipeline picks the correct primary model.


480-482: Validate second model call.

The second call uses GEMINI_1_5_PRO. If this fallback is intentional for JSON validation, the test is correct.


598-598: Simulate empty response.

"is_convertible": false is a good scenario to ensure the snippet is not updated. This negative path coverage is valuable.


744-744: API error tested.

Raising Exception("API Error") confirms Stage4Executor.run logic handles generative call failures.

tests/test_supabase_utils.py (2)

4-4: Removed reference to old constant.

Importing GEMINI_2_0_FLASH_EXP ensures tests align with the new model references in production code.


699-699: Confirm new reviewed_by constant usage.

The test’s updated "reviewed_by" field matches the main code’s move to GEMINI_2_0_FLASH_EXP.

@nhphong nhphong changed the title Integrate Gemini 2.0 Flash Exp into Stage 4 (Review) of the Pipeline Integrate Gemini 2.5 into Stage 4 (Review) of the Pipeline Sep 15, 2025
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
src/main.py (1)

37-38: Remove or clarify the outdated comment.

The comment about changing the paid key to free key appears to be a TODO or note. Since the code doesn't actually perform this change, this comment should either be removed or clarified with more context about when/how this should be done.

-    # We need to change the paid key to the free key in stage 4
+    # TODO: Consider using free tier key for testing to avoid costs

Or simply remove it if it's no longer relevant.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between e6e2c84 and c72eb00.

📒 Files selected for processing (8)
  • prompts/Stage_4_review_prompt.md (3 hunks)
  • requirements.txt (1 hunks)
  • src/main.py (1 hunks)
  • src/processing_pipeline/constants.py (2 hunks)
  • src/processing_pipeline/stage_4.py (5 hunks)
  • src/processing_pipeline/supabase_utils.py (2 hunks)
  • tests/processing_pipeline/test_stage_4.py (8 hunks)
  • tests/test_supabase_utils.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
  • requirements.txt
  • src/processing_pipeline/constants.py
  • tests/test_supabase_utils.py
🧰 Additional context used
🧬 Code graph analysis (2)
src/main.py (2)
src/processing_pipeline/stage_4.py (3)
  • Stage4Executor (189-334)
  • prepare_snippet_for_review (27-50)
  • run (196-270)
src/processing_pipeline/supabase_utils.py (1)
  • get_snippet_by_id (29-31)
tests/processing_pipeline/test_stage_4.py (2)
tests/processing_pipeline/test_stage_3.py (1)
  • mock_gemini_model (40-59)
tests/processing_pipeline/test_stage_1.py (1)
  • mock_gemini_model (71-77)
🪛 markdownlint-cli2 (0.17.2)
prompts/Stage_4_review_prompt.md

161-161: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


174-174: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)

🪛 Ruff (0.12.2)
src/processing_pipeline/stage_4.py

259-259: Avoid specifying long messages outside the exception class

(TRY003)

🔇 Additional comments (15)
prompts/Stage_4_review_prompt.md (3)

145-145: LGTM!

The header change from "Primacy of Transcription and Metadata" to "Transcription and Metadata" maintains clarity while removing unnecessary emphasis. The substantive guideline remains unchanged.


161-161: Important accuracy directive added.

The new instruction to verify and correct inaccurate claims in the Analysis JSON is crucial for Stage 4's review function and aligns well with the PR's integration of the new Gemini model. This ensures reviewers are explicitly prompted to validate rather than accept claims at face value.


174-174: Simplification of wording is appropriate.

Changing "well-written" to "good" maintains the intent while using simpler language that's less subjective.

src/processing_pipeline/supabase_utils.py (1)

3-3: Model constant update looks good.

The import and usage of GEMINI_2_5_PRO correctly replaces GEMINI_1_5_PRO for the reviewed_by field, maintaining consistency with the new model integration across the pipeline.

Also applies to: 255-255

src/processing_pipeline/stage_4.py (6)

1-12: LGTM! Clean migration to Google GenAI v2 API.

The imports correctly reflect the new API surface with google.genai.Client and necessary configuration types. The removal of the old generativeai imports is complete.


207-209: API migration implemented correctly.

The transition from GenerativeModel to the new Client-based API is properly implemented. The google_search_tool is correctly instantiated for grounding support.


220-251: Comprehensive safety settings and proper API usage.

The generate_content call correctly uses the new API structure with:

  • Proper GenerateContentConfig setup
  • Google Search tool integration for grounding
  • Comprehensive safety settings including HARM_CATEGORY_CIVIC_INTEGRITY
  • Appropriate response modality set to TEXT

This aligns well with the PR's goal of integrating the new Gemini model.


253-265: Good JSON extraction logic with proper fallback.

The implementation correctly:

  1. Attempts to extract JSON from the response text
  2. Handles cases where the response might contain additional text around the JSON
  3. Falls back to __ensure_json_format for validation when initial parsing fails

This robust approach ensures reliable JSON output.


278-279: Verify the use of different API keys for fallback.

The fallback method uses GOOGLE_GEMINI_KEY while the main method uses GOOGLE_GEMINI_PAID_KEY. Also note that the fallback uses GEMINI_1_5_PRO instead of GEMINI_2_5_PRO.

Is this intentional to use different API keys and models for the fallback validation? If the paid key has higher limits, it might make sense to use it for both calls to ensure consistency. Also, consider whether using the older model for validation might introduce discrepancies.


297-327: Excellent structured JSON validation approach.

The fallback validation correctly:

  • Uses response_mime_type and response_schema for strict JSON output
  • Includes the same comprehensive safety settings
  • Properly validates convertibility before returning

This ensures the output always conforms to the expected schema.

src/main.py (2)

1-7: LGTM! Clean migration to Supabase-based workflow.

The imports correctly reflect the removal of AWS/S3 dependencies and the addition of Supabase and Stage 4 components, aligning with the PR's objectives.


15-35: Test harness properly updated for new Stage 4 API.

The test function correctly:

  1. Fetches a snippet from Supabase using the new client
  2. Calls prepare_snippet_for_review with the correct signature
  3. Passes all four required parameters to Stage4Executor.run
  4. Includes the new disinformation_snippet parameter

The structured output formatting with JSON pretty-printing improves readability.

tests/processing_pipeline/test_stage_4.py (3)

36-109: Test mock correctly updated for new GenAI Client API.

The mock setup properly reflects the new API structure:

  • Creates a Client mock with models attribute
  • Configures models.generate_content to return the expected response
  • Includes comprehensive test data matching the expected schema
  • Sets up candidates with grounding metadata

This ensures tests accurately validate the new integration.


459-459: Test assertions correctly validate the new API calls.

The test properly verifies:

  1. The mock client is called twice (main + fallback)
  2. First call uses GEMINI_2_5_PRO with the new API signature
  3. Second call uses GEMINI_1_5_PRO for JSON validation
  4. Both calls use the correct parameter structure with model, contents, and config

Also applies to: 475-482


598-641: Test coverage for error scenarios is comprehensive.

The tests properly handle:

  • Empty/invalid responses with is_convertible: false
  • API errors with appropriate exception handling
  • Missing fields and invalid data

The mock setup correctly simulates these error conditions using the new API structure.

Also applies to: 744-744

@quancao-ea quancao-ea changed the title Integrate Gemini 2.5 into Stage 4 (Review) of the Pipeline Integrate Gemini 2.5 into the Analysis Pipeline Sep 17, 2025
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/processing_pipeline/constants.py (1)

1-7: Fix FD leaks and set UTF‑8 encoding for all prompt/schema loaders.

Current open(...).read()/json.load(open(...)) leak fds and rely on locale encodings. Use Path helpers with explicit UTF‑8.

+from pathlib import Path
 import json
 
 GEMINI_1_5_PRO = "gemini-1.5-pro-002"
 GEMINI_1_5_FLASH = "gemini-1.5-flash"
 GEMINI_2_5_FLASH = "gemini-2.5-flash"
 GEMINI_2_5_PRO = "gemini-2.5-pro"
 
+def _read_text(path: str) -> str:
+    return Path(path).read_text(encoding="utf-8")
+
+def _read_json(path: str):
+    return json.loads(Path(path).read_text(encoding="utf-8"))
 
 def get_transcription_prompt_for_stage_1_preprocess():
-    return open("prompts/Stage_1_Preprocess_transcription_prompt.md", "r").read()
+    return _read_text("prompts/Stage_1_Preprocess_transcription_prompt.md")
 
 def get_detection_prompt_for_stage_1():
-    return open("prompts/Stage_1_detection_prompt.md", "r").read()
+    return _read_text("prompts/Stage_1_detection_prompt.md")
 
 def get_detection_prompt_for_stage_1_preprocess():
-    return open("prompts/Stage_1_Preprocess_detection_prompt.md", "r").read()
+    return _read_text("prompts/Stage_1_Preprocess_detection_prompt.md")
 
 def get_system_instruction_for_stage_1():
-    return open("prompts/Stage_1_system_instruction.md", "r").read()
+    return _read_text("prompts/Stage_1_system_instruction.md")
 
 def get_system_instruction_for_stage_1_preprocess():
-    return open("prompts/Stage_1_Preprocess_system_instruction.md", "r").read()
+    return _read_text("prompts/Stage_1_Preprocess_system_instruction.md")
 
 def get_output_schema_for_stage_1():
-    return json.load(open("prompts/Stage_1_output_schema.json", "r"))
+    return _read_json("prompts/Stage_1_output_schema.json")
 
 def get_output_schema_for_stage_1_preprocess():
-    return json.load(open("prompts/Stage_1_Preprocess_output_schema.json", "r"))
+    return _read_json("prompts/Stage_1_Preprocess_output_schema.json")
 
 def get_user_prompt_for_stage_3():
-    return open("prompts/Stage_3_analysis_prompt.md", "r").read()
+    return _read_text("prompts/Stage_3_analysis_prompt.md")
 
 def get_system_instruction_for_stage_3():
-    return open("prompts/Stage_3_system_instruction.md", "r").read()
+    return _read_text("prompts/Stage_3_system_instruction.md")
 
 def get_output_schema_for_stage_3():
-    return json.load(open("prompts/Stage_3_output_schema.json", "r"))
+    return _read_json("prompts/Stage_3_output_schema.json")
 
 def get_timestamped_transcription_generation_prompt():
-    return open("prompts/Timestamped_transcription_generation_prompt.md", "r").read()
+    return _read_text("prompts/Timestamped_transcription_generation_prompt.md")
 
 def get_timestamped_transcription_generation_output_schema():
-    return json.load(open("prompts/Timestamped_transcription_generation_output_schema.json", "r"))
+    return _read_json("prompts/Timestamped_transcription_generation_output_schema.json")
 
 def get_system_instruction_for_stage_4():
-    return open("prompts/Stage_4_system_instruction.md", "r").read()
+    return _read_text("prompts/Stage_4_system_instruction.md")
 
 def get_user_prompt_for_stage_4():
-    return open("prompts/Stage_4_review_prompt.md", "r").read()
+    return _read_text("prompts/Stage_4_review_prompt.md")
 
 def get_output_schema_for_stage_4():
-    return json.load(open("prompts/Stage_4_output_schema.json", "r"))
+    return _read_json("prompts/Stage_4_output_schema.json")
 
 def get_gemini_2_5_pro_transcription_generation_prompt():
-    return open("prompts/Gemini_2_5_pro_transcription_generation_prompt.md", "r").read()
+    return _read_text("prompts/Gemini_2_5_pro_transcription_generation_prompt.md")

Also applies to: 9-71

🧹 Nitpick comments (27)
tests/processing_pipeline/test_timestamped_transcription_generator.py (3)

63-66: Relax strict length assertions for MP3 rounding jitter.

Exact equality on MP3 durations is flaky across encoders/platforms. Use a tolerance.

Apply:

-            assert abs(len(audio) - segment_length_ms) == 0
+            assert abs(len(audio) - segment_length_ms) <= 150  # allow small encoder jitter
-        assert len(audio1) == 10000 # The first part must be rounded down to the nearest multiple of 10 seconds
-        assert len(audio2) == 20000 # The remaining part of the audio
+        assert abs(len(audio1) - 10000) <= 150  # rounded down to nearest 10s
+        assert abs(len(audio2) - 20000) <= 150  # remainder

Also applies to: 78-83


84-98: Patching and wiring look correct; add a minimal call assertion.

Assert we invoked the client once to catch regressions.

         mock_client_class.return_value = mock_client
+        mock_client.models.generate_content.assert_not_called()
 ...
         result = TimestampedTranscriptionGenerator.transcribe_segments(
             segments, "fake-api-key"
         )
+        mock_client.models.generate_content.assert_called_once()

132-136: Avoid brittle exact error string match.

Error messages vary by OS/ffmpeg. Match on the errno phrase only.

-        with pytest.raises(FileNotFoundError) as exc_info:
+        with pytest.raises(FileNotFoundError) as exc_info:
             TimestampedTranscriptionGenerator.run("non_existed_file.mp3", "fake-api-key", 10)
-        assert str(exc_info.value) == "[Errno 2] No such file or directory: 'non_existed_file.mp3'"
+        assert "No such file or directory" in str(exc_info.value)
src/processing_pipeline/timestamped_transcription_generator.py (3)

87-91: Use splat instead of list concatenation (nit).

Slightly cleaner and avoids creating an intermediate list.

-        result = client.models.generate_content(
+        result = client.models.generate_content(
             model=GEMINI_2_5_PRO,
-            contents=[cls.USER_PROMPT] + segments,
+            contents=[cls.USER_PROMPT, *segments],

111-113: Harden JSON parsing and empty responses.

Guard against missing/invalid result.text to produce clearer errors.

-        return json.loads(result.text)["segments"]
+        if not getattr(result, "text", None):
+            raise ValueError("Empty response from Gemini when transcribing segments")
+        try:
+            return json.loads(result.text)["segments"]
+        except (json.JSONDecodeError, KeyError) as e:
+            raise ValueError(f"Invalid JSON in Gemini response: {e}") from e

95-109: Review safety/thinking settings for Stage 4 parity and policy.

All thresholds are BLOCK_NONE and include_thoughts=True. If that’s intentional, consider surfacing them via config/env to align with Stage 4 defaults and compliance posture.

Would you like these values to be driven by env vars (e.g., GEMINI_SAFETY_BLOCK_NONE=1)?

tests/processing_pipeline/test_stage_3.py (1)

163-167: Patch the object where it’s used for isolation.

Patching google.genai.Client works, but patching the import site is more robust.

-    @patch('google.genai.Client')
+    @patch('processing_pipeline.stage_3.genai.Client')

Apply similarly to other tests in this file.

Also applies to: 205-207, 228-230, 315-317, 343-344

tests/processing_pipeline/test_stage_1_preprocess.py (1)

49-65: Add safety settings assertions for stronger contract.

Also assert the SafetySetting categories/thresholds to catch regressions.

Apply this diff:

         # Check config
         config = kwargs['config']
         assert config.response_mime_type == "application/json"
         assert config.response_schema == Stage1PreprocessTranscriptionExecutor.OUTPUT_SCHEMA
         assert config.max_output_tokens == 8192
+        # Assert safety settings for completeness
+        categories = {s.category for s in config.safety_settings}
+        assert categories == {
+            HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT,
+            HarmCategory.HARM_CATEGORY_HATE_SPEECH,
+            HarmCategory.HARM_CATEGORY_HARASSMENT,
+            HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT,
+            HarmCategory.HARM_CATEGORY_CIVIC_INTEGRITY,
+        }
src/processing_pipeline/stage_3.py (3)

141-147: Outdated log message (“Gemini Pro 1.5”).

Update to reflect 2.5 usage.

-        print(f"Processing snippet: {local_file} with Gemini Pro 1.5")
+        print(f"Processing snippet: {local_file} with Gemini 2.5 Pro")

249-253: Add a timeout/attempt cap to file processing poll.

Avoid potential infinite wait if the upload never leaves PROCESSING.

-        while audio_file.state.name == "PROCESSING":
-            print("Processing the uploaded audio file...")
-            time.sleep(1)
-            audio_file = client.files.get(name=audio_file.name)
+        start_time = time.time()
+        max_wait_seconds = 300
+        while audio_file.state.name == "PROCESSING":
+            if time.time() - start_time > max_wait_seconds:
+                raise TimeoutError("Audio processing timed out")
+            print("Processing the uploaded audio file...")
+            time.sleep(1)
+            audio_file = client.files.get(name=audio_file.name)

269-286: Align safety settings with other stages (add CIVIC_INTEGRITY).

Stage 1/4 include CIVIC_INTEGRITY; Stage 3 should, too, for consistency.

                     SafetySetting(
                         category=HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT,
                         threshold=HarmBlockThreshold.BLOCK_NONE,
                     ),
+                    SafetySetting(
+                        category=HarmCategory.HARM_CATEGORY_CIVIC_INTEGRITY,
+                        threshold=HarmBlockThreshold.BLOCK_NONE,
+                    ),
src/processing_pipeline/stage_4.py (3)

219-249: Use enums for SafetySetting and enforce JSON in the primary call.

  • Swap string categories/thresholds to HarmCategory/HarmBlockThreshold for type safety.
  • Ask the model for JSON directly with response_mime_type/response_schema to reduce fallback usage.
  • Pass contents as a list for consistency with other stages.
-from google.genai.types import (
-    GenerateContentConfig,
-    GoogleSearch,
-    SafetySetting,
-    Tool,
-)
+from google.genai.types import (
+    GenerateContentConfig,
+    GoogleSearch,
+    SafetySetting,
+    Tool,
+    HarmCategory,
+    HarmBlockThreshold,
+)
@@
-        response = client.models.generate_content(
-            model=model_id,
-            contents=user_prompt,
-            config=GenerateContentConfig(
+        response = client.models.generate_content(
+            model=model_id,
+            contents=[user_prompt],
+            config=GenerateContentConfig(
                 tools=[google_search_tool],
                 response_modalities=["TEXT"],
+                response_mime_type="application/json",
+                response_schema=cls.OUTPUT_SCHEMA,
                 system_instruction=cls.SYSTEM_INSTRUCTION,
                 max_output_tokens=8192,
                 safety_settings=[
-                    SafetySetting(
-                        category="HARM_CATEGORY_SEXUALLY_EXPLICIT",
-                        threshold="BLOCK_NONE",
-                    ),
-                    SafetySetting(
-                        category="HARM_CATEGORY_HATE_SPEECH",
-                        threshold="BLOCK_NONE",
-                    ),
-                    SafetySetting(
-                        category="HARM_CATEGORY_HARASSMENT",
-                        threshold="BLOCK_NONE",
-                    ),
-                    SafetySetting(
-                        category="HARM_CATEGORY_DANGEROUS_CONTENT",
-                        threshold="BLOCK_NONE",
-                    ),
-                    SafetySetting(
-                        category="HARM_CATEGORY_CIVIC_INTEGRITY",
-                        threshold="BLOCK_NONE",
-                    ),
+                    SafetySetting(category=HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT, threshold=HarmBlockThreshold.BLOCK_NONE),
+                    SafetySetting(category=HarmCategory.HARM_CATEGORY_HATE_SPEECH,       threshold=HarmBlockThreshold.BLOCK_NONE),
+                    SafetySetting(category=HarmCategory.HARM_CATEGORY_HARASSMENT,        threshold=HarmBlockThreshold.BLOCK_NONE),
+                    SafetySetting(category=HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT, threshold=HarmBlockThreshold.BLOCK_NONE),
+                    SafetySetting(category=HarmCategory.HARM_CATEGORY_CIVIC_INTEGRITY,   threshold=HarmBlockThreshold.BLOCK_NONE),
                 ],
             ),
         )

296-326: Use enums for SafetySetting in fallback, too.

Match the primary call’s types for consistency.

-                safety_settings=[
-                    SafetySetting(
-                        category="HARM_CATEGORY_SEXUALLY_EXPLICIT",
-                        threshold="BLOCK_NONE",
-                    ),
-                    SafetySetting(
-                        category="HARM_CATEGORY_HATE_SPEECH",
-                        threshold="BLOCK_NONE",
-                    ),
-                    SafetySetting(
-                        category="HARM_CATEGORY_HARASSMENT",
-                        threshold="BLOCK_NONE",
-                    ),
-                    SafetySetting(
-                        category="HARM_CATEGORY_DANGEROUS_CONTENT",
-                        threshold="BLOCK_NONE",
-                    ),
-                    SafetySetting(
-                        category="HARM_CATEGORY_CIVIC_INTEGRITY",
-                        threshold="BLOCK_NONE",
-                    ),
-                ],
+                safety_settings=[
+                    SafetySetting(category=HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT, threshold=HarmBlockThreshold.BLOCK_NONE),
+                    SafetySetting(category=HarmCategory.HARM_CATEGORY_HATE_SPEECH,       threshold=HarmBlockThreshold.BLOCK_NONE),
+                    SafetySetting(category=HarmCategory.HARM_CATEGORY_HARASSMENT,        threshold=HarmBlockThreshold.BLOCK_NONE),
+                    SafetySetting(category=HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT, threshold=HarmBlockThreshold.BLOCK_NONE),
+                    SafetySetting(category=HarmCategory.HARM_CATEGORY_CIVIC_INTEGRITY,   threshold=HarmBlockThreshold.BLOCK_NONE),
+                ],

333-334: Shorten exception message per TRY003 and include context in the message payload instead.

Keep the error readable and avoid overly long exception strings.

-            raise ValueError(f"[Stage 4] The provided text is not convertible to a valid JSON object:\n{text}")
+            raise ValueError("[Stage 4] The provided text is not convertible to a valid JSON object")
src/processing_pipeline/stage_1_preprocess.py (1)

40-46: Consider providing mime_type and add a timeout to polling.

  • Supplying mime_type can speed up/clarify file processing.
  • Bound the polling loop to avoid indefinite waits.
-        audio_file = client.files.upload(file=audio_file)
+        audio_file = client.files.upload(file=audio_file)  # optionally: config={"mime_type": "audio/mp3"}
@@
-        while audio_file.state.name == "PROCESSING":
-            print("Processing the uploaded audio file...")
-            time.sleep(1)
-            audio_file = client.files.get(name=audio_file.name)
+        start_time = time.time()
+        max_wait_seconds = 300
+        while audio_file.state.name == "PROCESSING":
+            if time.time() - start_time > max_wait_seconds:
+                raise TimeoutError("Audio processing timed out")
+            print("Processing the uploaded audio file...")
+            time.sleep(1)
+            audio_file = client.files.get(name=audio_file.name)
tests/processing_pipeline/test_gemini_2_5_pro_transcription_generator.py (5)

60-62: Avoid hardcoding the model string; assert against the constant.

-from processing_pipeline.gemini_2_5_pro_transcription_generator import Gemini25ProTranscriptionGenerator
+from processing_pipeline.gemini_2_5_pro_transcription_generator import Gemini25ProTranscriptionGenerator
+from processing_pipeline.constants import GEMINI_2_5_PRO
@@
-        assert kwargs["model"] == "gemini-2.5-pro"
+        assert kwargs["model"] == GEMINI_2_5_PRO

76-87: Remove unused fixture argument.

mock_audio_file isn’t used in this test.

-    def test_run_with_processing_audio(self, mock_genai, mock_audio_file):
+    def test_run_with_processing_audio(self, mock_genai):

103-107: Remove unused fixture argument.

mock_genai isn’t used here.

-    def test_run_without_api_key(self, mock_genai):
+    def test_run_without_api_key(self):

174-175: Drop unused local variable (Ruff F841).

-        result = Gemini25ProTranscriptionGenerator.run("test.mp3", "fake-api-key")
+        Gemini25ProTranscriptionGenerator.run("test.mp3", "fake-api-key")

145-162: Naming nit: this test checks “thinking_config,” not timeouts.

Consider renaming to reflect intent (optional).

-    def test_timeout_configuration(self, mock_genai, mock_audio_file):
-        """Test timeout configuration in generate_content"""
+    def test_thinking_config_configuration(self, mock_genai, mock_audio_file):
+        """Test thinking_config in generate_content"""
tests/processing_pipeline/test_stage_4.py (2)

36-40: Patch where it’s used to avoid aliasing fragility.

Stage4Executor likely does from google import genai and then genai.Client(...). Patching "processing_pipeline.stage_4.genai.Client" is safer than "google.genai.Client".

-with patch("google.genai.Client") as mock:
+with patch("processing_pipeline.stage_4.genai.Client") as mock:

459-477: Grounding type assertion is inconsistent with process_snippet usage.

process_snippet forwards grounding_metadata as a dict to Supabase. Assert dict, not str.

-assert isinstance(grounding, str)
+assert isinstance(grounding, dict)
tests/processing_pipeline/test_stage_1.py (2)

709-717: Update expected transcriptor and logs to 2.5 Pro to match the code migration.

Tests still assert "gemini-1206". The code path now uses the 2.5 Pro transcriptor; align expectations.

-    call(1, {"timestamped_transcription": "Test transcription"}, "gemini-1206"),
-    call(2, {"timestamped_transcription": "Test transcription"}, "gemini-1206"),
+    call(1, {"timestamped_transcription": "Test transcription"}, "gemini-2.5-pro"),
+    call(2, {"timestamped_transcription": "Test transcription"}, "gemini-2.5-pro"),

Also applies to: 728-731, 732-753, 755-761


138-138: Silence Ruff ARG002 (unused fixture args) without losing fixture effects.

Access or delete the fixture vars at the start of each test.

 def test_transcribe_with_gemini_success(self, mock_environment):
+    del mock_environment
     """Test successful transcription with Gemini"""
 def test_transcribe_with_gemini_2_5_pro_success(self, mock_environment):
+    del mock_environment
 def test_transcribe_with_gemini_2_5_pro_rate_limit(self, mock_environment):
+    del mock_environment
 def test_initial_detection_success(self, mock_environment):
+    del mock_environment
 def test_disinformation_detection_success(self, mock_environment):
+    del mock_environment
 def test_run_success(self, mock_environment, mock_genai):
+    del mock_environment
-def test_disinformation_detection_with_unicode_handling(self, mock_supabase_client, mock_genai):
+def test_disinformation_detection_with_unicode_handling(self, mock_genai):

Also applies to: 164-164, 174-174, 183-183, 194-194, 209-209, 634-634

src/processing_pipeline/stage_1.py (3)

246-251: Two stray string literals — missing print.

These lines won’t log; convert to print calls.

-            if len(flagged_snippets) == 0:
-                "No flagged snippets found, inserting the llm response with status 'Processed'"
+            if len(flagged_snippets) == 0:
+                print("No flagged snippets found, inserting the llm response with status 'Processed'")
@@
-            else:
-                "Flagged snippets found, inserting the llm response with status 'New'"
+            else:
+                print("Flagged snippets found, inserting the llm response with status 'New'")

Also applies to: 265-287


428-434: Update log to reflect 2.5 Pro.

-print("Processing the timestamped transcription with Gemini 1.5 Pro")
+print("Processing the timestamped transcription with Gemini 2.5 Pro")

485-498: Regenerate flow: transcriptor label and logs still reference 1206/1.5.

Align with 2.5 Pro; this also requires updating tests that assert the transcriptor.

-                try:
-                    transcriptor = "gemini-1206"
-                    timestamped_transcription = transcribe_audio_file_with_gemini_2_5_pro(local_file)
+                try:
+                    transcriptor = "gemini-2.5-pro"
+                    timestamped_transcription = transcribe_audio_file_with_gemini_2_5_pro(local_file)
                 except google_exceptions.ResourceExhausted as e:
                     print(
-                        f"Failed to transcribe the audio file with Gemini 1206 due to Rate Limit: {e}\n"
+                        f"Failed to transcribe the audio file with Gemini 2.5 Pro due to Rate Limit: {e}\n"
                         "Falling back to the custom timestamped-transcript generator"
                     )
@@
-                print("Processing the timestamped transcription with Gemini 1.5 Pro")
+                print("Processing the timestamped transcription with Gemini 2.5 Pro")

Please also update tests/processing_pipeline/test_stage_1.py lines 749-751 to expect "gemini-2.5-pro" (see separate test comment).

Also applies to: 500-505

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between c72eb00 and 8db69d3.

📒 Files selected for processing (15)
  • src/processing_pipeline/constants.py (4 hunks)
  • src/processing_pipeline/gemini_1206_transcription_generator.py (0 hunks)
  • src/processing_pipeline/gemini_2_5_pro_transcription_generator.py (1 hunks)
  • src/processing_pipeline/stage_1.py (12 hunks)
  • src/processing_pipeline/stage_1_preprocess.py (3 hunks)
  • src/processing_pipeline/stage_3.py (3 hunks)
  • src/processing_pipeline/stage_4.py (5 hunks)
  • src/processing_pipeline/timestamped_transcription_generator.py (2 hunks)
  • tests/processing_pipeline/test_gemini_1206_transcription_generator.py (0 hunks)
  • tests/processing_pipeline/test_gemini_2_5_pro_transcription_generator.py (1 hunks)
  • tests/processing_pipeline/test_stage_1.py (23 hunks)
  • tests/processing_pipeline/test_stage_1_preprocess.py (7 hunks)
  • tests/processing_pipeline/test_stage_3.py (4 hunks)
  • tests/processing_pipeline/test_stage_4.py (14 hunks)
  • tests/processing_pipeline/test_timestamped_transcription_generator.py (1 hunks)
💤 Files with no reviewable changes (2)
  • tests/processing_pipeline/test_gemini_1206_transcription_generator.py
  • src/processing_pipeline/gemini_1206_transcription_generator.py
🧰 Additional context used
🧬 Code graph analysis (8)
src/processing_pipeline/gemini_2_5_pro_transcription_generator.py (3)
src/processing_pipeline/constants.py (1)
  • get_gemini_2_5_pro_transcription_generation_prompt (69-70)
src/processing_pipeline/stage_1.py (1)
  • run (535-579)
src/processing_pipeline/timestamped_transcription_generator.py (1)
  • run (30-65)
tests/processing_pipeline/test_stage_4.py (2)
src/processing_pipeline/stage_4.py (2)
  • process_snippet (92-129)
  • analysis_review (154-185)
src/processing_pipeline/supabase_utils.py (3)
  • set_snippet_status (69-79)
  • submit_snippet_review (238-260)
  • create_new_label (322-340)
tests/processing_pipeline/test_stage_1_preprocess.py (2)
src/processing_pipeline/stage_1_preprocess.py (4)
  • Stage1PreprocessTranscriptionExecutor (22-81)
  • Stage1PreprocessDetectionExecutor (84-135)
  • run (32-81)
  • run (91-135)
tests/processing_pipeline/test_stage_1.py (1)
  • mock_genai (71-77)
tests/processing_pipeline/test_gemini_2_5_pro_transcription_generator.py (2)
src/processing_pipeline/gemini_2_5_pro_transcription_generator.py (2)
  • Gemini25ProTranscriptionGenerator (10-59)
  • run (15-59)
src/processing_pipeline/stage_1.py (1)
  • run (535-579)
src/processing_pipeline/timestamped_transcription_generator.py (1)
src/processing_pipeline/constants.py (2)
  • get_timestamped_transcription_generation_output_schema (53-54)
  • get_timestamped_transcription_generation_prompt (49-50)
src/processing_pipeline/stage_1.py (5)
src/processing_pipeline/constants.py (3)
  • get_system_instruction_for_stage_1 (21-22)
  • get_output_schema_for_stage_1 (29-30)
  • get_detection_prompt_for_stage_1 (13-14)
src/processing_pipeline/gemini_2_5_pro_transcription_generator.py (2)
  • Gemini25ProTranscriptionGenerator (10-59)
  • run (15-59)
src/processing_pipeline/stage_3.py (1)
  • run (241-290)
src/processing_pipeline/timestamped_transcription_generator.py (1)
  • run (30-65)
src/processing_pipeline/stage_1_preprocess.py (3)
  • run (32-81)
  • run (91-135)
  • Stage1PreprocessDetectionExecutor (84-135)
tests/processing_pipeline/test_stage_3.py (2)
src/processing_pipeline/stage_3.py (3)
  • process_snippet (139-175)
  • Stage3Executor (234-290)
  • run (241-290)
src/processing_pipeline/supabase_utils.py (2)
  • update_snippet (185-227)
  • set_snippet_status (69-79)
tests/processing_pipeline/test_stage_1.py (4)
src/processing_pipeline/stage_1.py (11)
  • transcribe_audio_file_with_gemini_2_5_flash (80-83)
  • transcribe_audio_file_with_custom_timestamped_transcription_generator (139-143)
  • initial_disinformation_detection_with_gemini_2_5_pro (147-150)
  • disinformation_detection_with_gemini_2_5_pro (154-168)
  • transcribe_audio_file_with_gemini_2_5_pro (131-135)
  • run (535-579)
  • Stage1Executor (528-579)
  • initial_disinformation_detection (298-342)
  • regenerate_timestamped_transcript (443-525)
  • update_stage_1_llm_response_timestamped_transcription (351-352)
  • process_audio_file (204-294)
tests/processing_pipeline/test_gemini_2_5_pro_transcription_generator.py (2)
  • mock_genai (10-25)
  • test_run_success (35-74)
tests/processing_pipeline/test_stage_1_preprocess.py (4)
  • mock_genai (13-18)
  • mock_genai (111-116)
  • test_run_success (27-67)
  • test_run_success (118-160)
tests/test_main.py (1)
  • mock_environment (19-29)
🪛 Ruff (0.12.2)
src/processing_pipeline/gemini_2_5_pro_transcription_generator.py

17-17: Avoid specifying long messages outside the exception class

(TRY003)


55-55: Avoid specifying long messages outside the exception class

(TRY003)

tests/processing_pipeline/test_gemini_2_5_pro_transcription_generator.py

76-76: Unused method argument: mock_audio_file

(ARG002)


103-103: Unused method argument: mock_genai

(ARG002)


174-174: Local variable result is assigned to but never used

Remove assignment to unused variable result

(F841)

src/processing_pipeline/timestamped_transcription_generator.py

89-89: Consider [cls.USER_PROMPT, *segments] instead of concatenation

Replace with [cls.USER_PROMPT, *segments]

(RUF005)

src/processing_pipeline/stage_4.py

258-258: Avoid specifying long messages outside the exception class

(TRY003)

tests/processing_pipeline/test_stage_1.py

138-138: Unused method argument: mock_environment

(ARG002)


164-164: Unused method argument: mock_environment

(ARG002)


174-174: Unused method argument: mock_environment

(ARG002)


183-183: Unused method argument: mock_environment

(ARG002)


194-194: Unused method argument: mock_environment

(ARG002)


209-209: Unused method argument: mock_environment

(ARG002)


634-634: Unused method argument: mock_supabase_client

(ARG002)

🔇 Additional comments (24)
src/processing_pipeline/gemini_2_5_pro_transcription_generator.py (1)

28-59: Overall flow LGTM.

Upload, wait, generate, and cleanup are correctly structured. Once polling is fixed, this should be robust.

Consider adding a max wait (e.g., 5–10 min) to prevent infinite loops on stalled processing.

tests/processing_pipeline/test_stage_3.py (1)

247-255: LGTM: Stage3Executor smoke test.

Asserting the result is JSON‑parseable and using GEMINI_2_5_PRO matches the migration.

tests/processing_pipeline/test_stage_1_preprocess.py (3)

15-18: Good mocking pattern and result assertions.

Client/file ops and result checks look correct for the v2 client path.

Also applies to: 29-37


78-94: Nice polling test.

The processing→processed flow and sleep assertions are spot on.


101-108: Error paths covered well.

Upload and generation error propagation are correctly tested.

Also applies to: 167-174

src/processing_pipeline/stage_1_preprocess.py (1)

48-79: Stage 1 v2 migration looks solid.

Model selection, config, safety settings, and cleanup are consistent and correct.

Also applies to: 103-135

src/processing_pipeline/stage_4.py (1)

202-209: Unify Gemini env var usage: prefer GOOGLE_GEMINI_KEY or document the paid variant.

  • Found mismatch: stage_4.run reads GOOGLE_GEMINI_PAID_KEY (src/processing_pipeline/stage_4.py:202–204) while Stage4.__ensure_json_format reads GOOGLE_GEMINI_KEY (src/processing_pipeline/stage_4.py:273–275).
  • Majority of code, README and tests use GOOGLE_GEMINI_KEY (src/main.py, src/processing_pipeline/stage_1.py, src/processing_pipeline/stage_3.py, README.md, tests/).
  • Action: either change stage_4 to use GOOGLE_GEMINI_KEY for consistency, or keep GOOGLE_GEMINI_PAID_KEY but update README/tests and add a short comment explaining why a separate paid key is required.
tests/processing_pipeline/test_stage_4.py (6)

3-3: LGTM — imports updated for v2 client and new model constant.

Also applies to: 17-17


41-106: LGTM — solid client/models/response scaffolding.

Mocked response payload and grounding metadata shape look correct for downstream code.


331-336: LGTM — end-to-end process_snippet happy path is exercised.

Also applies to: 341-353, 355-363


545-557: LGTM — analysis_review loop wiring and calls look correct (non-repeating path).


589-651: LGTM — empty/invalid JSON fallback path behavior is verified correctly.

Also applies to: 640-648, 645-651


718-721: LGTM — error, invalid snippet, API error, specific IDs, and empty-category cases covered.

Also applies to: 732-735, 743-751, 763-781, 786-810

tests/processing_pipeline/test_stage_1.py (7)

12-16: LGTM — imports aligned with 2.5 Flash/Pro surfaces.

Also applies to: 24-25


71-77: LGTM — v2 client fixture covers models.generate_content; good default JSON.


138-153: LGTM — Stage 1 transcription/detection tests updated for v2 API path and pass-through JSON.

Also applies to: 164-173, 174-180, 183-193, 194-206, 209-221


233-239: LGTM — main-flow hooks (initial/redo/regenerate) patch points and assertions look right.

Also applies to: 269-275, 304-305


316-354: LGTM — process_audio_file success and error paths validated with 2.5 Flash + preprocess detection.

Also applies to: 373-379


464-505: LGTM — redo_main_detection over multiple responses with v2 client.


634-642: LGTM — Unicode payload handling through v2 client.

Also applies to: 653-661

src/processing_pipeline/stage_1.py (4)

8-14: LGTM — migrated to google.genai v2 client and type imports; model constant updated.

Also applies to: 24-29


79-84: LGTM — Stage 1 public API functions now target Gemini 2.5 Flash/Pro and preprocess executors.

Also applies to: 131-135, 147-151, 154-169


206-260: LGTM — fallback to custom generator on 2.5 Pro rate limit and main detection using Stage1Executor.

Also applies to: 242-259


539-579: LGTM — Stage1Executor.run migrated to v2 API with schema’d JSON and safety settings.

Comment thread src/processing_pipeline/gemini_2_5_pro_transcription_generator.py Outdated
Comment thread src/processing_pipeline/gemini_2_5_pro_transcription_generator.py
Comment on lines +252 to 265
# Check if response.text is a good json
try:
# Find the first { and last } to extract the JSON object
start_idx = response.text.find("{")
end_idx = response.text.rfind("}")
if start_idx == -1 or end_idx == -1:
raise ValueError("No JSON object found in response")
response_text = response.text[start_idx : end_idx + 1]
result = json.loads(response_text)
except json.JSONDecodeError:
print("[Stage 4] Response from gemini 2.5 pro is not a valid JSON object.")
# Use another prompt to ensure the response is a "valid" json
result = Stage4Executor.__ensure_json_format(response.text)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fallback not triggered when no braces (ValueError escapes).

The try only catches JSONDecodeError; “No JSON object found” raises ValueError and skips fallback.

-        try:
+        try:
             # Find the first { and last } to extract the JSON object
             start_idx = response.text.find("{")
             end_idx = response.text.rfind("}")
             if start_idx == -1 or end_idx == -1:
                 raise ValueError("No JSON object found in response")
             response_text = response.text[start_idx : end_idx + 1]
             result = json.loads(response_text)
-        except json.JSONDecodeError:
+        except (json.JSONDecodeError, ValueError):
             print("[Stage 4] Response from gemini 2.5 pro is not a valid JSON object.")
             # Use another prompt to ensure the response is a "valid" json
             result = Stage4Executor.__ensure_json_format(response.text)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Check if response.text is a good json
try:
# Find the first { and last } to extract the JSON object
start_idx = response.text.find("{")
end_idx = response.text.rfind("}")
if start_idx == -1 or end_idx == -1:
raise ValueError("No JSON object found in response")
response_text = response.text[start_idx : end_idx + 1]
result = json.loads(response_text)
except json.JSONDecodeError:
print("[Stage 4] Response from gemini 2.5 pro is not a valid JSON object.")
# Use another prompt to ensure the response is a "valid" json
result = Stage4Executor.__ensure_json_format(response.text)
# Check if response.text is a good json
try:
# Find the first { and last } to extract the JSON object
start_idx = response.text.find("{")
end_idx = response.text.rfind("}")
if start_idx == -1 or end_idx == -1:
raise ValueError("No JSON object found in response")
response_text = response.text[start_idx : end_idx + 1]
result = json.loads(response_text)
except (json.JSONDecodeError, ValueError):
print("[Stage 4] Response from gemini 2.5 pro is not a valid JSON object.")
# Use another prompt to ensure the response is a "valid" json
result = Stage4Executor.__ensure_json_format(response.text)
🧰 Tools
🪛 Ruff (0.12.2)

258-258: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents
In src/processing_pipeline/stage_4.py around lines 252 to 265, the current
try/except only catches json.JSONDecodeError so the manually raised ValueError
("No JSON object found") bypasses the fallback; update the exception handling to
catch both json.JSONDecodeError and ValueError (e.g., except
(json.JSONDecodeError, ValueError):) so the fallback
Stage4Executor.__ensure_json_format(response.text) is invoked for both malformed
JSON and missing braces, and keep the existing log message or add a concise one
before calling the fallback.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (3)
src/processing_pipeline/stage_4.py (1)

254-267: Fallback not triggered when no braces (ValueError escapes).

Expand the except clause to include ValueError so the fallback runs when braces are missing. This mirrors the prior review note.

-        except json.JSONDecodeError:
+        except (json.JSONDecodeError, ValueError):
             print("[Stage 4] Response from gemini 2.5 pro is not a valid JSON object.")
             # Use another prompt to ensure the response is a "valid" json
             result = Stage4Executor.__ensure_json_format(response.text)
src/processing_pipeline/gemini_2_5_pro_transcription_generator.py (2)

34-35: Gate chain‑of‑thought behind an env flag (default off)

Prevents accidental thought leakage to logs/telemetry while allowing opt‑in.

-                    thinking_config=ThinkingConfig(include_thoughts=False),
+                    thinking_config=ThinkingConfig(
+                        include_thoughts=bool(int(os.getenv("GEMINI_INCLUDE_THOUGHTS", "0")))
+                    ),

Apply outside-selected lines:

import os

21-59: Harden upload polling: add timeout, failure-state handling, robust enum access, and guaranteed cleanup

The current loop can hang indefinitely and doesn’t handle terminal failure states. It also reuses the audio_file arg for the uploaded handle and hardcodes audio/mp3. Patch below addresses:

  • bounded wait with exponential backoff + timeout
  • failure-state detection
  • robust enum access (getattr(..., "name", ...))
  • keep original arg intact (uploaded_file variable)
  • best‑effort cleanup even if polling fails
  • correct default MIME (audio/mpeg) with detection from filename
-        # Upload the audio file and wait for it to finish processing
-        audio_file = client.files.upload(file=audio_file, config={"mime_type": "audio/mp3"})
-        while audio_file.state.name == "PROCESSING":
-            print("Processing the uploaded audio file...")
-            time.sleep(1)
-            audio_file = client.files.get(name=audio_file.name)
-
-        try:
+        # Upload the audio file and wait for it to finish processing
+        # Detect MIME type when possible; default to audio/mpeg for MP3
+        filename = getattr(audio_file, "name", None)
+        mime_type = (mimetypes.guess_type(filename)[0] if filename else None) or "audio/mpeg"
+        uploaded_file = client.files.upload(file=audio_file, config={"mime_type": mime_type})
+        try:
+            # Poll with bounded wait and robust state handling
+            start = time.time()
+            backoff = 1.0
+            state = getattr(uploaded_file.state, "name", uploaded_file.state)
+            while state == "PROCESSING":
+                print("Processing the uploaded audio file...")
+                time.sleep(backoff)
+                backoff = min(backoff * 1.5, 5.0)
+                uploaded_file = client.files.get(name=uploaded_file.name)
+                state = getattr(uploaded_file.state, "name", uploaded_file.state)
+                if time.time() - start > 300:
+                    raise TimeoutError(f"Upload processing timed out after 300s (state={state}).")
+            if state in {"FAILED", "ERROR", "CANCELLED"}:
+                raise RuntimeError(f"Upload failed with terminal state: {state}")
             result = client.models.generate_content(
                 model=GEMINI_2_5_PRO,
-                contents=[cls.USER_PROMPT, audio_file],
+                contents=[cls.USER_PROMPT, uploaded_file],
                 config=GenerateContentConfig(
                     max_output_tokens=8192,
                     thinking_config=ThinkingConfig(include_thoughts=False),
                     safety_settings=[
                         SafetySetting(
                             category=HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT,
                             threshold=HarmBlockThreshold.BLOCK_NONE,
                         ),
                         SafetySetting(
                             category=HarmCategory.HARM_CATEGORY_HATE_SPEECH, threshold=HarmBlockThreshold.BLOCK_NONE
                         ),
                         SafetySetting(
                             category=HarmCategory.HARM_CATEGORY_HARASSMENT, threshold=HarmBlockThreshold.BLOCK_NONE
                         ),
                         SafetySetting(
                             category=HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT,
                             threshold=HarmBlockThreshold.BLOCK_NONE,
                         ),
                     ],
                 ),
             )
 
             if not result.text:
                 raise ValueError("No content in response - likely truncated or blocked")
 
             return result.text
         finally:
-            client.files.delete(name=audio_file.name)
+            try:
+                client.files.delete(name=uploaded_file.name)
+            except Exception:
+                # Best-effort cleanup; ignore delete errors
+                pass

Apply outside-selected lines:

# top-level imports
import mimetypes
🧹 Nitpick comments (6)
src/processing_pipeline/stage_4.py (3)

221-252: Ask for JSON up front; drop brittle brace slicing.

Request JSON with schema in the primary call to avoid substring extraction and reduce failures.

         response = client.models.generate_content(
             model=model_id,
             contents=user_prompt,
             config=GenerateContentConfig(
                 tools=[google_search_tool],
                 response_modalities=["TEXT"],
                 system_instruction=cls.SYSTEM_INSTRUCTION,
                 max_output_tokens=8192,
+                response_mime_type="application/json",
+                response_schema=cls.OUTPUT_SCHEMA,
                 safety_settings=[
                     SafetySetting(
                         category=HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT,
                         threshold=HarmBlockThreshold.BLOCK_NONE,
                     ),
                     ...
                 ],
             ),
         )
 
-        # Check if response.text is a good json
-        try:
-            # Find the first { and last } to extract the JSON object
-            start_idx = response.text.find("{")
-            end_idx = response.text.rfind("}")
-            if start_idx == -1 or end_idx == -1:
-                raise ValueError("No JSON object found in response")
-            response_text = response.text[start_idx : end_idx + 1]
-            result = json.loads(response_text)
+        # Parse model JSON response
+        try:
+            result = json.loads(response.text)

Also applies to: 254-262


269-271: Guard against missing grounding_metadata on candidate.

Avoid AttributeError when candidates exist but lack grounding_metadata.

-        grounding_metadata = str(response.candidates[0].grounding_metadata) if response.candidates else None
+        candidate = response.candidates[0] if getattr(response, "candidates", None) else None
+        gm = getattr(candidate, "grounding_metadata", None) if candidate else None
+        grounding_metadata = str(gm) if gm is not None else None

221-252: Confirm token limit, safety thresholds, and add per-call timeout.

  • Token limit: Gemini 2.5 Pro max_output_tokens = 65,536 — 8192 is within limit.
  • Timeout: GenerateContentConfig supports http_options (genai.types.HttpOptions) with timeout in milliseconds; add a per-call timeout (e.g., HttpOptions(timeout=30000)) on the config or client to avoid long hangs.
  • Safety: All thresholds set to BLOCK_NONE — confirm this is intentional.

File: src/processing_pipeline/stage_4.py lines 221–252

src/processing_pipeline/gemini_2_5_pro_transcription_generator.py (3)

36-51: Align safety settings with other stages: include CIVIC_INTEGRITY

Other modules include CIVIC_INTEGRITY; keep parity to avoid unexpected blocks later.

                         SafetySetting(
                             category=HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT,
                             threshold=HarmBlockThreshold.BLOCK_NONE,
                         ),
+                        SafetySetting(
+                            category=HarmCategory.HARM_CATEGORY_CIVIC_INTEGRITY,
+                            threshold=HarmBlockThreshold.BLOCK_NONE,
+                        ),

24-24: Prefer logging over print for observability

Switch to logging for structured, level‑controlled output.

-            print("Processing the uploaded audio file...")
+            logger.info("Processing the uploaded audio file...")

Apply outside-selected lines:

import logging
logger = logging.getLogger(__name__)

17-17: Ruff TRY003: long exception messages

Either shorten messages or define a small custom exception class to satisfy linter and keep messages centralized.

Also applies to: 55-55

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 8db69d3 and 84eee12.

📒 Files selected for processing (3)
  • src/processing_pipeline/gemini_2_5_pro_transcription_generator.py (1 hunks)
  • src/processing_pipeline/stage_1.py (12 hunks)
  • src/processing_pipeline/stage_4.py (5 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/processing_pipeline/stage_1.py
🧰 Additional context used
🧬 Code graph analysis (1)
src/processing_pipeline/gemini_2_5_pro_transcription_generator.py (3)
src/processing_pipeline/constants.py (1)
  • get_gemini_2_5_pro_transcription_generation_prompt (69-70)
src/processing_pipeline/stage_1.py (1)
  • run (534-578)
src/processing_pipeline/stage_4.py (1)
  • run (197-271)
🪛 Ruff (0.12.2)
src/processing_pipeline/gemini_2_5_pro_transcription_generator.py

17-17: Avoid specifying long messages outside the exception class

(TRY003)


55-55: Avoid specifying long messages outside the exception class

(TRY003)

src/processing_pipeline/stage_4.py

260-260: Avoid specifying long messages outside the exception class

(TRY003)

🔇 Additional comments (1)
src/processing_pipeline/gemini_2_5_pro_transcription_generator.py (1)

31-31: Verified — passing uploaded File object in contents is supported

Gemini/GenAI v2 docs and the Python SDK show three supported file-attachment methods: (1) pass the uploaded File object directly in contents (e.g., contents=[prompt, uploaded_file]); (2) reference an uploaded file via parts/file_data or Part.fromURI with file_uri; (3) include raw bytes inline (inline_data / Part.fromBytes). The snippet in src/processing_pipeline/gemini_2_5_pro_transcription_generator.py (line 31) — contents=[cls.USER_PROMPT, audio_file] — is a supported pattern; no change required.

Comment on lines +208 to 211
client = genai.Client(api_key=gemini_key)
model_id = GEMINI_2_5_PRO
google_search_tool = Tool(google_search=GoogleSearch())

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Unify Gemini API key env var across both paths (fallback may 401).

run() uses GOOGLE_GEMINI_PAID_KEY while the fallback uses GOOGLE_GEMINI_KEY. Use a single, consistent lookup.

-        gemini_key = os.getenv("GOOGLE_GEMINI_KEY")
+        gemini_key = os.getenv("GOOGLE_GEMINI_PAID_KEY") or os.getenv("GOOGLE_GEMINI_KEY")
         if not gemini_key:
             raise ValueError("Google Gemini API key was not set!")
 
-        client = genai.Client(api_key=gemini_key)
+        client = genai.Client(api_key=gemini_key)

Also applies to: 275-281

🤖 Prompt for AI Agents
In src/processing_pipeline/stage_4.py around lines 208-211 and 275-281, the code
reads Gemini API keys from two different environment variables
(GOOGLE_GEMINI_PAID_KEY in run() and GOOGLE_GEMINI_KEY in the fallback), which
can cause authentication errors; standardize on a single env var name across
both code paths by choosing one canonical variable (e.g., GOOGLE_GEMINI_PAID_KEY
or GOOGLE_GEMINI_KEY), update both lookups to use that same variable, and ensure
any helper/utility that reads the key falls back consistently (e.g., try the
primary name then a single documented fallback) so both the main path and
fallback use identical credentials.

Comment thread src/processing_pipeline/stage_4.py
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/processing_pipeline/stage_4.py (1)

333-335: Avoid leaking full model output in exception messages (privacy + lint).

Current raise includes the entire text; it can end up in Supabase status/logs. Use a generic message.

-            raise ValueError(f"[Stage 4] The provided text is not convertible to a valid JSON object:\n{text}")
+            raise ValueError("[Stage 4] The provided text is not convertible to the expected JSON schema.")

This also satisfies TRY003.

♻️ Duplicate comments (2)
src/processing_pipeline/stage_4.py (2)

204-207: Unify Gemini API key env across both paths (prevent 401 in fallback).

Standardize key lookup so main and fallback use identical credentials.

-        gemini_key = os.getenv("GOOGLE_GEMINI_PAID_KEY")
+        gemini_key = os.getenv("GOOGLE_GEMINI_PAID_KEY") or os.getenv("GOOGLE_GEMINI_KEY")
         if not gemini_key:
             raise ValueError("Google Gemini API key was not set!")
-        gemini_key = os.getenv("GOOGLE_GEMINI_KEY")
+        gemini_key = os.getenv("GOOGLE_GEMINI_PAID_KEY") or os.getenv("GOOGLE_GEMINI_KEY")
         if not gemini_key:
             raise ValueError("Google Gemini API key was not set!")

Also applies to: 275-281


254-266: Fallback not triggered when no braces (ValueError escapes).

Also reported previously; catch both exceptions so the JSON-fixer runs.

-        except json.JSONDecodeError:
+        except (json.JSONDecodeError, ValueError):
             print("[Stage 4] Response from gemini 2.5 pro is not a valid JSON object.")
             # Use another prompt to ensure the response is a "valid" json
             result = Stage4Executor.__ensure_json_format(response.text)
🧹 Nitpick comments (1)
src/processing_pipeline/stage_4.py (1)

221-251: Emit JSON directly from primary call (reduce brittle brace parsing).

Ask the model for JSON up front; add schema + MIME so you can parse response.text directly and avoid substring heuristics.

         response = client.models.generate_content(
             model=model_id,
             contents=user_prompt,
             config=GenerateContentConfig(
                 tools=[google_search_tool],
                 response_modalities=["TEXT"],
                 system_instruction=cls.SYSTEM_INSTRUCTION,
+                response_mime_type="application/json",
+                response_schema=cls.OUTPUT_SCHEMA,
                 max_output_tokens=8192,
                 safety_settings=[
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 84eee12 and 8609bc5.

📒 Files selected for processing (1)
  • src/processing_pipeline/stage_4.py (5 hunks)
🧰 Additional context used
🪛 Ruff (0.12.2)
src/processing_pipeline/stage_4.py

260-260: Avoid specifying long messages outside the exception class

(TRY003)

🔇 Additional comments (2)
src/processing_pipeline/stage_4.py (2)

306-325: SafetySetting enum usage looks correct.

Enums are used consistently in the fallback path; this addresses the earlier type‑mismatch issue.


6-14: Good: v2 client/types and enums are correctly imported.

Imports align with the v2 API surface and enum usage throughout.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants