Update by JuanBenitezG · Pull Request #62 · PublicDataWorks/verdad

JuanBenitezG · 2026-02-14T02:53:26Z

Summary by CodeRabbit

Documentation
- Added comprehensive documentation covering getting started, system architecture, pipeline stages, configuration, integration, and maintenance with detailed guides for database setup, testing, and deployment.
Infrastructure
- Migrated backend database from Supabase to PostgreSQL for improved performance and flexibility.
- Switched storage from cloud-based R2 to local filesystem for simplified deployment.
- Updated all pipeline stages to use new PostgreSQL and local storage backends.
Chores
- Removed legacy recording worker infrastructure and radio station adapter implementations.
- Added project configuration file (pyproject.toml) with Poetry support, testing setup, and coverage configuration.

… and standard pattern

Reorganized and improved documentation to follow a more modular and standard pattern

…lities - Deleted test_stations.py which contained unit tests for various radio station classes. - Removed test_generic_recording.py that included tests for the generic recording module. - Eliminated test_recording.py which had tests for the recording module and its associated functions.

- Replaced SupabaseClient with PostgresClient in stages 1 to 5 for database interactions. - Introduced LocalStorage for file handling instead of S3. - Updated audio file download and upload functions to work with the new storage client. - Enhanced audio file metadata handling to support both ISO strings and datetime objects. - Added environment variable loading with dotenv for better configuration management. - Created SQL migration scripts to reset the database and establish a minimal local schema. - Loaded initial prompt versions and heuristics into the database through migration scripts.

…itical debate fact-checking system - Created TESTING_GUIDE.md to outline testing procedures, including database connection, RPC function verification, and full processing pipeline tests. - Introduced DATABASE_SETUP_GUIDE.md detailing PostgreSQL setup, schema migrations, and function applications necessary for local development.

coderabbitai · 2026-02-14T02:53:44Z

Caution

Review failed

The pull request is closed.

Walkthrough

This PR refactors VERDAD from Supabase/S3 infrastructure to PostgreSQL/local storage, removes browser-based radio recording workers, introduces database and storage abstraction layers, and adds comprehensive documentation for pipeline stages, configuration, testing, and deployment workflows.

Changes

Cohort / File(s)	Summary
Infrastructure Deletion `Dockerfile.generic_recording_worker`, `Dockerfile.recording_worker`, `fly.generic_recording_worker.toml`, `fly.recording_worker.toml`, `scripts/generic_recording.sh`, `scripts/recording.sh`, `scripts/start_recording.sh`	Removes entire Docker image definitions and Fly.io process configurations for recording workers, along with shell scripts that orchestrated recording startup and stream capture.
Database/Storage Layer `src/processing_pipeline/postgres_client.py`, `src/processing_pipeline/local_storage.py`, `supabase/migrations/00_reset_database.sql`, `supabase/migrations/01_local_schema.sql`, `supabase/migrations/02_load_prompts.sql`	Introduces PostgreSQL client (replacing Supabase), local filesystem storage abstraction (replacing S3/R2), and comprehensive database schema migrations with tables for audio files, LLM responses, snippets, embeddings, prompts, and collaboration features.
Pipeline Stage Updates `src/processing_pipeline/stage_1.py`, `src/processing_pipeline/stage_2.py`, `src/processing_pipeline/stage_3.py`, `src/processing_pipeline/stage_4.py`, `src/processing_pipeline/stage_5.py`	Replaces SupabaseClient with PostgresClient and S3 storage with LocalStorage abstraction; updates function signatures and client instantiation across all pipeline stages; maintains functional equivalence with new backends.
Recording Module Removal `src/recording.py`, `src/generic_recording.py`	Deletes entire Prefect-based audio recording pipelines for both generic browser-based and station-specific direct URL recording modes, including metadata handling and database insertion logic.
Radio Station Classes Removal `src/radiostations/base.py`, `src/radiostations/khot.py`, `src/radiostations/kisf.py`, `src/radiostations/krgt.py`, `src/radiostations/wado.py`, `src/radiostations/waqi.py`, `src/radiostations/wkaq.py`, `src/radiostations/__init__.py`	Removes RadioStation base class with Selenium/PulseAudio browser automation and all six station implementations (KHOT, Kisf, Krgt, Wado, Waqi, Wkaq); eliminates public exports from radiostations package.
Test Suite Removals `tests/radiostations/test_base.py`, `tests/radiostations/test_stations.py`, `tests/test_generic_recording.py`, `tests/test_recording.py`	Deletes all test coverage for radio station implementations and recording modules (466, 119, 472, and 374 lines respectively).
Project Configuration `pyproject.toml`, `requirements.txt`, `src/processing_pipeline/__init__.py`, `src/main.py`, `scripts/load_prompts.py`	Adds Poetry project metadata, pytest/coverage configuration; updates module initialization with unified public API exports; switches SupabaseClient to PostgresClient in main; introduces prompt loading script with database seeding.
Comprehensive Documentation `docs/index.md`, `docs/getting-started/`, `docs/architecture/`, `docs/configuration/`, `docs/integration/`, `docs/maintenance/`, `docs/pipeline-stages/`, `docs/setup/`, `docs/testing/`	Adds 25+ new documentation files covering platform overview, quick-start, architecture deep-dives, data lifecycle, system design, Docker containerization, Fly.io deployment, prompt management, database setup, API integration, Slack/email notifications, real-time collaboration, testing strategy, and maintenance scripts (totaling 3,100+ lines).
README Update `README.md`	Expands Gemini LLM model references from single versions (1.5 Flash/Pro) to flexible ranges (1.5 or 2.5 Flash/Pro); adds documentation section index with links to comprehensive guides.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~150+ minutes

This PR requires deep review due to: large heterogeneous scope (infrastructure migration, client replacements, major deletions, extensive new documentation), new database layer implementation with complex RPC/CRUD operations, storage abstraction design decisions, SQL schema design and migrations, and impacts across five pipeline stages. The combination of logic-dense new code (postgres_client, local_storage), structural changes to core abstractions, and broad file distribution demands comprehensive cross-functional evaluation.

Possibly related PRs

[f] VER-266 - Replace Gemini 2.5 Pro with Flash with thinking in some stages #22: Modifies src/processing_pipeline/stage_1.py with related prompt and client changes in the same stage implementation.
[f] VER-274: Allow snippets from stage 3 to skip analysis review from stage 4 #30: Updates stage 3 and stage 4 safety settings and introduces postprocess utilities affecting pipeline analysis flows.
VER-297: Reactivate the preprocess step: Initial detection in Stage 1 #58: Overlaps on Stage 1 preprocessing, transcription code, and associated prompt/schema artifacts.

Suggested reviewers

nhphong

Poem

🐰 A migration whispers through the warren,
Supabase gave way to Postgres dawn,
Local storage hops where R2 once stood,
Browser bots retired to their bunker hood,
Documentation blooms in papers tall—
A refactored VERDAD, rebuilt for all! ✨

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 Pylint (4.0.4)

src/processing_pipeline/__init__.py

************* Module .pylintrc
.pylintrc:1:0: F0011: error while parsing the configuration: File contains no section headers.
file: '.pylintrc', line: 1
'disable=C0116\n' (config-parse-error)
[
{
"type": "convention",
"module": "src.processing_pipeline.init",
"obj": "",
"line": 76,
"column": 0,
"endLine": null,
"endColumn": null,
"path": "src/processing_pipeline/init.py",
"symbol": "trailing-whitespace",
"message": "Trailing whitespace",
"message-id": "C0303"
},
{
"type": "convention",
"module": "src.processing_pipeline.init",
"obj": "",
"line": 85,
"column": 0,
"endLine": null,
"endColumn": null,
"path": "src/processing_pipeline/init.py",
"symbol": "trailing-whitespace",
"message": "Trailing whitespace",
"message-id": "C0303"
},
{
"type": "convention",
"module"

... [truncated 5256 characters] ...

els'",
"message-id": "E0401"
},
{
"type": "error",
"module": "src.processing_pipeline",
"obj": "",
"line": 64,
"column": 0,
"endLine": 66,
"endColumn": 1,
"path": "src/processing_pipeline/init.py",
"symbol": "import-error",
"message": "Unable to import 'processing_pipeline.stage_4'",
"message-id": "E0401"
},
{
"type": "error",
"module": "src.processing_pipeline",
"obj": "",
"line": 68,
"column": 0,
"endLine": 70,
"endColumn": 1,
"path": "src/processing_pipeline/init.py",
"symbol": "import-error",
"message": "Unable to import 'processing_pipeline.stage_5'",
"message-id": "E0401"
}
]

src/main.py

************* Module .pylintrc
.pylintrc:1:0: F0011: error while parsing the configuration: File contains no section headers.
file: '.pylintrc', line: 1
'disable=C0116\n' (config-parse-error)
[
{
"type": "convention",
"module": "src.main",
"obj": "",
"line": 19,
"column": 0,
"endLine": null,
"endColumn": null,
"path": "src/main.py",
"symbol": "line-too-long",
"message": "Line too long (114/100)",
"message-id": "C0301"
},
{
"type": "convention",
"module": "src.main",
"obj": "",
"line": 1,
"column": 0,
"endLine": null,
"endColumn": null,
"path": "src/main.py",
"symbol": "missing-module-docstring",
"message": "Missing module docstring",
"message-id": "C0114"
},
{
"type": "error",
"module": "src.main",
"obj": "",
"line": 6,
"column": 0,
"endLine": 6,
"endColumn": 82,
"path": "src/main.py",
"symbol": "import-error",
"message": "Unable to import 'processing_pipeline.stage_4'",
"message-id": "E0401"
},
{
"type": "error",
"module": "src.main",
"obj": "",
"line": 7,
"column": 0,
"endLine": 7,
"endColumn": 62,
"path": "src/main.py",
"symbol": "import-error",
"message": "Unable to import 'processing_pipeline.postgres_client'",
"message-id": "E0401"
},
{
"type": "convention",
"module": "src.main",
"obj": "test_stage_4",
"line": 15,
"column": 0,
"endLine": 15,
"endColumn": 16,
"path": "src/main.py",
"symbol": "missing-function-docstring",
"message": "Missing function or method docstring",
"message-id": "C0116"
}
]

scripts/load_prompts.py

************* Module .pylintrc
.pylintrc:1:0: F0011: error while parsing the configuration: File contains no section headers.
file: '.pylintrc', line: 1
'disable=C0116\n' (config-parse-error)
[
{
"type": "convention",
"module": "load_prompts",
"obj": "",
"line": 35,
"column": 0,
"endLine": null,
"endColumn": null,
"path": "scripts/load_prompts.py",
"symbol": "trailing-whitespace",
"message": "Trailing whitespace",
"message-id": "C0303"
},
{
"type": "convention",
"module": "load_prompts",
"obj": "",
"line": 37,
"column": 0,
"endLine": null,
"endColumn": null,
"path": "scripts/load_prompts.py",
"symbol": "trailing-whitespace",
"message": "Trailing whitespace",
"message-id": "C0303"
},
{
"type": "convention",
"module": "load_prompts",
"obj": "",
"line": 77,

... [truncated 7292 characters] ...

to a protected member _execute of a client class",
"message-id": "W0212"
},
{
"type": "warning",
"module": "load_prompts",
"obj": "load_prompts",
"line": 169,
"column": 8,
"endLine": 169,
"endColumn": 19,
"path": "scripts/load_prompts.py",
"symbol": "protected-access",
"message": "Access to a protected member _execute of a client class",
"message-id": "W0212"
},
{
"type": "warning",
"module": "load_prompts",
"obj": "",
"line": 8,
"column": 0,
"endLine": 8,
"endColumn": 9,
"path": "scripts/load_prompts.py",
"symbol": "unused-import",
"message": "Unused import os",
"message-id": "W0611"
}
]

7 others

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist · 2026-02-14T02:54:35Z

Summary of Changes

Hello @JuanBenitezG, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a fundamental shift in the project's operational backbone by replacing external cloud dependencies with a robust local infrastructure. This change aims to provide a more self-contained and manageable environment for the multi-stage AI pipeline. Concurrently, a substantial documentation effort has been completed, offering detailed insights into every aspect of the system, from its architectural design to specific pipeline stages and configuration options.

Highlights

Infrastructure Refactoring: The project's backend infrastructure has been significantly refactored, transitioning from cloud-specific services (Supabase and Cloudflare R2/S3) to a local PostgreSQL database and local filesystem storage for enhanced control and simplified local development.
Comprehensive Documentation: Extensive new documentation has been added, covering system architecture, pipeline stages, configuration, testing, and maintenance, providing a complete guide for developers and researchers.
Dependency Updates: Python dependencies have been updated to reflect the new local database and storage clients, along with additions for Pydantic, OpenAI, and Tiktoken, streamlining the development environment.
Radio Recording System Overhaul: The previous radio recording system, including Dockerfiles, shell scripts, and Python modules for specific radio station adapters, has been removed, indicating a shift in how audio ingestion is managed.

Changelog

Dockerfile.generic_recording_worker
- Removed Dockerfile for generic recording worker.
Dockerfile.recording_worker
- Removed Dockerfile for recording worker.
README.md
- Updated LLM model references to include Gemini 2.5 Flash/Pro.
- Added a new 'Documentation' section with a link to the comprehensive docs index.
docs/STAGE_1_DEEP_DIVE.md
- Added detailed documentation for Stage 1: Initial Disinformation Detection.
docs/TESTING_GUIDE.md
- Added a comprehensive guide for testing the entire processing pipeline locally.
docs/architecture/data-lifecycle.md
- Added documentation detailing the data lifecycle within the VERDAD platform.
docs/architecture/iterative-refinement.md
- Added documentation explaining iterative refinement and prompt versioning strategies.
docs/architecture/system-architecture.md
- Added documentation outlining the overall system architecture and pipeline orchestration.
docs/configuration/disinformation-heuristics.md
- Added documentation detailing the disinformation detection heuristics used by the system.
docs/configuration/docker-containerization.md
- Added documentation on Docker and containerization strategies for workers.
docs/configuration/flyio-configuration.md
- Added documentation for Fly.io deployment and process group configuration.
docs/configuration/heuristics-output-schemas.md
- Added documentation on heuristics and structured output schemas for AI analysis.
docs/configuration/prompt-management-advanced.md
- Added advanced documentation for prompt management, versioning, and iterative refinement.
docs/configuration/prompt-management.md
- Added documentation on general prompt management concepts and retrieval.
docs/configuration/radio-recording-system.md
- Added documentation for the radio recording system, including generic and direct URL modes.
docs/configuration/radio-station-adapters.md
- Added documentation on radio station adapters and their configuration.
docs/configuration/recording-workers.md
- Added documentation for recording workers and their deployment.
docs/getting-started/introduction.md
- Added an introduction to the VERDAD platform and its mission.
docs/getting-started/quick-start.md
- Added a quick start guide for setting up the recording and analysis pipeline.
docs/index.md
- Added the main index for all project documentation.
docs/integration/express-server-api.md
- Added documentation for the Express server and its API endpoints.
docs/integration/real-time-collaboration.md
- Added documentation on real-time collaboration features using Liveblocks.
docs/integration/slack-email-notifications.md
- Added documentation for Slack and email notification systems.
docs/integration/supabase-postgres-schema.md
- Added documentation detailing the Supabase and Postgres database schema.
docs/maintenance/bigquery-archival.md
- Added documentation for BigQuery archival processes.
docs/maintenance/maintenance-scripts.md
- Added documentation for various maintenance and administrative scripts.
docs/maintenance/utility-scripts.md
- Added documentation for utility scripts, including flow management and undo operations.
docs/pipeline-stages/stage-1-detection-transcription.md
- Added detailed documentation for Stage 1: Detection & Transcription.
docs/pipeline-stages/stage-2-audio-clipping.md
- Added detailed documentation for Stage 2: Audio Clipping.
docs/pipeline-stages/stage-3-in-depth-analysis.md
- Added detailed documentation for Stage 3: In-Depth Analysis.
docs/pipeline-stages/stage-4-analysis-review.md
- Added detailed documentation for Stage 4: Analysis Review.
docs/pipeline-stages/stage-5-semantic-embedding.md
- Added detailed documentation for Stage 5: Semantic Embedding.
docs/setup/DATABASE_SETUP_GUIDE.md
- Added a complete guide for setting up the PostgreSQL database locally.
docs/testing/testing-suite.md
- Added documentation for the project's testing suite.
fly.generic_recording_worker.toml
- Removed Fly.io configuration for generic recording worker.
fly.recording_worker.toml
- Removed Fly.io configuration for recording worker.
pyproject.toml
- Updated project metadata and added new Python dependencies including psycopg2-binary, pydub, openai, tiktoken, and pydantic.
requirements.txt
- Updated Python dependencies, removing commented-out cloud-specific packages.
scripts/generic_recording.sh
- Removed shell script for generic recording.
scripts/load_prompts.py
- Added a Python script to load prompts and heuristics into the PostgreSQL database.
scripts/recording.sh
- Removed shell script for direct URL recording.
scripts/start_recording.sh
- Removed shell script for starting recording deployments.
src/generic_recording.py
- Removed Python script for generic recording.
src/main.py
- Updated client initialization from Supabase to Postgres for Stage 4 testing.
src/processing_pipeline/init.py
- Updated imports and exports to reflect the new local storage and Postgres client modules.
src/processing_pipeline/local_storage.py
- Added a local filesystem storage client to replace S3/R2 interactions.
src/processing_pipeline/postgres_client.py
- Added a PostgreSQL client to replace Supabase interactions.
src/processing_pipeline/stage_1.py
- Updated client and storage interactions from Supabase/S3 to Postgres/LocalStorage.
- Improved datetime parsing for recorded_at metadata.
src/processing_pipeline/stage_2.py
- Updated client and storage interactions from Supabase/S3 to Postgres/LocalStorage.
src/processing_pipeline/stage_3.py
- Updated client and storage interactions from Supabase/S3 to Postgres/LocalStorage.
src/processing_pipeline/stage_4.py
- Updated client initialization from Supabase to Postgres.
src/processing_pipeline/stage_5.py
- Updated client initialization from Supabase to Postgres.
src/radiostations/init.py
- Removed radio station module initialization.
src/radiostations/base.py
- Removed base class for radio station adapters.
src/radiostations/khot.py
- Removed specific radio station adapter.
src/radiostations/kisf.py
- Removed specific radio station adapter.
src/radiostations/krgt.py
- Removed specific radio station adapter.
src/radiostations/wado.py
- Removed specific radio station adapter.
src/radiostations/waqi.py
- Removed specific radio station adapter.
src/radiostations/wkaq.py
- Removed specific radio station adapter.
src/recording.py
- Removed Python script for direct URL recording.
supabase/migrations/00_reset_database.sql
- Added SQL script to reset the public schema and enable the pgvector extension.
supabase/migrations/01_local_schema.sql
- Added SQL script for a minimal local PostgreSQL schema, removing Supabase-specific dependencies.
supabase/migrations/02_load_prompts.sql
- Added SQL script with placeholder entries for initial prompt versions and heuristics.

Activity

JuanBenitezG created this pull request with a generic title 'Update' and an empty description, indicating the intent needs to be inferred from the code changes.
The pull request introduces a significant refactoring of the project's data layer and adds extensive documentation, suggesting a major development effort rather than a minor update.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

Copilot

Pull request overview

This PR represents a major architectural refactoring that transitions the system from a cloud-based (Supabase/S3) to a local PostgreSQL/filesystem setup. The changes include:

Changes:

Replaces SupabaseClient with PostgresClient for direct PostgreSQL access
Replaces boto3/S3 storage with LocalStorage filesystem implementation
Removes entire recording infrastructure (recording.py, generic_recording.py, all RadioStation adapters)
Deletes all test files (test_recording.py, test_generic_recording.py, test_stations.py, test_base.py)
Adds comprehensive documentation (30+ markdown files)
Adds PostgreSQL migration files for local schema setup
Adds load_prompts.py script for initializing database prompts
Updates pyproject.toml with Poetry configuration
Removes Dockerfiles for recording workers

Reviewed changes

Copilot reviewed 67 out of 69 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
src/processing_pipeline/postgres_client.py	New PostgreSQL client replacing SupabaseClient
src/processing_pipeline/local_storage.py	New filesystem storage replacing S3/boto3
supabase/migrations/01_local_schema.sql	Database schema for local PostgreSQL
supabase/migrations/02_load_prompts.sql	Prompt initialization migration
scripts/load_prompts.py	Script to load prompts from files into database
src/processing_pipeline/stage_*.py	Updated to use PostgresClient and LocalStorage
src/main.py	Updated to use PostgresClient
docs/*	Extensive new documentation added
tests/*	All test files removed
src/recording.py, src/generic_recording.py, src/radiostations/*	Recording system removed

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-14T02:57:11Z

+    def update_snippet(self, id, transcription, translation, title, summary,
+                       explanation, disinformation_categories, keywords_detected,
+                       language, confidence_scores, emotional_tone, context,
+                       political_leaning, grounding_metadata, thought_summaries,
+                       analyzed_by, status, error_message, stage_3_prompt_version_id=None):
+        """Update snippet with analysis results."""
+        return self._execute("""
+            UPDATE snippets SET
+                transcription = %s, translation = %s, title = %s, summary = %s,
+                explanation = %s, disinformation_categories = %s, keywords_detected = %s,
+                language = %s, confidence_scores = %s, emotional_tone = %s, context = %s,
+                political_leaning = %s, grounding_metadata = %s, thought_summaries = %s,
+                analyzed_by = %s, previous_analysis = NULL, status = %s, error_message = %s,
+                stage_3_prompt_version_id = %s, updated_at = NOW()
+            WHERE id = %s
+        """, (transcription, translation, title, summary, explanation,
+              disinformation_categories, keywords_detected, language,
+              Json(confidence_scores), Json(emotional_tone), context,
+              Json(political_leaning), Json(grounding_metadata),
+              Json(thought_summaries), analyzed_by, status, error_message,
+              stage_3_prompt_version_id, id))


The update_snippet method accepts 17 parameters but the table schema may not have all these columns. Specifically, parameters like 'transcription', 'translation', 'title', 'summary', 'explanation', 'keywords_detected', 'language', 'emotional_tone', 'context', 'analyzed_by', 'thought_summaries', 'grounding_metadata', and 'stage_3_prompt_version_id' are being set, but the schema in 01_local_schema.sql doesn't define these columns on the snippets table. This will cause SQL errors when trying to update snippets.

Copilot · 2026-02-14T02:57:11Z

+CREATE TABLE snippets (
+    id UUID NOT NULL DEFAULT gen_random_uuid() PRIMARY KEY,
+    created_at TIMESTAMPTZ NOT NULL DEFAULT (NOW() AT TIME ZONE 'utc'),
+    updated_at TIMESTAMPTZ NOT NULL DEFAULT (NOW() AT TIME ZONE 'utc'),
+    audio_file UUID NOT NULL REFERENCES audio_files(id) ON DELETE CASCADE,
+    stage_1_llm_response UUID REFERENCES stage_1_llm_responses(id) ON DELETE CASCADE,
+    file_path TEXT NOT NULL,
+    file_size BIGINT NOT NULL,
+    duration INTEGER NOT NULL,
+    recorded_at TIMESTAMPTZ NOT NULL,
+    start_time INTEGER NOT NULL,
+    end_time INTEGER NOT NULL,
+    transcription TEXT,
+    previous_analysis JSONB,
+    final_review JSONB,
+    status processing_status NOT NULL DEFAULT 'New',
+    error_message TEXT,
+    hidden BOOLEAN DEFAULT FALSE,
+    prompt_version UUID
+);


The snippets table schema only includes 'transcription', 'previous_analysis', 'final_review' fields, but the PostgresClient.update_snippet() method tries to set many individual columns like 'translation', 'title', 'summary', 'explanation', 'disinformation_categories', 'keywords_detected', 'language', 'confidence_scores', 'emotional_tone', 'context', 'political_leaning', 'grounding_metadata', 'thought_summaries', and 'analyzed_by'. These columns don't exist in the schema. All analysis data should be stored in the JSONB 'previous_analysis' field instead.

Copilot · 2026-02-14T02:57:12Z

+INSERT INTO prompt_versions (
+    stage,
+    version_number,
+    llm_model,
+    prompt_text,
+    system_instruction,
+    output_schema,
+    is_active,
+    change_explanation
+) VALUES (
+    'stage_1',
+    1,
+    'gemini-2.5-flash',
+    -- prompt_text will need to be loaded from Stage_1_detection_prompt.md
+    'This is a placeholder - prompts need to be loaded via script',
+    'This is a placeholder - system instructions need to be loaded via script',
+    '{"type": "object"}'::jsonb,
+    TRUE,
+    'Initial version from migration'
+);


This migration inserts placeholder prompt data that will never be replaced by actual prompts. The comments state "prompts need to be loaded via script" but the INSERT still runs with placeholder text. This creates unusable prompt records that may cause pipeline failures if the load_prompts.py script isn't run afterwards. Consider either making this migration depend on the script running first, or removing these placeholder INSERTs entirely and only creating them via the script.

Copilot · 2026-02-14T02:57:12Z

+    def get_a_new_audio_file_and_reserve_it(self):
+        """Reserve an audio file for processing (Stage 1)."""
+        result = self._execute(
+            "SELECT fetch_a_new_audio_file_and_reserve_it()",
+            fetch_one=True
+        )
+        # Extract jsonb value from the single-column result
+        return result['fetch_a_new_audio_file_and_reserve_it'] if result else None


The PostgresClient RPC methods return different data structures than the original SupabaseClient. The Supabase RPC calls returned .data directly, but these return the result of the SQL function call which is wrapped differently. For example, line 95 extracts result['fetch_a_new_audio_file_and_reserve_it'] - this assumes the RPC function returns a single column with that exact name. This won't work correctly if the RPC function returns multiple columns or has a different return structure. Verify that the RPC functions in the database return data in the expected format.

Copilot · 2026-02-14T02:57:12Z

+CREATE TABLE snippet_embeddings (
+    id UUID NOT NULL DEFAULT gen_random_uuid() PRIMARY KEY,
+    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
+    snippet UUID NOT NULL REFERENCES snippets(id) ON DELETE CASCADE,
+    embedding vector(768) NOT NULL
+);


The schema defines a vector embedding column as vector(768) but the vector extension must be created first. The migration 00_reset_database.sql creates the extension, but if migrations run out of order or the extension creation fails, this table creation will fail. Consider adding a comment or check to ensure the vector extension is available before using the vector type.

Copilot · 2026-02-14T02:57:12Z

+    def insert_stage_1_llm_response(self, audio_file_id, initial_transcription,
+                                     initial_detection_result, transcriptor,
+                                     timestamped_transcription, detection_result,
+                                     status, detection_prompt_version_id=None,
+                                     transcription_prompt_version_id=None):
+        """Insert Stage 1 LLM response."""
+        # Note: Schema only has timestamped_transcription, detection_result, and prompt_version
+        # Using detection_prompt_version_id for prompt_version (main prompt used)
+        return self._execute("""
+            INSERT INTO stage_1_llm_responses 
+            (audio_file, timestamped_transcription, detection_result, status, prompt_version)
+            VALUES (%s, %s, %s, %s, %s)
+            RETURNING *
+        """, (audio_file_id, Json(timestamped_transcription), Json(detection_result),
+              status, detection_prompt_version_id),
+        fetch_one=True)


The PostgresClient doesn't implement several methods that exist in the original SupabaseClient, including methods for handling transcriptor fields, initial_transcription, and initial_detection_result. The insert_stage_1_llm_response method (lines 223-238) only inserts timestamped_transcription and detection_result, ignoring the initial_transcription, initial_detection_result, and transcriptor parameters that are passed in. This data loss could break existing code that relies on these fields.

gemini-code-assist

Code Review

This pull request significantly refactors the project's database and storage layers, migrating from Supabase-specific clients and S3 (R2) storage to a local PostgreSQL client (psycopg2) and local filesystem storage. This change involved removing boto3 and related S3/R2 configurations, introducing new PostgresClient and LocalStorage classes, and updating all processing pipeline stages (Stage 1 through Stage 5) to use these new local clients. Additionally, the entire generic and direct URL radio recording system, including its Dockerfiles, Python scripts, and associated radiostations modules, has been removed, indicating a deprecation or externalization of audio ingestion. New SQL migration files (00_reset_database.sql, 01_local_schema.sql, 02_load_prompts.sql) were added to support a local PostgreSQL setup, replacing the previous Supabase-dependent schema. Documentation was extensively updated to reflect these architectural changes, including new sections on local database setup, testing, and configuration, while also updating references to Gemini model versions. Review comments highlighted the need to add python-dateutil as a dependency, address potential prompt injection vulnerabilities by sanitizing LLM inputs, refactor PostgresClient's __del__ method for reliable connection closing, remove the placeholder 02_load_prompts.sql migration file to avoid ambiguity, remove a hardcoded default password from PostgresClient, and update hardcoded future dates in the documentation.

gemini-code-assist · 2026-02-14T02:58:07Z

+    recorded_at_str = audio_file["recorded_at"]
+    if isinstance(recorded_at_str, str):
+        # Parse ISO 8601 format with timezone (handles microseconds and offsets)
+        from dateutil import parser


The code now uses dateutil.parser, but python-dateutil is not listed as a dependency in pyproject.toml. This will cause an ImportError at runtime. Please add python-dateutil to the project's dependencies.

Additionally, per PEP 8, this import should be moved to the top of the file.

gemini-code-assist · 2026-02-14T02:58:08Z

        user_prompt = (
-            f"{prompt_version['user_prompt']}\n\n"
+            f"{prompt_version['prompt_text']}\n\n"
            f"Here is the metadata of the transcription:\n\n{json.dumps(metadata, indent=2)}\n\n"
            f"Here is the timestamped transcription:\n\n{timestamped_transcription}"
        )


Untrusted data from radio transcriptions is directly concatenated into LLM prompts. An attacker who can influence the content of a radio broadcast could inject malicious instructions into the prompt, potentially causing the LLM to bypass disinformation detection, return incorrect analysis, or leak internal prompt instructions.

Remediation: Implement robust prompt engineering techniques to mitigate injection, such as using clear delimiters for untrusted content, providing explicit system instructions on how to handle untrusted data, and using few-shot examples to reinforce desired behavior.

gemini-code-assist · 2026-02-14T02:58:08Z

+    def __del__(self):
+        """Cleanup on object destruction."""
+        self.close()


Using __del__ to close database connections is unreliable because its execution is not guaranteed by the Python interpreter. This can lead to resource leaks (unclosed connections). A more robust pattern is to implement a context manager (__enter__ and __exit__) so the client can be used with a with statement, ensuring the connection is always closed. This would also require updating the flows to use the with statement or a try...finally block to explicitly call close().

Suggested change

def __del__(self):

"""Cleanup on object destruction."""

self.close()

def __enter__(self):

return self

def __exit__(self, exc_type, exc_val, exc_tb):

self.close()

gemini-code-assist · 2026-02-14T02:58:08Z

+-- Load initial prompt versions from the prompts directory
+-- This migration should be run after 01_local_schema.sql
+
+-- Stage 1 Prompts
+INSERT INTO prompt_versions (
+    stage,
+    version_number,
+    llm_model,
+    prompt_text,
+    system_instruction,
+    output_schema,
+    is_active,
+    change_explanation
+) VALUES (
+    'stage_1',
+    1,
+    'gemini-2.5-flash',
+    -- prompt_text will need to be loaded from Stage_1_detection_prompt.md
+    'This is a placeholder - prompts need to be loaded via script',
+    'This is a placeholder - system instructions need to be loaded via script',
+    '{"type": "object"}'::jsonb,
+    TRUE,
+    'Initial version from migration'
+);
+
+-- Gemini Timestamped Transcription Prompts
+INSERT INTO prompt_versions (
+    stage,
+    version_number,
+    llm_model,
+    prompt_text,
+    system_instruction,
+    output_schema,
+    is_active,
+    change_explanation
+) VALUES (
+    'gemini_timestamped_transcription',
+    1,
+    'gemini-2.5-flash',
+    -- prompt_text will need to be loaded from Gemini_timestamped_transcription_generation_prompt.md
+    'This is a placeholder - prompts need to be loaded via script',
+    NULL,
+    '{"type": "object"}'::jsonb,
+    TRUE,
+    'Initial version from migration'
+);
+
+-- Stage 3 Prompts
+INSERT INTO prompt_versions (
+    stage,
+    version_number,
+    llm_model,
+    prompt_text,
+    system_instruction,
+    output_schema,
+    is_active,
+    change_explanation
+) VALUES (
+    'stage_3',
+    1,
+    'gemini-2.5-flash',
+    -- prompt_text will need to be loaded from Stage_3_analysis_prompt.md
+    'This is a placeholder - prompts need to be loaded via script',
+    'This is a placeholder - system instructions need to be loaded via script',
+    '{"type": "object"}'::jsonb,
+    TRUE,
+    'Initial version from migration'
+);
+
+-- Stage 1 Heuristics
+INSERT INTO heuristics (
+    stage,
+    version_number,
+    content,
+    is_active,
+    change_explanation
+) VALUES (
+    'stage_1',
+    1,
+    'This is a placeholder - heuristics need to be loaded via script',
+    TRUE,
+    'Initial version from migration'
+);
+
+-- Stage 3 Heuristics
+INSERT INTO heuristics (
+    stage,
+    version_number,
+    content,
+    is_active,
+    change_explanation
+) VALUES (
+    'stage_3',
+    1,
+    'This is a placeholder - heuristics need to be loaded via script',
+    TRUE,
+    'Initial version from migration'
+);


This SQL migration file appears to insert only placeholder data for prompts and heuristics, with comments indicating that a script should be used instead. This is confusing and could lead to incorrect data being loaded if run as part of an automated migration process. Since scripts/load_prompts.py is provided to correctly load this data from source files, this SQL migration file should probably be removed to avoid ambiguity and potential errors. The setup documentation should clearly state that scripts/load_prompts.py must be run to populate these tables.

gemini-code-assist · 2026-02-14T02:58:08Z

+            connection_string = os.getenv(
+                'DATABASE_URL',
+                'postgresql://verdad_user:your_password@localhost:5432/verdad_debates'
+            )


The PostgresClient class contains a hardcoded default connection string with a password (your_password). If the DATABASE_URL environment variable is not set, the application will attempt to connect using these credentials. This poses a risk if the application is deployed without proper environment configuration.

Remediation: Remove the default connection string or at least the hardcoded password. Ensure that the application fails if DATABASE_URL is not provided.

gemini-code-assist · 2026-02-14T02:58:08Z

+---
+
+**Document Version**: 1.0  
+**Last Updated**: January 23, 2026  


This documentation contains a hardcoded future date. It's better to use a placeholder like [Date] to avoid confusion and prevent the documentation from becoming outdated.

gemini-code-assist · 2026-02-14T02:58:08Z

+Move quick-start.md to getting-started/
+```


This looks like a leftover instruction from editing the file. It should be removed.

Suggested change

Move quick-start.md to getting-started/

```

gemini-code-assist · 2026-02-14T02:58:08Z

+    location_city,
+    status
+) VALUES (
+    'test/debate_sample_2026.mp3',


This documentation contains a hardcoded future date ('2026'). It's better to use a placeholder like [Date] or a more generic example to avoid confusion and prevent the documentation from becoming outdated. This occurs in multiple places in this file.

gemini-code-assist · 2026-02-14T02:58:08Z

+flake8 = ">=7.3.0,<8.0.0"
+isort = ">=7.0.0,<8.0.0"
+prefect = ">=3.6.12,<4.0.0"
+boto3 = ">=1.42.31,<2.0.0"


The boto3 dependency is included, but it seems all usage of it (and the S3 client) has been replaced by the new LocalStorage client. If boto3 is no longer used anywhere in the project, it should be removed from the dependencies to keep the project lean.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 52f33b8fcc

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-02-14T02:59:02Z

+                transcription = %s, translation = %s, title = %s, summary = %s,
+                explanation = %s, disinformation_categories = %s, keywords_detected = %s,
+                language = %s, confidence_scores = %s, emotional_tone = %s, context = %s,
+                political_leaning = %s, grounding_metadata = %s, thought_summaries = %s,


Align snippet update query with migrated schema

update_snippet writes fields like translation, title, summary, explanation, keywords_detected, political_leaning, grounding_metadata, and stage_3_prompt_version_id, but the new local migration defines snippets without these columns (supabase/migrations/01_local_schema.sql only creates core clip metadata + transcription/previous_analysis). With the new PostgreSQL path, Stage 3/4 updates will fail at runtime with column does not exist errors as soon as this query runs.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-02-14T02:59:02Z

+                    snippet_document = %s, document_token_count = %s, embedding = %s,
+                    model_name = %s, status = %s, error_message = %s, updated_at = NOW()


Match embedding upsert columns to snippet_embeddings table

The upsert query assumes snippet_embeddings has snippet_document, document_token_count, model_name, status, error_message, and updated_at, but the new migration creates only id, created_at, snippet, and embedding (supabase/migrations/01_local_schema.sql:60-65). Stage 5 will therefore fail when attempting either INSERT or UPDATE through this method.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-02-14T02:59:02Z

+    def get_stage_1_llm_response_by_id(self, id, select="*"):
+        """Get Stage 1 LLM response by ID."""
+        return self._execute(
+            f"SELECT {select} FROM stage_1_llm_responses WHERE id = %s",


Handle Supabase-style relation selects before SQL execution

This method now interpolates select directly into raw SQL, but callers still pass Supabase relation syntax (e.g. stage_1.py uses select="*, audio_file(...)", and stage_3.py uses audio_file(...), stage_1_llm_response(...)). PostgreSQL will reject those expressions, so flows that fetch records via these helpers (redo/regenerate Stage 1 and Stage 3 by explicit snippet ID) fail with SQL syntax/function errors.

Useful? React with 👍 / 👎.

JuanBenitezG · 2026-02-14T03:01:42Z

PR creado por error, ignoren por favor
PR by mistake, ignore

JsNcAr added 5 commits January 20, 2026 11:35

docs: Reorganized and improved documentation to follow a more modular…

d81020a

… and standard pattern

Merge pull request #1 from PersonallyAI/docs/documentation-overhaul

9219abd

Reorganized and improved documentation to follow a more modular and standard pattern

Copilot AI review requested due to automatic review settings February 14, 2026 02:53

Copilot started reviewing on behalf of JuanBenitezG February 14, 2026 02:53 View session

JuanBenitezG closed this Feb 14, 2026

Copilot AI reviewed Feb 14, 2026

View reviewed changes

gemini-code-assist Bot reviewed Feb 14, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Feb 14, 2026

View reviewed changes

		snippet_document = %s, document_token_count = %s, embedding = %s,
		model_name = %s, status = %s, error_message = %s, updated_at = NOW()

Conversation

JuanBenitezG commented Feb 14, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Feb 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

gemini-code-assist Bot commented Feb 14, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Feb 14, 2026

Choose a reason for hiding this comment

JuanBenitezG commented Feb 14, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Feb 14, 2026 •

edited

Loading