Update#62
Conversation
… and standard pattern
Reorganized and improved documentation to follow a more modular and standard pattern
…lities - Deleted test_stations.py which contained unit tests for various radio station classes. - Removed test_generic_recording.py that included tests for the generic recording module. - Eliminated test_recording.py which had tests for the recording module and its associated functions.
- Replaced SupabaseClient with PostgresClient in stages 1 to 5 for database interactions. - Introduced LocalStorage for file handling instead of S3. - Updated audio file download and upload functions to work with the new storage client. - Enhanced audio file metadata handling to support both ISO strings and datetime objects. - Added environment variable loading with dotenv for better configuration management. - Created SQL migration scripts to reset the database and establish a minimal local schema. - Loaded initial prompt versions and heuristics into the database through migration scripts.
…itical debate fact-checking system - Created TESTING_GUIDE.md to outline testing procedures, including database connection, RPC function verification, and full processing pipeline tests. - Introduced DATABASE_SETUP_GUIDE.md detailing PostgreSQL setup, schema migrations, and function applications necessary for local development.
|
Caution Review failedThe pull request is closed. WalkthroughThis PR refactors VERDAD from Supabase/S3 infrastructure to PostgreSQL/local storage, removes browser-based radio recording workers, introduces database and storage abstraction layers, and adds comprehensive documentation for pipeline stages, configuration, testing, and deployment workflows. Changes
Estimated code review effort🎯 5 (Critical) | ⏱️ ~150+ minutes This PR requires deep review due to: large heterogeneous scope (infrastructure migration, client replacements, major deletions, extensive new documentation), new database layer implementation with complex RPC/CRUD operations, storage abstraction design decisions, SQL schema design and migrations, and impacts across five pipeline stages. The combination of logic-dense new code (postgres_client, local_storage), structural changes to core abstractions, and broad file distribution demands comprehensive cross-functional evaluation. Possibly related PRs
Suggested reviewers
Poem
✨ Finishing touches
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 Pylint (4.0.4)src/processing_pipeline/__init__.py************* Module .pylintrc ... [truncated 5256 characters] ... els'", src/main.py************* Module .pylintrc scripts/load_prompts.py************* Module .pylintrc ... [truncated 7292 characters] ... to a protected member _execute of a client class",
Tip Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello @JuanBenitezG, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a fundamental shift in the project's operational backbone by replacing external cloud dependencies with a robust local infrastructure. This change aims to provide a more self-contained and manageable environment for the multi-stage AI pipeline. Concurrently, a substantial documentation effort has been completed, offering detailed insights into every aspect of the system, from its architectural design to specific pipeline stages and configuration options. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Pull request overview
This PR represents a major architectural refactoring that transitions the system from a cloud-based (Supabase/S3) to a local PostgreSQL/filesystem setup. The changes include:
Changes:
- Replaces SupabaseClient with PostgresClient for direct PostgreSQL access
- Replaces boto3/S3 storage with LocalStorage filesystem implementation
- Removes entire recording infrastructure (recording.py, generic_recording.py, all RadioStation adapters)
- Deletes all test files (test_recording.py, test_generic_recording.py, test_stations.py, test_base.py)
- Adds comprehensive documentation (30+ markdown files)
- Adds PostgreSQL migration files for local schema setup
- Adds load_prompts.py script for initializing database prompts
- Updates pyproject.toml with Poetry configuration
- Removes Dockerfiles for recording workers
Reviewed changes
Copilot reviewed 67 out of 69 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| src/processing_pipeline/postgres_client.py | New PostgreSQL client replacing SupabaseClient |
| src/processing_pipeline/local_storage.py | New filesystem storage replacing S3/boto3 |
| supabase/migrations/01_local_schema.sql | Database schema for local PostgreSQL |
| supabase/migrations/02_load_prompts.sql | Prompt initialization migration |
| scripts/load_prompts.py | Script to load prompts from files into database |
| src/processing_pipeline/stage_*.py | Updated to use PostgresClient and LocalStorage |
| src/main.py | Updated to use PostgresClient |
| docs/* | Extensive new documentation added |
| tests/* | All test files removed |
| src/recording.py, src/generic_recording.py, src/radiostations/* | Recording system removed |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def update_snippet(self, id, transcription, translation, title, summary, | ||
| explanation, disinformation_categories, keywords_detected, | ||
| language, confidence_scores, emotional_tone, context, | ||
| political_leaning, grounding_metadata, thought_summaries, | ||
| analyzed_by, status, error_message, stage_3_prompt_version_id=None): | ||
| """Update snippet with analysis results.""" | ||
| return self._execute(""" | ||
| UPDATE snippets SET | ||
| transcription = %s, translation = %s, title = %s, summary = %s, | ||
| explanation = %s, disinformation_categories = %s, keywords_detected = %s, | ||
| language = %s, confidence_scores = %s, emotional_tone = %s, context = %s, | ||
| political_leaning = %s, grounding_metadata = %s, thought_summaries = %s, | ||
| analyzed_by = %s, previous_analysis = NULL, status = %s, error_message = %s, | ||
| stage_3_prompt_version_id = %s, updated_at = NOW() | ||
| WHERE id = %s | ||
| """, (transcription, translation, title, summary, explanation, | ||
| disinformation_categories, keywords_detected, language, | ||
| Json(confidence_scores), Json(emotional_tone), context, | ||
| Json(political_leaning), Json(grounding_metadata), | ||
| Json(thought_summaries), analyzed_by, status, error_message, | ||
| stage_3_prompt_version_id, id)) |
There was a problem hiding this comment.
The update_snippet method accepts 17 parameters but the table schema may not have all these columns. Specifically, parameters like 'transcription', 'translation', 'title', 'summary', 'explanation', 'keywords_detected', 'language', 'emotional_tone', 'context', 'analyzed_by', 'thought_summaries', 'grounding_metadata', and 'stage_3_prompt_version_id' are being set, but the schema in 01_local_schema.sql doesn't define these columns on the snippets table. This will cause SQL errors when trying to update snippets.
| CREATE TABLE snippets ( | ||
| id UUID NOT NULL DEFAULT gen_random_uuid() PRIMARY KEY, | ||
| created_at TIMESTAMPTZ NOT NULL DEFAULT (NOW() AT TIME ZONE 'utc'), | ||
| updated_at TIMESTAMPTZ NOT NULL DEFAULT (NOW() AT TIME ZONE 'utc'), | ||
| audio_file UUID NOT NULL REFERENCES audio_files(id) ON DELETE CASCADE, | ||
| stage_1_llm_response UUID REFERENCES stage_1_llm_responses(id) ON DELETE CASCADE, | ||
| file_path TEXT NOT NULL, | ||
| file_size BIGINT NOT NULL, | ||
| duration INTEGER NOT NULL, | ||
| recorded_at TIMESTAMPTZ NOT NULL, | ||
| start_time INTEGER NOT NULL, | ||
| end_time INTEGER NOT NULL, | ||
| transcription TEXT, | ||
| previous_analysis JSONB, | ||
| final_review JSONB, | ||
| status processing_status NOT NULL DEFAULT 'New', | ||
| error_message TEXT, | ||
| hidden BOOLEAN DEFAULT FALSE, | ||
| prompt_version UUID | ||
| ); |
There was a problem hiding this comment.
The snippets table schema only includes 'transcription', 'previous_analysis', 'final_review' fields, but the PostgresClient.update_snippet() method tries to set many individual columns like 'translation', 'title', 'summary', 'explanation', 'disinformation_categories', 'keywords_detected', 'language', 'confidence_scores', 'emotional_tone', 'context', 'political_leaning', 'grounding_metadata', 'thought_summaries', and 'analyzed_by'. These columns don't exist in the schema. All analysis data should be stored in the JSONB 'previous_analysis' field instead.
| INSERT INTO prompt_versions ( | ||
| stage, | ||
| version_number, | ||
| llm_model, | ||
| prompt_text, | ||
| system_instruction, | ||
| output_schema, | ||
| is_active, | ||
| change_explanation | ||
| ) VALUES ( | ||
| 'stage_1', | ||
| 1, | ||
| 'gemini-2.5-flash', | ||
| -- prompt_text will need to be loaded from Stage_1_detection_prompt.md | ||
| 'This is a placeholder - prompts need to be loaded via script', | ||
| 'This is a placeholder - system instructions need to be loaded via script', | ||
| '{"type": "object"}'::jsonb, | ||
| TRUE, | ||
| 'Initial version from migration' | ||
| ); |
There was a problem hiding this comment.
This migration inserts placeholder prompt data that will never be replaced by actual prompts. The comments state "prompts need to be loaded via script" but the INSERT still runs with placeholder text. This creates unusable prompt records that may cause pipeline failures if the load_prompts.py script isn't run afterwards. Consider either making this migration depend on the script running first, or removing these placeholder INSERTs entirely and only creating them via the script.
| def get_a_new_audio_file_and_reserve_it(self): | ||
| """Reserve an audio file for processing (Stage 1).""" | ||
| result = self._execute( | ||
| "SELECT fetch_a_new_audio_file_and_reserve_it()", | ||
| fetch_one=True | ||
| ) | ||
| # Extract jsonb value from the single-column result | ||
| return result['fetch_a_new_audio_file_and_reserve_it'] if result else None |
There was a problem hiding this comment.
The PostgresClient RPC methods return different data structures than the original SupabaseClient. The Supabase RPC calls returned .data directly, but these return the result of the SQL function call which is wrapped differently. For example, line 95 extracts result['fetch_a_new_audio_file_and_reserve_it'] - this assumes the RPC function returns a single column with that exact name. This won't work correctly if the RPC function returns multiple columns or has a different return structure. Verify that the RPC functions in the database return data in the expected format.
| CREATE TABLE snippet_embeddings ( | ||
| id UUID NOT NULL DEFAULT gen_random_uuid() PRIMARY KEY, | ||
| created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), | ||
| snippet UUID NOT NULL REFERENCES snippets(id) ON DELETE CASCADE, | ||
| embedding vector(768) NOT NULL | ||
| ); |
There was a problem hiding this comment.
The schema defines a vector embedding column as vector(768) but the vector extension must be created first. The migration 00_reset_database.sql creates the extension, but if migrations run out of order or the extension creation fails, this table creation will fail. Consider adding a comment or check to ensure the vector extension is available before using the vector type.
| def insert_stage_1_llm_response(self, audio_file_id, initial_transcription, | ||
| initial_detection_result, transcriptor, | ||
| timestamped_transcription, detection_result, | ||
| status, detection_prompt_version_id=None, | ||
| transcription_prompt_version_id=None): | ||
| """Insert Stage 1 LLM response.""" | ||
| # Note: Schema only has timestamped_transcription, detection_result, and prompt_version | ||
| # Using detection_prompt_version_id for prompt_version (main prompt used) | ||
| return self._execute(""" | ||
| INSERT INTO stage_1_llm_responses | ||
| (audio_file, timestamped_transcription, detection_result, status, prompt_version) | ||
| VALUES (%s, %s, %s, %s, %s) | ||
| RETURNING * | ||
| """, (audio_file_id, Json(timestamped_transcription), Json(detection_result), | ||
| status, detection_prompt_version_id), | ||
| fetch_one=True) |
There was a problem hiding this comment.
The PostgresClient doesn't implement several methods that exist in the original SupabaseClient, including methods for handling transcriptor fields, initial_transcription, and initial_detection_result. The insert_stage_1_llm_response method (lines 223-238) only inserts timestamped_transcription and detection_result, ignoring the initial_transcription, initial_detection_result, and transcriptor parameters that are passed in. This data loss could break existing code that relies on these fields.
There was a problem hiding this comment.
Code Review
This pull request significantly refactors the project's database and storage layers, migrating from Supabase-specific clients and S3 (R2) storage to a local PostgreSQL client (psycopg2) and local filesystem storage. This change involved removing boto3 and related S3/R2 configurations, introducing new PostgresClient and LocalStorage classes, and updating all processing pipeline stages (Stage 1 through Stage 5) to use these new local clients. Additionally, the entire generic and direct URL radio recording system, including its Dockerfiles, Python scripts, and associated radiostations modules, has been removed, indicating a deprecation or externalization of audio ingestion. New SQL migration files (00_reset_database.sql, 01_local_schema.sql, 02_load_prompts.sql) were added to support a local PostgreSQL setup, replacing the previous Supabase-dependent schema. Documentation was extensively updated to reflect these architectural changes, including new sections on local database setup, testing, and configuration, while also updating references to Gemini model versions. Review comments highlighted the need to add python-dateutil as a dependency, address potential prompt injection vulnerabilities by sanitizing LLM inputs, refactor PostgresClient's __del__ method for reliable connection closing, remove the placeholder 02_load_prompts.sql migration file to avoid ambiguity, remove a hardcoded default password from PostgresClient, and update hardcoded future dates in the documentation.
| recorded_at_str = audio_file["recorded_at"] | ||
| if isinstance(recorded_at_str, str): | ||
| # Parse ISO 8601 format with timezone (handles microseconds and offsets) | ||
| from dateutil import parser |
There was a problem hiding this comment.
| user_prompt = ( | ||
| f"{prompt_version['user_prompt']}\n\n" | ||
| f"{prompt_version['prompt_text']}\n\n" | ||
| f"Here is the metadata of the transcription:\n\n{json.dumps(metadata, indent=2)}\n\n" | ||
| f"Here is the timestamped transcription:\n\n{timestamped_transcription}" | ||
| ) |
There was a problem hiding this comment.
Untrusted data from radio transcriptions is directly concatenated into LLM prompts. An attacker who can influence the content of a radio broadcast could inject malicious instructions into the prompt, potentially causing the LLM to bypass disinformation detection, return incorrect analysis, or leak internal prompt instructions.
Remediation: Implement robust prompt engineering techniques to mitigate injection, such as using clear delimiters for untrusted content, providing explicit system instructions on how to handle untrusted data, and using few-shot examples to reinforce desired behavior.
| def __del__(self): | ||
| """Cleanup on object destruction.""" | ||
| self.close() |
There was a problem hiding this comment.
Using __del__ to close database connections is unreliable because its execution is not guaranteed by the Python interpreter. This can lead to resource leaks (unclosed connections). A more robust pattern is to implement a context manager (__enter__ and __exit__) so the client can be used with a with statement, ensuring the connection is always closed. This would also require updating the flows to use the with statement or a try...finally block to explicitly call close().
| def __del__(self): | |
| """Cleanup on object destruction.""" | |
| self.close() | |
| def __enter__(self): | |
| return self | |
| def __exit__(self, exc_type, exc_val, exc_tb): | |
| self.close() |
| -- Load initial prompt versions from the prompts directory | ||
| -- This migration should be run after 01_local_schema.sql | ||
|
|
||
| -- Stage 1 Prompts | ||
| INSERT INTO prompt_versions ( | ||
| stage, | ||
| version_number, | ||
| llm_model, | ||
| prompt_text, | ||
| system_instruction, | ||
| output_schema, | ||
| is_active, | ||
| change_explanation | ||
| ) VALUES ( | ||
| 'stage_1', | ||
| 1, | ||
| 'gemini-2.5-flash', | ||
| -- prompt_text will need to be loaded from Stage_1_detection_prompt.md | ||
| 'This is a placeholder - prompts need to be loaded via script', | ||
| 'This is a placeholder - system instructions need to be loaded via script', | ||
| '{"type": "object"}'::jsonb, | ||
| TRUE, | ||
| 'Initial version from migration' | ||
| ); | ||
|
|
||
| -- Gemini Timestamped Transcription Prompts | ||
| INSERT INTO prompt_versions ( | ||
| stage, | ||
| version_number, | ||
| llm_model, | ||
| prompt_text, | ||
| system_instruction, | ||
| output_schema, | ||
| is_active, | ||
| change_explanation | ||
| ) VALUES ( | ||
| 'gemini_timestamped_transcription', | ||
| 1, | ||
| 'gemini-2.5-flash', | ||
| -- prompt_text will need to be loaded from Gemini_timestamped_transcription_generation_prompt.md | ||
| 'This is a placeholder - prompts need to be loaded via script', | ||
| NULL, | ||
| '{"type": "object"}'::jsonb, | ||
| TRUE, | ||
| 'Initial version from migration' | ||
| ); | ||
|
|
||
| -- Stage 3 Prompts | ||
| INSERT INTO prompt_versions ( | ||
| stage, | ||
| version_number, | ||
| llm_model, | ||
| prompt_text, | ||
| system_instruction, | ||
| output_schema, | ||
| is_active, | ||
| change_explanation | ||
| ) VALUES ( | ||
| 'stage_3', | ||
| 1, | ||
| 'gemini-2.5-flash', | ||
| -- prompt_text will need to be loaded from Stage_3_analysis_prompt.md | ||
| 'This is a placeholder - prompts need to be loaded via script', | ||
| 'This is a placeholder - system instructions need to be loaded via script', | ||
| '{"type": "object"}'::jsonb, | ||
| TRUE, | ||
| 'Initial version from migration' | ||
| ); | ||
|
|
||
| -- Stage 1 Heuristics | ||
| INSERT INTO heuristics ( | ||
| stage, | ||
| version_number, | ||
| content, | ||
| is_active, | ||
| change_explanation | ||
| ) VALUES ( | ||
| 'stage_1', | ||
| 1, | ||
| 'This is a placeholder - heuristics need to be loaded via script', | ||
| TRUE, | ||
| 'Initial version from migration' | ||
| ); | ||
|
|
||
| -- Stage 3 Heuristics | ||
| INSERT INTO heuristics ( | ||
| stage, | ||
| version_number, | ||
| content, | ||
| is_active, | ||
| change_explanation | ||
| ) VALUES ( | ||
| 'stage_3', | ||
| 1, | ||
| 'This is a placeholder - heuristics need to be loaded via script', | ||
| TRUE, | ||
| 'Initial version from migration' | ||
| ); |
There was a problem hiding this comment.
This SQL migration file appears to insert only placeholder data for prompts and heuristics, with comments indicating that a script should be used instead. This is confusing and could lead to incorrect data being loaded if run as part of an automated migration process. Since scripts/load_prompts.py is provided to correctly load this data from source files, this SQL migration file should probably be removed to avoid ambiguity and potential errors. The setup documentation should clearly state that scripts/load_prompts.py must be run to populate these tables.
| connection_string = os.getenv( | ||
| 'DATABASE_URL', | ||
| 'postgresql://verdad_user:your_password@localhost:5432/verdad_debates' | ||
| ) |
There was a problem hiding this comment.
The PostgresClient class contains a hardcoded default connection string with a password (your_password). If the DATABASE_URL environment variable is not set, the application will attempt to connect using these credentials. This poses a risk if the application is deployed without proper environment configuration.
Remediation: Remove the default connection string or at least the hardcoded password. Ensure that the application fails if DATABASE_URL is not provided.
| --- | ||
|
|
||
| **Document Version**: 1.0 | ||
| **Last Updated**: January 23, 2026 |
| Move quick-start.md to getting-started/ | ||
| ``` |
| location_city, | ||
| status | ||
| ) VALUES ( | ||
| 'test/debate_sample_2026.mp3', |
| flake8 = ">=7.3.0,<8.0.0" | ||
| isort = ">=7.0.0,<8.0.0" | ||
| prefect = ">=3.6.12,<4.0.0" | ||
| boto3 = ">=1.42.31,<2.0.0" |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 52f33b8fcc
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| transcription = %s, translation = %s, title = %s, summary = %s, | ||
| explanation = %s, disinformation_categories = %s, keywords_detected = %s, | ||
| language = %s, confidence_scores = %s, emotional_tone = %s, context = %s, | ||
| political_leaning = %s, grounding_metadata = %s, thought_summaries = %s, |
There was a problem hiding this comment.
Align snippet update query with migrated schema
update_snippet writes fields like translation, title, summary, explanation, keywords_detected, political_leaning, grounding_metadata, and stage_3_prompt_version_id, but the new local migration defines snippets without these columns (supabase/migrations/01_local_schema.sql only creates core clip metadata + transcription/previous_analysis). With the new PostgreSQL path, Stage 3/4 updates will fail at runtime with column does not exist errors as soon as this query runs.
Useful? React with 👍 / 👎.
| snippet_document = %s, document_token_count = %s, embedding = %s, | ||
| model_name = %s, status = %s, error_message = %s, updated_at = NOW() |
There was a problem hiding this comment.
Match embedding upsert columns to snippet_embeddings table
The upsert query assumes snippet_embeddings has snippet_document, document_token_count, model_name, status, error_message, and updated_at, but the new migration creates only id, created_at, snippet, and embedding (supabase/migrations/01_local_schema.sql:60-65). Stage 5 will therefore fail when attempting either INSERT or UPDATE through this method.
Useful? React with 👍 / 👎.
| def get_stage_1_llm_response_by_id(self, id, select="*"): | ||
| """Get Stage 1 LLM response by ID.""" | ||
| return self._execute( | ||
| f"SELECT {select} FROM stage_1_llm_responses WHERE id = %s", |
There was a problem hiding this comment.
Handle Supabase-style relation selects before SQL execution
This method now interpolates select directly into raw SQL, but callers still pass Supabase relation syntax (e.g. stage_1.py uses select="*, audio_file(...)", and stage_3.py uses audio_file(...), stage_1_llm_response(...)). PostgreSQL will reject those expressions, so flows that fetch records via these helpers (redo/regenerate Stage 1 and Stage 3 by explicit snippet ID) fail with SQL syntax/function errors.
Useful? React with 👍 / 👎.
|
PR creado por error, ignoren por favor |
Summary by CodeRabbit
Documentation
Infrastructure
Chores