VER-297: Reactivate the preprocess step: Initial detection in Stage 1#58
VER-297: Reactivate the preprocess step: Initial detection in Stage 1#58quancao-ea merged 4 commits intomainfrom
Conversation
- Split main.py into executors.py, tasks.py, and flows.py - Re-enable Stage 1 Preprocess - Refactor executor classes to accept gemini_client instead of gemini_key - Create Gemini client once in flows and pass through tasks to executors
There was a problem hiding this comment.
Important
Looks good to me! 👍
Reviewed everything up to 8163a06 in 11 seconds. Click for details.
- Reviewed
2711lines of code in21files - Skipped
0files when reviewing. - Skipped posting
0draft comments. View those below. - Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
Workflow ID: wflow_lUUywnteahdyQZ7M
You can customize by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.
WalkthroughThis PR restructures the Stage 1 audio processing pipeline from a monolithic module into a modular architecture with separate executors, flows, and tasks. It introduces new prompt versions for initial transcription and detection preprocessing, reorganizes prompt files into a stage-specific directory structure, and updates constants and import scripts to reflect the new enum-based organization. Changes
Sequence Diagram(s)sequenceDiagram
participant Flow as initial_disinformation_detection
participant Tasks as tasks module
participant Supabase as Supabase
participant S3 as S3/R2 Storage
participant Gemini as Gemini API
participant Executors as Executor Classes
Flow->>Supabase: fetch_a_new_audio_file_from_supabase()
Supabase-->>Flow: audio_file record
Flow->>S3: download_audio_file_from_s3()
S3-->>Flow: local audio file
Flow->>Tasks: process_audio_file()
Tasks->>Executors: Stage1PreprocessTranscriptionExecutor.run()
Executors->>Gemini: upload audio + generate_content(user_prompt)
Gemini-->>Executors: initial_transcription
Executors-->>Tasks: initial_transcription result
Tasks->>Executors: Stage1PreprocessDetectionExecutor.run()
Executors->>Gemini: generate_content(transcription + metadata)
Gemini-->>Executors: initial_detection_result
Executors-->>Tasks: initial_detection result
alt flagged_snippets exist
Tasks->>Executors: GeminiTimestampTranscriptionGenerator.run()
Executors->>Executors: split_audio_into_segments()
Executors->>Executors: transcribe_batch() for each segment
Executors->>Gemini: generate_content(segments)
Gemini-->>Executors: timestamped_transcription
Executors-->>Tasks: timestamped_transcription result
Tasks->>Executors: Stage1Executor.run()
Executors->>Gemini: generate_content(timestamped_transcription)
Gemini-->>Executors: main_detection_result
Executors-->>Tasks: main_detection result
end
Tasks->>Supabase: insert_stage_1_llm_response()
Supabase-->>Tasks: response inserted
Tasks->>Supabase: set_audio_file_status()
Supabase-->>Tasks: status updated
Tasks-->>Flow: process completed
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 Pylint (4.0.4)src/processing_pipeline/stage_1/__init__.py************* Module .pylintrc src/processing_pipeline/constants.py************* Module .pylintrc ... [truncated 9025 characters] ... ini_timestamped_transcription_generation_prompt", src/processing_pipeline/stage_1/executors.py************* Module .pylintrc ... [truncated 12854 characters] ... inal_transcription",
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello @quancao-ea, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request reactivates and enhances the preprocessing capabilities within Stage 1 of the disinformation detection system. It introduces a two-stage detection process: an initial, lightweight transcription and detection, followed by a more detailed timestamped transcription and main detection only if potential disinformation is found in the initial pass. This change is accompanied by a comprehensive refactoring of the Stage 1 codebase into a modular structure, improving clarity and maintainability, and a reorganization of prompt files to support the new workflow. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request significantly refactors the Stage 1 processing pipeline, modularizing the code by splitting stage_1.py into separate files for flows, tasks, and executors, and introduces a two-step preprocess for initial transcription and detection. While this improves structure and maintainability, a critical security vulnerability exists: the direct concatenation of untrusted data (transcriptions and metadata) into LLM prompts makes the system susceptible to prompt injection attacks, which could bypass disinformation detection logic. This requires immediate attention, ideally through the implementation of delimiters and improved prompt engineering. Furthermore, the review suggests enhancing maintainability by removing magic numbers, reducing code duplication, and ensuring consistent client handling across tasks.
| user_prompt = ( | ||
| f"{prompt_version['user_prompt']}\n\n" | ||
| f"Here is the metadata of the transcription:\n\n{json.dumps(metadata, indent=2)}\n\n" | ||
| f"Here is the transcription:\n\n{transcription}" | ||
| ) |
There was a problem hiding this comment.
The user prompt is constructed by directly concatenating the transcription and metadata into the prompt string. Since the transcription is derived from external audio content, it can be used to perform a prompt injection attack. An attacker could include spoken instructions in the audio that, when transcribed, manipulate the LLM's behavior in the detection stage (e.g., to ignore disinformation or output specific malicious content). Delimiters and clear instructions should be used to separate untrusted content from the prompt's instructions.
| user_prompt = ( | ||
| f"{prompt_version['user_prompt']}\n\n" | ||
| f"Here is the metadata of the transcription:\n\n{json.dumps(metadata, indent=2)}\n\n" | ||
| f"Here is the timestamped transcription:\n\n{timestamped_transcription}" | ||
| ) |
There was a problem hiding this comment.
The user prompt is constructed by directly concatenating the timestamped_transcription and metadata into the prompt string. This allows for prompt injection attacks where malicious content in the transcription or metadata can override the LLM's intended instructions. Using delimiters (e.g., XML-like tags) and updating the system instructions to treat content within those delimiters as data only can help mitigate this risk.
| all_transcripts[absolute_segment_num] = segment["transcript"] | ||
|
|
||
| print(f"Batch complete: transcribed {actual_count} segments") | ||
| time.sleep(2) |
There was a problem hiding this comment.
The use of time.sleep(2) introduces a magic number. To improve code clarity and maintainability, consider defining this value as a named constant at the top of the class or module (e.g., _BATCH_PROCESSING_DELAY_SECONDS = 2). This makes the purpose of the delay explicit (e.g., to respect API rate limits) and simplifies future adjustments.
| def transcribe_audio_file_with_custom_timestamped_transcription_generator(audio_file): | ||
| print(f"Transcribing the audio file {audio_file} with the custom timestamped-transcription-generator") | ||
| gemini_key = os.getenv("GOOGLE_GEMINI_KEY") | ||
| timestamped_transcription = TimestampedTranscriptionGenerator.run(audio_file, gemini_key, 10) | ||
| return {"timestamped_transcription": timestamped_transcription} |
There was a problem hiding this comment.
This task creates its own Gemini API key from environment variables, which is inconsistent with other tasks like transcribe_audio_file_with_timestamp_with_gemini that receive a gemini_client instance. To improve consistency and centralize client management, consider refactoring this task and the underlying TimestampedTranscriptionGenerator to accept a gemini_client instance instead of creating one from an API key.
| if len(flagged_snippets) == 0: | ||
| print("No flagged snippets found during initial detection. Skipping timestamped transcription.") | ||
| insert_stage_1_llm_response( | ||
| supabase_client=supabase_client, | ||
| audio_file_id=audio_file["id"], | ||
| initial_transcription=initial_transcription, | ||
| initial_detection_result=initial_detection_result, | ||
| transcriptor=None, | ||
| timestamped_transcription=None, | ||
| detection_result=None, | ||
| status="Processed", | ||
| detection_prompt_version_id=None, | ||
| transcription_prompt_version_id=None, | ||
| ) | ||
| else: | ||
| # Timestamped transcription | ||
| transcriptor = GeminiModel.GEMINI_FLASH_LATEST | ||
| timestamped_transcription = transcribe_audio_file_with_timestamp_with_gemini( | ||
| gemini_client=gemini_client, | ||
| audio_file=local_file, | ||
| prompt_version=transcription_prompt_version, | ||
| model_name=transcriptor, | ||
| ) | ||
|
|
||
| # Main detection | ||
| detection_result = disinformation_detection_with_gemini( | ||
| gemini_client=gemini_client, | ||
| timestamped_transcription=timestamped_transcription["timestamped_transcription"], | ||
| metadata=metadata, | ||
| prompt_version=detection_prompt_version, | ||
| model_name=GeminiModel.GEMINI_FLASH_LATEST, | ||
| ) | ||
| print(f"Main detection result:\n{json.dumps(detection_result, indent=2, ensure_ascii=False)}\n") | ||
|
|
||
| main_flagged_snippets = detection_result["flagged_snippets"] | ||
|
|
||
| if len(main_flagged_snippets) == 0: | ||
| print("No flagged snippets found during main detection. Setting status to 'Processed'.") | ||
| insert_stage_1_llm_response( | ||
| supabase_client=supabase_client, | ||
| audio_file_id=audio_file["id"], | ||
| initial_transcription=initial_transcription, | ||
| initial_detection_result=initial_detection_result, | ||
| transcriptor=transcriptor, | ||
| timestamped_transcription=timestamped_transcription, | ||
| detection_result=detection_result, | ||
| status="Processed", | ||
| detection_prompt_version_id=detection_prompt_version["id"], | ||
| transcription_prompt_version_id=transcription_prompt_version["id"], | ||
| ) | ||
| else: | ||
| print(f"Found {len(main_flagged_snippets)} flagged snippets during main detection. Setting status to 'New'.") | ||
| insert_stage_1_llm_response( | ||
| supabase_client=supabase_client, | ||
| audio_file_id=audio_file["id"], | ||
| initial_transcription=initial_transcription, | ||
| initial_detection_result=initial_detection_result, | ||
| transcriptor=transcriptor, | ||
| timestamped_transcription=timestamped_transcription, | ||
| detection_result=detection_result, | ||
| status="New", | ||
| detection_prompt_version_id=detection_prompt_version["id"], | ||
| transcription_prompt_version_id=transcription_prompt_version["id"], | ||
| ) |
There was a problem hiding this comment.
This section has a nested if/else structure that leads to duplicated calls to insert_stage_1_llm_response. You can simplify this logic by preparing the parameters for the insertion and then making a single call at the end. This will make the code more readable and easier to maintain.
transcriptor = None
timestamped_transcription = None
detection_result = None
status = "Processed"
detection_prompt_version_id = None
transcription_prompt_version_id = None
if len(flagged_snippets) > 0:
# Timestamped transcription
transcriptor = GeminiModel.GEMINI_FLASH_LATEST
timestamped_transcription = transcribe_audio_file_with_timestamp_with_gemini(
gemini_client=gemini_client,
audio_file=local_file,
prompt_version=transcription_prompt_version,
model_name=transcriptor,
)
# Main detection
detection_result = disinformation_detection_with_gemini(
gemini_client=gemini_client,
timestamped_transcription=timestamped_transcription["timestamped_transcription"],
metadata=metadata,
prompt_version=detection_prompt_version,
model_name=GeminiModel.GEMINI_FLASH_LATEST,
)
print(f"Main detection result:\n{json.dumps(detection_result, indent=2, ensure_ascii=False)}\n")
detection_prompt_version_id = detection_prompt_version["id"]
transcription_prompt_version_id = transcription_prompt_version["id"]
if len(detection_result["flagged_snippets"]) > 0:
print(f"Found {len(detection_result['flagged_snippets'])} flagged snippets during main detection. Setting status to 'New'.")
status = "New"
else:
print("No flagged snippets found during main detection. Setting status to 'Processed'.")
else:
print("No flagged snippets found during initial detection. Skipping timestamped transcription.")
insert_stage_1_llm_response(
supabase_client=supabase_client,
audio_file_id=audio_file["id"],
initial_transcription=initial_transcription,
initial_detection_result=initial_detection_result,
transcriptor=transcriptor,
timestamped_transcription=timestamped_transcription,
detection_result=detection_result,
status=status,
detection_prompt_version_id=detection_prompt_version_id,
transcription_prompt_version_id=transcription_prompt_version_id,
)There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Fix all issues with AI agents
In `@src/processing_pipeline/stage_1/executors.py`:
- Around line 34-37: The polling loop that checks uploaded_file.state using
gemini_client.files.get can hang indefinitely; add a timeout mechanism: record a
start time (e.g., start = time.time()), introduce a configurable
max_wait_seconds (or DEFAULT_MAX_WAIT) and on each iteration check if
time.time() - start > max_wait_seconds, then break/raise an exception or set
uploaded_file.state to a failure status and log an error. Ensure the code
references the same uploaded_file and gemini_client.files.get call and keeps the
existing sleep interval while enforcing the timeout to avoid infinite looping.
In `@src/processing_pipeline/stage_1/tasks.py`:
- Line 153: The print call uses an unnecessary f-string prefix without any
placeholders; locate the print statement that prints "Processing initial
transcription with Gemini for disinformation detection" in
src/processing_pipeline/stage_1/tasks.py (the print(...) line) and remove the
leading "f" so it becomes a normal string literal (print("Processing initial
transcription with Gemini for disinformation detection")).
- Around line 382-385: The log message in reset_status_of_audio_files contains a
typo ("Reseting"); update the print statement in the reset_status_of_audio_files
function to use the correct spelling "Resetting the status of the audio files"
while keeping the rest of the message and behavior intact (the call to
supabase_client.reset_audio_file_status(audio_file_ids) should remain
unchanged).
- Around line 102-107: The file handle for audio_file is opened directly in the
call to client.audio.transcriptions.create, which can leak descriptors on
exceptions; change this to open the file with a context manager (use with
open(audio_file, "rb") as f:) and pass that handle (f) into
client.audio.transcriptions.create so the file is always closed after the call
(refer to response and client.audio.transcriptions.create in tasks.py).
🧹 Nitpick comments (10)
src/processing_pipeline/constants.py (2)
43-52: Consider using context managers for file operations.The
open().read()pattern without closing the file handle can lead to resource leaks, especially if these functions are called frequently.♻️ Suggested refactor using context managers
def get_user_prompt_for_stage_3(): - return open("prompts/Stage_3_analysis_prompt.md", "r").read() + with open("prompts/Stage_3_analysis_prompt.md", "r") as f: + return f.read() def get_system_instruction_for_stage_3(): - return open("prompts/Stage_3_system_instruction.md", "r").read() + with open("prompts/Stage_3_system_instruction.md", "r") as f: + return f.read() def get_output_schema_for_stage_3(): - return json.load(open("prompts/Stage_3_output_schema.json", "r")) + with open("prompts/Stage_3_output_schema.json", "r") as f: + return json.load(f)Apply the same pattern to other file-reading functions in this module.
79-126: Large block of commented-out code.This
__main__block contains extensive commented-out code. If it's no longer needed, consider removing it to reduce clutter. If it's useful for debugging, consider moving it to a separate script or documenting its purpose.src/processing_pipeline/stage_1/flows.py (4)
84-102: Consider handling cleanup failure after processing error.If
process_audio_fileraises an exception, the local file may not be cleaned up sinceos.removeis outside any try/finally block. This could cause disk space issues over time.♻️ Proposed fix to ensure file cleanup
if audio_file: local_file = download_audio_file_from_s3(s3_client, audio_file["file_path"]) - # Process the audio file - process_audio_file( - supabase_client=supabase_client, - gemini_client=gemini_client, - audio_file=audio_file, - local_file=local_file, - initial_transcription_prompt_version=initial_transcription_prompt_version, - initial_detection_prompt_version=initial_detection_prompt_version, - transcription_prompt_version=transcription_prompt_version, - detection_prompt_version=detection_prompt_version, - ) - processed_audio_files += 1 - print(f"Processed {processed_audio_files}/{limit} audio files") - - print(f"Delete the downloaded audio file: {local_file}") - os.remove(local_file) + try: + # Process the audio file + process_audio_file( + supabase_client=supabase_client, + gemini_client=gemini_client, + audio_file=audio_file, + local_file=local_file, + initial_transcription_prompt_version=initial_transcription_prompt_version, + initial_detection_prompt_version=initial_detection_prompt_version, + transcription_prompt_version=transcription_prompt_version, + detection_prompt_version=detection_prompt_version, + ) + processed_audio_files += 1 + print(f"Processed {processed_audio_files}/{limit} audio files") + finally: + print(f"Delete the downloaded audio file: {local_file}") + if os.path.exists(local_file): + os.remove(local_file)
158-158: Datetime parsing assumes fixed UTC offset format.The format string
%Y-%m-%dT%H:%M:%S+00:00will fail if the database returns a different timezone offset (e.g.,+05:30). Consider using a more robust parser.♻️ Proposed fix for flexible datetime parsing
+from datetime import datetime, timezone +from dateutil import parser as dateutil_parser ... - recorded_at = datetime.strptime(audio_file["recorded_at"], "%Y-%m-%dT%H:%M:%S+00:00") + recorded_at = dateutil_parser.isoparse(audio_file["recorded_at"])Alternatively, if you want to avoid adding a dependency:
recorded_at = datetime.fromisoformat(audio_file["recorded_at"].replace("+00:00", "+00:00"))Note:
datetime.fromisoformat()in Python 3.11+ handles ISO 8601 formats including timezone offsets.
241-256: Inconsistent transcriptor identifier.Line 242 uses a hardcoded string
"gemini-1206"while other parts of the codebase useGeminiModelenum values (e.g.,GeminiModel.GEMINI_FLASH_LATEST). Consider using the enum for consistency, or define this as a constant.♻️ Proposed fix
try: - transcriptor = "gemini-1206" + transcriptor = str(GeminiModel.GEMINI_FLASH_LATEST) timestamped_transcription = transcribe_audio_file_with_timestamp_with_gemini(
286-288: File cleanup should be in a finally block.Similar to
initial_disinformation_detection, if an exception occurs during processing, the downloaded file won't be cleaned up.♻️ Wrap processing in try/finally for cleanup
if stage_1_llm_response: print(f"Found stage 1 llm response {id}") # Get metadata of the transcription audio_file = stage_1_llm_response["audio_file"] local_file = download_audio_file_from_s3(s3_client, audio_file["file_path"]) - recorded_at = datetime.strptime(audio_file["recorded_at"], "%Y-%m-%dT%H:%M:%S+00:00") - ... - print(f"Processing completed for stage 1 llm response {id}") - print(f"Delete the downloaded audio file: {local_file}") - os.remove(local_file) + try: + recorded_at = datetime.strptime(audio_file["recorded_at"], "%Y-%m-%dT%H:%M:%S+00:00") + ... + print(f"Processing completed for stage 1 llm response {id}") + finally: + print(f"Delete the downloaded audio file: {local_file}") + if os.path.exists(local_file): + os.remove(local_file)src/processing_pipeline/stage_1/executors.py (2)
229-242: Consider using list unpacking for cleaner code.Per static analysis suggestion, use unpacking instead of concatenation for better readability.
♻️ Proposed fix
result = gemini_client.models.generate_content( model=model_name, - contents=[prompt_version["user_prompt"]] + segments, + contents=[prompt_version["user_prompt"], *segments], config=GenerateContentConfig(
267-285: Audio format is hardcoded to MP3.
AudioSegment.from_mp3()assumes the input is always MP3. If other audio formats are provided, this will fail with an unclear error. Consider usingAudioSegment.from_file()with format detection or explicit format parameter.♻️ Proposed fix for flexible format handling
`@classmethod` def split_audio_into_segments(cls, audio_file: str, segment_length_ms: int) -> list: - audio = AudioSegment.from_mp3(audio_file) + # Detect format from extension, default to mp3 + ext = pathlib.Path(audio_file).suffix.lower().lstrip('.') + audio = AudioSegment.from_file(audio_file, format=ext or "mp3") segments = []src/processing_pipeline/stage_1/tasks.py (2)
42-55: Same datetime format issue as in flows.py.Line 44 uses the same hardcoded format string. Consider extracting datetime parsing to a shared utility function.
357-359: Consider logging the full traceback for debugging.While catching bare
Exceptionis reasonable here to ensure status updates, the current code only prints the exception message. Consider logging the full traceback for easier debugging.♻️ Proposed fix
+import traceback ... except Exception as e: - print(f"Failed to process audio file {local_file}: {e}") + print(f"Failed to process audio file {local_file}: {e}\n{traceback.format_exc()}") set_audio_file_status(supabase_client, audio_file["id"], ProcessingStatus.ERROR, str(e))
| while uploaded_file.state.name == "PROCESSING": | ||
| print("Processing the uploaded audio file...") | ||
| time.sleep(1) | ||
| uploaded_file = gemini_client.files.get(name=uploaded_file.name) |
There was a problem hiding this comment.
Polling loop lacks timeout - potential infinite loop.
If the uploaded file never transitions out of PROCESSING state (due to API issues), this loop will run indefinitely. Consider adding a timeout.
🛡️ Proposed fix to add timeout
+ MAX_WAIT_SECONDS = 300 # 5 minutes
+ waited_seconds = 0
while uploaded_file.state.name == "PROCESSING":
print("Processing the uploaded audio file...")
time.sleep(1)
+ waited_seconds += 1
+ if waited_seconds >= MAX_WAIT_SECONDS:
+ gemini_client.files.delete(name=uploaded_file.name)
+ raise TimeoutError(f"File processing timed out after {MAX_WAIT_SECONDS} seconds")
uploaded_file = gemini_client.files.get(name=uploaded_file.name)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| while uploaded_file.state.name == "PROCESSING": | |
| print("Processing the uploaded audio file...") | |
| time.sleep(1) | |
| uploaded_file = gemini_client.files.get(name=uploaded_file.name) | |
| MAX_WAIT_SECONDS = 300 # 5 minutes | |
| waited_seconds = 0 | |
| while uploaded_file.state.name == "PROCESSING": | |
| print("Processing the uploaded audio file...") | |
| time.sleep(1) | |
| waited_seconds += 1 | |
| if waited_seconds >= MAX_WAIT_SECONDS: | |
| gemini_client.files.delete(name=uploaded_file.name) | |
| raise TimeoutError(f"File processing timed out after {MAX_WAIT_SECONDS} seconds") | |
| uploaded_file = gemini_client.files.get(name=uploaded_file.name) |
🤖 Prompt for AI Agents
In `@src/processing_pipeline/stage_1/executors.py` around lines 34 - 37, The
polling loop that checks uploaded_file.state using gemini_client.files.get can
hang indefinitely; add a timeout mechanism: record a start time (e.g., start =
time.time()), introduce a configurable max_wait_seconds (or DEFAULT_MAX_WAIT)
and on each iteration check if time.time() - start > max_wait_seconds, then
break/raise an exception or set uploaded_file.state to a failure status and log
an error. Ensure the code references the same uploaded_file and
gemini_client.files.get call and keeps the existing sleep interval while
enforcing the timeout to avoid infinite looping.
| response = client.audio.transcriptions.create( | ||
| model="whisper-1", | ||
| file=open(audio_file, "rb"), | ||
| response_format="verbose_json", | ||
| timestamp_granularities=["segment"], | ||
| ) |
There was a problem hiding this comment.
File handle not properly closed - potential resource leak.
The file is opened without a context manager, which could leak file descriptors if an exception occurs before the API call completes.
🐛 Proposed fix
# Transcribe the audio file
- response = client.audio.transcriptions.create(
- model="whisper-1",
- file=open(audio_file, "rb"),
- response_format="verbose_json",
- timestamp_granularities=["segment"],
- )
+ with open(audio_file, "rb") as f:
+ response = client.audio.transcriptions.create(
+ model="whisper-1",
+ file=f,
+ response_format="verbose_json",
+ timestamp_granularities=["segment"],
+ )📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| response = client.audio.transcriptions.create( | |
| model="whisper-1", | |
| file=open(audio_file, "rb"), | |
| response_format="verbose_json", | |
| timestamp_granularities=["segment"], | |
| ) | |
| with open(audio_file, "rb") as f: | |
| response = client.audio.transcriptions.create( | |
| model="whisper-1", | |
| file=f, | |
| response_format="verbose_json", | |
| timestamp_granularities=["segment"], | |
| ) |
🤖 Prompt for AI Agents
In `@src/processing_pipeline/stage_1/tasks.py` around lines 102 - 107, The file
handle for audio_file is opened directly in the call to
client.audio.transcriptions.create, which can leak descriptors on exceptions;
change this to open the file with a context manager (use with open(audio_file,
"rb") as f:) and pass that handle (f) into client.audio.transcriptions.create so
the file is always closed after the call (refer to response and
client.audio.transcriptions.create in tasks.py).
| metadata: dict, | ||
| prompt_version: dict, | ||
| ): | ||
| print(f"Processing initial transcription with Gemini for disinformation detection") |
There was a problem hiding this comment.
Remove extraneous f-string prefix.
Static analysis correctly identifies that this f-string has no placeholders.
🐛 Proposed fix
- print(f"Processing initial transcription with Gemini for disinformation detection")
+ print("Processing initial transcription with Gemini for disinformation detection")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| print(f"Processing initial transcription with Gemini for disinformation detection") | |
| print("Processing initial transcription with Gemini for disinformation detection") |
🧰 Tools
🪛 Ruff (0.14.14)
[error] 153-153: f-string without any placeholders
Remove extraneous f prefix
(F541)
🤖 Prompt for AI Agents
In `@src/processing_pipeline/stage_1/tasks.py` at line 153, The print call uses an
unnecessary f-string prefix without any placeholders; locate the print statement
that prints "Processing initial transcription with Gemini for disinformation
detection" in src/processing_pipeline/stage_1/tasks.py (the print(...) line) and
remove the leading "f" so it becomes a normal string literal (print("Processing
initial transcription with Gemini for disinformation detection")).
| @optional_task(log_prints=True, retries=3) | ||
| def reset_status_of_audio_files(supabase_client, audio_file_ids): | ||
| print(f"Reseting the status of the audio files: {audio_file_ids}") | ||
| supabase_client.reset_audio_file_status(audio_file_ids) |
There was a problem hiding this comment.
Minor typo in log message.
"Reseting" should be "Resetting".
📝 Proposed fix
`@optional_task`(log_prints=True, retries=3)
def reset_status_of_audio_files(supabase_client, audio_file_ids):
- print(f"Reseting the status of the audio files: {audio_file_ids}")
+ print(f"Resetting the status of the audio files: {audio_file_ids}")
supabase_client.reset_audio_file_status(audio_file_ids)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| @optional_task(log_prints=True, retries=3) | |
| def reset_status_of_audio_files(supabase_client, audio_file_ids): | |
| print(f"Reseting the status of the audio files: {audio_file_ids}") | |
| supabase_client.reset_audio_file_status(audio_file_ids) | |
| `@optional_task`(log_prints=True, retries=3) | |
| def reset_status_of_audio_files(supabase_client, audio_file_ids): | |
| print(f"Resetting the status of the audio files: {audio_file_ids}") | |
| supabase_client.reset_audio_file_status(audio_file_ids) |
🤖 Prompt for AI Agents
In `@src/processing_pipeline/stage_1/tasks.py` around lines 382 - 385, The log
message in reset_status_of_audio_files contains a typo ("Reseting"); update the
print statement in the reset_status_of_audio_files function to use the correct
spelling "Resetting the status of the audio files" while keeping the rest of the
message and behavior intact (the call to
supabase_client.reset_audio_file_status(audio_file_ids) should remain
unchanged).
Important
Reactivates Stage 1 preprocess step with initial transcription and detection, adding new prompts and updating the processing pipeline.
prompts/stage_1/preprocess/.initial_disinformation_detection()inflows.pyto include new preprocess steps.initial_transcription_user_prompt.mdandinitial_detection_user_prompt.mdfor transcription and detection.initial_transcription_output_schema.jsonandinitial_detection_output_schema.jsonfor output schemas.import_prompts_to_db.pyto include new prompt stages.stage_1.pyintostage_1/directory withexecutors.py,flows.py, andtasks.py.constants.pyto include new prompt stagesSTAGE_1_INITIAL_TRANSCRIPTIONandSTAGE_1_INITIAL_DETECTION.tasks.pyto handle new transcription and detection processes.This description was created by
for 8163a06. You can customize this summary. It will automatically update as commits are pushed.
Summary by CodeRabbit
Release Notes
New Features
Refactor
✏️ Tip: You can customize this high-level summary in your review settings.