VER-306: Fix crashing issue due to out of memory in Stage 3 machine by quancao-ea · Pull Request #69 · PublicDataWorks/verdad

quancao-ea · 2026-03-20T09:09:37Z

Summary by CodeRabbit

New Features
- Web-search enabled fact-checking and web-content fetching for richer, more grounded analysis
- Fully asynchronous processing for faster, non-blocking content analysis and polling
Bug Fixes
- Improved validation, clearer error messages, and more resilient fallback handling during analysis
Chores
- Pinned new runtime dependencies to support async HTTP access and HTML→markdown conversion (aiohttp, html2text)

Implement SearXNG web search and URL content extraction functionality to enable web-based information gathering in Stage 3 processing. These tools provide asynchronous web search capabilities and HTML-to-markdown conversion for content extraction.

Migrate Stage 3 processing pipeline from synchronous to asynchronous execution with enhanced web search capabilities. Replace Gemini CLI and Google Search grounding with direct SDK web search tools. Key changes: - Convert executor and flow to async/await pattern - Replace CLI and Google Search with custom searxng_web_search/web_url_read tools - Add dedicated constants module for model configuration - Simplify error handling with unified fallback strategy - Pass gemini_client instance instead of API key - Improve memory efficiency with streaming operations

linear · 2026-03-20T09:09:41Z

VER-306 Fix crashing issue due to out of memory in Stage 3 machine

coderabbitai · 2026-03-20T09:09:54Z

Walkthrough

This PR converts Stage 3 to async execution, introduces SDK-based web-search grounding using SearXNG and URL reading, centralizes a shared GenAI client (removing direct API-key passing), adds web tooling and model constants, and refactors executors, flows, and tasks to use the new async flow.

Changes

Cohort / File(s)	Summary
Dependencies `requirements.txt`	Added `aiohttp==3.13.3` and `html2text==2025.4.15`.
Model Constants `src/processing_pipeline/stage_3/constants.py`	New module exposing `MAIN_MODEL = GeminiModel.GEMINI_2_5_PRO` and `FALLBACK_MODEL = GeminiModel.GEMINI_2_5_FLASH`.
Web Tooling `src/processing_pipeline/stage_3/web_tools.py`	New async utilities: `searxng_web_search()` (SearXNG JSON normalization) and `web_url_read()` (fetch + HTML→markdown via html2text) with shared aiohttp timeout and SSL config.
Core Executor Refactoring `src/processing_pipeline/stage_3/executors.py`	Replaced sync `run()` with async `run_async(gemini_client: genai.Client, ...)`; removed API-key handling and CLI fallback; added SDK-based web-search/function-calling, async schema structuring, and file cleanup via `gemini_client.files.delete()`.
Async Flow & Task Updates `src/processing_pipeline/stage_3/flows.py`, `src/processing_pipeline/stage_3/tasks.py`	Converted `in_depth_analysis`, `process_snippet`, and `analyze_snippet` to async; create and pass a shared `genai.Client`; updated error handling to preserve auth failures, attempt fallback model, and build detailed error messages for complex exceptions.

Sequence Diagram(s)

sequenceDiagram
    participant Flow as in_depth_analysis (Flow)
    participant Task as process_snippet (Task)
    participant Executor as Stage3Executor.run_async()
    participant GenAI as GenAI Client (SDK)
    participant Files as GenAI Files API
    participant Search as searxng_web_search (Web Tools)
    participant WebRead as web_url_read (Web Tools)

    Flow->>Task: await process_snippet(gemini_client, snippet)
    Task->>Executor: await run_async(gemini_client, model, audio_file)
    Executor->>GenAI: upload file (aio)
    GenAI->>Files: store file -> returns file_id
    Executor->>GenAI: aio.models.generate_content (with automatic_function_calling)
    GenAI->>Search: call searxng_web_search(query)
    Search-->>GenAI: results
    GenAI->>WebRead: call web_url_read(url)
    WebRead-->>GenAI: markdown content
    GenAI-->>Executor: analysis + grounding
    Executor->>Files: delete(file_id)
    Executor-->>Task: analysis result
    Task-->>Flow: final result

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

VER-298: Refactor Stage 3 prompts: add verification fields and fix web search tools #59: Related refactor of Stage 3 into executors/flows/tasks and async web-search tooling.
VER-278: Use Gemini 2.5 Flash Latest as fallback model in Stage 3 #35: Related changes to Stage 3 model-fallback and async analysis flow.
VER-292: Implement custom web search validation solution in stage 3 #47: Related additions to web-search grounding and thought_summaries handling.

Suggested reviewers

nhphong

"🐰
I hop through code with eager cheer,
Async hops and web-tools near,
SearXNG crumbs and markdown bright,
Gemini hums through day and night,
Cheers — the pipeline's taking flight!"

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 44.44% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The title references a specific issue (VER-306) and describes the fix (crashing/out of memory in Stage 3), but the actual changes involve a comprehensive refactor from synchronous to asynchronous execution, adding web tools, and restructuring the entire Stage 3 pipeline - not solely addressing memory issues.	Clarify whether the title should emphasize the primary architectural change (async refactor) or if memory optimization is the core focus. Consider a more descriptive title like 'VER-306: Refactor Stage 3 to async execution with web tools' if the async conversion is the main solution.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/stage-3-out-of-memory-issue

📝 Coding Plan

Generate coding plan for human review comments

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 Pylint (4.0.5)

src/processing_pipeline/stage_3/tasks.py

************* Module .pylintrc
.pylintrc:1:0: F0011: error while parsing the configuration: File contains no section headers.
file: '.pylintrc', line: 1
'disable=C0116\n' (config-parse-error)
[
{
"type": "convention",
"module": "src.processing_pipeline.stage_3.tasks",
"obj": "",
"line": 24,
"column": 0,
"endLine": null,
"endColumn": null,
"path": "src/processing_pipeline/stage_3/tasks.py",
"symbol": "line-too-long",
"message": "Line too long (180/100)",
"message-id": "C0301"
},
{
"type": "convention",
"module": "src.processing_pipeline.stage_3.tasks",
"obj": "",
"line": 108,
"column": 0,
"endLine": null,
"endColumn": null,
"path": "src/processing_pipeline/stage_3/tasks.py",
"symbol": "line-too-long",
"message": "Line too long (119/100)",
"message-id": "C0301"
},
{
"type": "convention",

... [truncated 10800 characters] ...

"module": "src.processing_pipeline.stage_3.tasks",
"obj": "process_snippet",
"line": 219,
"column": 11,
"endLine": 219,
"endColumn": 20,
"path": "src/processing_pipeline/stage_3/tasks.py",
"symbol": "broad-exception-caught",
"message": "Catching too general exception Exception",
"message-id": "W0718"
},
{
"type": "error",
"module": "src.processing_pipeline.stage_3.tasks",
"obj": "process_snippet",
"line": 221,
"column": 82,
"endLine": 221,
"endColumn": 94,
"path": "src/processing_pipeline/stage_3/tasks.py",
"symbol": "no-member",
"message": "Instance of 'Exception' has no 'exceptions' member",
"message-id": "E1101"
}
]

src/processing_pipeline/stage_3/web_tools.py

************* Module .pylintrc
.pylintrc:1:0: F0011: error while parsing the configuration: File contains no section headers.
file: '.pylintrc', line: 1
'disable=C0116\n' (config-parse-error)
[
{
"type": "convention",
"module": "src.processing_pipeline.stage_3.web_tools",
"obj": "",
"line": 51,
"column": 0,
"endLine": null,
"endColumn": null,
"path": "src/processing_pipeline/stage_3/web_tools.py",
"symbol": "line-too-long",
"message": "Line too long (120/100)",
"message-id": "C0301"
},
{
"type": "convention",
"module": "src.processing_pipeline.stage_3.web_tools",
"obj": "",
"line": 1,
"column": 0,
"endLine": null,
"endColumn": null,
"path": "src/processing_pipeline/stage_3/web_tools.py",
"symbol": "missing-module-docstring",
"message": "Missing module docstring",
"message-id": "C0114"
},
{
"type": "error",
"module": "src.processing_pipeline.stage_3.web_tools",
"obj": "",
"line": 6,
"column": 0,
"endLine": 6,
"endColumn": 16,
"path": "src/processing_pipeline/stage_3/web_tools.py",
"symbol": "import-error",
"message": "Unable to import 'html2text'",
"message-id": "E0401"
}
]

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Tip

CodeRabbit can generate a title for your PR based on the changes with custom instructions.

Set the reviews.auto_title_instructions setting to generate a title for your PR based on the changes in the PR with custom instructions.

gemini-code-assist · 2026-03-20T09:10:11Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical out-of-memory issue in the Stage 3 processing machine by overhauling how it interacts with the Gemini API and external web resources. The primary change involves migrating the entire Stage 3 pipeline to an asynchronous architecture, utilizing asyncio and aiohttp for more efficient resource handling. This refactoring also modernizes the Gemini integration by switching from CLI-based methods to the GenAI SDK's automatic function calling, enabling robust web search and URL content reading capabilities for enhanced fact-checking.

Highlights

Asynchronous Processing Migration: Migrated Stage 3 Gemini interactions from synchronous calls to asynchronous operations using asyncio and aiohttp to improve resource management and prevent out-of-memory issues.
Gemini SDK Integration with Automatic Function Calling: Replaced the Gemini CLI and Google Search grounding with direct GenAI SDK integration, leveraging automatic function calling for new web tools.
New Web Tools for Fact-Checking: Introduced searxng_web_search and web_url_read as asynchronous tools for dynamic fact-checking and content extraction from URLs.
Dependency Updates: Added aiohttp and html2text to requirements.txt to support the new asynchronous web tools and HTML parsing capabilities.
Centralized Model Configuration: Centralized Gemini model selection (MAIN_MODEL, FALLBACK_MODEL) in a new constants.py file for Stage 3, improving maintainability.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request is a significant and well-executed refactoring of the Stage 3 processing pipeline to address out-of-memory issues. By moving from a synchronous, subprocess-based approach to a fully asynchronous implementation using asyncio and the Google GenAI async SDK, the code is now more efficient, scalable, and robust. The introduction of web_tools.py for handling web requests is a clean separation of concerns. The overall changes are excellent. I have one high-severity suggestion regarding exception handling to make the service more manageable by avoiding a broad BaseException catch.

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/processing_pipeline/stage_3/executors.py`:
- Around line 60-64: The processing wait loop using
uploaded_audio_file.state.name == "PROCESSING" can hang indefinitely; modify the
logic around that loop (the block updating uploaded_audio_file and calling
gemini_client.files.get) to enforce a maximum wait: introduce a timeout
parameter (e.g., max_wait_seconds or deadline) and break/raise once elapsed,
checking elapsed time each iteration (or use asyncio.wait_for around a helper
coroutine) and surface a clear error or change state if the timeout is reached;
update references in the same function where uploaded_audio_file and
gemini_client.files.get are used so the loop exits deterministically on timeout.

In `@src/processing_pipeline/stage_3/web_tools.py`:
- Around line 48-51: searxng_web_search lacks a timeout and can hang; mirror
web_url_read by adding a 10-second aiohttp timeout. Update the call that opens
the client/session (in searxng_web_search) to pass
timeout=aiohttp.ClientTimeout(total=10) either on ClientSession(...) or on
session.get(...), so response.raise_for_status() and await response.json() are
bounded by 10s; keep existing SSL connector usage
(aiohttp.TCPConnector(ssl=_ssl_context)) when adding the timeout.
- Line 8: SEARXNG_URL currently defaults to an empty string which leads to
requests to "/search" and confusing failures; update the module to validate the
configuration and fail fast: change the default to None (or keep env lookup) and
either (a) raise a clear ValueError during import if SEARXNG_URL is falsy, or
(b) add a guard at the start of searxng_web_search(...) that checks "if not
SEARXNG_URL: raise ValueError('SEARXNG_URL environment variable is not set')" so
callers get an explicit error instead of silent broken requests; reference
SEARXNG_URL and searxng_web_search to locate where to add the validation.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: b2615480-abce-46c6-a62a-a782f94a19df

📥 Commits

Reviewing files that changed from the base of the PR and between 2d51fb3 and d4efd8d.

📒 Files selected for processing (6)

requirements.txt
src/processing_pipeline/stage_3/constants.py
src/processing_pipeline/stage_3/executors.py
src/processing_pipeline/stage_3/flows.py
src/processing_pipeline/stage_3/tasks.py
src/processing_pipeline/stage_3/web_tools.py

- Narrow BaseException to Exception to avoid catching SystemExit/KeyboardInterrupt - Validate SEARXNG_URL is set - Add 10s HTTP timeout to searxng_web_search (matching web_url_read)

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/processing_pipeline/stage_3/tasks.py`:
- Around line 220-223: Add a safe import for ExceptionGroup and use the backport
on older interpreters: in the modules that reference ExceptionGroup (the
exception-handling block building error_message), wrap the import with a
try/except to import builtins.ExceptionGroup and fall back to the exceptiongroup
backport if ImportError; alternatively, if the project requires Python 3.11+,
ensure pyproject.toml specifies that and add the exceptiongroup backport to
dev/test requirements so imports succeed during development.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 44039503-e8fd-47a7-af97-168b200768e1

📥 Commits

Reviewing files that changed from the base of the PR and between d4efd8d and 761ee60.

📒 Files selected for processing (2)

src/processing_pipeline/stage_3/tasks.py
src/processing_pipeline/stage_3/web_tools.py

✅ Files skipped from review due to trivial changes (1)

src/processing_pipeline/stage_3/web_tools.py

quancao-ea added 3 commits March 20, 2026 10:50

Update requirements for web search functionality

2fea143

gemini-code-assist Bot reviewed Mar 20, 2026

View reviewed changes

Comment thread src/processing_pipeline/stage_3/tasks.py Outdated

coderabbitai Bot reviewed Mar 20, 2026

View reviewed changes

Comment thread src/processing_pipeline/stage_3/web_tools.py

Comment thread src/processing_pipeline/stage_3/web_tools.py Outdated

Fix error handling and add safeguards in Stage 3 web tools

761ee60

- Narrow BaseException to Exception to avoid catching SystemExit/KeyboardInterrupt - Validate SEARXNG_URL is set - Add 10s HTTP timeout to searxng_web_search (matching web_url_read)

PublicDataWorks deleted a comment from coderabbitai Bot Mar 20, 2026

coderabbitai Bot reviewed Mar 20, 2026

View reviewed changes

PublicDataWorks deleted a comment from coderabbitai Bot Mar 20, 2026

quancao-ea merged commit a02f1fd into main Mar 20, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VER-306: Fix crashing issue due to out of memory in Stage 3 machine#69

VER-306: Fix crashing issue due to out of memory in Stage 3 machine#69
quancao-ea merged 4 commits intomainfrom
fix/stage-3-out-of-memory-issue

quancao-ea commented Mar 20, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

linear Bot commented Mar 20, 2026

Uh oh!

coderabbitai Bot commented Mar 20, 2026 •

edited

Loading

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

gemini-code-assist Bot commented Mar 20, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

quancao-ea commented Mar 20, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

linear Bot commented Mar 20, 2026

Uh oh!

coderabbitai Bot commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

gemini-code-assist Bot commented Mar 20, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

quancao-ea commented Mar 20, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Mar 20, 2026 •

edited

Loading