Skip to content

feat: integrate file exporter to monorepo as a celery worker#57

Merged
aturret merged 28 commits intomainfrom
celery-update
Feb 20, 2026
Merged

feat: integrate file exporter to monorepo as a celery worker#57
aturret merged 28 commits intomainfrom
celery-update

Conversation

@aturret
Copy link
Owner

@aturret aturret commented Feb 20, 2026

Summary by CodeRabbit

  • New Features

    • Background file-export worker (Celery + Redis) for video download, PDF export, and audio transcription; CI now builds and publishes a worker image and tags releases.
  • Bug Fixes / Chores

    • docker-compose and deployment templates updated to include Redis and the worker; workspace/dependency config adjusted.
  • Removed / Breaking Changes

    • Removed multiple scraping integrations and related APIs, and Telegram bot/webhook endpoints; many configuration/auth helpers also removed.
  • Documentation

    • Expanded architecture and developer guidance.
  • Tests

    • Added tests for PDF conversion, transcription utilities, and video-download helpers.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 20, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Removes the legacy monolith under app/ (routers, scrapers, models, templates, telegram/telegraph, in‑app file‑export and S3). Adds a Celery-backed worker, a shared file‑export package, Celery tasks, CI/Docker/workspace updates, Redis integration, and tests for file‑export (PDF/video/transcribe).

Changes

Cohort / File(s) Summary
CI & Repo Metadata
/.github/workflows/ci.yml, pyproject.toml, .gitignore, CLAUDE.md, template.env
Enable tag triggers, add worker to build matrix, add Celery deps and workspace sources, ignore apps/worker/conf/, and add CELERY env vars.
App Core Removal
app/config.py, app/main.py, app/auth.py, app/database.py
Delete app-level configuration, entrypoint/lifespan, API key/Telegram auth helpers, and DB startup/shutdown/save utilities.
API Routers & Endpoints
app/routers/*, app/routers/scraper_routers.py
Remove all FastAPI routers/endpoints and webhook handlers (feed_push, scraper, inoreader, telegram, twitter, wechat, etc.).
Models, Templates & Re-exports
app/models/*, app/templates/*, app/utils/*
Remove many data models, enums, Jinja2 templates, and local re-export wrappers (public API surface reduced).
Scrapers & Scraper Infra
app/services/scrapers/*, app/services/scrapers/general/*
Remove scraper implementations, base classes, ScraperManager, and related clients (Firecrawl, Zyte, XHS) and utilities.
App File-Export & S3 Removal
app/services/file_export/*, app/services/amazon/s3.py
Remove in‑app video/pdf/transcribe and S3 upload logic (migrated to new shared package + Celery tasks).
Telegram & Telegraph Removal
app/services/telegram_bot/*, app/services/telegraph/__init__.py
Remove Telegram wiring, handlers, message sender, and Telegraph posting logic.
API: Celery Integration & Adapters
apps/api/src/config.py, apps/api/src/services/celery_client.py, apps/api/src/services/file_export/*, apps/api/src/services/scrapers/common.py
Add Celery client and config; replace direct HTTP exporter calls with Celery task submissions; update PdfExport usage and timeouts.
Worker Service
apps/worker/Dockerfile, apps/worker/pyproject.toml, apps/worker/worker_core/*
Introduce worker microservice: Celery app, task modules (video/pdf/transcribe), config, Dockerfile, and packaging.
Shared File‑Export Package
packages/file-export/*
Add shared package implementing video_download, pdf_export (WeasyPrint), and transcribe (OpenAI/Whisper + postprocessing) used by worker tasks.
Docker Compose / Deployment
docker-compose.template.yml, template.env
Add Redis service, remove fast-yt-downloader, add worker service, wire CELERY env vars and shared volumes.
Tests
tests/cases/test_file_export/*
Add unit tests for PDF conversion, video orientation/format selection, and audio silence detection.

Sequence Diagram(s)

sequenceDiagram
    participant Client as Client
    participant API as API (apps/api)
    participant Broker as Redis (Broker/Backend)
    participant Worker as Worker (Celery)
    participant FileExport as fastfetchbot_file_export

    Client->>API: request export/download/transcribe
    API->>Broker: enqueue Celery task
    Broker->>Worker: deliver task
    Worker->>FileExport: invoke video/pdf/transcribe logic
    FileExport-->>Worker: return result (file path / transcript)
    Worker->>Broker: store task result
    API->>Broker: fetch task result
    Broker-->>API: return result
    API-->>Client: respond with URL/transcript
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~150 minutes

Possibly related PRs

Poem

🐰 I hopped through tangled files and seams,

Queues now hum and celery dreams,
Workers fetch and PDFs blend,
Redis gardens help tasks ascend,
A carrot‑tidy codebase springs.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 42.86% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat: integrate file exporter to monorepo as a celery worker' clearly describes the main change in the changeset—integrating a file exporter as a Celery worker component into the monorepo architecture.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch celery-update

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 18

🧹 Nitpick comments (12)
.gitignore (1)

261-261: Consider tracking an apps/worker/conf/.gitkeep placeholder for consistency.

The root-level conf/ is kept traceable via conf/* + !conf/.gitkeep, so a fresh git clone always materialises that directory. The new /apps/worker/conf/ rule ignores the directory wholesale — including any .gitkeep — so developers who clone the repo won't have the directory created automatically. No Docker layer, Makefile target, or startup script currently creates it. If the worker process expects apps/worker/conf/ to exist at start-up, it will fail on a fresh clone.

Mirror the root convention:

📁 Proposed fix
 /apps/worker/conf/
+!/apps/worker/conf/.gitkeep

Then add an empty apps/worker/conf/.gitkeep file to the repo so the directory is scaffolded on clone.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.gitignore at line 261, The .gitignore entry "/apps/worker/conf/" currently
ignores the entire directory (including any .gitkeep), so a fresh clone won't
scaffold apps/worker/conf; update the .gitignore to mirror the root convention
by ignoring the directory contents but allowing a placeholder (e.g., change the
rule to "apps/worker/conf/*" and add an exception "!apps/worker/conf/.gitkeep"),
then add an empty file named "apps/worker/conf/.gitkeep" to the repo so the
directory is created on clone; ensure you update the same .gitignore entry and
commit the new .gitkeep file.
packages/file-export/fastfetchbot_file_export/pdf_export.py (2)

26-39: Same implicit Optional issue in export_pdf, and consider adding Loguru logging.

The str = None parameters here have the same PEP 484 issue. Additionally, per the project's coding guidelines, Loguru should be used for comprehensive logging — a brief log statement on PDF generation (input type, output path) would aid debugging in the Celery worker context.

Proposed fix
+from loguru import logger
+
+
 def export_pdf(
-    html_string: str = None,
-    html_file: str = None,
+    html_string: str | None = None,
+    html_file: str | None = None,
     output_filename: str = "output.pdf",
     download_dir: str = "/tmp",
 ) -> str:
     """Export HTML to PDF and return the output file path."""
     output_path = os.path.join(download_dir, output_filename)
+    logger.info("Exporting PDF to {}", output_path)
     convert_html_to_pdf(
         output_filename=output_path,
         html_string=html_string,
         html_file=html_file,
     )
+    logger.info("PDF export complete: {}", output_path)
     return output_path

As per coding guidelines: "Use Loguru for comprehensive logging in Python modules."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/file-export/fastfetchbot_file_export/pdf_export.py` around lines 26
- 39, export_pdf currently uses parameters typed as str = None which implicitly
relies on None for a non-Optional type; update the signature to use
typing.Optional[str] for html_string and html_file and import Optional from
typing, and add a Loguru logger call (using from loguru import logger) inside
export_pdf to log the input mode (whether html_string or html_file is used) and
the resolved output_path before calling convert_html_to_pdf; reference the
function name export_pdf and the convert_html_to_pdf call so you place the
logger and type changes in the same function.

9-23: Use explicit str | None type annotations instead of implicit Optional.

PEP 484 prohibits implicit Optional (assigning None default to a str-annotated parameter). Since the project targets Python ≥3.12, use the modern union syntax. Ruff RUF013 also flags this.

Proposed fix
 def convert_html_to_pdf(
     output_filename: str,
-    html_string: str = None,
-    html_file: str = None,
+    html_string: str | None = None,
+    html_file: str | None = None,
 ) -> None:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/file-export/fastfetchbot_file_export/pdf_export.py` around lines 9 -
23, The function convert_html_to_pdf currently uses implicit Optional by giving
None defaults without modern union annotations; update the signature to use
explicit union types (PEP 604) for the optional params so html_string and
html_file are annotated as str | None (e.g.,
convert_html_to_pdf(output_filename: str, html_string: str | None = None,
html_file: str | None = None) -> None), leaving the body unchanged and no extra
imports required for Python ≥3.12; keep the function name and parameter names
exactly as-is to locate the change.
apps/worker/Dockerfile (1)

25-37: System packages are installed in both stages — consider cleaning apt cache in production.

The production stage installs runtime system packages but doesn't clean up the apt cache, increasing image size. Adding && rm -rf /var/lib/apt/lists/* at the end of the RUN command would reduce the final image size.

Also applies to: 52-63

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/worker/Dockerfile` around lines 25 - 37, The Dockerfile RUN that
installs system packages should remove apt lists to reduce image size: update
the RUN commands that install packages (the multi-line RUN installing ffmpeg,
libpango, libjpeg, libopenjp2, libffi, build-essential, fonts-*) and the similar
RUN in the second stage (lines referenced in the diff) to append a cleanup step
that deletes /var/lib/apt/lists/* after apt-get install (and ensure apt-get
update && apt-get install ... && rm -rf /var/lib/apt/lists/* are in the same RUN
so caches aren’t preserved).
packages/file-export/pyproject.toml (1)

5-11: Dependencies lack upper-bound version constraints, inconsistent with the rest of the monorepo.

The root pyproject.toml pins all dependencies with upper bounds (e.g., "loguru>=0.7.2,<0.8.0", "openai>=2.15.0,<3.0.0", "pydub>=0.25.1,<0.26.0"), but this package uses only lower bounds. This can cause resolution conflicts within the uv workspace or allow unintended major-version upgrades that break compatibility.

Proposed fix: add upper-bound constraints consistent with the root
 dependencies = [
-    "yt-dlp[default]>=2026.02.04",
-    "weasyprint>=65.1",
-    "pydub>=0.25.1",
-    "openai>=2.15.0",
-    "loguru>=0.7.2",
+    "yt-dlp[default]>=2026.02.04,<2027.0.0",
+    "weasyprint>=65.1,<66.0",
+    "pydub>=0.25.1,<0.26.0",
+    "openai>=2.15.0,<3.0.0",
+    "loguru>=0.7.2,<0.8.0",
 ]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/file-export/pyproject.toml` around lines 5 - 11, Update the
dependency entries in the dependencies list to include the same upper-bound
constraints used by the monorepo root; specifically change the strings for
yt-dlp, weasyprint, pydub, openai, and loguru to include appropriate "<" upper
bounds consistent with the root (e.g., "loguru>=0.7.2,<0.8.0",
"openai>=2.15.0,<3.0.0", "pydub>=0.25.1,<0.26.0") so the package uses bounded
ranges matching the workspace policy.
packages/file-export/fastfetchbot_file_export/transcribe.py (1)

80-80: Hardcoded "m4a" format limits input flexibility.

AudioSegment.from_file(audio_file, "m4a") forces m4a parsing regardless of actual file format. Consider inferring the format from the file extension or omitting the format parameter to let pydub auto-detect.

♻️ Suggested change
-    audio_item = AudioSegment.from_file(audio_file, "m4a")
+    audio_item = AudioSegment.from_file(audio_file)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/file-export/fastfetchbot_file_export/transcribe.py` at line 80, The
call to AudioSegment.from_file(audio_file, "m4a") in transcribe.py hardcodes the
format; change it so the format is inferred (e.g., derive ext =
Path(audio_file).suffix.lstrip('.') and pass that) or simply call
AudioSegment.from_file(audio_file) to let pydub auto-detect, updating the
AudioSegment.from_file invocation that produces audio_item and ensuring any
downstream logic that expects a specific format still works.
packages/file-export/fastfetchbot_file_export/video_download.py (1)

41-54: Implicit Optional type hints.

Parameters extractor, video_format, cookie_file_path, and config default to None but are annotated as str / dict. PEP 484 requires explicit Optional[str] or str | None.

♻️ Proposed fix
+from __future__ import annotations
+
 def init_yt_downloader(
     hd: bool = False,
     audio_only: bool = False,
-    extractor: str = None,
+    extractor: str | None = None,
     no_proxy: bool = False,
     extract_only: bool = False,
-    video_format: str = None,
+    video_format: str | None = None,
     download_dir: str = "/tmp",
-    cookie_file_path: str = None,
+    cookie_file_path: str | None = None,
     proxy_mode: bool = False,
     proxy_url: str = "",
     youtube_cookie: bool = False,
     bilibili_cookie: bool = False,
 ) -> YoutubeDL:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/file-export/fastfetchbot_file_export/video_download.py` around lines
41 - 54, The function init_yt_downloader has parameters extractor, video_format,
cookie_file_path (and any config param) defaulting to None but typed as
str/dict; update their annotations to explicit optional types (e.g.,
Optional[str] or str | None, and Optional[dict] or dict | None) to satisfy PEP
484/PEP 604 and avoid implicit Optional usage — modify the function signature of
init_yt_downloader and any related type hints where config is declared to use
the explicit Optional form.
tests/cases/test_file_export/test_video_download.py (1)

1-30: Tests are correct and aligned with the implementation.

All assertions have been verified against the get_video_orientation and get_format_for_orientation implementations. The bilibili_non_hd test correctly checks for "480" in the returned format string.

Optionally, consider adding a test for the ValueError raised when an unknown extractor is passed to get_format_for_orientation, since that code path exists in the implementation:

def test_get_format_unknown_extractor():
    with pytest.raises(ValueError, match="no available extractor found"):
        get_format_for_orientation("unknown", "horizontal", hd=False)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/cases/test_file_export/test_video_download.py` around lines 1 - 30, Add
a test that verifies get_format_for_orientation raises a ValueError for unknown
extractors: create a new test function (e.g., test_get_format_unknown_extractor)
that uses pytest.raises to call get_format_for_orientation("unknown",
"horizontal", hd=False) and asserts the raised error message contains "no
available extractor found"; reference get_format_for_orientation and the test
name so reviewers can locate the correct place to add it.
apps/worker/worker_core/config.py (2)

24-24: Empty default for OPENAI_API_KEY will cause silent failures.

If OPENAI_API_KEY is not set, the worker will attempt transcription with an empty string, leading to cryptic API errors from OpenAI instead of a clear startup-time failure. Consider either validating at import time or logging a warning.

Proposed improvement
 OPENAI_API_KEY = env.get("OPENAI_API_KEY", "")
+if not OPENAI_API_KEY:
+    import logging
+    logging.getLogger(__name__).warning("OPENAI_API_KEY is not set; transcription tasks will fail")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/worker/worker_core/config.py` at line 24, The config currently sets
OPENAI_API_KEY = env.get("OPENAI_API_KEY", "") which allows an empty string and
leads to silent runtime failures; change the code in config.py to fetch the key
without a permissive empty default and perform a fail-fast validation (e.g., set
OPENAI_API_KEY = env.get("OPENAI_API_KEY") or None, then if not OPENAI_API_KEY
either raise a RuntimeError with a clear message or log an error via the module
logger and exit), so any startup without a valid OpenAI key fails immediately
and clearly instead of producing cryptic API errors.

10-11: Default broker/backend URLs point to localhost.

These defaults are fine for local development, but in a containerized deployment (Docker Compose), the worker will fail to connect if the env vars aren't set because localhost won't resolve to the Redis container. Consider documenting this or using a Docker-friendly default like redis://redis:6379/0 if the primary deployment target is Docker.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/worker/worker_core/config.py` around lines 10 - 11, The
CELERY_BROKER_URL and CELERY_RESULT_BACKEND defaults currently point to
localhost which breaks containerized deployments; update the defaults in config
by changing CELERY_BROKER_URL and CELERY_RESULT_BACKEND to Docker-friendly hosts
(e.g., redis://redis:6379/0 and redis://redis:6379/1) or add a clear
comment/docstring near the CELERY_BROKER_URL and CELERY_RESULT_BACKEND
definitions instructing operators to set environment variables in Docker Compose
(or update README/compose docs) so the worker uses the Redis service hostname
instead of localhost.
apps/api/src/services/file_export/document_export/pdf_export.py (1)

36-36: DOWNLOAD_VIDEO_TIMEOUT is a misleading name for PDF export timeout.

This timeout constant is reused across video download, audio transcription, and PDF export. Consider renaming it to something generic like CELERY_TASK_TIMEOUT or introducing per-task timeouts, since PDF generation likely has different timing characteristics than video downloads.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/api/src/services/file_export/document_export/pdf_export.py` at line 36,
The constant DOWNLOAD_VIDEO_TIMEOUT is misleading when used in pdf_export.py for
PDF export; define and use a more appropriate timeout (either a generic
CELERY_TASK_TIMEOUT or a specific PDF_EXPORT_TIMEOUT) and replace the reference
to DOWNLOAD_VIDEO_TIMEOUT in the PDF export code (the async call that awaits
result.get in pdf_export.py) so the timeout name matches the task purpose;
update any relevant imports/usages and tests to reference the new constant and
remove or stop reusing DOWNLOAD_VIDEO_TIMEOUT for non-video tasks.
apps/worker/worker_core/tasks/video.py (1)

1-9: No task-level logging.

The task body has no logging. The coding guidelines require Loguru for comprehensive logging. Adding a log line on entry (with the URL and key params) would aid debugging in the worker.

Proposed fix
 from worker_core.main import app
 from worker_core.config import (
     DOWNLOAD_DIR,
     COOKIE_FILE_PATH,
     PROXY_MODE,
     PROXY_URL,
     YOUTUBE_COOKIE,
     BILIBILI_COOKIE,
 )
 from fastfetchbot_file_export.video_download import download_video
+from fastfetchbot_shared.utils.logger import logger


 `@app.task`(name="file_export.video_download")
 def video_download_task(
     url: str,
     download: bool = True,
     hd: bool = False,
     extractor: str = "youtube",
     audio_only: bool = False,
 ) -> dict:
+    logger.info(f"video_download_task started: url={url}, extractor={extractor}, hd={hd}, audio_only={audio_only}")
     config = {

As per coding guidelines, "Use Loguru for comprehensive logging in Python modules."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/worker/worker_core/tasks/video.py` around lines 1 - 9, Add task-level
logging using Loguru: import logger with "from loguru import logger" in this
module and, at the start of the task function(s) that perform video
processing/download in video.py, add a logger.info entry that records the
incoming URL and key parameters (e.g., DOWNLOAD_DIR, PROXY_MODE, PROXY_URL,
COOKIE_FILE_PATH) and indicates which cookie type is used (YOUTUBE_COOKIE or
BILIBILI_COOKIE) but do not log raw cookie values—mask or indicate presence
only. Ensure the log message is concise and executed immediately on task entry
so it appears in worker logs for debugging.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/api/src/services/file_export/audio_transcribe/__init__.py`:
- Around line 22-23: The call that awaits the blocking result.get (the line
using asyncio.to_thread(result.get, timeout=int(DOWNLOAD_VIDEO_TIMEOUT))) must
be wrapped in a try/except that catches celery.exceptions.TimeoutError,
backend/connection exceptions and any generic Exception from the worker; log the
failure with the caught exception and context (e.g., which task/result and
DOWNLOAD_VIDEO_TIMEOUT) and then raise a domain-appropriate error (e.g.,
AudioTranscriptionError or FileExportError) instead of letting the original
exception propagate; apply the same fix to the identical patterns in
video_download/__init__.py and pdf_export.py (the lines that call
asyncio.to_thread(result.get, ... ) and then access response["transcript"] or
similar keys).
- Around line 17-23: The API is leaking OPENAI_API_KEY via Celery kwargs: stop
including OPENAI_API_KEY in the task payload in the send_task call in
apps/api/src/services/file_export/audio_transcribe/__init__.py (remove
"openai_api_key" from kwargs and any callers expecting it) and update the worker
task implementation (file_export.transcribe) to import OPENAI_API_KEY from
apps/worker/worker_core/config.py and read it from the worker environment
instead of from task kwargs; also adjust the task signature (and any tests) so
it only requires audio_file and does not depend on a passed API key.

In `@apps/api/src/services/file_export/document_export/__init__.py`:
- Around line 11-13: DocumentExport.export is synchronous and calling
pdf_export.PdfExport(...) returns a coroutine and is given wrong positional
args; change export to async def export(self) -> str, construct PdfExport with
proper keyword args (e.g., pdf_export.PdfExport(title=self.document.get("title",
""), html_string=self.document.get("content")) ) and await its .export() (return
await pdf_export.PdfExport(...).export()), ensuring you reference
pdf_export.PdfExport.__init__ and PdfExport.export correctly; alternatively
remove the DocumentExport class if you prefer not to maintain dead code.

In `@apps/api/src/services/file_export/document_export/pdf_export.py`:
- Around line 27-37: The PdfExport.export method is async but is being called
from a synchronous caller, causing a coroutine to be returned; update the caller
(the export function that invokes
pdf_export.PdfExport(self.document["content"]).export()) to be async and await
PdfExport.export, i.e. change the caller signature to async def export(self) and
replace the direct call with await pdf_export.PdfExport(...).export(), ensuring
any upstream call sites either await this new async export or are adapted to
async as well; keep the PdfExport.export implementation as-is.

In `@apps/worker/Dockerfile`:
- Around line 49-68: The production Docker stage (FROM python-base AS
production) currently runs as root; create and switch to a non-root user there:
add a non-root group/user (e.g., app or worker with a fixed UID/GID), chown the
copied directories ($PYSETUP_PATH, /app/packages, /app/apps/worker) and any
runtime directories to that user, and add a USER directive before CMD so the
Celery process (CMD ["celery", "-A", "worker_core.main:app", "worker",
"--loglevel=info", "--concurrency=2"]) runs unprivileged; ensure file ownership
and PYTHONPATH remain correct for the new user.

In `@apps/worker/worker_core/tasks/pdf.py`:
- Around line 6-13: The task pdf_export_task is missing Loguru logs; add Loguru
logging at the start and after a successful export so failures are easier to
debug: import logger from loguru, log an info/debug at the start of
pdf_export_task including the output_filename and any key params (html_string
length or DOWNLOAD_DIR), call export_pdf as-is, then log an info on successful
completion with output_path (output_filename) and any timing/size details; also
log an error/exception if export_pdf raises so failures are recorded. Ensure you
reference pdf_export_task, export_pdf, and DOWNLOAD_DIR when adding the logs.

In `@apps/worker/worker_core/tasks/transcribe.py`:
- Around line 7-10: transcribe_task currently lacks Loguru logging and silently
passes a None API key to get_audio_text; add a Loguru logger and log task entry
(with audio_file), successful exit (including transcript length or summary), and
exceptions; before calling get_audio_text validate api_key (from openai_api_key
or OPENAI_API_KEY) and if missing log an explicit error and raise/return a clear
failure (e.g., raise ValueError or return {"error": "...", "message": "failed"})
so OpenAI(auth) errors are not cryptic; update references in transcribe_task and
ensure any caught exceptions from get_audio_text are logged via logger.exception
for debugging.

In `@CLAUDE.md`:
- Around line 158-170: Replace the prompt-like Migration plan section with
concrete documentation: remove imperative language ("Please analyze...", "Please
investigate...") and fix grammar (e.g., "make the think" → reword, "it should
has its own" → "it should have its own"). Describe the implemented architecture
for FastFileExporter: that its code has been moved to packages/shared
(FastFileExporter shared package), that a new Celery worker app runs under
apps/worker with its own Dockerfile and pyproject.toml, that API and Telegram
components enqueue tasks to Redis (message broker) for the worker, and that the
worker executes synchronous tasks (yt-dlp, WeasyPrint) and returns results
asynchronously; include brief notes on startup & deployment expectations (how to
run worker, broker, and API) and list key components/paths (packages/shared,
apps/worker, Dockerfile, pyproject.toml, FastFileExporter, yt-dlp, WeasyPrint,
Redis, Celery).
- Around line 19-23: Add the new Celery worker service to the architecture docs:
update the top-level architecture tree and the services table in CLAUDE.md to
include the Worker service for apps/worker/ (reference the service’s Dockerfile
and CI entry added in the PR), listing its package name (e.g.,
fastfetchbot-worker), port (— if none), and entry point (use the actual Celery
start command from the worker Dockerfile/CI, e.g., the celery invocation used in
apps/worker/). Ensure the entry mirrors the naming/format used for API Server
and Telegram Bot rows so readers can locate apps/worker/ in the repo easily.

In `@docker-compose.template.yml`:
- Around line 57-64: The docker-compose template exposes the Redis service
(service name "redis", container_name "fastfetchbot-redis") to the host via the
ports mapping "- 6379:6379" without authentication; either remove the ports
mapping so Redis is only reachable on Docker's internal network, or enable Redis
authentication/ACLs (e.g., configure --requirepass or a redis.conf with ACL
rules) and update any redis:// connection strings in the app config to include
credentials; modify the "redis" service block accordingly and ensure any
references to the Redis URL (used by the app) are updated to use the new
credentialed URL or internal hostname.

In `@packages/file-export/fastfetchbot_file_export/transcribe.py`:
- Around line 10-21: The system prompt constant PUNCTUATION_SYSTEM_PROMPT
contains a typo "capialization" that should be "capitalization"; update the
string inside PUNCTUATION_SYSTEM_PROMPT (the multi-line prompt around the rule
list) to correct that spelling and ensure surrounding punctuation and spacing
remain unchanged so the prompt semantics and formatting are preserved.
- Around line 85-106: The loop that builds segments is using inverted limits and
mixed units; update the enumerate(range(...)) call to iterate over milliseconds
of the audio: range(0, audio_length * 1000, SEGMENT_LENGTH * 1000). Compute
end_time in milliseconds as end_time = i + SEGMENT_LENGTH * 1000 (not (i +
SEGMENT_LENGTH) * 1000). Fix the slice conditional so that if end_time >=
audio_length * 1000 you take the unbounded tail audio_item[start_time:] else
take the bounded slice audio_item[start_time:end_time]; keep existing file
naming (audio_file_non_ext, audio_file_ext, segment_path) and
export/transcribe/remove logic unchanged.
- Around line 45-68: Both punctuation_assistant and summary_assistant currently
reference the retired model "gpt-3.5-turbo-16k"; update the model parameter in
both functions (punctuation_assistant and summary_assistant) to a supported
model such as "gpt-3.5-turbo" (16k context is now default) or "gpt-4o-mini" so
API calls succeed at runtime, ensuring you replace the model string passed to
client.chat.completions.create in each function.

In `@packages/file-export/fastfetchbot_file_export/video_download.py`:
- Around line 8-18: The get_video_orientation function misclassifies when
"formats" is missing or when the first format is audio-only; update
get_video_orientation to treat missing formats as unknown/default horizontal
instead of vertical, and inspect the formats list to find the first video format
(e.g., where a format dict has an "aspect_ratio" or both "width" and "height"
and not an "acodec" only) rather than blindly using formats[0]; compute
orientation from a valid format's aspect_ratio (or width/height) and if no valid
video format is found, return "horizontal" as a safe default.
- Around line 77-101: Replace the duplicated format-selection block in
init_yt_downloader with a call to get_format_for_orientation(extractor, hd,
bilibili_cookie) (preserving the extractor None check) so video_format is
delegated to that helper; remove the duplicated elif/else branches for "youtube"
and "bilibili" and use the returned format for ydl_opts["format"]. Also scope
the referer so you only add "referer": "https://www.bilibili.com/" to ydl_opts
when extractor == "bilibili" (leave it out for YouTube and others); keep using
base_opts, download_dir, and outtmpl as before.
- Around line 208-210: Replace the raw traceback print with Loguru logging: in
the except Exception as e block in video_download.py, remove
traceback.print_exc() and call logger.exception("Error downloading video") (or a
context-appropriate message) to log the exception and stack trace via Loguru,
then re-raise (raise) as before; ensure you import or reference the module-level
Loguru logger used elsewhere in this module.

In `@tests/cases/test_file_export/test_pdf_export.py`:
- Around line 6-10: The import block that tries to import convert_html_to_pdf
currently only catches OSError and thus misses a missing package
(ModuleNotFoundError/ImportError) causing tests to fail; update the except to
catch ImportError as well (e.g., except (OSError, ImportError)) so that
HAS_WEASYPRINT is set to False when fastfetchbot_file_export or its dependencies
are absent, keeping the import attempt around convert_html_to_pdf and the
HAS_WEASYPRINT flag intact.

---

Duplicate comments:
In `@apps/api/src/services/file_export/document_export/pdf_export.py`:
- Around line 40-43: The code assumes the worker-produced PDF path in
output_filename is accessible from the API host and tries to call
upload_file_to_s3(...) and aiofiles.os.remove(...) using that local path, which
will fail when API and worker do not share storage; instead, move the S3 upload
and local-file removal to the worker process that actually creates the PDF (so
the worker calls upload_file_to_s3(...) and deletes its own local file), and
change the API-facing contract of the function that returns output_filename to
expect an S3 key/URL (or have upload_file_to_s3 accept file bytes streamed from
the worker). Update usages of AWS_STORAGE_ON, upload_file_to_s3,
output_filename, local_filename and aiofiles.os.remove accordingly so the API
never tries to access or delete a worker-local filesystem path.

---

Nitpick comments:
In @.gitignore:
- Line 261: The .gitignore entry "/apps/worker/conf/" currently ignores the
entire directory (including any .gitkeep), so a fresh clone won't scaffold
apps/worker/conf; update the .gitignore to mirror the root convention by
ignoring the directory contents but allowing a placeholder (e.g., change the
rule to "apps/worker/conf/*" and add an exception "!apps/worker/conf/.gitkeep"),
then add an empty file named "apps/worker/conf/.gitkeep" to the repo so the
directory is created on clone; ensure you update the same .gitignore entry and
commit the new .gitkeep file.

In `@apps/api/src/services/file_export/document_export/pdf_export.py`:
- Line 36: The constant DOWNLOAD_VIDEO_TIMEOUT is misleading when used in
pdf_export.py for PDF export; define and use a more appropriate timeout (either
a generic CELERY_TASK_TIMEOUT or a specific PDF_EXPORT_TIMEOUT) and replace the
reference to DOWNLOAD_VIDEO_TIMEOUT in the PDF export code (the async call that
awaits result.get in pdf_export.py) so the timeout name matches the task
purpose; update any relevant imports/usages and tests to reference the new
constant and remove or stop reusing DOWNLOAD_VIDEO_TIMEOUT for non-video tasks.

In `@apps/worker/Dockerfile`:
- Around line 25-37: The Dockerfile RUN that installs system packages should
remove apt lists to reduce image size: update the RUN commands that install
packages (the multi-line RUN installing ffmpeg, libpango, libjpeg, libopenjp2,
libffi, build-essential, fonts-*) and the similar RUN in the second stage (lines
referenced in the diff) to append a cleanup step that deletes
/var/lib/apt/lists/* after apt-get install (and ensure apt-get update && apt-get
install ... && rm -rf /var/lib/apt/lists/* are in the same RUN so caches aren’t
preserved).

In `@apps/worker/worker_core/config.py`:
- Line 24: The config currently sets OPENAI_API_KEY = env.get("OPENAI_API_KEY",
"") which allows an empty string and leads to silent runtime failures; change
the code in config.py to fetch the key without a permissive empty default and
perform a fail-fast validation (e.g., set OPENAI_API_KEY =
env.get("OPENAI_API_KEY") or None, then if not OPENAI_API_KEY either raise a
RuntimeError with a clear message or log an error via the module logger and
exit), so any startup without a valid OpenAI key fails immediately and clearly
instead of producing cryptic API errors.
- Around line 10-11: The CELERY_BROKER_URL and CELERY_RESULT_BACKEND defaults
currently point to localhost which breaks containerized deployments; update the
defaults in config by changing CELERY_BROKER_URL and CELERY_RESULT_BACKEND to
Docker-friendly hosts (e.g., redis://redis:6379/0 and redis://redis:6379/1) or
add a clear comment/docstring near the CELERY_BROKER_URL and
CELERY_RESULT_BACKEND definitions instructing operators to set environment
variables in Docker Compose (or update README/compose docs) so the worker uses
the Redis service hostname instead of localhost.

In `@apps/worker/worker_core/tasks/video.py`:
- Around line 1-9: Add task-level logging using Loguru: import logger with "from
loguru import logger" in this module and, at the start of the task function(s)
that perform video processing/download in video.py, add a logger.info entry that
records the incoming URL and key parameters (e.g., DOWNLOAD_DIR, PROXY_MODE,
PROXY_URL, COOKIE_FILE_PATH) and indicates which cookie type is used
(YOUTUBE_COOKIE or BILIBILI_COOKIE) but do not log raw cookie values—mask or
indicate presence only. Ensure the log message is concise and executed
immediately on task entry so it appears in worker logs for debugging.

In `@packages/file-export/fastfetchbot_file_export/pdf_export.py`:
- Around line 26-39: export_pdf currently uses parameters typed as str = None
which implicitly relies on None for a non-Optional type; update the signature to
use typing.Optional[str] for html_string and html_file and import Optional from
typing, and add a Loguru logger call (using from loguru import logger) inside
export_pdf to log the input mode (whether html_string or html_file is used) and
the resolved output_path before calling convert_html_to_pdf; reference the
function name export_pdf and the convert_html_to_pdf call so you place the
logger and type changes in the same function.
- Around line 9-23: The function convert_html_to_pdf currently uses implicit
Optional by giving None defaults without modern union annotations; update the
signature to use explicit union types (PEP 604) for the optional params so
html_string and html_file are annotated as str | None (e.g.,
convert_html_to_pdf(output_filename: str, html_string: str | None = None,
html_file: str | None = None) -> None), leaving the body unchanged and no extra
imports required for Python ≥3.12; keep the function name and parameter names
exactly as-is to locate the change.

In `@packages/file-export/fastfetchbot_file_export/transcribe.py`:
- Line 80: The call to AudioSegment.from_file(audio_file, "m4a") in
transcribe.py hardcodes the format; change it so the format is inferred (e.g.,
derive ext = Path(audio_file).suffix.lstrip('.') and pass that) or simply call
AudioSegment.from_file(audio_file) to let pydub auto-detect, updating the
AudioSegment.from_file invocation that produces audio_item and ensuring any
downstream logic that expects a specific format still works.

In `@packages/file-export/fastfetchbot_file_export/video_download.py`:
- Around line 41-54: The function init_yt_downloader has parameters extractor,
video_format, cookie_file_path (and any config param) defaulting to None but
typed as str/dict; update their annotations to explicit optional types (e.g.,
Optional[str] or str | None, and Optional[dict] or dict | None) to satisfy PEP
484/PEP 604 and avoid implicit Optional usage — modify the function signature of
init_yt_downloader and any related type hints where config is declared to use
the explicit Optional form.

In `@packages/file-export/pyproject.toml`:
- Around line 5-11: Update the dependency entries in the dependencies list to
include the same upper-bound constraints used by the monorepo root; specifically
change the strings for yt-dlp, weasyprint, pydub, openai, and loguru to include
appropriate "<" upper bounds consistent with the root (e.g.,
"loguru>=0.7.2,<0.8.0", "openai>=2.15.0,<3.0.0", "pydub>=0.25.1,<0.26.0") so the
package uses bounded ranges matching the workspace policy.

In `@tests/cases/test_file_export/test_video_download.py`:
- Around line 1-30: Add a test that verifies get_format_for_orientation raises a
ValueError for unknown extractors: create a new test function (e.g.,
test_get_format_unknown_extractor) that uses pytest.raises to call
get_format_for_orientation("unknown", "horizontal", hd=False) and asserts the
raised error message contains "no available extractor found"; reference
get_format_for_orientation and the test name so reviewers can locate the correct
place to add it.

Comment on lines +45 to +68
def punctuation_assistant(client: OpenAI, transcript: str) -> str:
"""Use GPT to add punctuation and formatting to raw transcript."""
response = client.chat.completions.create(
model="gpt-3.5-turbo-16k",
temperature=0,
messages=[
{"role": "system", "content": PUNCTUATION_SYSTEM_PROMPT},
{"role": "user", "content": transcript},
],
)
return response.choices[0].message.content


def summary_assistant(client: OpenAI, transcript: str) -> str:
"""Use GPT to generate a summary of the transcript."""
response = client.chat.completions.create(
model="gpt-3.5-turbo-16k",
temperature=0,
messages=[
{"role": "system", "content": SUMMARY_SYSTEM_PROMPT},
{"role": "user", "content": transcript},
],
)
return response.choices[0].message.content
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

OpenAI gpt-3.5-turbo-16k deprecation status January 2026

💡 Result:

As of January 2026, the old gpt-3.5-turbo-16k line is already retired:

  • The 16k “0613” snapshot (gpt-3.5-turbo-16k-0613) was deprecated and had a shutdown date of Sep 13, 2024, so it is no longer available by Jan 2026. [1]
  • OpenAI’s current docs no longer have a model page for gpt-3.5-turbo-16k (it returns “Page not found”), indicating it’s not an active, supported model identifier. [3]
  • If you need 16k context on “GPT‑3.5 Turbo”, OpenAI’s guidance is to use gpt-3.5-turbo (the 16k context became the default in the updated GPT‑3.5 Turbo line), and OpenAI also notes newer replacements (e.g., gpt-4o-mini) for many use cases. [1][2]

🏁 Script executed:

# Find all references to gpt-3.5-turbo-16k in the codebase
rg "gpt-3.5-turbo-16k" -i

Repository: aturret/FastFetchBot

Length of output: 253


🏁 Script executed:

# Check the structure and other model references
find . -name "*.py" -type f | head -20

Repository: aturret/FastFetchBot

Length of output: 842


Replace gpt-3.5-turbo-16k with an active model.

The gpt-3.5-turbo-16k model was retired by OpenAI on September 13, 2024, and is no longer available. Both punctuation_assistant and summary_assistant reference this model, causing API calls to fail at runtime.

Use gpt-3.5-turbo (16k context is now the default) or gpt-4o-mini as a replacement.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/file-export/fastfetchbot_file_export/transcribe.py` around lines 45
- 68, Both punctuation_assistant and summary_assistant currently reference the
retired model "gpt-3.5-turbo-16k"; update the model parameter in both functions
(punctuation_assistant and summary_assistant) to a supported model such as
"gpt-3.5-turbo" (16k context is now default) or "gpt-4o-mini" so API calls
succeed at runtime, ensuring you replace the model string passed to
client.chat.completions.create in each function.

@aturret
Copy link
Owner Author

aturret commented Feb 20, 2026

Review the code @codex

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6c6f25944a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (2)
packages/file-export/fastfetchbot_file_export/transcribe.py (2)

55-55: response.choices[0].message.content can be None.

Both punctuation_assistant and summary_assistant return response.choices[0].message.content directly. This value is None when the model's finish reason is not "stop" (e.g., "length", "content_filter"), and accessing index 0 raises IndexError if choices is empty. Both cases result in propagating None or an unhandled exception into the transcript string.

♻️ Proposed fix
-    return response.choices[0].message.content
+    content = response.choices[0].message.content if response.choices else None
+    return content or transcript  # fallback to original on empty/null response

Also applies to: 68-68

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/file-export/fastfetchbot_file_export/transcribe.py` at line 55, The
current returns in punctuation_assistant and summary_assistant assume
response.choices[0].message.content exists, which can be None or choices may be
empty; update both functions to validate the OpenAI response: ensure
response.choices is non-empty, check the chosen item's finish_reason is "stop"
(or at minimum that choice.message.content is not None), and handle error cases
by either raising a clear exception or returning a safe fallback (e.g., empty
string) and logging the response (including finish_reason and full choices) for
debugging instead of directly returning response.choices[0].message.content.

93-106: File path computation belongs outside the loop, and split(".") is fragile.

audio_file_list, audio_file_ext, and audio_file_non_ext (Lines 93–96) are recomputed on every iteration despite being invariant. Moving them before the loop avoids redundant work. Additionally, audio_file.split(".") mis-handles filenames with multiple dots (e.g., audio.backup.m4a) — the non-extension part will be truncated. Use os.path.splitext instead.

♻️ Proposed refactor
+    audio_file_non_ext, _ext = os.path.splitext(audio_file)
+    audio_file_ext = _ext.lstrip(".")
     for index, start_ms in enumerate(range(0, audio_length * 1000, SEGMENT_LENGTH * 1000)):
         ...
-        audio_file_list = audio_file.split(".")
-        audio_file_ext = audio_file_list[-1]
-        audio_file_non_ext = ".".join(audio_file_list[:-1])
         segment_path = f"{audio_file_non_ext}-{index + 1}.{audio_file_ext}"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/file-export/fastfetchbot_file_export/transcribe.py` around lines 93
- 106, The filename splitting and base-name computation are done inside the loop
and use fragile audio_file.split("."); move the decomposition out of the loop
and replace it with os.path.splitext to correctly handle names with multiple
dots: compute base (audio_file_non_ext) and ext (audio_file_ext) once before
iterating, then inside the loop build segment_path as
f"{audio_file_non_ext}-{index + 1}{audio_file_ext if ext.startswith('.') else
'.'+audio_file_ext}" (or simply use the tuple from splitext to format safely),
keep exporting audio_segment, calling
client.audio.transcriptions.create(model=TRANSCRIBE_MODEL, file=f) and appending
result.text to transcript, and still remove the segment_path after the request;
update references to audio_file_list/audio_file_ext/audio_file_non_ext to use
the new variables so the loop no longer recomputes them.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@packages/file-export/fastfetchbot_file_export/transcribe.py`:
- Line 111: The f-string in transcribe.py uses fullwidth colons (U+FF1A) which
triggers Ruff RUF001; update the string in the expression that calls
summary_assistant(client, transcript) to use ASCII colons ":" instead of the
fullwidth ":" OR, if the fullwidth punctuation is intentional for Chinese
formatting, append a "# noqa: RUF001" comment to that line to suppress the lint
error and document the intent (refer to the f-string that includes
summary_assistant(client, transcript) and transcript).
- Around line 108-112: The code currently calls os.remove(audio_file) before
post-processing (punctuation_assistant and summary_assistant), risking loss if
those calls fail; update the flow so the original audio file is deleted only
after punctuation_assistant(client, transcript) and summary_assistant(client,
transcript) have both succeeded (or persist the raw transcript first).
Concretely: remove or relocate the os.remove(audio_file) call to after the
transcript is saved and the two assistants complete successfully (or write the
raw transcript to disk/DB under the variable transcript_raw before calling
punctuation_assistant) and ensure errors from
punctuation_assistant/summary_assistant abort deletion and propagate for retry.
- Line 80: The code always calls AudioSegment.from_file(audio_file, "m4a") which
ignores the actual file extension extracted later; change the logic to derive
the extension once from the audio filename (e.g., ext =
os.path.splitext(audio_file)[1].lstrip('.') or similar) and reuse that variable
when loading (AudioSegment.from_file(audio_file, ext)) and when constructing
export paths/filenames so the format is consistent; update any references around
the audio_item creation and the later dynamic extension usage to use this single
ext variable.

---

Duplicate comments:
In `@packages/file-export/fastfetchbot_file_export/transcribe.py`:
- Around line 45-55: The code calls the retired model name "gpt-3.5-turbo-16k"
inside the functions punctuation_assistant and summary_assistant; update the
model parameter in both OpenAI chat completion calls to a supported model (e.g.,
"gpt-3.5-turbo" or "gpt-4o-mini") so runtime requests won't fail, ensuring you
change the model string in the client.chat.completions.create(...) invocation in
both functions.
- Around line 85-91: The segmentation loop in transcribe.py is wrong: swap the
range bounds to iterate over audio length (use range(0, audio_length * 1000,
SEGMENT_LENGTH * 1000)), compute end_time in consistent milliseconds (use
end_time = start_time + SEGMENT_LENGTH * 1000 rather than (i + SEGMENT_LENGTH) *
1000), and invert the slice conditional so that when end_time >= audio_length *
1000 you take the tail audio_item[start_time:] else take
audio_item[start_time:end_time]; update variable references (index,
i/start_time/end_time, SEGMENT_LENGTH, audio_length, audio_item) accordingly.

---

Nitpick comments:
In `@packages/file-export/fastfetchbot_file_export/transcribe.py`:
- Line 55: The current returns in punctuation_assistant and summary_assistant
assume response.choices[0].message.content exists, which can be None or choices
may be empty; update both functions to validate the OpenAI response: ensure
response.choices is non-empty, check the chosen item's finish_reason is "stop"
(or at minimum that choice.message.content is not None), and handle error cases
by either raising a clear exception or returning a safe fallback (e.g., empty
string) and logging the response (including finish_reason and full choices) for
debugging instead of directly returning response.choices[0].message.content.
- Around line 93-106: The filename splitting and base-name computation are done
inside the loop and use fragile audio_file.split("."); move the decomposition
out of the loop and replace it with os.path.splitext to correctly handle names
with multiple dots: compute base (audio_file_non_ext) and ext (audio_file_ext)
once before iterating, then inside the loop build segment_path as
f"{audio_file_non_ext}-{index + 1}{audio_file_ext if ext.startswith('.') else
'.'+audio_file_ext}" (or simply use the tuple from splitext to format safely),
keep exporting audio_segment, calling
client.audio.transcriptions.create(model=TRANSCRIBE_MODEL, file=f) and appending
result.text to transcript, and still remove the segment_path after the request;
update references to audio_file_list/audio_file_ext/audio_file_non_ext to use
the new variables so the loop no longer recomputes them.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (2)
packages/file-export/fastfetchbot_file_export/video_download.py (2)

202-210: Move the success return to an else block (TRY300).

Placing the return statement inside the try body makes it harder to distinguish happy-path logic from exception-prone code. An else block runs only when no exception was raised.

♻️ Proposed refactor
-        return {
-            "message": "success",
-            "content_info": content_info,
-            "orientation": orientation,
-            "file_path": file_path_output,
-        }
     except Exception:
         logger.exception("Video download failed")
         raise
+    else:
+        return {
+            "message": "success",
+            "content_info": content_info,
+            "orientation": orientation,
+            "file_path": file_path_output,
+        }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/file-export/fastfetchbot_file_export/video_download.py` around lines
202 - 210, The return of the success payload is inside the try block; move it
into an else block so the happy-path (returning {"message": "success",
"content_info": content_info, "orientation": orientation, "file_path":
file_path_output}) executes only when no exception is raised. In the function in
video_download.py (the block that currently catches Exception as e and calls
logger.exception("Video download failed")), wrap the existing success return in
an else: after the except handler and keep the logger.exception/raise in except.
This separates success logic from exception-prone code and preserves the same
variables (content_info, orientation, file_path_output).

44-49: PEP 484: Use explicit T | None annotations for nullable parameters.

Four parameters use implicit Optional (str = None, dict = None), which Ruff (RUF013) flags:

  • extractor: str = None (line 44)
  • video_format: str = None (line 47)
  • cookie_file_path: str = None (line 49)
  • config: dict = None (line 125)
♻️ Proposed fix
-    extractor: str = None,
+    extractor: str | None = None,
 ...
-    video_format: str = None,
+    video_format: str | None = None,
 ...
-    cookie_file_path: str = None,
+    cookie_file_path: str | None = None,
-    config: dict = None,
+    config: dict | None = None,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/file-export/fastfetchbot_file_export/video_download.py` around lines
44 - 49, Update the nullable parameter annotations to use explicit union types:
change extractor: str = None, video_format: str = None, and cookie_file_path:
str = None to extractor: str | None = None, video_format: str | None = None,
cookie_file_path: str | None = None, and change config: dict = None to config:
dict | None = None; if the module must remain compatible with older evaluation
semantics, add from __future__ import annotations at the top so the PEP 604 pipe
syntax is allowed without runtime issues.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@packages/file-export/fastfetchbot_file_export/video_download.py`:
- Line 48: The default download_dir set to "/tmp" is insecure; update
init_yt_downloader (and the fallback logic in download_video) to avoid a
hardcoded world-writable path by requiring an explicit config or creating a
dedicated temporary directory using tempfile.mkdtemp(prefix="fastfetchbot_",
dir=None) and immediately set restrictive permissions (os.chmod(path, 0o700));
replace default parameter download_dir: str = "/tmp" with either download_dir:
Optional[str] = None and if None create the secure temp dir, ensure the created
directory is used consistently in download_video, and add proper cleanup of that
directory when done.
- Around line 208-210: The except clause currently binds an unused variable with
"except Exception as e" — remove the unused binding by changing it to "except
Exception:" so logger.exception("Video download failed") still captures the
active exception and the subsequent raise re-raises it; update the except block
surrounding the video download logic where logger.exception is called to
eliminate the unused variable `e`.
- Around line 193-200: The current download code uses
prepare_filename(content_info) where content_info was obtained from
extract_info(..., download=False), so prepare_filename still reflects the
pre-merge ext (e.g., webm) and yields a non-existent path when merging to mp4;
instead, perform the actual download via extract_info(url, download=True) (or
use the info dict returned by downloader.download/extract_info when
download=True) during Phase 2 and use the returned post-processed info's
filepath/filename (or ext) as the real output path rather than calling
prepare_filename(content_info); update the download block around
downloader.download/extract_info and replace prepare_filename(content_info) with
the post-download info's filepath/filename so merged ext (merge_output_format)
is honored.

---

Duplicate comments:
In `@packages/file-export/fastfetchbot_file_export/video_download.py`:
- Around line 77-101: The branch in init_yt_downloader duplicates the
format-selection done in get_format_for_orientation and always sets a Bilibili
referer; instead, when video_format is None call get_format_for_orientation(...)
to pick the format (preserving the current extractor/hd/bilibili_cookie inputs
and keeping the same ValueError checks for extractor being None/unrecognized),
and only add "referer": "https://www.bilibili.com/" into ydl_opts when extractor
== "bilibili" (leave referer out for YouTube and others). Ensure you reference
video_format, extractor, bilibili_cookie, get_format_for_orientation, and
ydl_opts in the change so the duplicated logic is removed and the referer is
conditional.
- Around line 8-18: The get_video_orientation function currently treats missing
formats or absent aspect_ratio as vertical; change it to default to "horizontal"
on metadata failures and avoid the biased 0.56 default: when extractor ==
"youtube", check content_info.get("formats") exists and is non-empty and then
scan formats (not just formats[0]) for the first video stream that has a usable
aspect ratio or width/height (ignore audio-only entries that lack width/height
or have a vcodec of "none"); compute aspect_ratio from aspect_ratio field if
present, otherwise from width/height when both exist, and only if that computed
ratio < 1 return "vertical", otherwise "horizontal".

---

Nitpick comments:
In `@packages/file-export/fastfetchbot_file_export/video_download.py`:
- Around line 202-210: The return of the success payload is inside the try
block; move it into an else block so the happy-path (returning {"message":
"success", "content_info": content_info, "orientation": orientation,
"file_path": file_path_output}) executes only when no exception is raised. In
the function in video_download.py (the block that currently catches Exception as
e and calls logger.exception("Video download failed")), wrap the existing
success return in an else: after the except handler and keep the
logger.exception/raise in except. This separates success logic from
exception-prone code and preserves the same variables (content_info,
orientation, file_path_output).
- Around line 44-49: Update the nullable parameter annotations to use explicit
union types: change extractor: str = None, video_format: str = None, and
cookie_file_path: str = None to extractor: str | None = None, video_format: str
| None = None, cookie_file_path: str | None = None, and change config: dict =
None to config: dict | None = None; if the module must remain compatible with
older evaluation semantics, add from __future__ import annotations at the top so
the PEP 604 pipe syntax is allowed without runtime issues.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
.github/workflows/ci.yml (2)

10-12: ⚠️ Potential issue | 🟠 Major

Concurrency group is not scoped by ref — tag builds can cancel in-flight branch builds and vice versa.

With a single static group key fastfetchbot and cancel-in-progress: true, any new workflow run (whether triggered by a v* tag or a main push) will cancel whatever is currently running. This means:

  • Pushing to main after tagging can abort a production image push mid-flight, leaving a partial/corrupted image in the container registry.
  • Two rapid tag pushes will cancel each other.

Scope the group key by github.ref to isolate concurrency per branch/tag:

🛠️ Proposed fix
 concurrency:
-  group: fastfetchbot
+  group: fastfetchbot-${{ github.ref }}
   cancel-in-progress: true
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/ci.yml around lines 10 - 12, The concurrency block
currently uses a static group key ("fastfetchbot") so runs across refs cancel
each other; update the concurrency.group to include the ref (e.g., interpolate
github.ref or another ref-specific variable) so the key is scoped per branch/tag
(keep cancel-in-progress: true); specifically modify the concurrency -> group
value to something like "fastfetchbot-${{ github.ref }}" to isolate runs while
leaving concurrency and cancel-in-progress as-is.

10-12: ⚠️ Potential issue | 🟠 Major

Concurrency group must be scoped by ref to prevent tag builds from canceling branch builds (and vice versa).

The static key fastfetchbot groups every workflow run — regardless of whether it was triggered by a tag push or a branch push — into the same concurrency slot. With cancel-in-progress: true, a tag push (production build) will cancel an in-flight staging build, and two rapid tag pushes will cancel each other mid-image-push, leaving a partially-written container registry entry. This becomes more likely now that the workflow has both branches and tags triggers.

🛠️ Proposed fix: scope the concurrency group by ref
 concurrency:
-  group: fastfetchbot
+  group: fastfetchbot-${{ github.ref }}
   cancel-in-progress: true
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/ci.yml around lines 10 - 12, The concurrency group uses a
static key "fastfetchbot" which causes tag and branch runs to share a slot;
change the group to include the ref (e.g. replace group: fastfetchbot with
group: fastfetchbot-${{ github.ref }} or group: fastfetchbot-${{ github.ref_name
}}) so concurrency is scoped per ref and tag builds won't cancel branch builds
while keeping cancel-in-progress: true; update the group line where
"concurrency:" is defined and leave "cancel-in-progress" as-is.
🧹 Nitpick comments (11)
CLAUDE.md (1)

156-163: Consider documenting Celery worker conventions alongside the other conventions.

The Key Conventions section covers the shared library, API, bot, and templates but says nothing about the new worker. Future contributors would benefit from a brief note covering when to route work to the Celery worker (e.g., video download, PDF export, audio transcription), how tasks are enqueued from the API/bot side, and which package owns task definitions.

📝 Suggested addition
 ### Key Conventions
 - Shared models and utilities go in `packages/shared/fastfetchbot_shared/`
 - API-specific code goes in `apps/api/src/`
 - Telegram bot code goes in `apps/telegram-bot/core/`
+- Long-running/blocking operations (video download, PDF generation, audio transcription) go in Celery tasks under `apps/worker/worker_core/tasks/`; the API and bot enqueue them via the Celery client rather than executing them inline
 - The bot communicates with the API only via HTTP — no direct imports of API code
 - Jinja2 templates for output formatting, with i18n support via Babel
 - Loguru for logging, Sentry for production error monitoring
 - Store sensitive cookies/tokens in environment variables, never in code
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@CLAUDE.md` around lines 156 - 163, Add a short Celery worker subsection to
"Key Conventions" explaining when to route long-running jobs (e.g., video
download, PDF export, audio transcription), how tasks are enqueued
(API/telegram-bot should trigger tasks via the HTTP task enqueue endpoint or by
calling the shared enqueue helper in packages/shared/fastfetchbot_shared/), and
which package owns task definitions (declare the worker app/package—e.g.,
apps/worker or packages/worker/tasks.py—as the single source of Celery task
definitions and include a note to import only task names, not implementation,
from other services); also mention monitoring/visibility (Loguru/Sentry
integration) and that sensitive credentials for broker/backend must live in
environment variables.
apps/api/src/services/file_export/video_download/__init__.py (2)

169-180: Move send_task inside the try block and use else for the happy path (Ruff TRY300).

Two related issues:

  1. celery_app.send_task(...) at line 169 sits outside the try/except, so broker-level errors (e.g., kombu.exceptions.OperationalError) propagate without the structured log context.
  2. return content_info at line 174 is inside the try block — Ruff TRY300 recommends moving it to an else clause so the intent is unambiguous and exception-path logic is not mixed with the success path.

Both issues are present in the AudioTranscribe sibling at apps/api/src/services/file_export/audio_transcribe/__init__.py lines 18-23.

♻️ Proposed refactor
-        result = celery_app.send_task("file_export.video_download", kwargs=body)
         try:
+            result = celery_app.send_task("file_export.video_download", kwargs=body)
             response = await asyncio.to_thread(result.get, timeout=int(DOWNLOAD_VIDEO_TIMEOUT))
-            content_info = response["content_info"]
-            content_info["file_path"] = response["file_path"]
-            return content_info
         except Exception:
             logger.exception(
                 f"file_export.video_download task failed: url={url}, extractor={extractor}, "
                 f"timeout={DOWNLOAD_VIDEO_TIMEOUT}"
             )
             raise
+        else:
+            content_info = response["content_info"]
+            content_info["file_path"] = response["file_path"]
+            return content_info
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/api/src/services/file_export/video_download/__init__.py` around lines
169 - 180, Move the call to celery_app.send_task(...) into the try block so
broker-level errors are caught and logged by the existing except handler, and
after calling result.get(...) keep the success-path return out of the try by
placing the return of content_info inside an else clause; specifically update
the block that uses celery_app.send_task, result.get, DOWNLOAD_VIDEO_TIMEOUT,
logger.exception, and content_info to call send_task inside try, handle
exceptions in except as-is, and put the happy-path code (assigning
content_info["file_path"] and returning content_info) in an else so only
successful execution reaches the return; apply the same change to the
AudioTranscribe sibling that uses the same pattern.

9-10: Consolidate two src.config imports into one line.

Lines 10 and 13 both import from src.config; they should be merged into a single statement.

♻️ Proposed refactor
 from src.services.celery_client import celery_app
-from src.config import DOWNLOAD_VIDEO_TIMEOUT
+from src.config import DOWNLOAD_VIDEO_TIMEOUT, JINJA2_ENV
 from fastfetchbot_shared.utils.parse import unix_timestamp_to_utc, second_to_time, wrap_text_into_html
 from fastfetchbot_shared.utils.logger import logger
-from src.config import JINJA2_ENV

Also applies to: 13-13

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/api/src/services/file_export/video_download/__init__.py` around lines 9
- 10, There are two separate imports from the same module; consolidate the
`src.config` imports into one statement by combining all imported names (e.g.,
keep `from src.services.celery_client import celery_app` as-is and replace
multiple `from src.config` lines with a single `from src.config import
DOWNLOAD_VIDEO_TIMEOUT, <other_config_names>` so all config symbols used in this
module are imported in one line; update references accordingly (look for
`celery_app` and `DOWNLOAD_VIDEO_TIMEOUT` to locate the relevant imports).
packages/file-export/fastfetchbot_file_export/video_download.py (4)

198-203: Consider moving return to an else block (TRY300).

Having a return inside a try block is flagged by Ruff and makes control flow slightly harder to follow. Moving it to an explicit else clause makes the success path unambiguous.

♻️ Proposed fix
-        return {
-            "message": "success",
-            "content_info": content_info,
-            "orientation": orientation,
-            "file_path": file_path_output,
-        }
     except Exception:
         logger.exception(f"download_video failed: url={url}\n{traceback.format_exc()}")
         raise
+    else:
+        return {
+            "message": "success",
+            "content_info": content_info,
+            "orientation": orientation,
+            "file_path": file_path_output,
+        }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/file-export/fastfetchbot_file_export/video_download.py` around lines
198 - 203, The current try block returns the success dict directly, which Ruff
flags; refactor by moving the return out of the try into an explicit
try/except/else structure: perform risky operations inside try, handle
exceptions in the existing except, and place the return {"message": "success",
"content_info": content_info, "orientation": orientation, "file_path":
file_path_output} in the else block so the success path is unambiguous; update
the surrounding function (video_download.py, the function that defines
content_info/orientation/file_path_output) accordingly.

48-53: Fix implicit Optional annotations (RUF013).

extractor, video_format, and cookie_file_path all default to None but are typed as plain str, which violates PEP 484. The same issue applies to config at line 118.

♻️ Proposed fix
 def init_yt_downloader(
     hd: bool = False,
     audio_only: bool = False,
-    extractor: str = None,
+    extractor: str | None = None,
     no_proxy: bool = False,
     extract_only: bool = False,
-    video_format: str = None,
-    download_dir: str = "/tmp",
-    cookie_file_path: str = None,
+    video_format: str | None = None,
+    download_dir: str = "/tmp",
+    cookie_file_path: str | None = None,
     ...
 def download_video(
     ...
-    config: dict = None,
+    config: dict | None = None,
 ) -> dict:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/file-export/fastfetchbot_file_export/video_download.py` around lines
48 - 53, The parameters extractor, video_format, and cookie_file_path are
annotated as plain str but default to None—change their annotations to
Optional[str] and add "from typing import Optional" to the imports; also update
the variable/parameter named config (at its declaration around line 118) to
Optional[...] with the appropriate inner type, and run a quick type-check to
ensure uses of extractor, video_format, cookie_file_path, and config handle None
safely (e.g., by checking for None before string operations).

2-2: traceback.format_exc() is redundant — Loguru's logger.exception() already captures the traceback.

logger.exception() is Loguru's shorthand for logger.opt(exception=True).error(...), which automatically records the active exception and its full traceback. Embedding traceback.format_exc() in the f-string causes the traceback to appear twice in the log output. Removing it also makes the import traceback at line 2 unnecessary.

♻️ Proposed fix
 import os
-import traceback
 
 from loguru import logger
     except Exception:
-        logger.exception(f"download_video failed: url={url}\n{traceback.format_exc()}")
+        logger.exception(f"download_video failed: url={url}")
         raise

Also applies to: 204-205

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/file-export/fastfetchbot_file_export/video_download.py` at line 2,
Remove the redundant use of traceback.format_exc() and the unused import
traceback: replace calls that do logger.exception(f"...
{traceback.format_exc()}") with logger.exception("...") so Loguru will record
the exception and full traceback automatically, and delete the top-level import
traceback; also update the other logger.exception usages that embed
traceback.format_exc() (the occurrences later in the file) the same way.

45-58: no_proxy parameter in init_yt_downloader is unreachable through download_video.

no_proxy controls whether the proxy is applied (if proxy_mode and not no_proxy), but download_video never extracts or forwards it from config. Every call made through download_video silently uses the default no_proxy=False, making the parameter inaccessible from the public API.

Either expose no_proxy via config in download_video or document that the parameter is only usable when calling init_yt_downloader directly.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/file-export/fastfetchbot_file_export/video_download.py` around lines
45 - 58, The download_video path never forwards the no_proxy flag to
init_yt_downloader, so no_proxy is effectively unreachable; update
download_video to read no_proxy from the incoming config (e.g., no_proxy =
config.get("no_proxy", False)) and pass it into init_yt_downloader when
constructing the YoutubeDL instance (alongside
proxy_mode/proxy_url/youtube_cookie/bilibili_cookie), ensuring the existing
conditional (if proxy_mode and not no_proxy) in init_yt_downloader works as
intended; alternatively, if you prefer not to expose it, add documentation to
the function docstrings stating no_proxy is only effective when calling
init_yt_downloader directly.
apps/api/src/services/file_export/audio_transcribe/__init__.py (1)

3-3: DOWNLOAD_VIDEO_TIMEOUT is a misleading constant for the audio transcription timeout.

The constant name implies it governs video downloads, but it's reused here as the Celery result timeout for audio transcription. If audio transcription (Whisper + post-GPT processing) takes longer than video downloads, this can produce spurious TimeoutErrors.

Consider either introducing a dedicated TRANSCRIBE_TIMEOUT config key or renaming the existing constant to something broader (e.g., FILE_EXPORT_TASK_TIMEOUT) and using it consistently across all file-export tasks.

Also applies to: 22-22

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/api/src/services/file_export/audio_transcribe/__init__.py` at line 3,
The imported constant DOWNLOAD_VIDEO_TIMEOUT in
apps/api/src/services/file_export/audio_transcribe/__init__.py is misleading for
Celery result timeout used by audio transcription; introduce a dedicated
TRANSCRIBE_TIMEOUT in the config (or rename DOWNLOAD_VIDEO_TIMEOUT to a broader
FILE_EXPORT_TASK_TIMEOUT) and update this module to import and use that new
symbol (TRANSCRIBE_TIMEOUT) for the transcription Celery result timeout; ensure
you also update any other file-export task modules to use the new broader
constant if you chose renaming so all file-export timeouts are consistent.
apps/worker/worker_core/tasks/transcribe.py (1)

12-12: Consider moving the ValueError message into a dedicated exception class (Ruff TRY003).

Ruff flags long inline messages on raise statements as TRY003. This is a style concern rather than a correctness issue, but it keeps the message co-located with the exception type for reuse.

♻️ Proposed refactor
+class MissingAPIKeyError(ValueError):
+    """Raised when OPENAI_API_KEY is not configured in the worker environment."""

 `@app.task`(name="file_export.transcribe")
 def transcribe_task(audio_file: str) -> dict:
     logger.info(f"transcribe_task started: audio_file={audio_file}")
     if not OPENAI_API_KEY:
         logger.error("transcribe_task failed: OPENAI_API_KEY is not set")
-        raise ValueError("OPENAI_API_KEY is not configured in the worker environment")
+        raise MissingAPIKeyError
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/worker/worker_core/tasks/transcribe.py` at line 12, Replace the inline
ValueError message in transcribe.py with a dedicated exception class (e.g.,
define class OpenAIKeyNotConfigured(Exception): with the long message as the
class docstring or default args in __init__) and then raise
OpenAIKeyNotConfigured() instead of raise ValueError("OPENAI_API_KEY is not
configured in the worker environment"); update any imports/usages in this module
to reference OpenAIKeyNotConfigured so the message is co-located with the
exception type.
apps/api/src/services/file_export/document_export/pdf_export.py (1)

8-8: DOWNLOAD_VIDEO_TIMEOUT repurposed as the PDF Celery task timeout.

The config constant name implies a video-specific setting. Reusing it for a PDF export task makes the timeout semantics opaque and could be accidentally changed during a video-only refactor.

🔧 Suggested fix

Consider adding a dedicated config key, e.g. PDF_EXPORT_TIMEOUT, or at minimum a more generic CELERY_TASK_TIMEOUT. Change the import and usage accordingly:

-from src.config import DOWNLOAD_VIDEO_TIMEOUT, AWS_STORAGE_ON
+from src.config import PDF_EXPORT_TIMEOUT, AWS_STORAGE_ON
 ...
-            response = await asyncio.to_thread(result.get, timeout=int(DOWNLOAD_VIDEO_TIMEOUT))
+            response = await asyncio.to_thread(result.get, timeout=int(PDF_EXPORT_TIMEOUT))

Also applies to: 37-37

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/api/src/services/file_export/document_export/pdf_export.py` at line 8,
The PDF export task is using DOWNLOAD_VIDEO_TIMEOUT which is misnamed for PDFs;
add a new config constant (e.g. PDF_EXPORT_TIMEOUT or CELERY_TASK_TIMEOUT) in
the config module, update the import in pdf_export.py to use the new name
instead of DOWNLOAD_VIDEO_TIMEOUT, and replace all usages in the file
(references to DOWNLOAD_VIDEO_TIMEOUT) with the new constant (ensure any other
files that share the same value are updated if you choose a generic name).
pyproject.toml (1)

57-57: celery-types is missing an upper-bound version constraint.

Every other dev dependency in [dependency-groups] is pinned with an upper bound (e.g., black>=25.1.0,<26.0.0). celery-types>=0.24.0 has only a lower bound, which deviates from the project's convention.

🔧 Suggested fix
-    "celery-types>=0.24.0",
+    "celery-types>=0.24.0,<0.25.0",
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pyproject.toml` at line 57, The dev dependency entry "celery-types>=0.24.0"
in pyproject.toml lacks an upper-bound constraint and should follow the
project's pinning convention; update that dependency to include a reasonable
upper bound (for example "<0.25.0" or another next-major cutoff) so it matches
other entries like "black>=25.1.0,<26.0.0", i.e., change the spec for
celery-types to "celery-types>=0.24.0,<0.25.0" (or an appropriate upper bound)
to enforce compatibility.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/ci.yml:
- Around line 52-54: The if condition is using a pre-evaluated expression "${{
github.ref }}" which risks shell injection; update the condition to use the
runtime environment variable $GITHUB_REF instead (i.e. change if [[ "${{
github.ref }}" == refs/tags/* ]]; then to if [[ "$GITHUB_REF" == refs/tags/* ]];
then) and ensure VERSION_TAG=${GITHUB_REF#refs/tags/} remains unchanged; also
apply the same replacement inside the commented deploy block (the commented
if/then that references github.ref) so all checks use $GITHUB_REF at runtime.
- Around line 52-54: Replace direct interpolation of the expression "${{
github.ref }}" inside the run script with GitHub Actions env mapping: set an env
entry GITHUB_REF: ${{ github.ref }} for the job/step and then reference the
shell variable GITHUB_REF in the run block (e.g., change the if test to use
"$GITHUB_REF" == refs/tags/*). Apply the same env/GITHUB_REF pattern to the
commented deploy block before enabling it so both the VERSION_TAG logic and any
deploy steps use the environment-mapped GITHUB_REF instead of direct expression
interpolation.

In `@CLAUDE.md`:
- Line 9: The fenced code block that contains the directory tree entry
"FastFetchBot/" is missing a language specifier (MD040); update that fenced
block (the one enclosing "FastFetchBot/") to include a language tag such as text
(e.g., ```text) so the markdown linter stops reporting the violation.
- Around line 113-125: The two Markdown tables under the "Required" heading and
the "Service Communication (Docker)" heading are missing blank lines around them
(MD058); edit CLAUDE.md to insert a blank line immediately before and after each
table so each table is separated from surrounding paragraphs/headings—i.e.,
ensure there is an empty line above the "### Required" table and below it, and
likewise an empty line above and below the "### Service Communication (Docker)"
table.

---

Outside diff comments:
In @.github/workflows/ci.yml:
- Around line 10-12: The concurrency block currently uses a static group key
("fastfetchbot") so runs across refs cancel each other; update the
concurrency.group to include the ref (e.g., interpolate github.ref or another
ref-specific variable) so the key is scoped per branch/tag (keep
cancel-in-progress: true); specifically modify the concurrency -> group value to
something like "fastfetchbot-${{ github.ref }}" to isolate runs while leaving
concurrency and cancel-in-progress as-is.
- Around line 10-12: The concurrency group uses a static key "fastfetchbot"
which causes tag and branch runs to share a slot; change the group to include
the ref (e.g. replace group: fastfetchbot with group: fastfetchbot-${{
github.ref }} or group: fastfetchbot-${{ github.ref_name }}) so concurrency is
scoped per ref and tag builds won't cancel branch builds while keeping
cancel-in-progress: true; update the group line where "concurrency:" is defined
and leave "cancel-in-progress" as-is.

---

Duplicate comments:
In `@apps/api/src/services/file_export/document_export/__init__.py`:
- Around line 11-13: The export path is calling PdfExport incorrectly:
PdfExport.__init__ expects (title: str, html_string: str = None) and
PdfExport.export is async, so change the call in export() to instantiate
PdfExport with the title (e.g. self.document["title"]) and pass the HTML content
as the html_string keyword (html_string=self.document["content"]), then await
the async PdfExport.export() call (await PdfExport(...).export()); update
references to PdfExport.export, PdfExport.__init__, and the export() method in
the class accordingly.

In `@packages/file-export/fastfetchbot_file_export/transcribe.py`:
- Line 109: Replace the two fullwidth colon characters in the f-string that
constructs the final message (the f-string calling summary_assistant(client,
transcript) and including transcript) with ASCII colons, or if the fullwidth
punctuation is intentional, add a per-line linter suppression comment "# noqa:
RUF001" directly after that line; locate the f-string using summary_assistant
and transcript to make the change.
- Around line 45-68: Both punctuation_assistant and summary_assistant call a
retired model ("gpt-3.5-turbo-16k") and will fail; update the model string in
both functions (punctuation_assistant and summary_assistant) to a supported
alternative such as "gpt-4o-mini" or "gpt-3.5-turbo" to restore runtime calls,
keeping the rest of the chat.completions.create call and parameters unchanged so
behavior and prompt wiring remains intact.

In `@packages/file-export/fastfetchbot_file_export/video_download.py`:
- Line 52: The default parameter download_dir: str = "/tmp" is insecure; change
the signature to avoid a public /tmp default by either setting download_dir
default to None (and require callers to provide it) or using a secure
per-process temp directory created with tempfile.mkdtemp() at runtime, ensure
the code that uses download_dir (the download_dir parameter in this
module/function) validates the path, creates the directory with restrictive
permissions, and schedules cleanup; update any callers/tests accordingly.

---

Nitpick comments:
In `@apps/api/src/services/file_export/audio_transcribe/__init__.py`:
- Line 3: The imported constant DOWNLOAD_VIDEO_TIMEOUT in
apps/api/src/services/file_export/audio_transcribe/__init__.py is misleading for
Celery result timeout used by audio transcription; introduce a dedicated
TRANSCRIBE_TIMEOUT in the config (or rename DOWNLOAD_VIDEO_TIMEOUT to a broader
FILE_EXPORT_TASK_TIMEOUT) and update this module to import and use that new
symbol (TRANSCRIBE_TIMEOUT) for the transcription Celery result timeout; ensure
you also update any other file-export task modules to use the new broader
constant if you chose renaming so all file-export timeouts are consistent.

In `@apps/api/src/services/file_export/document_export/pdf_export.py`:
- Line 8: The PDF export task is using DOWNLOAD_VIDEO_TIMEOUT which is misnamed
for PDFs; add a new config constant (e.g. PDF_EXPORT_TIMEOUT or
CELERY_TASK_TIMEOUT) in the config module, update the import in pdf_export.py to
use the new name instead of DOWNLOAD_VIDEO_TIMEOUT, and replace all usages in
the file (references to DOWNLOAD_VIDEO_TIMEOUT) with the new constant (ensure
any other files that share the same value are updated if you choose a generic
name).

In `@apps/api/src/services/file_export/video_download/__init__.py`:
- Around line 169-180: Move the call to celery_app.send_task(...) into the try
block so broker-level errors are caught and logged by the existing except
handler, and after calling result.get(...) keep the success-path return out of
the try by placing the return of content_info inside an else clause;
specifically update the block that uses celery_app.send_task, result.get,
DOWNLOAD_VIDEO_TIMEOUT, logger.exception, and content_info to call send_task
inside try, handle exceptions in except as-is, and put the happy-path code
(assigning content_info["file_path"] and returning content_info) in an else so
only successful execution reaches the return; apply the same change to the
AudioTranscribe sibling that uses the same pattern.
- Around line 9-10: There are two separate imports from the same module;
consolidate the `src.config` imports into one statement by combining all
imported names (e.g., keep `from src.services.celery_client import celery_app`
as-is and replace multiple `from src.config` lines with a single `from
src.config import DOWNLOAD_VIDEO_TIMEOUT, <other_config_names>` so all config
symbols used in this module are imported in one line; update references
accordingly (look for `celery_app` and `DOWNLOAD_VIDEO_TIMEOUT` to locate the
relevant imports).

In `@apps/worker/worker_core/tasks/transcribe.py`:
- Line 12: Replace the inline ValueError message in transcribe.py with a
dedicated exception class (e.g., define class OpenAIKeyNotConfigured(Exception):
with the long message as the class docstring or default args in __init__) and
then raise OpenAIKeyNotConfigured() instead of raise ValueError("OPENAI_API_KEY
is not configured in the worker environment"); update any imports/usages in this
module to reference OpenAIKeyNotConfigured so the message is co-located with the
exception type.

In `@CLAUDE.md`:
- Around line 156-163: Add a short Celery worker subsection to "Key Conventions"
explaining when to route long-running jobs (e.g., video download, PDF export,
audio transcription), how tasks are enqueued (API/telegram-bot should trigger
tasks via the HTTP task enqueue endpoint or by calling the shared enqueue helper
in packages/shared/fastfetchbot_shared/), and which package owns task
definitions (declare the worker app/package—e.g., apps/worker or
packages/worker/tasks.py—as the single source of Celery task definitions and
include a note to import only task names, not implementation, from other
services); also mention monitoring/visibility (Loguru/Sentry integration) and
that sensitive credentials for broker/backend must live in environment
variables.

In `@packages/file-export/fastfetchbot_file_export/video_download.py`:
- Around line 198-203: The current try block returns the success dict directly,
which Ruff flags; refactor by moving the return out of the try into an explicit
try/except/else structure: perform risky operations inside try, handle
exceptions in the existing except, and place the return {"message": "success",
"content_info": content_info, "orientation": orientation, "file_path":
file_path_output} in the else block so the success path is unambiguous; update
the surrounding function (video_download.py, the function that defines
content_info/orientation/file_path_output) accordingly.
- Around line 48-53: The parameters extractor, video_format, and
cookie_file_path are annotated as plain str but default to None—change their
annotations to Optional[str] and add "from typing import Optional" to the
imports; also update the variable/parameter named config (at its declaration
around line 118) to Optional[...] with the appropriate inner type, and run a
quick type-check to ensure uses of extractor, video_format, cookie_file_path,
and config handle None safely (e.g., by checking for None before string
operations).
- Line 2: Remove the redundant use of traceback.format_exc() and the unused
import traceback: replace calls that do logger.exception(f"...
{traceback.format_exc()}") with logger.exception("...") so Loguru will record
the exception and full traceback automatically, and delete the top-level import
traceback; also update the other logger.exception usages that embed
traceback.format_exc() (the occurrences later in the file) the same way.
- Around line 45-58: The download_video path never forwards the no_proxy flag to
init_yt_downloader, so no_proxy is effectively unreachable; update
download_video to read no_proxy from the incoming config (e.g., no_proxy =
config.get("no_proxy", False)) and pass it into init_yt_downloader when
constructing the YoutubeDL instance (alongside
proxy_mode/proxy_url/youtube_cookie/bilibili_cookie), ensuring the existing
conditional (if proxy_mode and not no_proxy) in init_yt_downloader works as
intended; alternatively, if you prefer not to expose it, add documentation to
the function docstrings stating no_proxy is only effective when calling
init_yt_downloader directly.

In `@pyproject.toml`:
- Line 57: The dev dependency entry "celery-types>=0.24.0" in pyproject.toml
lacks an upper-bound constraint and should follow the project's pinning
convention; update that dependency to include a reasonable upper bound (for
example "<0.25.0" or another next-major cutoff) so it matches other entries like
"black>=25.1.0,<26.0.0", i.e., change the spec for celery-types to
"celery-types>=0.24.0,<0.25.0" (or an appropriate upper bound) to enforce
compatibility.


## Architecture

```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add a language specifier to the fenced code block.

The directory-tree block has no language tag, which violates MD040.

📝 Proposed fix
-```
+```text
 FastFetchBot/
🧰 Tools
🪛 markdownlint-cli2 (0.21.0)

[warning] 9-9: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@CLAUDE.md` at line 9, The fenced code block that contains the directory tree
entry "FastFetchBot/" is missing a language specifier (MD040); update that
fenced block (the one enclosing "FastFetchBot/") to include a language tag such
as text (e.g., ```text) so the markdown linter stops reporting the violation.

Comment on lines +113 to +125
### Required
| Variable | Description |
|----------|-------------|
| `BASE_URL` | Public server domain (used for webhook URL construction) |
| `TELEGRAM_BOT_TOKEN` | Bot token from @BotFather |
| `TELEGRAM_CHAT_ID` | Default chat ID for the bot |

### Critical Setup Notes
- Most social media scrapers require authentication cookies/tokens
### Service Communication (Docker)
| Variable | Default | Description |
|----------|---------|-------------|
| `API_SERVER_URL` | `http://localhost:10450` | URL the Telegram Bot uses to call the API. `http://api:10450` in Docker. |
| `TELEGRAM_BOT_CALLBACK_URL` | `http://localhost:10451` | URL the API uses to call the Telegram Bot. `http://telegram-bot:10451` in Docker. |
| `TELEGRAM_BOT_MODE` | `polling` | `polling` (dev) or `webhook` (production with HTTPS) |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add blank lines around the environment variable tables (MD058).

Both the "Required" table (around line 114) and the "Service Communication (Docker)" table (around line 121) are flagged by markdownlint for missing surrounding blank lines.

📝 Proposed fix
 ### Required
+
 | Variable | Description |
 |----------|-------------|
 | `BASE_URL` | Public server domain (used for webhook URL construction) |
 | `TELEGRAM_BOT_TOKEN` | Bot token from `@BotFather` |
 | `TELEGRAM_CHAT_ID` | Default chat ID for the bot |
+
 ### Service Communication (Docker)
+
 | Variable | Default | Description |
 |----------|---------|-------------|
 | `API_SERVER_URL` | `http://localhost:10450` | URL the Telegram Bot uses to call the API. `http://api:10450` in Docker. |
 | `TELEGRAM_BOT_CALLBACK_URL` | `http://localhost:10451` | URL the API uses to call the Telegram Bot. `http://telegram-bot:10451` in Docker. |
 | `TELEGRAM_BOT_MODE` | `polling` | `polling` (dev) or `webhook` (production with HTTPS) |
+
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
### Required
| Variable | Description |
|----------|-------------|
| `BASE_URL` | Public server domain (used for webhook URL construction) |
| `TELEGRAM_BOT_TOKEN` | Bot token from @BotFather |
| `TELEGRAM_CHAT_ID` | Default chat ID for the bot |
### Critical Setup Notes
- Most social media scrapers require authentication cookies/tokens
### Service Communication (Docker)
| Variable | Default | Description |
|----------|---------|-------------|
| `API_SERVER_URL` | `http://localhost:10450` | URL the Telegram Bot uses to call the API. `http://api:10450` in Docker. |
| `TELEGRAM_BOT_CALLBACK_URL` | `http://localhost:10451` | URL the API uses to call the Telegram Bot. `http://telegram-bot:10451` in Docker. |
| `TELEGRAM_BOT_MODE` | `polling` | `polling` (dev) or `webhook` (production with HTTPS) |
### Required
| Variable | Description |
|----------|-------------|
| `BASE_URL` | Public server domain (used for webhook URL construction) |
| `TELEGRAM_BOT_TOKEN` | Bot token from `@BotFather` |
| `TELEGRAM_CHAT_ID` | Default chat ID for the bot |
### Service Communication (Docker)
| Variable | Default | Description |
|----------|---------|-------------|
| `API_SERVER_URL` | `http://localhost:10450` | URL the Telegram Bot uses to call the API. `http://api:10450` in Docker. |
| `TELEGRAM_BOT_CALLBACK_URL` | `http://localhost:10451` | URL the API uses to call the Telegram Bot. `http://telegram-bot:10451` in Docker. |
| `TELEGRAM_BOT_MODE` | `polling` | `polling` (dev) or `webhook` (production with HTTPS) |
🧰 Tools
🪛 markdownlint-cli2 (0.21.0)

[warning] 114-114: Tables should be surrounded by blank lines

(MD058, blanks-around-tables)


[warning] 121-121: Tables should be surrounded by blank lines

(MD058, blanks-around-tables)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@CLAUDE.md` around lines 113 - 125, The two Markdown tables under the
"Required" heading and the "Service Communication (Docker)" heading are missing
blank lines around them (MD058); edit CLAUDE.md to insert a blank line
immediately before and after each table so each table is separated from
surrounding paragraphs/headings—i.e., ensure there is an empty line above the
"### Required" table and below it, and likewise an empty line above and below
the "### Service Communication (Docker)" table.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
.github/workflows/ci.yml (1)

10-12: ⚠️ Potential issue | 🟠 Major

Static concurrency group breaks isolation between tag and branch builds.

With group: fastfetchbot and cancel-in-progress: true, a branch push (staging) and a near-simultaneous tag push (production) compete for the same slot — whichever arrives second cancels the first. Since tag builds publish the latest production image, silently canceling them mid-flight means the image is never pushed.

Fix: include ${{ github.ref }} in the group key so tag workflows and branch workflows run independently:

🔒 Proposed fix
 concurrency:
-  group: fastfetchbot
+  group: fastfetchbot-${{ github.ref }}
   cancel-in-progress: true
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/ci.yml around lines 10 - 12, The concurrency block
currently uses a static group ("group: fastfetchbot") with "cancel-in-progress:
true" which causes tag and branch workflows to cancel each other; update the
concurrency group key to include the workflow ref (e.g., interpolate ${{
github.ref }}) so each ref gets its own concurrency slot—modify the "group:
fastfetchbot" value to embed the ref variable (preserving "cancel-in-progress:
true") so tag builds and branch builds run independently.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In @.github/workflows/ci.yml:
- Around line 10-12: The concurrency block currently uses a static group
("group: fastfetchbot") with "cancel-in-progress: true" which causes tag and
branch workflows to cancel each other; update the concurrency group key to
include the workflow ref (e.g., interpolate ${{ github.ref }}) so each ref gets
its own concurrency slot—modify the "group: fastfetchbot" value to embed the ref
variable (preserving "cancel-in-progress: true") so tag builds and branch builds
run independently.

---

Duplicate comments:
In @.github/workflows/ci.yml:
- Around line 103-108: The workflow currently injects secrets by macro-expanding
`${{ secrets.* }}` directly into the shell script body for TOKEN and WEBHOOK_URL
(the commented deploy block), which is unsafe if secrets contain shell
meta-characters; change this to pass those secrets via an `env:` mapping (e.g.,
set TOKEN and WEBHOOK_URL in the job or step `env:` using `${{
secrets.PROD_WATCHTOWER_TOKEN }}` / `${{ secrets.STAGE_WATCHTOWER_TOKEN }}` and
the matching webhook secrets) before you uncomment the deploy block so the shell
only reads them as environment variables instead of performing inline expansion.

@aturret aturret merged commit 61f3dcc into main Feb 20, 2026
2 checks passed
@aturret aturret deleted the celery-update branch February 23, 2026 17:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant