feat: add Async worker to decouple API and core logic#69
Conversation
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughAdds an ARQ-based async worker service, refactors file-export logic into shared Celery-backed wrappers, and integrates queue mode into the Telegram bot with Redis-backed outbox and ARQ enqueueing; API app classes now delegate to shared implementations. Changes
Sequence DiagramssequenceDiagram
participant User as User (Telegram)
participant Bot as Telegram Bot
participant ARQ as ARQ Queue (Redis)
participant Worker as Async Worker
participant Celery as Celery Worker
participant Outbox as Outbox (Redis)
participant Consumer as Outbox Consumer
User->>Bot: Send URL
Bot->>ARQ: enqueue_scrape(url, chat_id)
ARQ-->>Bot: job_id
Bot-->>User: "Item queued for processing..."
Worker->>ARQ: dequeue scrape_and_enrich
Worker->>Celery: send_task(video_download)
Celery-->>Worker: video metadata / file_path
Worker->>Celery: send_task(pdf_export)
Celery-->>Worker: pdf_url
Worker->>Outbox: push(job_id, metadata_item)
Consumer->>Outbox: BRPOP outbox_queue
Outbox-->>Consumer: metadata_item payload
Consumer->>Bot: send_item_message(metadata_item, chat_id)
Bot-->>User: Deliver message to chat
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Codecov Report❌ Patch coverage is
@@ Coverage Diff @@
## main #69 +/- ##
=======================================
Coverage ? 71.74%
=======================================
Files ? 62
Lines ? 3072
Branches ? 0
=======================================
Hits ? 2204
Misses ? 868
Partials ? 0
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Actionable comments posted: 15
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
CLAUDE.md (1)
127-128:⚠️ Potential issue | 🟡 MinorUpdate uv version documentation to include the new async-worker Dockerfile.
The documentation mentions updating uv version in three Dockerfiles, but the new
apps/async-worker/Dockerfilealso pins uv to0.10.4(line 23). This should be included to keep all services in sync.📝 Suggested documentation update
-> **uv version in Docker**: All three Dockerfiles pin uv to `0.10.4` via `COPY --from=ghcr.io/astral-sh/uv:0.10.4`. -> To upgrade, update that tag in `apps/api/Dockerfile`, `apps/telegram-bot/Dockerfile`, and `apps/worker/Dockerfile`. +> **uv version in Docker**: All four Dockerfiles pin uv to `0.10.4` via `COPY --from=ghcr.io/astral-sh/uv:0.10.4`. +> To upgrade, update that tag in `apps/api/Dockerfile`, `apps/telegram-bot/Dockerfile`, `apps/worker/Dockerfile`, and `apps/async-worker/Dockerfile`.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@CLAUDE.md` around lines 127 - 128, Update the documentation note about pinned uv versions to include the missing Dockerfile that also pins uv to 0.10.4: search for the COPY line "COPY --from=ghcr.io/astral-sh/uv:0.10.4" in the async-worker Dockerfile and add it to the list of files mentioned (so the doc references the async-worker along with the api, telegram-bot, and worker Dockerfiles); ensure the doc text explicitly instructs updating that tag in the async-worker Dockerfile when bumping uv.apps/telegram-bot/core/handlers/url_process.py (1)
100-120:⚠️ Potential issue | 🟡 MinorMissing
messageparameter may cause issues in API mode.When
SCRAPE_MODEis not"queue", the_fetch_and_sendcall at line 111 doesn't pass amessageobject, butsend_item_messageat line 79 expects it for reply context. This could cause the bot to lose the reply threading in API mode for this code path.Suggested fix
await _fetch_and_send( url=url_metadata["url"], chat_id=message.chat_id, + message=message, source=url_metadata.get("source", ""), content_type=url_metadata.get("content_type", ""), )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/telegram-bot/core/handlers/url_process.py` around lines 100 - 120, The code path calling _fetch_and_send in url_process.py doesn't pass the message object, which breaks reply threading because send_item_message expects message when SCRAPE_MODE != "queue"; update the _fetch_and_send(...) invocation inside the "unknown" source branch to include the original message (pass message=message) or conditionally forward message when SCRAPE_MODE != "queue", ensuring _fetch_and_send and downstream send_item_message receive the message for correct reply context.
🧹 Nitpick comments (9)
apps/async-worker/Dockerfile (2)
1-1: Minor: Remove leading blank line.The Dockerfile starts with a blank line before the comment, which is unconventional.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/async-worker/Dockerfile` at line 1, Remove the leading blank line at the top of the Dockerfile so the first line begins immediately with the existing comment/instruction; open the Dockerfile and delete the initial empty line before the first token to ensure the file starts with the intended content.
44-55: Consider adding a non-root user for improved security posture.The container runs as root by default. While this may not be actively exploitable, running as a non-root user is a security best practice that reduces the blast radius if the container is compromised. No other services in the project use non-root users currently.
🔒 Optional: Add non-root user
FROM python-base AS production ENV PYTHONPATH=/app/apps/async-worker:$PYTHONPATH RUN apt-get update \ && apt-get install --no-install-recommends -y \ ca-certificates \ libffi-dev \ - && rm -rf /var/lib/apt/lists/* + && rm -rf /var/lib/apt/lists/* \ + && useradd --create-home --shell /bin/bash appuser COPY --from=builder-base $PYSETUP_PATH $PYSETUP_PATH COPY packages/ /app/packages/ COPY apps/async-worker/ /app/apps/async-worker/ WORKDIR /app/apps/async-worker +USER appuser CMD ["arq", "async_worker.main.WorkerSettings"]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/async-worker/Dockerfile` around lines 44 - 55, Add a non-root user and switch the container to run as that user to improve security: in the Dockerfile create a group and user (e.g., appuser) with a fixed non-root UID/GID, set a HOME, chown the application files under WORKDIR (/app/apps/async-worker) and any copied paths ($PYSETUP_PATH, /app/packages/) to that user, and then add a USER appuser line before the CMD ["arq", "async_worker.main.WorkerSettings"]; ensure PYTHONPATH and WORKDIR remain correct and that file ownership is updated so the non-root user can run the process.tests/unit/file_export/test_pdf_export.py (2)
128-128: Unused variable from unpacking.The
kwargsvariable is unpacked but never used. Prefix with underscore to indicate intentional discard.♻️ Fix unused variable
- args, kwargs = mock_celery.send_task.call_args + args, _kwargs = mock_celery.send_task.call_args🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/file_export/test_pdf_export.py` at line 128, The test unpacks call_args into args and kwargs but never uses kwargs; change the unpack to discard the unused dict by renaming it with a leading underscore (e.g., args, _kwargs = mock_celery.send_task.call_args or args, _ = mock_celery.send_task.call_args) so intent is clear and linter warnings are avoided; update the unpack at the mock_celery.send_task.call_args usage accordingly.
3-3: Remove unused import.
asynciois imported but never used in this test file.♻️ Remove unused import
-import asyncio from unittest.mock import MagicMock, patch🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/file_export/test_pdf_export.py` at line 3, Remove the unused top-level import of asyncio from the test file; simply delete the import statement referencing asyncio so the module no longer contains an unused import (look for the import asyncio line at the top of tests' PDF export test).apps/telegram-bot/core/handlers/buttons.py (1)
67-67: Remove extraneous f-string prefixes.These strings have no placeholders, so the
fprefix is unnecessary.♻️ Fix f-strings
- text=f"Item queued for processing...", + text="Item queued for processing...",- text=f"Item processing...", + text="Item processing...",- text=f"Item processed. Sending to the target...", + text="Item processed. Sending to the target...",- text=f"Item sent to the channel.", + text="Item sent to the channel.",Also applies to: 79-79, 85-85, 92-92
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/telegram-bot/core/handlers/buttons.py` at line 67, Several message strings in apps/telegram-bot/core/handlers/buttons.py are using unnecessary f-string prefixes (e.g., text=f"Item queued for processing...") even though they contain no placeholders; remove the leading "f" from those string literals so they become plain strings. Locate the occurrences where the parameter name text is set (the calls using text=f"...") — specifically the instances around the current lines 67, 79, 85, and 92 — and change text=f"..." to text="..." for each occurrence (no other logic changes required).packages/shared/fastfetchbot_shared/services/scrapers/common.py (1)
72-78: Keep video scraper registration in the canonical registry.This introduces a third resolution path outside
service_classes/ScraperManager, which makes video support harder to discover and extend consistently. Prefer wiring the lazy video loader through the existing registry mechanism instead of special-casing categories here.Based on learnings
Register new platform scrapers in InfoExtractService.service_classes (in packages/shared/fastfetchbot_shared/services/scrapers/common.py) or ScraperManager (for scrapers needing lazy initialization).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/shared/fastfetchbot_shared/services/scrapers/common.py` around lines 72 - 78, The _resolve_scraper_class method currently special-cases "youtube" and "bilibili" by calling _get_video_downloader, which bypasses the canonical registry (service_classes/ScraperManager); remove that special-case and make _resolve_scraper_class only consult self.service_classes and raise KeyError if missing. Instead, register the lazy video downloader via the existing registry: add entries for "youtube" and "bilibili" to InfoExtractService.service_classes (or register them with ScraperManager if lazy init is required) mapping to the _get_video_downloader factory/lazy wrapper so the registry, not _resolve_scraper_class, controls video scraper resolution.tests/unit/telegram_bot/test_outbox_consumer.py (1)
36-53: Covermessage_idforwarding in the success-path test.The outbox payload shape includes
message_id, but this test only proveschat_idreachessend_item_message. Without a regression here, queue mode can silently stop replying in-thread while the suite still passes.Also applies to: 92-126
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/telegram_bot/test_outbox_consumer.py` around lines 36 - 53, The success-path test in tests/unit/telegram_bot/test_outbox_consumer.py doesn't assert that message_id from the outbox payload is forwarded, only chat_id; update the test(s) that use _make_payload (and the helper itself if needed) to include a non-None message_id and add an assertion that send_item_message (the mocked function) is called with that message_id (or that the consumer passed it through to the API call), ensuring the payload's "message_id" field produced by _make_payload is included in the forwarded arguments in the success path.packages/shared/fastfetchbot_shared/services/file_export/video_download.py (1)
163-170: Use explicit| Nonetype hints for optional parameters.PEP 484 prohibits implicit
Optional. Parameters with defaultNoneshould use explicit union syntax.Suggested fix
async def get_video_info( self, - url: str = None, - download: bool = None, - extractor: str = None, - audio_only: bool = None, - hd: bool = None, + url: str | None = None, + download: bool | None = None, + extractor: str | None = None, + audio_only: bool | None = None, + hd: bool | None = None, ) -> dict:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/shared/fastfetchbot_shared/services/file_export/video_download.py` around lines 163 - 170, The function get_video_info currently uses default None for optional parameters without explicit union type hints; update its signature to use explicit "X | None" annotations for each optional parameter (url: str | None, download: bool | None, extractor: str | None, audio_only: bool | None, hd: bool | None) so the type hints are PEP 484/PEP 604 compliant while keeping the same default values and behavior.apps/async-worker/async_worker/services/enrichment.py (1)
40-42: Simplify redundant condition.The condition
store_document or (not store_document and metadata_item.get("telegraph_url") == "")can be simplified. The second branch only triggers whenstore_documentisFalse, sonot store_documentis always true there.Suggested simplification
- if store_document or ( - not store_document and metadata_item.get("telegraph_url") == "" - ): + if store_document or metadata_item.get("telegraph_url") == "":🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/async-worker/async_worker/services/enrichment.py` around lines 40 - 42, The if-condition in enrichment.py is redundant: replace the current condition `if store_document or (not store_document and metadata_item.get("telegraph_url") == "")` with the simplified equivalent `if store_document or metadata_item.get("telegraph_url") == ""` so the branch triggers when either storing is enabled or the telegraph_url is empty; update the conditional used in the relevant function where `metadata_item` and `store_document` are evaluated.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@apps/api/src/services/file_export/document_export/pdf_export.py`:
- Line 25: The constructor signature in PDF export uses a nullable parameter
annotated as `html_string: str = None`, which triggers RUF013; change the
parameter annotation in the __init__ method of the class (the PDF export class
in pdf_export.py) to use PEP 604 union syntax: `html_string: str | None = None`
so the type is explicit and conforms to Python 3.12/Black formatting.
In `@apps/async-worker/async_worker/main.py`:
- Around line 22-32: The parse_redis_url function currently converts
parsed.path.lstrip("/") to int directly which will raise a ValueError for
non-numeric DB paths (e.g., redis://host/abc); update parse_redis_url to
validate or guard that value before constructing RedisSettings: extract db_str =
parsed.path.lstrip("/") or "0", attempt to convert to int inside a try/except,
and on ValueError either default to 0 or raise a clearer error with context
(include the original url/db_str in the message); ensure the change is applied
in parse_redis_url and the returned RedisSettings construction uses the
validated integer.
In `@apps/async-worker/async_worker/services/enrichment.py`:
- Around line 62-65: The code unconditionally accesses metadata_item["title"]
which can raise KeyError or TypeError; update the logic around metadata_item (in
enrichment.py where metadata_item["title"] is set) to safely handle missing or
non-string titles by checking "title" in metadata_item or using
metadata_item.get("title", "") and only calling .strip() when the value is a
string (fallback to an empty string or None as appropriate), ensuring no
exception is raised if the key is absent or value is not a string.
In `@apps/async-worker/async_worker/tasks/scrape.py`:
- Line 41: Replace the direct logging of the untrusted `url` in the logger.info
call with a sanitized version: parse the `url` (e.g., using
urllib.parse.urlparse), drop or mask the query and fragment components (and any
bearer/token-like values), then log `job_id`, `source`, and the sanitized URL;
update the logger.info at the site where `logger.info(f"[{job_id}] Starting
scrape: url={url}, source={source}")` is called so it uses the sanitized
variable instead of the raw `url`.
- Around line 43-75: The current try block mixes scraping/enrichment with
delivery so failures in outbox.push() can mark the whole job as a scrape
failure; split delivery into its own error handling: keep the
scraping/enrichment sequence (UrlMetadata, InfoExtractService.get_item,
enrichment.enrich) in the original try and return success once metadata_item is
produced, then perform outbox.push(job_id, chat_id, message_id, metadata_item)
in a separate try/except that logs delivery failures and attempts the fallback
outbox.push(error=...) without changing the returned scrape status; apply the
same separation to the other delivery call around the block that spans the code
referenced at lines 77-89 so outbox.push failures never override a successful
scrape (refer to outbox.push, service.get_item, enrichment.enrich, and job_id to
locate the spots).
In `@apps/telegram-bot/core/handlers/buttons.py`:
- Around line 66-76: Wrap the enqueue call in a try/except around
queue_client.enqueue_scrape in the handler so failures are caught and the user
is notified instead of silently deleting the "Item queued for processing..."
message; use the existing replying_message (created via
query.message.reply_text) and, on exception, update or replace it with an error
message that includes brief context, log the exception, and only delete the
success message when enqueue_scrape completes successfully. Ensure you catch
broad exceptions but log the stack/exception for debugging.
In `@apps/telegram-bot/core/services/outbox_consumer.py`:
- Around line 61-71: The _send_error_to_chat function lacks a guard for a None
chat_id; update _send_error_to_chat to check if chat_id is None (or empty) and
return early (or log a warning) instead of calling application.bot.send_message
with None, ensuring you still catch and log other exceptions; reference the
_send_error_to_chat function and the caller that extracts chat_id via .get() so
reviewers can locate the calling site and validate the guard.
In `@packages/shared/fastfetchbot_shared/services/file_export/pdf_export.py`:
- Around line 40-48: The filename is built from raw self.title (output_filename
= f"{self.title}-{uuid.uuid4()}.pdf") which can include path separators or
unsafe characters; sanitize self.title before constructing output_filename by
stripping or replacing path separators (/, \), control characters and other
disallowed filename chars, collapsing whitespace (e.g., to underscores), and
truncating to a safe length, then use the sanitized value in output_filename;
update the code around output_filename creation in pdf_export.py (the place that
references self.title and output_filename) to perform this sanitization before
sending the Celery task.
In `@packages/shared/fastfetchbot_shared/services/file_export/video_download.py`:
- Around line 70-78: get_video currently indexes video_info_funcs with
self.extractor which can KeyError for unsupported values; update get_video to
validate self.extractor against the allowed keys (e.g., "youtube", "bilibili")
before lookup, and either provide a clear fallback handler or raise a
descriptive error. Specifically, check self.extractor exists in
video_info_funcs, select the parser via video_info_funcs[self.extractor] only
after validation, and keep calling the chosen parser (_youtube_info_parse or
_bilibili_info_parse) and then _video_info_formatting; ensure the failure path
returns or raises a meaningful exception instead of allowing a raw KeyError.
- Around line 96-104: The nested helper _get_redirected_url inside _parse_url
returns mixed types (httpx.URL vs str); change it to always return a string by
converting resp.url to str (e.g., str(resp.url)) when status is 200 and use
resp.headers.get("Location") (or cast that to str) for 302, and ensure the
function returns a string in all branches so callers of _get_redirected_url and
_parse_url receive a consistent type.
In `@packages/shared/fastfetchbot_shared/services/scrapers/common.py`:
- Around line 88-90: The instantiation of scraper classes (scraper_cls =
self._resolve_scraper_class(...); scraper_item = scraper_cls(url=self.url,
category=self.category, data=self.data, **self.kwargs)) passes data and **kwargs
to all scrapers, which breaks dataclass-based scrapers like Weibo and Bluesky
(subclasses of MetadataItem) that only accept declared fields; change the
instantiation logic in the calling site to detect whether scraper_cls accepts
extra kwargs (e.g., use inspect.signature to see if **kwargs or a data parameter
is present) or whether scraper_cls is a subclass of MetadataItem, and if so call
it with only the allowed fields (url, category and other declared MetadataItem
fields) instead of passing data/**kwargs; alternatively, attempt instantiation
and on TypeError retry by creating the object with the minimal allowed
parameters for MetadataItem (url, category, telegraph_url, content, text,
media_files, author, title, author_url, message_type).
In `@pyproject.toml`:
- Line 47: The dependency entry "arq>=0.27.0" in pyproject.toml lacks an
explicit upper bound; update that entry to follow the project's dependency
policy by pinning an upper bound (e.g., change "arq>=0.27.0" to
"arq>=0.27.0,<0.28.0") so it matches the other runtime dependencies and prevents
unintended upgrades.
In `@template.env`:
- Around line 167-176: The SCRAPE_MODE default is inconsistent: template.env
sets SCRAPE_MODE=api but the application default in
apps/telegram-bot/core/config.py falls back to "queue" (e.g., where SCRAPE_MODE
is loaded via get_env or Config.SCRAPE_MODE with default "queue"); pick a single
safe default (recommend "api") and change the fallback in config.py to "api" (or
change template.env to "queue" if you prefer the other behavior), and update any
associated comments/docs so both the env template and the code use the same
SCRAPE_MODE default.
In `@tests/unit/async_worker/test_enrichment.py`:
- Around line 187-197: In test_uses_config_defaults_when_none move the
mock_telegraph unpacking before calling enrich so the fixture/patch is active
during the call and drop the unused instance variable; e.g. unpack
mock_telegraph into (MockTg, _) or (MockTg, instance) at the top of the test,
then call await enrich(base_metadata_item) and assert
MockTg.from_dict.assert_called_once(); ensure you still patch
STORE_TELEGRAPH/STORE_DOCUMENT as currently done.
In `@tests/unit/file_export/test_pdf_export.py`:
- Around line 110-116: The test currently asserts mock_celery.send_task against
values extracted from the same call (mock_celery.send_task.call_args), which is
circular and always passes; update the assertion in the test_pdf_export test to
compare the send_task call to concrete expected values (or to variables computed
earlier in the test) instead of pulling values from call_args (i.e., assert
mock_celery.send_task.assert_called_once_with("file_export.pdf_export",
kwargs={"html_string": expected_html_string, "output_filename":
expected_output_filename})), referencing the mock_celery.send_task call and the
expected HTML/output filename variables used in the test.
---
Outside diff comments:
In `@apps/telegram-bot/core/handlers/url_process.py`:
- Around line 100-120: The code path calling _fetch_and_send in url_process.py
doesn't pass the message object, which breaks reply threading because
send_item_message expects message when SCRAPE_MODE != "queue"; update the
_fetch_and_send(...) invocation inside the "unknown" source branch to include
the original message (pass message=message) or conditionally forward message
when SCRAPE_MODE != "queue", ensuring _fetch_and_send and downstream
send_item_message receive the message for correct reply context.
In `@CLAUDE.md`:
- Around line 127-128: Update the documentation note about pinned uv versions to
include the missing Dockerfile that also pins uv to 0.10.4: search for the COPY
line "COPY --from=ghcr.io/astral-sh/uv:0.10.4" in the async-worker Dockerfile
and add it to the list of files mentioned (so the doc references the
async-worker along with the api, telegram-bot, and worker Dockerfiles); ensure
the doc text explicitly instructs updating that tag in the async-worker
Dockerfile when bumping uv.
---
Nitpick comments:
In `@apps/async-worker/async_worker/services/enrichment.py`:
- Around line 40-42: The if-condition in enrichment.py is redundant: replace the
current condition `if store_document or (not store_document and
metadata_item.get("telegraph_url") == "")` with the simplified equivalent `if
store_document or metadata_item.get("telegraph_url") == ""` so the branch
triggers when either storing is enabled or the telegraph_url is empty; update
the conditional used in the relevant function where `metadata_item` and
`store_document` are evaluated.
In `@apps/async-worker/Dockerfile`:
- Line 1: Remove the leading blank line at the top of the Dockerfile so the
first line begins immediately with the existing comment/instruction; open the
Dockerfile and delete the initial empty line before the first token to ensure
the file starts with the intended content.
- Around line 44-55: Add a non-root user and switch the container to run as that
user to improve security: in the Dockerfile create a group and user (e.g.,
appuser) with a fixed non-root UID/GID, set a HOME, chown the application files
under WORKDIR (/app/apps/async-worker) and any copied paths ($PYSETUP_PATH,
/app/packages/) to that user, and then add a USER appuser line before the CMD
["arq", "async_worker.main.WorkerSettings"]; ensure PYTHONPATH and WORKDIR
remain correct and that file ownership is updated so the non-root user can run
the process.
In `@apps/telegram-bot/core/handlers/buttons.py`:
- Line 67: Several message strings in apps/telegram-bot/core/handlers/buttons.py
are using unnecessary f-string prefixes (e.g., text=f"Item queued for
processing...") even though they contain no placeholders; remove the leading "f"
from those string literals so they become plain strings. Locate the occurrences
where the parameter name text is set (the calls using text=f"...") —
specifically the instances around the current lines 67, 79, 85, and 92 — and
change text=f"..." to text="..." for each occurrence (no other logic changes
required).
In `@packages/shared/fastfetchbot_shared/services/file_export/video_download.py`:
- Around line 163-170: The function get_video_info currently uses default None
for optional parameters without explicit union type hints; update its signature
to use explicit "X | None" annotations for each optional parameter (url: str |
None, download: bool | None, extractor: str | None, audio_only: bool | None, hd:
bool | None) so the type hints are PEP 484/PEP 604 compliant while keeping the
same default values and behavior.
In `@packages/shared/fastfetchbot_shared/services/scrapers/common.py`:
- Around line 72-78: The _resolve_scraper_class method currently special-cases
"youtube" and "bilibili" by calling _get_video_downloader, which bypasses the
canonical registry (service_classes/ScraperManager); remove that special-case
and make _resolve_scraper_class only consult self.service_classes and raise
KeyError if missing. Instead, register the lazy video downloader via the
existing registry: add entries for "youtube" and "bilibili" to
InfoExtractService.service_classes (or register them with ScraperManager if lazy
init is required) mapping to the _get_video_downloader factory/lazy wrapper so
the registry, not _resolve_scraper_class, controls video scraper resolution.
In `@tests/unit/file_export/test_pdf_export.py`:
- Line 128: The test unpacks call_args into args and kwargs but never uses
kwargs; change the unpack to discard the unused dict by renaming it with a
leading underscore (e.g., args, _kwargs = mock_celery.send_task.call_args or
args, _ = mock_celery.send_task.call_args) so intent is clear and linter
warnings are avoided; update the unpack at the mock_celery.send_task.call_args
usage accordingly.
- Line 3: Remove the unused top-level import of asyncio from the test file;
simply delete the import statement referencing asyncio so the module no longer
contains an unused import (look for the import asyncio line at the top of tests'
PDF export test).
In `@tests/unit/telegram_bot/test_outbox_consumer.py`:
- Around line 36-53: The success-path test in
tests/unit/telegram_bot/test_outbox_consumer.py doesn't assert that message_id
from the outbox payload is forwarded, only chat_id; update the test(s) that use
_make_payload (and the helper itself if needed) to include a non-None message_id
and add an assertion that send_item_message (the mocked function) is called with
that message_id (or that the consumer passed it through to the API call),
ensuring the payload's "message_id" field produced by _make_payload is included
in the forwarded arguments in the success path.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: c1a7bdb6-3c96-4a8e-a338-3df323e0c7c6
⛔ Files ignored due to path filters (2)
apps/worker/celerybeat-schedule.dbis excluded by!**/*.dbuv.lockis excluded by!**/*.lock
📒 Files selected for processing (47)
.idea/FastFetchBot.iml.idea/runConfigurations/fullstack_polling_api.xmlCLAUDE.mdapps/api/src/services/file_export/audio_transcribe/__init__.pyapps/api/src/services/file_export/document_export/pdf_export.pyapps/api/src/services/file_export/video_download/__init__.pyapps/async-worker/Dockerfileapps/async-worker/async_worker/__init__.pyapps/async-worker/async_worker/celery_client.pyapps/async-worker/async_worker/config.pyapps/async-worker/async_worker/main.pyapps/async-worker/async_worker/services/__init__.pyapps/async-worker/async_worker/services/enrichment.pyapps/async-worker/async_worker/services/outbox.pyapps/async-worker/async_worker/tasks/__init__.pyapps/async-worker/async_worker/tasks/scrape.pyapps/async-worker/pyproject.tomlapps/telegram-bot/core/config.pyapps/telegram-bot/core/handlers/buttons.pyapps/telegram-bot/core/handlers/url_process.pyapps/telegram-bot/core/queue_client.pyapps/telegram-bot/core/services/bot_app.pyapps/telegram-bot/core/services/outbox_consumer.pyapps/telegram-bot/pyproject.tomldocker-compose.template.ymlpackages/shared/fastfetchbot_shared/services/file_export/__init__.pypackages/shared/fastfetchbot_shared/services/file_export/audio_transcribe.pypackages/shared/fastfetchbot_shared/services/file_export/pdf_export.pypackages/shared/fastfetchbot_shared/services/file_export/video_download.pypackages/shared/fastfetchbot_shared/services/scrapers/common.pypyproject.tomltemplate.envtests/unit/async_worker/__init__.pytests/unit/async_worker/conftest.pytests/unit/async_worker/test_enrichment.pytests/unit/async_worker/test_main.pytests/unit/async_worker/test_outbox.pytests/unit/async_worker/test_scrape_task.pytests/unit/file_export/__init__.pytests/unit/file_export/test_audio_transcribe.pytests/unit/file_export/test_pdf_export.pytests/unit/file_export/test_video_download.pytests/unit/scrapers/test_common.pytests/unit/telegram_bot/__init__.pytests/unit/telegram_bot/conftest.pytests/unit/telegram_bot/test_outbox_consumer.pytests/unit/telegram_bot/test_queue_client.py
| class PdfExport(BasePdfExport): | ||
| """API PDF export that adds optional S3 upload after Celery PDF generation.""" | ||
|
|
||
| def __init__(self, title: str, html_string: str = None): |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check if file exists and examine the context around line 25
if [ -f "apps/api/src/services/file_export/document_export/pdf_export.py" ]; then
echo "=== File found. Examining lines 20-30 ==="
cat -n apps/api/src/services/file_export/document_export/pdf_export.py | sed -n '20,30p'
else
echo "File not found at expected path"
find . -name "pdf_export.py" -type f
fiRepository: aturret/FastFetchBot
Length of output: 528
🏁 Script executed:
#!/bin/bash
# Check Python version requirement to understand if str | None is supported
if [ -f "setup.py" ] || [ -f "pyproject.toml" ] || [ -f ".python-version" ]; then
echo "=== Checking Python version requirements ==="
[ -f "setup.py" ] && head -30 setup.py | grep -i python
[ -f "pyproject.toml" ] && cat pyproject.toml | grep -A 5 "python"
[ -f ".python-version" ] && cat .python-version
else
echo "Checking for version hints in common config files"
ls -la | grep -E "(setup|requirements|pyproject|\.python)"
fiRepository: aturret/FastFetchBot
Length of output: 851
🏁 Script executed:
#!/bin/bash
# Check imports and overall structure of the file
if [ -f "apps/api/src/services/file_export/document_export/pdf_export.py" ]; then
echo "=== First 50 lines of file ==="
head -50 apps/api/src/services/file_export/document_export/pdf_export.py
fiRepository: aturret/FastFetchBot
Length of output: 1420
🏁 Script executed:
#!/bin/bash
# Check the parent class BasePdfExport to ensure signature consistency
find . -name "pdf_export.py" -type f | xargs grep -l "class.*BasePdfExport\|class PdfExport" | grep -v "__pycache__"Repository: aturret/FastFetchBot
Length of output: 202
🏁 Script executed:
#!/bin/bash
# Search for other potential implicit Optional issues in this file
if [ -f "apps/api/src/services/file_export/document_export/pdf_export.py" ]; then
echo "=== Full file content ==="
cat apps/api/src/services/file_export/document_export/pdf_export.py
fiRepository: aturret/FastFetchBot
Length of output: 1415
🏁 Script executed:
#!/bin/bash
# Check the parent class BasePdfExport signature
if [ -f "packages/shared/fastfetchbot_shared/services/file_export/pdf_export.py" ]; then
echo "=== BasePdfExport class definition ==="
cat packages/shared/fastfetchbot_shared/services/file_export/pdf_export.py
fiRepository: aturret/FastFetchBot
Length of output: 2745
Use explicit Optional type annotation for html_string.
Line 25 has html_string: str = None which triggers Ruff RUF013. Since the project targets Python 3.12, use str | None instead to follow PEP 604 union syntax and satisfy black formatting standards.
✏️ Suggested change
- def __init__(self, title: str, html_string: str = None):
+ def __init__(self, title: str, html_string: str | None = None):📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def __init__(self, title: str, html_string: str = None): | |
| def __init__(self, title: str, html_string: str | None = None): |
🧰 Tools
🪛 Ruff (0.15.6)
[warning] 25-25: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@apps/api/src/services/file_export/document_export/pdf_export.py` at line 25,
The constructor signature in PDF export uses a nullable parameter annotated as
`html_string: str = None`, which triggers RUF013; change the parameter
annotation in the __init__ method of the class (the PDF export class in
pdf_export.py) to use PEP 604 union syntax: `html_string: str | None = None` so
the type is explicit and conforms to Python 3.12/Black formatting.
| def parse_redis_url(url: str) -> RedisSettings: | ||
| """Parse a redis:// URL into ARQ RedisSettings.""" | ||
| from urllib.parse import urlparse | ||
|
|
||
| parsed = urlparse(url) | ||
| return RedisSettings( | ||
| host=parsed.hostname or "localhost", | ||
| port=parsed.port or 6379, | ||
| database=int(parsed.path.lstrip("/") or 0), | ||
| password=parsed.password, | ||
| ) |
There was a problem hiding this comment.
Potential ValueError on malformed database path.
If the Redis URL contains a non-numeric database path (e.g., redis://host/abc), int(parsed.path.lstrip("/") or 0) will raise a ValueError. Consider adding error handling or validation.
Suggested defensive fix
def parse_redis_url(url: str) -> RedisSettings:
"""Parse a redis:// URL into ARQ RedisSettings."""
from urllib.parse import urlparse
parsed = urlparse(url)
+ db_str = parsed.path.lstrip("/")
+ try:
+ database = int(db_str) if db_str else 0
+ except ValueError:
+ database = 0
return RedisSettings(
host=parsed.hostname or "localhost",
port=parsed.port or 6379,
- database=int(parsed.path.lstrip("/") or 0),
+ database=database,
password=parsed.password,
)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@apps/async-worker/async_worker/main.py` around lines 22 - 32, The
parse_redis_url function currently converts parsed.path.lstrip("/") to int
directly which will raise a ValueError for non-numeric DB paths (e.g.,
redis://host/abc); update parse_redis_url to validate or guard that value before
constructing RedisSettings: extract db_str = parsed.path.lstrip("/") or "0",
attempt to convert to int inside a try/except, and on ValueError either default
to 0 or raise a clearer error with context (include the original url/db_str in
the message); ensure the change is applied in parse_redis_url and the returned
RedisSettings construction uses the validated integer.
| except Exception as e: | ||
| logger.error(f"Error exporting PDF: {e}") | ||
|
|
||
| metadata_item["title"] = metadata_item["title"].strip() |
There was a problem hiding this comment.
Guard against missing title key.
Line 65 accesses metadata_item["title"] directly, which will raise KeyError if the key is missing. Consider using .get() with a default or checking for presence first.
Suggested fix
- metadata_item["title"] = metadata_item["title"].strip()
+ if "title" in metadata_item and metadata_item["title"]:
+ metadata_item["title"] = metadata_item["title"].strip()📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| except Exception as e: | |
| logger.error(f"Error exporting PDF: {e}") | |
| metadata_item["title"] = metadata_item["title"].strip() | |
| except Exception as e: | |
| logger.error(f"Error exporting PDF: {e}") | |
| if "title" in metadata_item and metadata_item["title"]: | |
| metadata_item["title"] = metadata_item["title"].strip() |
🧰 Tools
🪛 Ruff (0.15.6)
[warning] 62-62: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@apps/async-worker/async_worker/services/enrichment.py` around lines 62 - 65,
The code unconditionally accesses metadata_item["title"] which can raise
KeyError or TypeError; update the logic around metadata_item (in enrichment.py
where metadata_item["title"] is set) to safely handle missing or non-string
titles by checking "title" in metadata_item or using metadata_item.get("title",
"") and only calling .strip() when the value is a string (fallback to an empty
string or None as appropriate), ensuring no exception is raised if the key is
absent or value is not a string.
| if job_id is None: | ||
| job_id = str(uuid.uuid4()) | ||
|
|
||
| logger.info(f"[{job_id}] Starting scrape: url={url}, source={source}") |
There was a problem hiding this comment.
Redact the user-submitted URL before logging it.
url is untrusted input, so this info log can persist signed query params or bearer-style tokens from private links. Log job_id/source plus a sanitized URL instead.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@apps/async-worker/async_worker/tasks/scrape.py` at line 41, Replace the
direct logging of the untrusted `url` in the logger.info call with a sanitized
version: parse the `url` (e.g., using urllib.parse.urlparse), drop or mask the
query and fragment components (and any bearer/token-like values), then log
`job_id`, `source`, and the sanitized URL; update the logger.info at the site
where `logger.info(f"[{job_id}] Starting scrape: url={url}, source={source}")`
is called so it uses the sanitized variable instead of the raw `url`.
| try: | ||
| # Build UrlMetadata and scrape | ||
| url_metadata = UrlMetadata( | ||
| url=url, source=source, content_type=content_type | ||
| ) | ||
| service = InfoExtractService( | ||
| url_metadata=url_metadata, | ||
| store_telegraph=False, # We handle enrichment separately | ||
| store_document=False, | ||
| celery_app=celery_app, | ||
| timeout=DOWNLOAD_VIDEO_TIMEOUT, | ||
| **kwargs, | ||
| ) | ||
| metadata_item = await service.get_item() | ||
|
|
||
| # Enrich: Telegraph, PDF | ||
| metadata_item = await enrichment.enrich( | ||
| metadata_item, | ||
| store_telegraph=store_telegraph, | ||
| store_document=store_document, | ||
| ) | ||
|
|
||
| logger.info(f"[{job_id}] Scrape completed successfully") | ||
|
|
||
| # Push to outbox | ||
| await outbox.push( | ||
| job_id=job_id, | ||
| chat_id=chat_id, | ||
| message_id=message_id, | ||
| metadata_item=metadata_item, | ||
| ) | ||
|
|
||
| return {"job_id": job_id, "status": "success"} |
There was a problem hiding this comment.
Do not treat outbox delivery failures as scrape failures.
outbox.push() is inside the same try as scraping and enrichment. If Redis flakes after a successful scrape, this path reports the job as a scrape error and the fallback outbox.push(error=...) can mask the original result entirely. Split delivery into its own error handling so completed work is not misclassified. Worth adding a unit test that makes outbox.push() fail after get_item() succeeds.
Also applies to: 77-89
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@apps/async-worker/async_worker/tasks/scrape.py` around lines 43 - 75, The
current try block mixes scraping/enrichment with delivery so failures in
outbox.push() can mark the whole job as a scrape failure; split delivery into
its own error handling: keep the scraping/enrichment sequence (UrlMetadata,
InfoExtractService.get_item, enrichment.enrich) in the original try and return
success once metadata_item is produced, then perform outbox.push(job_id,
chat_id, message_id, metadata_item) in a separate try/except that logs delivery
failures and attempts the fallback outbox.push(error=...) without changing the
returned scrape status; apply the same separation to the other delivery call
around the block that spans the code referenced at lines 77-89 so outbox.push
failures never override a successful scrape (refer to outbox.push,
service.get_item, enrichment.enrich, and job_id to locate the spots).
| scraper_cls = self._resolve_scraper_class(self.category) | ||
| scraper_item = scraper_cls( | ||
| url=self.url, category=self.category, data=self.data, **self.kwargs |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
python - <<'PY'
import ast
from pathlib import Path
targets = {
"twitter.py": "Twitter",
"threads.py": "Threads",
"reddit.py": "Reddit",
"weibo.py": "Weibo",
"wechat.py": "Wechat",
"instagram.py": "Instagram",
"douban.py": "Douban",
"zhihu.py": "Zhihu",
"xiaohongshu.py": "Xiaohongshu",
}
root = Path("packages/shared/fastfetchbot_shared/services/scrapers")
problems = []
for filename, class_name in targets.items():
path = root / filename
if not path.exists():
problems.append(f"{filename}: file not found")
continue
tree = ast.parse(path.read_text())
cls = next(
(node for node in tree.body if isinstance(node, ast.ClassDef) and node.name == class_name),
None,
)
if cls is None:
problems.append(f"{path}: class {class_name} not found")
continue
init = next(
(node for node in cls.body if isinstance(node, ast.FunctionDef) and node.name == "__init__"),
None,
)
if init is None:
print(f"{path}: {class_name} inherits __init__; verify its base class accepts category=...")
continue
args = [arg.arg for arg in init.args.args]
has_category = "category" in args
has_kwargs = init.args.kwarg is not None
print(f"{path}: {class_name}.__init__ args={args} kwargs={has_kwargs}")
if not (has_category or has_kwargs):
problems.append(f"{path}: {class_name}.__init__ accepts neither category nor **kwargs")
if problems:
print("\nPotential incompatibilities:")
for problem in problems:
print(f" - {problem}")
PYRepository: aturret/FastFetchBot
Length of output: 362
🏁 Script executed:
#!/bin/bash
# Find the actual scraper files in the repository
find packages/shared/fastfetchbot_shared/services/scrapers -type f -name "*.py" | head -20Repository: aturret/FastFetchBot
Length of output: 1510
🏁 Script executed:
#!/bin/bash
# Also check the structure of the scrapers directory
ls -la packages/shared/fastfetchbot_shared/services/scrapers/ 2>/dev/null || echo "Directory not found at expected path"Repository: aturret/FastFetchBot
Length of output: 1186
🏁 Script executed:
#!/bin/bash
# Search for scraper class definitions in the entire codebase
rg "class.*\(.*Scraper" --type py -A 3 packages/shared/fastfetchbot_shared/services/scrapers/Repository: aturret/FastFetchBot
Length of output: 2599
🏁 Script executed:
#!/bin/bash
# Read the base Scraper class
cat -n packages/shared/fastfetchbot_shared/services/scrapers/scraper.pyRepository: aturret/FastFetchBot
Length of output: 506
🏁 Script executed:
#!/bin/bash
# Check the call site in common.py around lines 88-90
cat -n packages/shared/fastfetchbot_shared/services/scrapers/common.py | sed -n '80,100p'Repository: aturret/FastFetchBot
Length of output: 1295
🏁 Script executed:
#!/bin/bash
# Extract all scraper __init__ signatures using ast-grep
ast-grep --pattern 'class $_ {
$$$
def __init__($$$) {
$$$
}
$$$
}'Repository: aturret/FastFetchBot
Length of output: 46
🏁 Script executed:
#!/bin/bash
# Use Python to inspect all scraper __init__ signatures
python3 << 'EOF'
import ast
from pathlib import Path
scrapers_dir = Path("packages/shared/fastfetchbot_shared/services/scrapers")
scrapers = []
# Check subdirectories for scraper.py files
for subdir in scrapers_dir.iterdir():
if subdir.is_dir() and subdir.name not in ['__pycache__', 'templates']:
scraper_file = subdir / 'scraper.py'
if scraper_file.exists():
try:
tree = ast.parse(scraper_file.read_text())
for node in tree.body:
if isinstance(node, ast.ClassDef):
init_method = next(
(m for m in node.body if isinstance(m, ast.FunctionDef) and m.name == '__init__'),
None
)
args = [arg.arg for arg in node.args.args] if init_method else None
if init_method:
args = [arg.arg for arg in init_method.args.args]
has_kwargs = init_method.args.kwarg is not None
scrapers.append({
'file': str(scraper_file),
'class': node.name,
'args': args,
'has_kwargs': has_kwargs,
'has_category': 'category' in args
})
else:
scrapers.append({
'file': str(scraper_file),
'class': node.name,
'args': 'inherits __init__',
'has_kwargs': None,
'has_category': None
})
except Exception as e:
print(f"Error parsing {scraper_file}: {e}")
# Print results
for s in scrapers:
print(f"{s['file']}: {s['class']}")
if isinstance(s['args'], list):
print(f" args: {s['args']}")
print(f" has_category: {s['has_category']}, has_kwargs: {s['has_kwargs']}")
else:
print(f" {s['args']}")
print()
EOFRepository: aturret/FastFetchBot
Length of output: 451
🏁 Script executed:
#!/bin/bash
# Corrected Python script to inspect all scraper __init__ signatures
python3 << 'EOF'
import ast
from pathlib import Path
scrapers_dir = Path("packages/shared/fastfetchbot_shared/services/scrapers")
scrapers = []
# Check subdirectories for scraper.py files
for subdir in sorted(scrapers_dir.iterdir()):
if subdir.is_dir() and subdir.name not in ['__pycache__', 'templates', 'general']:
scraper_file = subdir / 'scraper.py'
if scraper_file.exists():
try:
tree = ast.parse(scraper_file.read_text())
for node in tree.body:
if isinstance(node, ast.ClassDef):
init_method = next(
(m for m in node.body if isinstance(m, ast.FunctionDef) and m.name == '__init__'),
None
)
if init_method:
args = [arg.arg for arg in init_method.args.args]
has_kwargs = init_method.args.kwarg is not None
has_category = 'category' in args
print(f"{subdir.name}/{node.name}")
print(f" args: {args}")
print(f" has_category: {has_category}, has_kwargs: {has_kwargs}")
if not (has_category or has_kwargs):
print(f" ⚠️ INCOMPATIBLE: neither category nor **kwargs")
else:
print(f"{subdir.name}/{node.name}")
print(f" inherits __init__ (check base class)")
print()
except Exception as e:
print(f"Error parsing {scraper_file}: {e}\n")
# Also check general scrapers
general_dir = scrapers_dir / 'general'
for py_file in sorted(general_dir.glob('*.py')):
if py_file.name in ['scraper.py', 'base.py', 'firecrawl.py', 'zyte.py']:
try:
tree = ast.parse(py_file.read_text())
for node in tree.body:
if isinstance(node, ast.ClassDef):
init_method = next(
(m for m in node.body if isinstance(m, ast.FunctionDef) and m.name == '__init__'),
None
)
if init_method:
args = [arg.arg for arg in init_method.args.args]
has_kwargs = init_method.args.kwarg is not None
has_category = 'category' in args
print(f"general/{py_file.name}: {node.name}")
print(f" args: {args}")
print(f" has_category: {has_category}, has_kwargs: {has_kwargs}")
if not (has_category or has_kwargs):
print(f" ⚠️ INCOMPATIBLE: neither category nor **kwargs")
else:
print(f"general/{py_file.name}: {node.name}")
print(f" inherits __init__ (check base class)")
print()
except Exception as e:
print(f"Error parsing {py_file}: {e}\n")
EOFRepository: aturret/FastFetchBot
Length of output: 1643
🏁 Script executed:
#!/bin/bash
# Check what base classes the scrapers inherit from
rg "class.*Scraper\(" packages/shared/fastfetchbot_shared/services/scrapers/ --type py -A 1 | head -60Repository: aturret/FastFetchBot
Length of output: 1248
🏁 Script executed:
#!/bin/bash
# Let's look at a few specific scrapers to understand their structure
for scraper in bluesky twitter threads reddit; do
echo "=== $scraper ===="
if [ -f "packages/shared/fastfetchbot_shared/services/scrapers/$scraper/scraper.py" ]; then
head -30 "packages/shared/fastfetchbot_shared/services/scrapers/$scraper/scraper.py"
fi
echo ""
doneRepository: aturret/FastFetchBot
Length of output: 1529
🏁 Script executed:
#!/bin/bash
# Check all remaining scrapers' __init__ signatures
python3 << 'EOF'
import ast
from pathlib import Path
scrapers_dir = Path("packages/shared/fastfetchbot_shared/services/scrapers")
scrapers_to_check = ['twitter', 'threads', 'reddit', 'instagram', 'douban', 'wechat', 'xiaohongshu', 'zhihu']
print("Checking scraper __init__ signatures:\n")
for scraper_name in scrapers_to_check:
scraper_file = scrapers_dir / scraper_name / 'scraper.py'
if scraper_file.exists():
try:
tree = ast.parse(scraper_file.read_text())
for node in tree.body:
if isinstance(node, ast.ClassDef):
init_method = next(
(m for m in node.body if isinstance(m, ast.FunctionDef) and m.name == '__init__'),
None
)
if init_method:
args = [arg.arg for arg in init_method.args.args]
has_kwargs = init_method.args.kwarg is not None
has_category = 'category' in args
print(f"{scraper_name}: {node.name}")
print(f" args: {args}, kwargs: {has_kwargs}")
if not (has_category or has_kwargs):
print(f" ❌ INCOMPATIBLE")
else:
print(f"{scraper_name}: {node.name}")
print(f" inherits __init__ (base: Scraper - no __init__)")
print(f" ❌ INCOMPATIBLE")
print()
except Exception as e:
print(f"Error: {scraper_file}: {e}\n")
else:
print(f"{scraper_name}: scraper.py not found\n")
EOFRepository: aturret/FastFetchBot
Length of output: 351
🏁 Script executed:
#!/bin/bash
# Let's also check what _resolve_scraper_class returns to understand the full scope
grep -n "_resolve_scraper_class" packages/shared/fastfetchbot_shared/services/scrapers/common.py -A 10Repository: aturret/FastFetchBot
Length of output: 1237
🏁 Script executed:
#!/bin/bash
# Check what files are in the scraper directories
for dir in twitter threads reddit instagram douban wechat xiaohongshu zhihu; do
echo "=== $dir ==="
ls -la packages/shared/fastfetchbot_shared/services/scrapers/$dir/
echo ""
doneRepository: aturret/FastFetchBot
Length of output: 1936
🏁 Script executed:
#!/bin/bash
# Check service_classes registration in common.py
cat -n packages/shared/fastfetchbot_shared/services/scrapers/common.py | head -75Repository: aturret/FastFetchBot
Length of output: 3177
🏁 Script executed:
#!/bin/bash
# Find where service_classes is defined
grep -rn "service_classes" packages/shared/fastfetchbot_shared/services/scrapers/ --type pyRepository: aturret/FastFetchBot
Length of output: 181
🏁 Script executed:
#!/bin/bash
# Check the __init__ signatures of the actual scraper classes defined in __init__.py files
python3 << 'EOF'
import ast
from pathlib import Path
scrapers_dir = Path("packages/shared/fastfetchbot_shared/services/scrapers")
scrapers_to_check = ['twitter', 'threads', 'reddit', 'instagram', 'douban', 'wechat', 'xiaohongshu', 'zhihu', 'weibo', 'bluesky']
print("Scraper classes that will be instantiated with category= argument:\n")
for scraper_name in scrapers_to_check:
init_file = scrapers_dir / scraper_name / '__init__.py'
if init_file.exists():
try:
tree = ast.parse(init_file.read_text())
# Find the main scraper class (usually capitalized version of scraper_name)
for node in tree.body:
if isinstance(node, ast.ClassDef):
# Look for __init__ method
init_method = next(
(m for m in node.body if isinstance(m, ast.FunctionDef) and m.name == '__init__'),
None
)
if init_method:
args = [arg.arg for arg in init_method.args.args]
has_kwargs = init_method.args.kwarg is not None
has_category = 'category' in args
print(f"{scraper_name}: {node.name}")
print(f" args: {args}")
print(f" has_kwargs: {has_kwargs}, has_category: {has_category}")
if not (has_category or has_kwargs):
print(f" ❌ WILL FAIL: does not accept category= or **kwargs")
print()
else:
print(f"{scraper_name}: {node.name}")
print(f" inherits __init__")
print()
except Exception as e:
print(f"Error parsing {init_file}: {e}\n")
EOFRepository: aturret/FastFetchBot
Length of output: 991
🏁 Script executed:
#!/bin/bash
# Let's check a couple of these __init__.py files directly to understand their structure
echo "=== twitter/__init__.py (first 50 lines) ==="
head -50 packages/shared/fastfetchbot_shared/services/scrapers/twitter/__init__.pyRepository: aturret/FastFetchBot
Length of output: 1618
🏁 Script executed:
#!/bin/bash
# Check threads/__init__.py
echo "=== threads/__init__.py (first 50 lines) ==="
head -50 packages/shared/fastfetchbot_shared/services/scrapers/threads/__init__.pyRepository: aturret/FastFetchBot
Length of output: 1772
🏁 Script executed:
#!/bin/bash
# Check what Weibo and Bluesky inherit from and if their base classes accept category
python3 << 'EOF'
import ast
from pathlib import Path
# Check Weibo
weibo_file = Path("packages/shared/fastfetchbot_shared/services/scrapers/weibo/__init__.py")
bluesky_file = Path("packages/shared/fastfetchbot_shared/services/scrapers/bluesky/__init__.py")
print("=== Weibo ===")
if weibo_file.exists():
tree = ast.parse(weibo_file.read_text())
for node in tree.body:
if isinstance(node, ast.ClassDef) and node.name == "Weibo":
print(f"Class: {node.name}")
print(f"Bases: {[base.id if hasattr(base, 'id') else str(base) for base in node.bases]}")
# Check if __init__ is defined
init_method = next(
(m for m in node.body if isinstance(m, ast.FunctionDef) and m.name == '__init__'),
None
)
if init_method:
print("Has custom __init__")
else:
print("Inherits __init__ from base class")
print("\n=== Bluesky ===")
if bluesky_file.exists():
tree = ast.parse(bluesky_file.read_text())
for node in tree.body:
if isinstance(node, ast.ClassDef) and node.name == "Bluesky":
print(f"Class: {node.name}")
print(f"Bases: {[base.id if hasattr(base, 'id') else str(base) for base in node.bases]}")
# Check if __init__ is defined
init_method = next(
(m for m in node.body if isinstance(m, ast.FunctionDef) and m.name == '__init__'),
None
)
if init_method:
print("Has custom __init__")
else:
print("Inherits __init__ from base class")
EOFRepository: aturret/FastFetchBot
Length of output: 238
🏁 Script executed:
#!/bin/bash
# Check weibo/__init__.py to see the Weibo class definition
head -80 packages/shared/fastfetchbot_shared/services/scrapers/weibo/__init__.py | tail -50Repository: aturret/FastFetchBot
Length of output: 1616
🏁 Script executed:
#!/bin/bash
# Check bluesky/__init__.py to see the Bluesky class definition
head -50 packages/shared/fastfetchbot_shared/services/scrapers/bluesky/__init__.py | tail -30Repository: aturret/FastFetchBot
Length of output: 1167
🏁 Script executed:
#!/bin/bash
# Find and read the MetadataItem class
find packages/shared -name "metadata_item.py" -type fRepository: aturret/FastFetchBot
Length of output: 123
🏁 Script executed:
#!/bin/bash
# Check the MetadataItem __init__ signature
python3 << 'EOF'
import ast
from pathlib import Path
metadata_file = Path("packages/shared/fastfetchbot_shared/models/metadata_item.py")
if metadata_file.exists():
tree = ast.parse(metadata_file.read_text())
for node in tree.body:
if isinstance(node, ast.ClassDef) and node.name == "MetadataItem":
print(f"Class: {node.name}")
init_method = next(
(m for m in node.body if isinstance(m, ast.FunctionDef) and m.name == '__init__'),
None
)
if init_method:
args = [arg.arg for arg in init_method.args.args]
has_kwargs = init_method.args.kwarg is not None
has_category = 'category' in args
print(f"__init__ args: {args}")
print(f"has_category: {has_category}, has_kwargs: {has_kwargs}")
else:
print("No custom __init__ (inherits from parent)")
EOFRepository: aturret/FastFetchBot
Length of output: 125
🏁 Script executed:
#!/bin/bash
# Let's check the actual file content
wc -l packages/shared/fastfetchbot_shared/models/metadata_item.py
head -100 packages/shared/fastfetchbot_shared/models/metadata_item.pyRepository: aturret/FastFetchBot
Length of output: 2885
Weibo and Bluesky scrapers are incompatible with the new constructor signature.
The call at lines 88-90 passes data= and **self.kwargs, but Weibo and Bluesky inherit from the dataclass MetadataItem, whose auto-generated __init__ only accepts the declared fields (url, telegraph_url, content, text, media_files, author, title, author_url, category, message_type). They do not accept **kwargs, so this will fail at runtime with TypeError: __init__() got an unexpected keyword argument 'data' whenever data or additional kwargs are provided. The other scrapers (Twitter, Threads, Reddit, Instagram, etc.) all have **kwargs in their signatures and will work correctly.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@packages/shared/fastfetchbot_shared/services/scrapers/common.py` around lines
88 - 90, The instantiation of scraper classes (scraper_cls =
self._resolve_scraper_class(...); scraper_item = scraper_cls(url=self.url,
category=self.category, data=self.data, **self.kwargs)) passes data and **kwargs
to all scrapers, which breaks dataclass-based scrapers like Weibo and Bluesky
(subclasses of MetadataItem) that only accept declared fields; change the
instantiation logic in the calling site to detect whether scraper_cls accepts
extra kwargs (e.g., use inspect.signature to see if **kwargs or a data parameter
is present) or whether scraper_cls is a subclass of MetadataItem, and if so call
it with only the allowed fields (url, category and other declared MetadataItem
fields) instead of passing data/**kwargs; alternatively, attempt instantiation
and on TypeError retry by creating the object with the minimal allowed
parameters for MetadataItem (url, category, telegraph_url, content, text,
media_files, author, title, author_url, message_type).
| "celery[redis]>=5.4.0,<6.0.0", | ||
| "fastfetchbot-shared[postgres]", | ||
| "fastfetchbot-file-export", | ||
| "arq>=0.27.0", |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
For ARQ, what version pinning strategy is recommended for production when using 0.27.x? Are breaking changes expected in 0.28.x?
💡 Result:
For ARQ (python-arq job queue library) version 0.27.x in production, pin to a specific patch version (e.g. arq==0.27.0) rather than the loose 0.27.x range, as the project is in maintenance-only mode with limited updates and follows semantic versioning where minor releases (0.28.0) may introduce breaking changes. No 0.28.x release exists as of 2026-03-22 (latest is v0.27.0 released 2026-02-02), so breaking changes are possible but not yet released; treat 0.28.x upgrade as potentially breaking per semver conventions used in releases like v0.26.0b1 and earlier.
Citations:
- 1: https://github.com/python-arq/arq
- 2: https://arq-docs.helpmanual.io/
- 3: https://pypi.org/project/arq/
- 4:
v0.26release python-arq/arq#441 - 5: https://github.com/samuelcolvin/arq
- 6: https://newreleases.io/project/github/python-arq/arq/release/v0.26.0b1
- 7: https://newreleases.io/project/github/python-arq/arq/release/v0.25.0
🏁 Script executed:
cat -n pyproject.toml | head -60Repository: aturret/FastFetchBot
Length of output: 2217
Pin arq with an explicit upper bound to match the project's dependency strategy.
Line 47 is the only external runtime dependency without an upper bound; all 30+ other dependencies use explicit upper bounds (e.g., >=X.Y.Z,<UPPER.0.0). ARQ is also in maintenance-only mode with no 0.28.x release planned, making strict version bounds prudent.
Proposed diff
- "arq>=0.27.0",
+ "arq>=0.27.0,<0.28.0",📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| "arq>=0.27.0", | |
| "arq>=0.27.0,<0.28.0", |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pyproject.toml` at line 47, The dependency entry "arq>=0.27.0" in
pyproject.toml lacks an explicit upper bound; update that entry to follow the
project's dependency policy by pinning an upper bound (e.g., change
"arq>=0.27.0" to "arq>=0.27.0,<0.28.0") so it matches the other runtime
dependencies and prevents unintended upgrades.
|
|
||
| # Async Scraping Worker (ARQ) | ||
| # Scrape mode: "api" (sync via API server) or "queue" (async via ARQ worker). Default: `api` | ||
| SCRAPE_MODE=api | ||
|
|
||
| # Redis URL for ARQ task queue. Default: `redis://localhost:6379/2` | ||
| ARQ_REDIS_URL=redis://redis:6379/2 | ||
|
|
||
| # Redis URL for the result outbox. Default: `redis://localhost:6379/3` | ||
| OUTBOX_REDIS_URL=redis://redis:6379/3 |
There was a problem hiding this comment.
Default value inconsistency: SCRAPE_MODE differs between template.env and config.py.
The template.env sets SCRAPE_MODE=api (Line 170), but apps/telegram-bot/core/config.py (Line 125) defaults to "queue" when the variable is unset. This mismatch could cause unexpected behavior for users who deploy without explicitly setting SCRAPE_MODE.
Consider aligning the defaults—likely api should be the safe default in both places to ensure backwards compatibility for existing deployments.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@template.env` around lines 167 - 176, The SCRAPE_MODE default is
inconsistent: template.env sets SCRAPE_MODE=api but the application default in
apps/telegram-bot/core/config.py falls back to "queue" (e.g., where SCRAPE_MODE
is loaded via get_env or Config.SCRAPE_MODE with default "queue"); pick a single
safe default (recommend "api") and change the fallback in config.py to "api" (or
change template.env to "queue" if you prefer the other behavior), and update any
associated comments/docs so both the env template and the code use the same
SCRAPE_MODE default.
| class TestConfigDefaults: | ||
| @pytest.mark.asyncio | ||
| async def test_uses_config_defaults_when_none(self, base_metadata_item, mock_telegraph): | ||
| """When store_telegraph/store_document are None, config defaults should be used.""" | ||
| with patch("async_worker.services.enrichment.STORE_TELEGRAPH", True), \ | ||
| patch("async_worker.services.enrichment.STORE_DOCUMENT", False): | ||
| result = await enrich(base_metadata_item) | ||
|
|
||
| # STORE_TELEGRAPH=True means Telegraph should be called | ||
| MockTg, instance = mock_telegraph | ||
| MockTg.from_dict.assert_called_once() |
There was a problem hiding this comment.
Fixture mock_telegraph not applied correctly in this test.
The test uses mock_telegraph as a parameter but then unpacks it after the await enrich(...) call completes, meaning the fixture's patch context may not be active at assertion time. Additionally, the unpacked instance variable is unused.
Suggested fix
`@pytest.mark.asyncio`
async def test_uses_config_defaults_when_none(self, base_metadata_item, mock_telegraph):
"""When store_telegraph/store_document are None, config defaults should be used."""
+ MockTg, _ = mock_telegraph # Unpack before calling enrich
with patch("async_worker.services.enrichment.STORE_TELEGRAPH", True), \
patch("async_worker.services.enrichment.STORE_DOCUMENT", False):
result = await enrich(base_metadata_item)
# STORE_TELEGRAPH=True means Telegraph should be called
- MockTg, instance = mock_telegraph
MockTg.from_dict.assert_called_once()🧰 Tools
🪛 Ruff (0.15.6)
[warning] 196-196: Unpacked variable instance is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/unit/async_worker/test_enrichment.py` around lines 187 - 197, In
test_uses_config_defaults_when_none move the mock_telegraph unpacking before
calling enrich so the fixture/patch is active during the call and drop the
unused instance variable; e.g. unpack mock_telegraph into (MockTg, _) or
(MockTg, instance) at the top of the test, then call await
enrich(base_metadata_item) and assert MockTg.from_dict.assert_called_once();
ensure you still patch STORE_TELEGRAPH/STORE_DOCUMENT as currently done.
| mock_celery.send_task.assert_called_once_with( | ||
| "file_export.pdf_export", | ||
| kwargs={ | ||
| "html_string": mock_celery.send_task.call_args.kwargs["kwargs"]["html_string"], | ||
| "output_filename": mock_celery.send_task.call_args.kwargs["kwargs"]["output_filename"], | ||
| }, | ||
| ) |
There was a problem hiding this comment.
Test assertion is circular and doesn't verify expected values.
The assertion compares send_task call args against values extracted from the same call args, which always passes. Consider asserting against expected values instead.
💚 Suggested fix
assert output == "/tmp/final.pdf"
- mock_celery.send_task.assert_called_once_with(
- "file_export.pdf_export",
- kwargs={
- "html_string": mock_celery.send_task.call_args.kwargs["kwargs"]["html_string"],
- "output_filename": mock_celery.send_task.call_args.kwargs["kwargs"]["output_filename"],
- },
- )
+ mock_celery.send_task.assert_called_once()
+ call_args = mock_celery.send_task.call_args
+ assert call_args[0][0] == "file_export.pdf_export"
+ sent_kwargs = call_args.kwargs["kwargs"]
+ assert "<html>" in sent_kwargs["html_string"]
+ assert "content" in sent_kwargs["html_string"]
+ assert sent_kwargs["output_filename"].endswith(".pdf")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/unit/file_export/test_pdf_export.py` around lines 110 - 116, The test
currently asserts mock_celery.send_task against values extracted from the same
call (mock_celery.send_task.call_args), which is circular and always passes;
update the assertion in the test_pdf_export test to compare the send_task call
to concrete expected values (or to variables computed earlier in the test)
instead of pulling values from call_args (i.e., assert
mock_celery.send_task.assert_called_once_with("file_export.pdf_export",
kwargs={"html_string": expected_html_string, "output_filename":
expected_output_filename})), referencing the mock_celery.send_task call and the
expected HTML/output filename variables used in the test.
Summary by CodeRabbit
New Features
Chores
Tests