Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
89b8c8e
refactor: delete old code
aturret Feb 19, 2026
eefcedd
feat(api)!: replace http call with celery
aturret Feb 19, 2026
15e9446
feat: add new celery worker process
aturret Feb 20, 2026
dc70b04
feat: add file-export as a new package
aturret Feb 20, 2026
c4ba717
chore: update sample files and docs
aturret Feb 20, 2026
6c6f259
chore: remove unused import
aturret Feb 20, 2026
e904f50
fix: remove open_api_key pass
aturret Feb 20, 2026
be73cb1
style: add celery-type
aturret Feb 20, 2026
bb17ffc
chore: set unused export function as async
aturret Feb 20, 2026
5f5cecd
fix: add logger to file export
aturret Feb 20, 2026
5ddee8a
feat: update github action for different envs
aturret Feb 20, 2026
a2c2cd7
fix: add logger to transcribe
aturret Feb 20, 2026
0cf8717
feat: add sanitizing for yt-dlp content info
aturret Feb 20, 2026
fda576e
Update packages/file-export/fastfetchbot_file_export/transcribe.py
aturret Feb 20, 2026
b34fbac
fix: update CLAUDE.md
aturret Feb 20, 2026
fb64e43
fix: fix the audio segment logic
aturret Feb 20, 2026
5404419
refactor: remove duplicated code
aturret Feb 20, 2026
c5c77e2
Update packages/file-export/fastfetchbot_file_export/video_download.py
aturret Feb 20, 2026
7c27c13
refactor: add exception handling for celery task
aturret Feb 20, 2026
35b1966
feat: add exception handling for celery tasks
aturret Feb 20, 2026
4bbb9af
fix: fix format check logic
aturret Feb 20, 2026
e172c78
fix: fix error exception handling
aturret Feb 20, 2026
e5b7e51
fix: fix remove file logic sequence
aturret Feb 20, 2026
06010a9
fix: fix audio_file ext name
aturret Feb 20, 2026
5e34fe3
fix: fix video download exception handling
aturret Feb 20, 2026
3c266cb
fix: update filepath generation logic for yt-dlp
aturret Feb 20, 2026
37e7b7d
Merge branch 'celery-update' of https://github.com/aturret/FastFetchB…
aturret Feb 20, 2026
69b457d
fix: update ci to avoid injection risk
aturret Feb 20, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
51 changes: 42 additions & 9 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@ on:
push:
branches:
- main
tags:
- 'v*'

concurrency:
group: fastfetchbot
Expand All @@ -24,6 +26,9 @@ jobs:
- service: telegram-bot
dockerfile: apps/telegram-bot/Dockerfile
image_suffix: tgbot
- service: worker
dockerfile: apps/worker/Dockerfile
image_suffix: worker
steps:
- name: Checkout
uses: actions/checkout@v4
Expand All @@ -33,13 +38,29 @@ jobs:
- name: Check commit message
id: check_message
run: |
MESSAGE=$(git log --format=%B -n 1 ${{ github.sha }})
MESSAGE=$(git log --format=%B -n 1 "$GITHUB_SHA")
if [[ "$MESSAGE" == *"[github-action]"* ]]; then
echo "skip=true" >> "$GITHUB_OUTPUT"
else
echo "skip=false" >> "$GITHUB_OUTPUT"
fi

- name: Determine Environment Tags
id: env_vars
run: |
# Check if the workflow was triggered by a tag or a branch push
if [[ "$GITHUB_REF" == refs/tags/* ]]; then
# Production Environment (Tag Trigger)
VERSION_TAG=${GITHUB_REF#refs/tags/}
echo "docker_tag=latest" >> "$GITHUB_OUTPUT"
echo "version_tag=$VERSION_TAG" >> "$GITHUB_OUTPUT"
else
# Staging Environment (Main Branch Trigger)
echo "docker_tag=stage" >> "$GITHUB_OUTPUT"
# Use the short commit SHA as a secondary tag for tracking
echo "version_tag=$(git rev-parse --short HEAD)" >> "$GITHUB_OUTPUT"
fi

- name: Set up QEMU
uses: docker/setup-qemu-action@v3

Expand Down Expand Up @@ -67,12 +88,24 @@ jobs:
build-args: |
APP_VERSION=${{ env.APP_VERSION }}
tags: |
ghcr.io/${{ github.repository_owner }}/fastfetchbot-${{ matrix.image_suffix }}:latest
ghcr.io/${{ github.repository_owner }}/fastfetchbot-${{ matrix.image_suffix }}:${{ steps.env_vars.outputs.docker_tag }}
ghcr.io/${{ github.repository_owner }}/fastfetchbot-${{ matrix.image_suffix }}:${{ steps.env_vars.outputs.version_tag }}

deploy:
needs: build
runs-on: ubuntu-latest
steps:
- name: Trigger Watchtower deployment
run: |
curl -H "Authorization: Bearer ${{ secrets.WATCHTOWER_TOKEN }}" ${{ secrets.WATCHTOWER_WEBHOOK_URL }}
# deploy:
# needs: build
# runs-on: ubuntu-latest
# steps:
# - name: Trigger Watchtower deployment
# run: |
# # Route the webhook to the appropriate server based on the trigger
# if [[ "$GITHUB_REF" == refs/tags/* ]]; then
# echo "Deploying to Production..."
# TOKEN="${{ secrets.PROD_WATCHTOWER_TOKEN }}"
# WEBHOOK_URL="${{ secrets.PROD_WATCHTOWER_WEBHOOK_URL }}"
# else
# echo "Deploying to Staging..."
# TOKEN="${{ secrets.STAGE_WATCHTOWER_TOKEN }}"
# WEBHOOK_URL="${{ secrets.STAGE_WATCHTOWER_WEBHOOK_URL }}"
# fi
#
# curl -H "Authorization: Bearer $TOKEN" "$WEBHOOK_URL"
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -258,3 +258,4 @@ conf/*
/.run/
.DS_Store
/.claude/
/apps/worker/conf/
249 changes: 134 additions & 115 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,143 +2,162 @@

## Project Overview

FastFetchBot is a social media content fetching API built with FastAPI, designed to scrape and archive content from various social media platforms. It includes a Telegram Bot as the default client interface and supports multiple social media platforms including Twitter, Weibo, Xiaohongshu, Reddit, Bluesky, Instagram, Zhihu, Douban, YouTube, and Bilibili.
FastFetchBot is a social media content fetching service built as a **UV workspace monorepo** with three microservices: a FastAPI server (API), a Telegram Bot client, and a Celery worker for file operations. It scrapes and archives content from various social media platforms including Twitter, Weibo, Xiaohongshu, Reddit, Bluesky, Instagram, Zhihu, Douban, YouTube, and Bilibili.

## Architecture

```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add a language specifier to the fenced code block.

The directory-tree block has no language tag, which violates MD040.

📝 Proposed fix
-```
+```text
 FastFetchBot/
🧰 Tools
🪛 markdownlint-cli2 (0.21.0)

[warning] 9-9: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@CLAUDE.md` at line 9, The fenced code block that contains the directory tree
entry "FastFetchBot/" is missing a language specifier (MD040); update that
fenced block (the one enclosing "FastFetchBot/") to include a language tag such
as text (e.g., ```text) so the markdown linter stops reporting the violation.

FastFetchBot/
├── packages/shared/ # fastfetchbot-shared: common models, utilities, logger
├── packages/file-export/ # fastfetchbot-file-export: video download, PDF export, transcription
├── apps/api/ # FastAPI server: scrapers, storage, routing
├── apps/telegram-bot/ # Telegram Bot: webhook/polling, message handling
├── apps/worker/ # Celery worker: async file operations (video, PDF, audio)
├── app/ # Legacy re-export wrappers (backward compatibility)
├── pyproject.toml # Root workspace configuration
└── uv.lock # Lockfile for the entire workspace
```

| Service | Package Name | Port | Entry Point |
|---------|-------------|------|-------------|
| **API Server** (`apps/api/src/`) | `fastfetchbot-api` | 10450 | `gunicorn -k uvicorn.workers.UvicornWorker src.main:app --preload` |
| **Telegram Bot** (`apps/telegram-bot/core/`) | `fastfetchbot-telegram-bot` | 10451 | `python -m core.main` |
| **Worker** (`apps/worker/worker_core/`) | `fastfetchbot-worker` | — | `celery -A worker_core.main:app worker --loglevel=info --concurrency=2` |
| **Shared Library** (`packages/shared/fastfetchbot_shared/`) | `fastfetchbot-shared` | — | — |
| **File Export Library** (`packages/file-export/fastfetchbot_file_export/`) | `fastfetchbot-file-export` | — | — |

The Telegram Bot communicates with the API server over HTTP (`API_SERVER_URL`). In Docker, this is `http://api:10450`.

### API Server (`apps/api/src/`)

- **`main.py`** — FastAPI app setup, Sentry integration, lifecycle management
- **`config.py`** — Environment variable handling, platform credentials
- **`routers/`** — `scraper.py` (generic endpoint), `scraper_routers.py` (platform-specific), `inoreader.py`, `wechat.py`
- **`services/scrapers/`** — `scraper_manager.py` orchestrates platform scrapers (twitter, weibo, bluesky, xiaohongshu, reddit, instagram, zhihu, douban, threads, wechat, general)
- **`services/file_export/`** — PDF generation, audio transcription (OpenAI), video download
- **`services/amazon/s3.py`** — S3 storage integration
- **`services/telegraph/`** — Telegraph content publishing
- **`templates/`** — Jinja2 templates for platform-specific output formatting

### Telegram Bot (`apps/telegram-bot/core/`)

- **`main.py`** — Entry point
- **`api_client.py`** — HTTP client calling the API server
- **`handlers/`** — `messages.py`, `buttons.py`, `url_process.py`
- **`services/`** — `bot_app.py`, `message_sender.py`, `constants.py`
- **`webhook/server.py`** — Webhook/polling server
- **`templates/`** — Jinja2 templates for bot messages

### Shared Library (`packages/shared/fastfetchbot_shared/`)

- **`config.py`** — URL patterns (SOCIAL_MEDIA_WEBSITE_PATTERNS, VIDEO_WEBSITE_PATTERNS, BANNED_PATTERNS)
- **`models/`** — `classes.py` (NamedBytesIO), `metadata_item.py`, `telegraph_item.py`, `url_metadata.py`
- **`utils/`** — `parse.py` (URL parsing, HTML processing, `get_env_bool`), `image.py`, `logger.py`, `network.py`

### Legacy `app/` Directory

Re-export wrappers providing backward compatibility. Actual code lives in `apps/api/src/` and `packages/shared/`. For example, `app/config.py` imports `get_env_bool` from `fastfetchbot_shared.utils.parse`.

## Development Commands

### Package Management
- `uv sync` - Install all dependencies (including dev)
- `uv sync --no-dev` - Install production dependencies only
- `uv sync --extra windows` - Install with Windows extras
- `uv lock` - Regenerate the lock file after pyproject.toml changes
- `uv sync` — Install all dependencies (including dev)
- `uv lock` — Regenerate the lock file after pyproject.toml changes

### Running the Application
- **Production**: `uv run gunicorn -k uvicorn.workers.UvicornWorker app.main:app --preload`
- **Development**: `uv run gunicorn -k uvicorn.workers.UvicornWorker --bind 0.0.0.0:10450 wsgi:app`
### Running Locally

### Docker Commands
- `docker-compose up -d` - Start all services (FastFetchBot, Telegram Bot API, File Exporter)
- `docker-compose build` - Build the FastFetchBot container
```bash
# API server
cd apps/api
uv run gunicorn -k uvicorn.workers.UvicornWorker src.main:app --preload

> **uv version in Docker**: The Dockerfile pins uv to `0.8.18` via `COPY --from=ghcr.io/astral-sh/uv:0.8.18`.
> To upgrade, update that tag in `Dockerfile` line 24 and verify the build with `docker build -t fastfetchbot-test .`.
# Telegram Bot (separate terminal)
cd apps/telegram-bot
uv run python -m core.main
```

### Testing
- `uv run pytest` - Run all tests
- `uv run pytest tests/test_bluesky.py` - Run specific test file
- `uv run pytest -v` - Run tests with verbose output
- `uv run pytest` Run all tests
- `uv run pytest tests/test_bluesky.py` Run specific test file
- `uv run pytest -v` — Verbose output

### Code Formatting
- `uv run black .` - Format all Python code using Black formatter

## Architecture Overview

### Core Components

**FastAPI Application (`app/main.py`)**
- Main application entry point with FastAPI instance
- Configures routers, middleware, and lifecycle management
- Integrates Sentry for error monitoring
- Handles Telegram bot webhook setup on startup

**Scraper Architecture (`app/services/scrapers/`)**
- `ScraperManager`: Centralized manager for all platform scrapers
- Individual scraper modules for each platform (twitter, weibo, bluesky, etc.)
- Each scraper implements platform-specific content extraction logic
- Common scraping utilities in `common.py`

**Router Structure (`app/routers/`)**
- Platform-specific routers (twitter.py, weibo.py, etc.)
- Generic scraper router for unified API endpoints
- Telegram bot webhook handler
- Feed processing and Inoreader integration

**Data Models (`app/models/`)**
- `classes.py`: Core data structures (NamedBytesIO)
- `database_model.py`: MongoDB/Beanie models
- Platform-specific metadata models
- Telegram chat and Telegraph item models

**Configuration (`app/config.py`)**
- Comprehensive environment variable handling
- Platform-specific API credentials and cookies
- Database, storage, and service configurations
- Template and localization settings

### Key Services

**Telegram Bot Service (`app/services/telegram_bot/`)**
- Handles webhook setup and message processing
- Integrates with local Telegram Bot API server for large file support
- Channel and admin management

**File Export Service (`app/services/file_export/`)**
- Document export (PDF generation)
- Audio transcription (OpenAI integration)
- Video download capabilities

**Storage Services**
- Amazon S3 integration for media storage
- Local file system management
- Telegraph integration for content publishing

### Platform Support

**Supported Social Media Platforms:**
- Twitter (requires ct0 and auth_token cookies)
- Weibo (requires cookies)
- Xiaohongshu (requires a1, webid, websession cookies)
- Bluesky (requires username/password)
- Reddit (requires API credentials)
- Instagram (requires X-RapidAPI key)
- Zhihu (requires cookies in conf/zhihu_cookies.json)
- Douban
- YouTube, Bilibili (video content)
- `uv run black .` — Format all Python code

### Docker

```bash
# Start all services (uses pre-built images from GHCR)
docker-compose up -d

# Build locally
docker build -f apps/api/Dockerfile -t fastfetchbot-api .
docker build -f apps/telegram-bot/Dockerfile -t fastfetchbot-telegram-bot .
docker build -f apps/worker/Dockerfile -t fastfetchbot-worker .
```

> **uv version in Docker**: All three Dockerfiles pin uv to `0.10.4` via `COPY --from=ghcr.io/astral-sh/uv:0.10.4`.
> To upgrade, update that tag in `apps/api/Dockerfile`, `apps/telegram-bot/Dockerfile`, and `apps/worker/Dockerfile`.

Docker Compose services (see `docker-compose.template.yml`):
- **api** — API server (port 10450)
- **telegram-bot** — Telegram Bot (port 10451)
- **telegram-bot-api** — Local Telegram Bot API for large file support (ports 8081-8082)
- **redis** — Message broker and result backend for Celery (port 6379)
- **worker** — Celery worker for file operations (video download, PDF export, audio transcription)

## Environment Configuration

### Required Variables
- `BASE_URL`: Server base URL
- `TELEGRAM_BOT_TOKEN`: Telegram bot token
- `TELEGRAM_CHAT_ID`: Default chat ID for bot
See `template.env` for a complete reference. Key variables:

### Required
| Variable | Description |
|----------|-------------|
| `BASE_URL` | Public server domain (used for webhook URL construction) |
| `TELEGRAM_BOT_TOKEN` | Bot token from @BotFather |
| `TELEGRAM_CHAT_ID` | Default chat ID for the bot |

### Critical Setup Notes
- Most social media scrapers require authentication cookies/tokens
### Service Communication (Docker)
| Variable | Default | Description |
|----------|---------|-------------|
| `API_SERVER_URL` | `http://localhost:10450` | URL the Telegram Bot uses to call the API. `http://api:10450` in Docker. |
| `TELEGRAM_BOT_CALLBACK_URL` | `http://localhost:10451` | URL the API uses to call the Telegram Bot. `http://telegram-bot:10451` in Docker. |
| `TELEGRAM_BOT_MODE` | `polling` | `polling` (dev) or `webhook` (production with HTTPS) |
Comment on lines +113 to +125
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add blank lines around the environment variable tables (MD058).

Both the "Required" table (around line 114) and the "Service Communication (Docker)" table (around line 121) are flagged by markdownlint for missing surrounding blank lines.

📝 Proposed fix
 ### Required
+
 | Variable | Description |
 |----------|-------------|
 | `BASE_URL` | Public server domain (used for webhook URL construction) |
 | `TELEGRAM_BOT_TOKEN` | Bot token from `@BotFather` |
 | `TELEGRAM_CHAT_ID` | Default chat ID for the bot |
+
 ### Service Communication (Docker)
+
 | Variable | Default | Description |
 |----------|---------|-------------|
 | `API_SERVER_URL` | `http://localhost:10450` | URL the Telegram Bot uses to call the API. `http://api:10450` in Docker. |
 | `TELEGRAM_BOT_CALLBACK_URL` | `http://localhost:10451` | URL the API uses to call the Telegram Bot. `http://telegram-bot:10451` in Docker. |
 | `TELEGRAM_BOT_MODE` | `polling` | `polling` (dev) or `webhook` (production with HTTPS) |
+
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
### Required
| Variable | Description |
|----------|-------------|
| `BASE_URL` | Public server domain (used for webhook URL construction) |
| `TELEGRAM_BOT_TOKEN` | Bot token from @BotFather |
| `TELEGRAM_CHAT_ID` | Default chat ID for the bot |
### Critical Setup Notes
- Most social media scrapers require authentication cookies/tokens
### Service Communication (Docker)
| Variable | Default | Description |
|----------|---------|-------------|
| `API_SERVER_URL` | `http://localhost:10450` | URL the Telegram Bot uses to call the API. `http://api:10450` in Docker. |
| `TELEGRAM_BOT_CALLBACK_URL` | `http://localhost:10451` | URL the API uses to call the Telegram Bot. `http://telegram-bot:10451` in Docker. |
| `TELEGRAM_BOT_MODE` | `polling` | `polling` (dev) or `webhook` (production with HTTPS) |
### Required
| Variable | Description |
|----------|-------------|
| `BASE_URL` | Public server domain (used for webhook URL construction) |
| `TELEGRAM_BOT_TOKEN` | Bot token from `@BotFather` |
| `TELEGRAM_CHAT_ID` | Default chat ID for the bot |
### Service Communication (Docker)
| Variable | Default | Description |
|----------|---------|-------------|
| `API_SERVER_URL` | `http://localhost:10450` | URL the Telegram Bot uses to call the API. `http://api:10450` in Docker. |
| `TELEGRAM_BOT_CALLBACK_URL` | `http://localhost:10451` | URL the API uses to call the Telegram Bot. `http://telegram-bot:10451` in Docker. |
| `TELEGRAM_BOT_MODE` | `polling` | `polling` (dev) or `webhook` (production with HTTPS) |
🧰 Tools
🪛 markdownlint-cli2 (0.21.0)

[warning] 114-114: Tables should be surrounded by blank lines

(MD058, blanks-around-tables)


[warning] 121-121: Tables should be surrounded by blank lines

(MD058, blanks-around-tables)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@CLAUDE.md` around lines 113 - 125, The two Markdown tables under the
"Required" heading and the "Service Communication (Docker)" heading are missing
blank lines around them (MD058); edit CLAUDE.md to insert a blank line
immediately before and after each table so each table is separated from
surrounding paragraphs/headings—i.e., ensure there is an empty line above the
"### Required" table and below it, and likewise an empty line above and below
the "### Service Communication (Docker)" table.


### Platform Credentials
- Most scrapers require authentication cookies/tokens
- Use browser extension "Get cookies.txt LOCALLY" to extract cookies
- Store Zhihu cookies in `conf/zhihu_cookies.json`
- Template environment file available at `template.env`
- See `template.env` for all platform-specific variables (Twitter, Weibo, Xiaohongshu, Reddit, Instagram, Bluesky, etc.)

### Database Integration
- Optional MongoDB integration (set `DATABASE_ON=true`)
- Uses Beanie ODM for async MongoDB operations
- Database initialization handled in app lifecycle
### Database
- Optional MongoDB integration (`DATABASE_ON=true`)
- Uses Beanie ODM for async operations

### Docker Services
- **fastfetchbot**: Main application container
- **telegram-bot-api**: Local Telegram Bot API for large file support
- **fast-yt-downloader**: Separate service for video downloads
## CI/CD

## Development Guidelines
GitHub Actions (`.github/workflows/ci.yml`) builds and pushes all three images on push to `main`:
- `ghcr.io/aturret/fastfetchbot-api:latest`
- `ghcr.io/aturret/fastfetchbot-tgbot:latest`
- `ghcr.io/aturret/fastfetchbot-worker:latest`

### Cookie Management
- Platform scrapers depend on valid authentication cookies
- Store sensitive cookies in environment variables, never in code
- Test scraper functionality after cookie updates
Deployment is triggered via Watchtower webhook after builds complete. Include `[github-action]` in a commit message to skip the build.

### Adding New Platform Support
1. Create new scraper module in `app/services/scrapers/[platform]/`
## Development Guidelines

### Adding a New Platform Scraper
1. Create scraper module in `apps/api/src/services/scrapers/<platform>/`
2. Implement scraper class following existing patterns
3. Add platform-specific router in `app/routers/`
4. Update ScraperManager to include new scraper
5. Add configuration variables in `app/config.py`
3. Add platform-specific router in `apps/api/src/routers/`
4. Register the scraper in `ScraperManager`
5. Add configuration variables in `apps/api/src/config.py`
6. Create tests in `tests/cases/`

### Template System
- Jinja2 templates in `app/templates/` for content formatting
- Platform-specific templates for different output formats
- Supports internationalization via gettext

### Error Handling and Logging
- Loguru for comprehensive logging
- Sentry integration for production error monitoring
- Platform-specific error handling in scrapers
### Key Conventions
- Shared models and utilities go in `packages/shared/fastfetchbot_shared/`
- API-specific code goes in `apps/api/src/`
- Telegram bot code goes in `apps/telegram-bot/core/`
- The bot communicates with the API only via HTTP — no direct imports of API code
- Jinja2 templates for output formatting, with i18n support via Babel
- Loguru for logging, Sentry for production error monitoring
- Store sensitive cookies/tokens in environment variables, never in code
Loading