Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -272,3 +272,4 @@ conf/*
.DS_Store
/.claude/
/apps/worker/conf/
apps/worker/celerybeat-schedule.db
6 changes: 4 additions & 2 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ The Telegram Bot communicates with the API server over HTTP (`API_SERVER_URL`).
- **`main.py`** — FastAPI app setup, Sentry integration, lifecycle management
- **`config.py`** — Environment variable handling, platform credentials
- **`routers/`** — `scraper.py` (generic endpoint), `scraper_routers.py` (platform-specific), `inoreader.py`, `wechat.py`
- **`services/scrapers/`** — `scraper_manager.py` orchestrates platform scrapers (twitter, weibo, bluesky, xiaohongshu, reddit, instagram, zhihu, douban, threads, wechat, general)
- **`services/scrapers/`** — `scraper_manager.py` orchestrates platform scrapers (twitter, weibo, bluesky, xiaohongshu, reddit, instagram, zhihu, douban, threads, wechat, general); the Xiaohongshu scraper uses `xiaohongshu/adaptar.py` (`XhsSinglePostAdapter`) with an external sign server instead of the old Playwright-based crawler
- **`services/file_export/`** — PDF generation, audio transcription (OpenAI), video download
- **`services/amazon/s3.py`** — S3 storage integration
- **`services/telegraph/`** — Telegraph content publishing
Expand All @@ -50,7 +50,7 @@ The Telegram Bot communicates with the API server over HTTP (`API_SERVER_URL`).

### Shared Library (`packages/shared/fastfetchbot_shared/`)

- **`config.py`** — URL patterns (SOCIAL_MEDIA_WEBSITE_PATTERNS, VIDEO_WEBSITE_PATTERNS, BANNED_PATTERNS)
- **`config.py`** — URL patterns (SOCIAL_MEDIA_WEBSITE_PATTERNS, VIDEO_WEBSITE_PATTERNS, BANNED_PATTERNS); shared env vars including `SIGN_SERVER_URL` and `XHS_COOKIE_PATH`
- **`models/`** — `classes.py` (NamedBytesIO), `metadata_item.py`, `telegraph_item.py`, `url_metadata.py`
- **`utils/`** — `parse.py` (URL parsing, HTML processing, `get_env_bool`), `image.py`, `logger.py`, `network.py`

Expand Down Expand Up @@ -128,6 +128,8 @@ See `template.env` for a complete reference. Key variables:
- Most scrapers require authentication cookies/tokens
- Use browser extension "Get cookies.txt LOCALLY" to extract cookies
- Store Zhihu cookies in `conf/zhihu_cookies.json`
- Store Xiaohongshu cookies in `conf/xhs_cookies.txt` (single-line cookie string, e.g. `a1=x; web_id=x; web_session=x`)
- Xiaohongshu also requires an external **sign server** reachable at `SIGN_SERVER_URL` (default `http://localhost:8989`); the sign server is currently closed-source — you must supply your own compatible implementation
- See `template.env` for all platform-specific variables (Twitter, Weibo, Xiaohongshu, Reddit, Instagram, Bluesky, etc.)

### Database
Expand Down
42 changes: 39 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -154,10 +154,46 @@ See `template.env` for a complete reference with comments.
| Twitter | `TWITTER_CT0`, `TWITTER_AUTH_TOKEN` |
| Reddit | `REDDIT_CLIENT_ID`, `REDDIT_CLIENT_SECRET`, `REDDIT_USERNAME`, `REDDIT_PASSWORD` |
| Weibo | `WEIBO_COOKIES` |
| Xiaohongshu | `XIAOHONGSHU_A1`, `XIAOHONGSHU_WEBID`, `XIAOHONGSHU_WEBSESSION` |
| Xiaohongshu | See [Xiaohongshu Setup](#xiaohongshu-setup) below |
| Instagram | `X_RAPIDAPI_KEY` |
| Zhihu | Store cookies in `conf/zhihu_cookies.json` |

#### Xiaohongshu Setup

Xiaohongshu (XHS) API requests require a cryptographic signature (`x-s`, `x-t`, etc.) that must be computed by a dedicated signing proxy. FastFetchBot delegates this to an external **sign server**.

> **Note:** We currently use a closed-source sign server. You will need to run your own compatible signing proxy and point `SIGN_SERVER_URL` at it.

The sign server must accept `POST /signsrv/v1/xhs/sign` with a JSON body:

```json
{"uri": "/api/sns/web/v1/feed", "data": {...}, "cookies": "a1=..."}
```

and return:

```json
{"isok": true, "data": {"x_s": "...", "x_t": "...", "x_s_common": "...", "x_b3_traceid": "..."}}
```

**Cookie configuration** (two options; file takes priority):

- **File (recommended):** Create `apps/api/conf/xhs_cookies.txt` containing your XHS cookies as a single line:
```
a1=xxxxxxxx; web_id=xxxxxxxx; web_session=xxxxxxxx
```
Comment on lines +182 to +184
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add a language identifier to the fenced code block.

The cookie-file example block has no language specifier, which triggers markdownlint MD040. Since it's a plain text example, use text (or sh).

🔧 Proposed fix
-  ```
+  ```text
   a1=xxxxxxxx; web_id=xxxxxxxx; web_session=xxxxxxxx
</details>

<!-- suggestion_start -->

<details>
<summary>📝 Committable suggestion</summary>

> ‼️ **IMPORTANT**
> Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

```suggestion

🧰 Tools
🪛 markdownlint-cli2 (0.21.0)

[warning] 182-182: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@README.md` around lines 182 - 184, The fenced code block containing the
cookie-file example (the block showing "a1=xxxxxxxx; web_id=xxxxxxxx;
web_session=xxxxxxxx") is missing a language identifier and triggers
markdownlint MD040; update that fenced block to include a language specifier
such as "text" (or "sh") so the block becomes ```text ... ``` to mark it as
plain text.

Log in to [xiaohongshu.com](https://www.xiaohongshu.com) in your browser, then copy the cookie values from DevTools → Application → Cookies, or use the [Get cookies.txt LOCALLY](https://chrome.google.com/webstore/detail/get-cookiestxt-locally/cclelndahbckbenkjhflpdbgdldlbecc) extension.

- **Environment variables (legacy fallback):** Set `XIAOHONGSHU_A1`, `XIAOHONGSHU_WEBID`, and `XIAOHONGSHU_WEBSESSION` individually. Used only when the cookie file is absent.

| Variable | Default | Description |
|----------|---------|-------------|
| `SIGN_SERVER_URL` | `http://localhost:8989` | URL of the XHS signing proxy |
| `XHS_COOKIE_PATH` | `conf/xhs_cookies.txt` | Path to cookie file (overrides default location) |
| `XIAOHONGSHU_A1` | `None` | `a1` cookie value (legacy fallback) |
| `XIAOHONGSHU_WEBID` | `None` | `web_id` cookie value (legacy fallback) |
| `XIAOHONGSHU_WEBSESSION` | `None` | `web_session` cookie value (legacy fallback) |

#### Cloud Services

| Variable | Description |
Expand Down Expand Up @@ -193,7 +229,7 @@ See `template.env` for a complete reference with comments.
- [x] WeChat Public Account Articles
- [x] Zhihu
- [x] Douban
- [ ] Xiaohongshu
- [x] Xiaohongshu

### Video

Expand All @@ -211,7 +247,7 @@ The GitHub Actions pipeline (`.github/workflows/ci.yml`) automatically builds an

The HTML to Telegra.ph converter function is based on [html-telegraph-poster](https://github.com/mercuree/html-telegraph-poster). I separated it from this project as an independent Python package: [html-telegraph-poster-v2](https://github.com/aturret/html-telegraph-poster-v2).

The Xiaohongshu scraper is based on [MediaCrawler](https://github.com/NanmiCoder/MediaCrawler).
The original Xiaohongshu scraper was based on [MediaCrawler](https://github.com/NanmiCoder/MediaCrawler). The current implementation uses a custom httpx-based adapter with an external signing proxy.

The Weibo scraper is based on [weiboSpider](https://github.com/dataabc/weiboSpider).

Expand Down
27 changes: 27 additions & 0 deletions apps/api/src/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
import gettext
import secrets

from fastfetchbot_shared.utils.logger import logger
from fastfetchbot_shared.utils.parse import get_env_bool

env = os.environ
Expand Down Expand Up @@ -89,6 +90,32 @@
XHS_ENABLE_IP_PROXY = get_env_bool(env, "XHS_ENABLE_IP_PROXY", False)
XHS_SAVE_LOGIN_STATE = get_env_bool(env, "XHS_SAVE_LOGIN_STATE", True)

# XHS sign server and cookie file
from fastfetchbot_shared.config import SIGN_SERVER_URL as XHS_SIGN_SERVER_URL
from fastfetchbot_shared.config import XHS_COOKIE_PATH as _XHS_COOKIE_PATH

xhs_cookie_path = _XHS_COOKIE_PATH or os.path.join(conf_dir, "xhs_cookies.txt")

# Load XHS cookies from file (similar to Zhihu cookie loading)
XHS_COOKIE_STRING = ""
if os.path.exists(xhs_cookie_path):
try:
with open(xhs_cookie_path, "r", encoding="utf-8") as f:
XHS_COOKIE_STRING = f.read().strip()
except (IOError, OSError) as e:
logger.error(f"Error reading XHS cookie file: {e}")
XHS_COOKIE_STRING = ""
else:
# Fallback: build cookie string from individual env vars (backward compat)
cookie_parts = []
if XIAOHONGSHU_A1:
cookie_parts.append(f"a1={XIAOHONGSHU_A1}")
if XIAOHONGSHU_WEBID:
cookie_parts.append(f"web_id={XIAOHONGSHU_WEBID}")
if XIAOHONGSHU_WEBSESSION:
cookie_parts.append(f"web_session={XIAOHONGSHU_WEBSESSION}")
XHS_COOKIE_STRING = "; ".join(cookie_parts)

# Zhihu
FXZHIHU_HOST = env.get("FXZHIHU_HOST", "fxzhihu.com")

Expand Down
123 changes: 32 additions & 91 deletions apps/api/src/services/scrapers/xiaohongshu/__init__.py
Original file line number Diff line number Diff line change
@@ -1,23 +1,14 @@
import asyncio
from typing import Any
from urllib.parse import urlparse

import httpx
import jmespath

from fastfetchbot_shared.models.metadata_item import MetadataItem, MediaFile, MessageType
from fastfetchbot_shared.utils.network import HEADERS
from src.config import JINJA2_ENV, HTTP_REQUEST_TIMEOUT
from .xhs.core import XiaoHongShuCrawler
from .xhs.client import XHSClient
from .xhs import proxy_account_pool

from fastfetchbot_shared.utils.logger import logger
from fastfetchbot_shared.utils.parse import (
unix_timestamp_to_utc,
get_html_text_length,
wrap_text_into_html,
)
from src.config import JINJA2_ENV, XHS_COOKIE_STRING, XHS_SIGN_SERVER_URL
from .adaptar import XhsSinglePostAdapter

environment = JINJA2_ENV
short_text_template = environment.get_template("xiaohongshu_short_text.jinja2")
Expand All @@ -42,78 +33,51 @@ def __init__(self, url: str, data: Any, **kwargs):
self.raw_content = None

async def get_item(self) -> dict:
await self.get_xiaohongshu()
await self._get_xiaohongshu()
return self.to_dict()

async def get_xiaohongshu(self) -> None:
if self.url.find("xiaohongshu.com") == -1:
async with httpx.AsyncClient() as client:
resp = await client.get(
self.url,
headers=HEADERS,
follow_redirects=True,
timeout=HTTP_REQUEST_TIMEOUT,
)
if (
resp.history
): # if there is a redirect, the request will have a response chain
for h in resp.history:
print(h.status_code, h.url)
self.url = str(resp.url)
urlparser = urlparse(self.url)
self.id = urlparser.path.split("/")[-1]
crawler = XiaoHongShuCrawler()
account_pool = proxy_account_pool.create_account_pool()
crawler.init_config("xhs", "cookie", account_pool)
note_detail = None
for _ in range(5):
try:
note_detail = await crawler.start(id=self.id)
break
except Exception as e:
await asyncio.sleep(3)
logger.error(f"error: {e}")
logger.error(f"retrying...")
if not note_detail:
raise Exception("重试了这么多次还是无法签名成功,寄寄寄")
# logger.debug(f"json_data: {json.dumps(note_detail, ensure_ascii=False, indent=4)}")
parsed_data = self.process_note_json(note_detail)
await self.process_xiaohongshu_note(parsed_data)
async def _get_xiaohongshu(self) -> None:
async with XhsSinglePostAdapter(
cookies=XHS_COOKIE_STRING,
sign_server_endpoint=XHS_SIGN_SERVER_URL,
) as adapter:
result = await adapter.fetch_post(note_url=self.url)
Comment on lines +40 to +44

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Retry Xiaohongshu fetch on transient sign/API failures

The new flow performs a single fetch_post call and propagates errors directly, while the previous implementation retried up to 5 times with delay. Under transient sign-server/API/network failures (timeouts, brief 5xx, intermittent blocks), this now fails the entire scrape immediately and increases flaky user-facing errors. Reintroducing bounded retries here would restore the reliability behavior users previously had.

Useful? React with 👍 / 👎.

note = result["note"]
self.id = note.get("note_id")
self.url = result["url"]
await self._process_xiaohongshu_note(note)

async def process_xiaohongshu_note(self, json_data: dict):
async def _process_xiaohongshu_note(self, json_data: dict):
user = json_data.get("user", {}) or {}
self.title = json_data.get("title")
self.author = json_data.get("author")
self.author = user.get("nickname")
if not self.title and self.author:
self.title = f"{self.author}的小红书笔记"
self.author_url = "https://www.xiaohongshu.com/user/profile/" + json_data.get(
"user_id"
self.author_url = (
"https://www.xiaohongshu.com/user/profile/" + user.get("user_id", "")
)
self.raw_content = json_data.get("raw_content")
logger.debug(f"{json_data.get('created')}")
self.raw_content = json_data.get("desc", "")
raw_time = json_data.get("time", 0)
raw_updated = json_data.get("last_update_time", 0)
self.created = (
unix_timestamp_to_utc(json_data.get("created") / 1000)
if json_data.get("created")
else None
unix_timestamp_to_utc(int(raw_time) / 1000) if raw_time else None
)
self.updated = (
unix_timestamp_to_utc(json_data.get("updated") / 1000)
if json_data.get("updated")
else None
unix_timestamp_to_utc(int(raw_updated) / 1000) if raw_updated else None
)
self.like_count = json_data.get("like_count")
self.like_count = json_data.get("liked_count")
self.collected_count = json_data.get("collected_count")
self.comment_count = json_data.get("comment_count")
self.share_count = json_data.get("share_count")
self.ip_location = json_data.get("ip_location")
if json_data.get("image_list"):
for image_url in json_data.get("image_list"):
self.media_files.append(MediaFile(url=image_url, media_type="image"))
if json_data.get("video"):
self.media_files.append(
MediaFile(url=json_data.get("video"), media_type="video")
)
for image_url in json_data.get("image_list", []) or []:
self.media_files.append(MediaFile(url=image_url, media_type="image"))
video_urls = json_data.get("video_urls", []) or []
if video_urls:
self.media_files.append(MediaFile(url=video_urls[0], media_type="video"))
data = self.__dict__
data["raw_content"] = data["raw_content"].replace("\t", "")
raw_content = self.raw_content or ""
data["raw_content"] = raw_content.replace("\t", "")
if data["raw_content"].endswith("\n"):
data["raw_content"] = data["raw_content"][:-1]
self.text = short_text_template.render(data=data)
Expand All @@ -124,30 +88,7 @@ async def process_xiaohongshu_note(self, json_data: dict):
if media_file.media_type == "image":
data["raw_content"] += f'<p><img src="{media_file.url}" alt=""/></p>'
elif media_file.media_type == "video":
data[
"raw_content"
] += (
data["raw_content"] += (
f'<p><video src="{media_file.url}" controls="controls"></video></p>'
)
self.content = content_template.render(data=data)

@staticmethod
def process_note_json(json_data: dict):
expression = """
{
title: title,
raw_content: desc,
author: user.nickname,
user_id: user.user_id,
image_list: image_list[*].url,
video: video.media.stream.h264[0].master_url,
like_count: interact_info.liked_count,
collected_count: interact_info.collected_count,
comment_count: interact_info.comment_count,
share_count: interact_info.share_count,
ip_location: ip_location,
created: time,
updated: last_update_time
}
"""
return jmespath.search(expression, json_data)
Loading