Conversation
|
Caution Review failedThe pull request is closed. ℹ️ Recent review infoConfiguration used: defaults Review profile: CHILL Plan: Pro 📒 Files selected for processing (3)
📝 WalkthroughWalkthroughReplaces the Playwright-based Xiaohongshu crawler stack with a new HTTP adapter (XhsSinglePostAdapter), removes legacy xhs modules, adds sign-server and cookie-file configuration, updates scraper wiring and templates, and refreshes docs and env templates. Changes
Sequence Diagram(s)sequenceDiagram
participant Client as Client/API
participant Adapter as XhsSinglePostAdapter
participant SignServer as Sign Server
participant XhsAPI as Xiaohongshu API
participant Parser as HTML Parser
Client->>Adapter: fetch_post(note_url, with_comments?)
Adapter->>Adapter: parse_xhs_note_url(note_url)
Adapter->>SignServer: request signature for uri/data
SignServer-->>Adapter: returns signed headers
Adapter->>XhsAPI: _signed_post or _signed_get with signed headers
alt API returns JSON
XhsAPI-->>Adapter: JSON response
Adapter->>Adapter: _parse_api_response() and _normalize_note()
else API fails or non-JSON
Adapter->>XhsAPI: fetch note HTML
XhsAPI-->>Parser: HTML content
Parser-->>Adapter: extracted fields
Adapter->>Adapter: _normalize_note()
end
opt with_comments
Adapter->>XhsAPI: _fetch_comments (paged)
XhsAPI-->>Adapter: comment pages
Adapter->>Adapter: _normalize_comment() (incl. sub-comments)
end
Adapter-->>Client: normalized note dict (media, content, metadata, comments?)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@codex review this PR |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0d8b5c3d8d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| HTTP_REQUEST_TIMEOUT = env.get("HTTP_REQUEST_TIMEOUT", 30) | ||
|
|
||
| # XHS (Xiaohongshu) shared configuration | ||
| SIGN_SERVER_URL = env.get("SIGN_SERVER_URL", "http://localhost:8989") |
There was a problem hiding this comment.
Preserve localhost fallback when SIGN_SERVER_URL is empty
This reads SIGN_SERVER_URL with env.get("SIGN_SERVER_URL", "http://localhost:8989"), which does not apply the default when the key exists but is blank. In this repo’s common setup (docker-compose.template.yml uses env_file: .env, and template.env now adds SIGN_SERVER_URL=), the value becomes an empty string, so XHS signing requests are built from an invalid base URL instead of the documented localhost default and scraping fails at runtime. Treat empty values as unset (e.g., env.get(...) or "http://localhost:8989").
Useful? React with 👍 / 👎.
| async with XhsSinglePostAdapter( | ||
| cookies=XHS_COOKIE_STRING, | ||
| sign_server_endpoint=XHS_SIGN_SERVER_URL, | ||
| ) as adapter: | ||
| result = await adapter.fetch_post(note_url=self.url) |
There was a problem hiding this comment.
Retry Xiaohongshu fetch on transient sign/API failures
The new flow performs a single fetch_post call and propagates errors directly, while the previous implementation retried up to 5 times with delay. Under transient sign-server/API/network failures (timeouts, brief 5xx, intermittent blocks), this now fails the entire scrape immediately and increases flaky user-facing errors. Reintroducing bounded retries here would restore the reliability behavior users previously had.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Actionable comments posted: 7
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
apps/api/src/config.py (1)
82-90:⚠️ Potential issue | 🟡 MinorRemove unused legacy XHS configuration variables from
config.py.These variables (
XIAOHONGSHU_COOKIES,XHS_PHONE_LIST,XHS_IP_PROXY_LIST,XHS_ENABLE_IP_PROXY,XHS_SAVE_LOGIN_STATE) are no longer referenced anywhere in the codebase after the migration toXhsSinglePostAdapter. The new adapter only usesXHS_COOKIE_STRINGandXHS_SIGN_SERVER_URL. Remove these dead config variables to keep the configuration file clean.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/api/src/config.py` around lines 82 - 90, Remove the unused legacy XHS config variables from config.py by deleting the definitions XIAOHONGSHU_COOKIES, XHS_PHONE_LIST, XHS_IP_PROXY_LIST, XHS_ENABLE_IP_PROXY, and XHS_SAVE_LOGIN_STATE; ensure you leave the current XHS settings (XHS_COOKIE_STRING and XHS_SIGN_SERVER_URL) intact, and run a quick repo-wide search for those removed symbols to confirm they are not referenced elsewhere before committing.
🧹 Nitpick comments (1)
apps/api/src/config.py (1)
93-94: Move the mid-file imports to the top of the file.
PEP 8specifies that all import statements should be "put at the top of the file, just after any module comments and docstrings, and before module globals and constants." These two imports are inserted after ~90 lines of variable assignments, which breaks the established pattern in the file and makes dependencies harder to locate.♻️ Proposed fix
Remove lines 93–94 from their current position and add alongside the other
fastfetchbot_sharedimport at the top of the file:from fastfetchbot_shared.utils.parse import get_env_bool +from fastfetchbot_shared.config import SIGN_SERVER_URL as XHS_SIGN_SERVER_URL +from fastfetchbot_shared.config import XHS_COOKIE_PATH as _XHS_COOKIE_PATHThen delete them from lines 93–94.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/api/src/config.py` around lines 93 - 94, The two mid-file imports (SIGN_SERVER_URL as XHS_SIGN_SERVER_URL and XHS_COOKIE_PATH as _XHS_COOKIE_PATH from fastfetchbot_shared.config) must be moved to the top with the other fastfetchbot_shared imports; remove the duplicate lines at lines 93–94 and add those two import names alongside the existing fastfetchbot_shared import block so XHS_SIGN_SERVER_URL and _XHS_COOKIE_PATH are declared together with the module-level imports.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.gitignore:
- Around line 275-276: Remove the duplicate gitignore entry
'apps/worker/celerybeat-schedule.db' so it only appears once (or remove both),
and rely on the existing '*.db' pattern to cover it; update the .gitignore by
deleting the repeated 'apps/worker/celerybeat-schedule.db' line to avoid
redundancy.
In `@apps/api/src/config.py`:
- Around line 104-106: Replace the print call in the exception handler with
Loguru logging: in the except block that catches (IOError, OSError) where
XHS_COOKIE_STRING is set to "", call logger.error(...) and include the exception
details (e) and context ("Error reading XHS cookie file") instead of using
print; ensure Loguru's logger is imported/available in this module (logger from
loguru) before using it so the error is recorded consistently with the project's
logging conventions.
In `@apps/api/src/services/scrapers/xiaohongshu/adaptar.py`:
- Around line 43-61: In __init__, validate that a non-empty sign server URL is
available (either the provided sign_server_endpoint or the fallback
SIGN_SERVER_URL) and raise a clear error if both are empty to fail fast; update
the constructor logic around sign_server_endpoint (and its rstrip usage) so that
after resolving (sign_server_endpoint or SIGN_SERVER_URL).rstrip("/") you check
for empty string and raise ValueError (or a configuration-specific exception)
referencing the constructor/__init__ and _sign_headers behavior so
misconfiguration is surfaced immediately instead of causing httpx runtime URL
errors.
- Around line 225-260: The regex in _fetch_note_by_html currently uses ({.*})
which is greedy and won't match across newlines; change the search to use a
non-greedy capture and DOTALL mode so the JSON blob spanning multiple lines is
matched (e.g., use a pattern like "window.__INITIAL_STATE__=({.*?})</script>"
with the re.DOTALL flag) when calling re.search(...) so match.group(1) yields
the full JSON state for json.loads.
- Around line 438-459: The _extract_video_urls method builds a hardcoded HTTP
CDN URL using origin_video_key which can cause mixed-content warnings; update
the returned URL to use HTTPS (i.e., change the scheme for the
sns-video-bd.xhscdn.com URL built from origin_video_key in _extract_video_urls)
and keep the logic that falls back to collecting master_url entries from the
h264 stream unchanged.
In `@packages/shared/fastfetchbot_shared/utils/config.py`:
- Line 55: The regex entry in BANNED_PATTERNS for GitHub uses an unescaped dot
("github.com") which allows any character at that position; update the pattern
in BANNED_PATTERNS (the string currently
r"github.com\/[A-Za-z0-9_-]+\/[A-Za-z0-9_-]+") to escape the dot as
r"github\.com\/[A-Za-z0-9_-]+\/[A-Za-z0-9_-]+" so it only matches "github.com"
exactly.
In `@README.md`:
- Around line 182-184: The fenced code block containing the cookie-file example
(the block showing "a1=xxxxxxxx; web_id=xxxxxxxx; web_session=xxxxxxxx") is
missing a language identifier and triggers markdownlint MD040; update that
fenced block to include a language specifier such as "text" (or "sh") so the
block becomes ```text ... ``` to mark it as plain text.
---
Outside diff comments:
In `@apps/api/src/config.py`:
- Around line 82-90: Remove the unused legacy XHS config variables from
config.py by deleting the definitions XIAOHONGSHU_COOKIES, XHS_PHONE_LIST,
XHS_IP_PROXY_LIST, XHS_ENABLE_IP_PROXY, and XHS_SAVE_LOGIN_STATE; ensure you
leave the current XHS settings (XHS_COOKIE_STRING and XHS_SIGN_SERVER_URL)
intact, and run a quick repo-wide search for those removed symbols to confirm
they are not referenced elsewhere before committing.
---
Nitpick comments:
In `@apps/api/src/config.py`:
- Around line 93-94: The two mid-file imports (SIGN_SERVER_URL as
XHS_SIGN_SERVER_URL and XHS_COOKIE_PATH as _XHS_COOKIE_PATH from
fastfetchbot_shared.config) must be moved to the top with the other
fastfetchbot_shared imports; remove the duplicate lines at lines 93–94 and add
those two import names alongside the existing fastfetchbot_shared import block
so XHS_SIGN_SERVER_URL and _XHS_COOKIE_PATH are declared together with the
module-level imports.
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (20)
.gitignoreCLAUDE.mdREADME.mdapps/api/src/config.pyapps/api/src/services/scrapers/xiaohongshu/__init__.pyapps/api/src/services/scrapers/xiaohongshu/adaptar.pyapps/api/src/services/scrapers/xiaohongshu/xhs/__init__.pyapps/api/src/services/scrapers/xiaohongshu/xhs/base_crawler.pyapps/api/src/services/scrapers/xiaohongshu/xhs/client.pyapps/api/src/services/scrapers/xiaohongshu/xhs/core.pyapps/api/src/services/scrapers/xiaohongshu/xhs/exception.pyapps/api/src/services/scrapers/xiaohongshu/xhs/field.pyapps/api/src/services/scrapers/xiaohongshu/xhs/help.pyapps/api/src/services/scrapers/xiaohongshu/xhs/login.pyapps/api/src/services/scrapers/xiaohongshu/xhs/proxy_account_pool.pyapps/api/src/services/scrapers/xiaohongshu/xhs/utils.pyapps/api/src/templates/xiaohongshu_content.jinja2packages/shared/fastfetchbot_shared/config.pypackages/shared/fastfetchbot_shared/utils/config.pytemplate.env
💤 Files with no reviewable changes (10)
- apps/api/src/services/scrapers/xiaohongshu/xhs/login.py
- apps/api/src/services/scrapers/xiaohongshu/xhs/exception.py
- apps/api/src/services/scrapers/xiaohongshu/xhs/help.py
- apps/api/src/services/scrapers/xiaohongshu/xhs/client.py
- apps/api/src/services/scrapers/xiaohongshu/xhs/field.py
- apps/api/src/services/scrapers/xiaohongshu/xhs/init.py
- apps/api/src/services/scrapers/xiaohongshu/xhs/base_crawler.py
- apps/api/src/services/scrapers/xiaohongshu/xhs/proxy_account_pool.py
- apps/api/src/services/scrapers/xiaohongshu/xhs/utils.py
- apps/api/src/services/scrapers/xiaohongshu/xhs/core.py
| async def _fetch_note_by_html( | ||
| self, | ||
| note_id: str, | ||
| xsec_token: str, | ||
| xsec_source: str, | ||
| ) -> Optional[Dict[str, Any]]: | ||
| detail_url = f"{XHS_WEB_URL}/explore/{note_id}" | ||
| if xsec_token: | ||
| detail_url += f"?xsec_token={xsec_token}&xsec_source={xsec_source}" | ||
|
|
||
| resp = await self._http.get(detail_url, headers=self._base_headers()) | ||
| if resp.status_code != 200: | ||
| return None | ||
|
|
||
| text = resp.text or "" | ||
| match = re.search(r"window.__INITIAL_STATE__=({.*})</script>", text) | ||
| if not match: | ||
| return None | ||
|
|
||
| try: | ||
| state = json.loads(match.group(1).replace("undefined", "null")) | ||
| note = ( | ||
| state.get("note", {}) | ||
| .get("noteDetailMap", {}) | ||
| .get(note_id, {}) | ||
| .get("note") | ||
| ) | ||
| except json.JSONDecodeError: | ||
| return None | ||
|
|
||
| if not note: | ||
| return None | ||
| note["xsec_token"] = xsec_token | ||
| note["xsec_source"] = xsec_source | ||
| return self._normalize_note(note) | ||
|
|
There was a problem hiding this comment.
Make the HTML state regex DOTALL and non-greedy.
The HTML JSON blob often spans newlines, so the current regex is likely to miss and make the fallback path fail.
🔧 Suggested fix
- match = re.search(r"window.__INITIAL_STATE__=({.*})</script>", text)
+ match = re.search(
+ r"window.__INITIAL_STATE__=(\{.*?\})</script>", text, re.S
+ )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@apps/api/src/services/scrapers/xiaohongshu/adaptar.py` around lines 225 -
260, The regex in _fetch_note_by_html currently uses ({.*}) which is greedy and
won't match across newlines; change the search to use a non-greedy capture and
DOTALL mode so the JSON blob spanning multiple lines is matched (e.g., use a
pattern like "window.__INITIAL_STATE__=({.*?})</script>" with the re.DOTALL
flag) when calling re.search(...) so match.group(1) yields the full JSON state
for json.loads.
| @staticmethod | ||
| def _extract_video_urls(note_item: Dict[str, Any]) -> List[str]: | ||
| if note_item.get("type") != "video": | ||
| return [] | ||
| video = note_item.get("video", {}) or {} | ||
| consumer = video.get("consumer", {}) or {} | ||
| origin_video_key = consumer.get("origin_video_key") or consumer.get("originVideoKey") | ||
| if origin_video_key: | ||
| return [f"http://sns-video-bd.xhscdn.com/{origin_video_key}"] | ||
|
|
||
| urls = [] | ||
| h264_list = ( | ||
| (video.get("media", {}) or {}) | ||
| .get("stream", {}) | ||
| .get("h264", []) | ||
| or [] | ||
| ) | ||
| for item in h264_list: | ||
| master_url = item.get("master_url") | ||
| if master_url: | ||
| urls.append(master_url) | ||
| return urls |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, check if the file exists
git ls-files | grep -i "xiaohongshu"Repository: aturret/FastFetchBot
Length of output: 275
🏁 Script executed:
# Search for HTTP URLs in the codebase
rg "http://" --type py | head -20Repository: aturret/FastFetchBot
Length of output: 1272
🏁 Script executed:
# Check for any configuration or environment variables related to this CDN
rg "sns-video-bd|xhscdn" --type pyRepository: aturret/FastFetchBot
Length of output: 191
🏁 Script executed:
# Look for similar patterns with HTTPS in other scrapers
rg "https://.*\.xhscdn\.com" --type pyRepository: aturret/FastFetchBot
Length of output: 46
🏁 Script executed:
# Check for comments or documentation about xiaohongshu/CDN
rg -A 5 -B 5 "xhscdn|xiaohongshu.*cdn" --type pyRepository: aturret/FastFetchBot
Length of output: 1099
🏁 Script executed:
# Look at the adaptar.py file to see more context around this function
cat -n apps/api/src/services/scrapers/xiaohongshu/adaptar.py | head -100Repository: aturret/FastFetchBot
Length of output: 4388
🏁 Script executed:
# Check if there's any environment configuration for URL schemes
rg "HTTPS|https_only|secure" --type py apps/api/src/Repository: aturret/FastFetchBot
Length of output: 46
🌐 Web query:
xiaohongshu xhscdn HTTPS support CDN
💡 Result:
xhscdn.com/xhscdn.netare domains used by Xiaohongshu (XHS) for serving static assets (images/video) via many subdomains (it’s not a “public CDN you sign up for”). XHS’s own security center explicitly identifies*.xhscdn.comand*.xhscdn.netas their CDN domains used for user-uploaded content. [1]- HTTPS support is not consistent at the apex/root domain: multiple checkers report that
xhscdn.comitself (the bare domain) does not support HTTPS. [2] - Many specific subdomains do support HTTPS/TLS (i.e., they present valid certificates), e.g.
magpie.xhscdn.comis reported as having a valid SSL certificate. [3]
Implication: if you’re trying to fetch an XHS asset, you typically need to use the exact subdomain hostname XHS provides (often works over https://<subdomain>.xhscdn.com/...), rather than assuming https://xhscdn.com/... will work. [2][3]
Sources
[1] Xiaohongshu Security Response Center: CDN domains include *.xhscdn.com / *.xhscdn.net
[2] Hypestat SSL checker result for xhscdn.com (reports no HTTPS on apex)
[3] ScamAdviser report for magpie.xhscdn.com (reports valid SSL certificate)
Use HTTPS for CDN video URLs to prevent mixed content warnings.
The hardcoded HTTP URL will trigger mixed content warnings when served in HTTPS contexts. While the base xhscdn.com domain has inconsistent HTTPS support, individual subdomains like sns-video-bd.xhscdn.com typically support HTTPS. This aligns with other Xiaohongshu API calls in this file which already use HTTPS (XHS_API_URL, XHS_WEB_URL).
🔧 Suggested fix
- return [f"http://sns-video-bd.xhscdn.com/{origin_video_key}"]
+ return [f"https://sns-video-bd.xhscdn.com/{origin_video_key}"]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@apps/api/src/services/scrapers/xiaohongshu/adaptar.py` around lines 438 -
459, The _extract_video_urls method builds a hardcoded HTTP CDN URL using
origin_video_key which can cause mixed-content warnings; update the returned URL
to use HTTPS (i.e., change the scheme for the sns-video-bd.xhscdn.com URL built
from origin_video_key in _extract_video_urls) and keep the logic that falls back
to collecting master_url entries from the h264 stream unchanged.
| ``` | ||
| a1=xxxxxxxx; web_id=xxxxxxxx; web_session=xxxxxxxx | ||
| ``` |
There was a problem hiding this comment.
Add a language identifier to the fenced code block.
The cookie-file example block has no language specifier, which triggers markdownlint MD040. Since it's a plain text example, use text (or sh).
🔧 Proposed fix
- ```
+ ```text
a1=xxxxxxxx; web_id=xxxxxxxx; web_session=xxxxxxxx</details>
<!-- suggestion_start -->
<details>
<summary>📝 Committable suggestion</summary>
> ‼️ **IMPORTANT**
> Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
```suggestion
🧰 Tools
🪛 markdownlint-cli2 (0.21.0)
[warning] 182-182: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@README.md` around lines 182 - 184, The fenced code block containing the
cookie-file example (the block showing "a1=xxxxxxxx; web_id=xxxxxxxx;
web_session=xxxxxxxx") is missing a language identifier and triggers
markdownlint MD040; update that fenced block to include a language specifier
such as "text" (or "sh") so the block becomes ```text ... ``` to mark it as
plain text.
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Summary by CodeRabbit
New Features
Improvements
Documentation
Chores