feat: integrate FxZhihu implementation by aturret · Pull Request #66 · aturret/FastFetchBot

aturret · 2026-03-20T03:21:32Z

Summary by CodeRabbit

New Features
- Added API-based scraping method for Zhihu content as an alternative to existing methods
- Introduced z_c0 authentication cookie configuration option for Zhihu API access
Improvements
- Enhanced content processing for images, links, and reference extraction
Tests
- Added test coverage for content processing functions

coderabbitai · 2026-03-20T03:21:47Z

📝 Walkthrough

Walkthrough

This PR extends Zhihu scraper functionality with a new "api" method alongside the existing "fxzhihu" approach, introducing cookie-based direct API authentication via ZHIHU_Z_C0 configuration, implementing HTML content post-processing utilities for image and link handling, and hardening API field extraction with safe defaults.

Changes

Cohort / File(s)	Summary
Application Configuration `apps/api/src/config.py`, `template.env`	Added `ZHIHU_Z_C0` environment variable for Zhihu API authentication, with documentation describing its precedence over config file-based cookies.
Zhihu Scraper Configuration `apps/api/src/services/scrapers/zhihu/config.py`	Updated `ZHIHU_API_ANSWER_PARAMS` to unencoded format; expanded `ALL_METHODS` to include "api" method; introduced `ZHIHU_API_COOKIE` constant with fallback logic prioritizing `ZHIHU_Z_C0` over config-file-based cookies.
Zhihu Scraper Core Logic `apps/api/src/services/scrapers/zhihu/__init__.py`	Switched default method from "fxzhihu" to "api"; replaced randomized User-Agent with static "node"; implemented conditional `Cookie` header setting from `ZHIHU_API_COOKIE`; changed API host for articles; added post-processing pipeline (`fix_images_and_links`, `unmask_zhihu_links`) for API responses; hardened field extraction with `.get()` defaults and enhanced status/video URL parsing logic.
Content Processing Module `apps/api/src/services/scrapers/zhihu/content_processing.py`	New utility module providing `fix_images_and_links()` to replace image sources and unwrap underline tags, `extract_references()` to aggregate and format reference metadata from HTML, and `unmask_zhihu_links()` to decode Zhihu redirect links.
Tests `tests/test_zhihu_content_processing.py`	New test module validating content processing functions: image source replacement, reference extraction, underline tag removal, and Zhihu link unmasking behaviors.

Sequence Diagram(s)

sequenceDiagram
    participant Config as Configuration
    participant Scraper as Zhihu Scraper
    participant API as Zhihu API
    participant Processor as Content Processor
    
    Config->>Scraper: Load ZHIHU_Z_C0 or ZHIHU_COOKIES
    Note over Scraper: Build ZHIHU_API_COOKIE
    Scraper->>Scraper: Set method="api" if ZHIHU_API_COOKIE present
    Scraper->>API: GET request with Cookie header
    API-->>Scraper: Return JSON response
    Scraper->>Processor: Pass raw_content to fix_images_and_links()
    Processor-->>Scraper: Fixed HTML (images, links)
    Scraper->>Processor: Pass HTML to unmask_zhihu_links()
    Processor-->>Scraper: Unmasked links (decoded targets)
    Scraper->>Scraper: Extract fields with .get() defaults
    Scraper-->>Scraper: Return parsed content

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 A new path emerges for Zhihu's tales so bright,
With cookies and magic, the API takes flight,
Images unmasked and links set free,
Content flows pure as can be! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 35.29% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately reflects the main change—integrating FxZhihu implementation by adding API support, content processing utilities, and updated configuration across multiple Zhihu-related files.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch zhihu-local-scraper

📝 Coding Plan

Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Tip

You can customize the high-level summary generated by CodeRabbit.

Configure the reviews.high_level_summary_instructions setting to provide custom instructions for generating the high-level summary.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (3)

apps/api/src/services/scrapers/zhihu/__init__.py (1)

796-798: Simplify origin_pin check using .get().

As flagged by static analysis, the key check before dictionary access can be simplified. The current pattern "origin_pin" in data and data["origin_pin"] can be replaced with data.get("origin_pin").

♻️ Proposed fix

-        if "origin_pin" in data and data["origin_pin"]:
-            result["origin_pin_id"] = str(data["origin_pin"]["id"])
+        origin_pin = data.get("origin_pin")
+        if origin_pin:
+            result["origin_pin_id"] = str(origin_pin["id"])
-            result["origin_pin_data"] = Zhihu._resolve_status_api_data(data["origin_pin"])
+            result["origin_pin_data"] = Zhihu._resolve_status_api_data(origin_pin)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@apps/api/src/services/scrapers/zhihu/__init__.py` around lines 796 - 798,
Replace the explicit key-and-truthy check "origin_pin" in data and
data["origin_pin"] with a single get call: retrieve origin_pin =
data.get("origin_pin") and if origin_pin: set result["origin_pin_id"] =
str(origin_pin["id"]) and result["origin_pin_data"] =
Zhihu._resolve_status_api_data(origin_pin); this simplifies the conditional and
avoids double dictionary lookup while preserving behavior.

apps/api/src/services/scrapers/zhihu/content_processing.py (1)

39-44: Unused loop variable index.

The loop variable index is not used in the loop body. Consider using an underscore prefix to indicate it's intentionally unused.

♻️ Proposed fix

-    for index, ref in sorted_refs:
+    for _index, ref in sorted_refs:

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@apps/api/src/services/scrapers/zhihu/content_processing.py` around lines 39 -
44, The loop in sorted_refs = sorted(references.items(), key=lambda x:
int(x[0])) assigns the first element to index but never uses it; change the loop
header in the block that builds items (currently "for index, ref in
sorted_refs") to use an unused-variable name (e.g., "for _, ref in sorted_refs")
so it's clear index is intentionally ignored while preserving the sorted_refs,
references, items, and URL/html construction logic.

tests/test_zhihu_content_processing.py (1)

1-11: Consider using a proper import path instead of sys.path manipulation.

The sys.path manipulation is fragile and can break if the directory structure changes. Consider:

Running tests from project root with proper PYTHONPATH

Using relative imports if this is a package

Adding a conftest.py that sets up the path once

However, the comment explains the rationale (avoiding heavy dependencies), so this is acceptable for now if the test infrastructure requires it.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_zhihu_content_processing.py` around lines 1 - 11, The test
currently mutates sys.path to import content_processing which is fragile;
replace this inline sys.path manipulation by one of: (a) configure PYTHONPATH or
run tests from project root so tests import the module with a normal import, (b)
convert the tests folder into a package and use a relative import to import
content_processing, or (c) create a conftest.py that adjusts sys.path once for
the test suite; ensure tests still import fix_images_and_links,
extract_references, and unmask_zhihu_links from content_processing and remove
the sys.path.insert and os.path.dirname usage from
tests/test_zhihu_content_processing.py.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/api/src/services/scrapers/zhihu/content_processing.py`:
- Around line 56-63: The try/except in the URL unmasking block silently swallows
errors; update the block in content_processing.py that parses href (uses
urlparse, parse_qs, and unquote to set a_tag["href"]) to catch Exception as e
and log a debug-level message with the original href and exception details using
Loguru (ensure logger is imported/available), then proceed without raising;
include contextual text like "failed to unmask href" in the log so failures are
discoverable during debugging.

---

Nitpick comments:
In `@apps/api/src/services/scrapers/zhihu/__init__.py`:
- Around line 796-798: Replace the explicit key-and-truthy check "origin_pin" in
data and data["origin_pin"] with a single get call: retrieve origin_pin =
data.get("origin_pin") and if origin_pin: set result["origin_pin_id"] =
str(origin_pin["id"]) and result["origin_pin_data"] =
Zhihu._resolve_status_api_data(origin_pin); this simplifies the conditional and
avoids double dictionary lookup while preserving behavior.

In `@apps/api/src/services/scrapers/zhihu/content_processing.py`:
- Around line 39-44: The loop in sorted_refs = sorted(references.items(),
key=lambda x: int(x[0])) assigns the first element to index but never uses it;
change the loop header in the block that builds items (currently "for index, ref
in sorted_refs") to use an unused-variable name (e.g., "for _, ref in
sorted_refs") so it's clear index is intentionally ignored while preserving the
sorted_refs, references, items, and URL/html construction logic.

In `@tests/test_zhihu_content_processing.py`:
- Around line 1-11: The test currently mutates sys.path to import
content_processing which is fragile; replace this inline sys.path manipulation
by one of: (a) configure PYTHONPATH or run tests from project root so tests
import the module with a normal import, (b) convert the tests folder into a
package and use a relative import to import content_processing, or (c) create a
conftest.py that adjusts sys.path once for the test suite; ensure tests still
import fix_images_and_links, extract_references, and unmask_zhihu_links from
content_processing and remove the sys.path.insert and os.path.dirname usage from
tests/test_zhihu_content_processing.py.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9954da61-4b3d-4755-9e5a-0d3388cc1e05

📥 Commits

Reviewing files that changed from the base of the PR and between 7d4e441 and d86e926.

📒 Files selected for processing (6)

apps/api/src/config.py
apps/api/src/services/scrapers/zhihu/__init__.py
apps/api/src/services/scrapers/zhihu/config.py
apps/api/src/services/scrapers/zhihu/content_processing.py
template.env
tests/test_zhihu_content_processing.py

coderabbitai · 2026-03-20T03:43:19Z

apps/api/src/services/scrapers/zhihu/content_processing.py

+            try:
+                parsed = urlparse(href)
+                qs = parse_qs(parsed.query)
+                target = qs.get("target", [None])[0]
+                if target:
+                    a_tag["href"] = unquote(target)
+            except Exception:
+                pass


⚠️ Potential issue | 🟡 Minor

Consider logging failed URL unmasking attempts.

The silent except: pass makes debugging difficult when URL parsing fails unexpectedly. Per coding guidelines, Loguru should be used for logging. Consider logging at debug level to aid troubleshooting without cluttering normal output.

🛠️ Proposed fix

+from fastfetchbot_shared.utils.logger import logger from urllib.parse import urlparse, parse_qs, unquote

try: parsed = urlparse(href) qs = parse_qs(parsed.query) target = qs.get("target", [None])[0] if target: a_tag["href"] = unquote(target) - except Exception: - pass + except Exception as e: + logger.debug(f"Failed to unmask Zhihu link {href}: {e}")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

try:

parsed = urlparse(href)

qs = parse_qs(parsed.query)

target = qs.get("target", [None])[0]

if target:

a_tag["href"] = unquote(target)

except Exception:

pass

try:

parsed = urlparse(href)

qs = parse_qs(parsed.query)

target = qs.get("target", [None])[0]

if target:

a_tag["href"] = unquote(target)

except Exception as e:

logger.debug(f"Failed to unmask Zhihu link {href}: {e}")

🧰 Tools

🪛 Ruff (0.15.6)

[error] 62-63: try-except-pass detected, consider logging the exception

(S110)

[warning] 62-62: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@apps/api/src/services/scrapers/zhihu/content_processing.py` around lines 56 - 63, The try/except in the URL unmasking block silently swallows errors; update the block in content_processing.py that parses href (uses urlparse, parse_qs, and unquote to set a_tag["href"]) to catch Exception as e and log a debug-level message with the original href and exception details using Loguru (ensure logger is imported/available), then proceed without raising; include contextual text like "failed to unmask href" in the log so failures are discoverable during debugging.

feat: integrate FxZhihu implementation

d86e926

coderabbitai bot reviewed Mar 20, 2026

View reviewed changes

aturret merged commit 3ae90ec into main Mar 20, 2026
2 checks passed

coderabbitai bot mentioned this pull request Mar 21, 2026

feat: add some unit tests #68

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: integrate FxZhihu implementation#66

feat: integrate FxZhihu implementation#66
aturret merged 1 commit intomainfrom
zhihu-local-scraper

aturret commented Mar 20, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Mar 20, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aturret commented Mar 20, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

aturret commented Mar 20, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 20, 2026 •

edited

Loading