Skip to content

feat: integrate FxZhihu implementation#66

Merged
aturret merged 1 commit intomainfrom
zhihu-local-scraper
Mar 20, 2026
Merged

feat: integrate FxZhihu implementation#66
aturret merged 1 commit intomainfrom
zhihu-local-scraper

Conversation

@aturret
Copy link
Owner

@aturret aturret commented Mar 20, 2026

Summary by CodeRabbit

  • New Features

    • Added API-based scraping method for Zhihu content as an alternative to existing methods
    • Introduced z_c0 authentication cookie configuration option for Zhihu API access
  • Improvements

    • Enhanced content processing for images, links, and reference extraction
  • Tests

    • Added test coverage for content processing functions

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 20, 2026

📝 Walkthrough

Walkthrough

This PR extends Zhihu scraper functionality with a new "api" method alongside the existing "fxzhihu" approach, introducing cookie-based direct API authentication via ZHIHU_Z_C0 configuration, implementing HTML content post-processing utilities for image and link handling, and hardening API field extraction with safe defaults.

Changes

Cohort / File(s) Summary
Application Configuration
apps/api/src/config.py, template.env
Added ZHIHU_Z_C0 environment variable for Zhihu API authentication, with documentation describing its precedence over config file-based cookies.
Zhihu Scraper Configuration
apps/api/src/services/scrapers/zhihu/config.py
Updated ZHIHU_API_ANSWER_PARAMS to unencoded format; expanded ALL_METHODS to include "api" method; introduced ZHIHU_API_COOKIE constant with fallback logic prioritizing ZHIHU_Z_C0 over config-file-based cookies.
Zhihu Scraper Core Logic
apps/api/src/services/scrapers/zhihu/__init__.py
Switched default method from "fxzhihu" to "api"; replaced randomized User-Agent with static "node"; implemented conditional Cookie header setting from ZHIHU_API_COOKIE; changed API host for articles; added post-processing pipeline (fix_images_and_links, unmask_zhihu_links) for API responses; hardened field extraction with .get() defaults and enhanced status/video URL parsing logic.
Content Processing Module
apps/api/src/services/scrapers/zhihu/content_processing.py
New utility module providing fix_images_and_links() to replace image sources and unwrap underline tags, extract_references() to aggregate and format reference metadata from HTML, and unmask_zhihu_links() to decode Zhihu redirect links.
Tests
tests/test_zhihu_content_processing.py
New test module validating content processing functions: image source replacement, reference extraction, underline tag removal, and Zhihu link unmasking behaviors.

Sequence Diagram(s)

sequenceDiagram
    participant Config as Configuration
    participant Scraper as Zhihu Scraper
    participant API as Zhihu API
    participant Processor as Content Processor
    
    Config->>Scraper: Load ZHIHU_Z_C0 or ZHIHU_COOKIES
    Note over Scraper: Build ZHIHU_API_COOKIE
    Scraper->>Scraper: Set method="api" if ZHIHU_API_COOKIE present
    Scraper->>API: GET request with Cookie header
    API-->>Scraper: Return JSON response
    Scraper->>Processor: Pass raw_content to fix_images_and_links()
    Processor-->>Scraper: Fixed HTML (images, links)
    Scraper->>Processor: Pass HTML to unmask_zhihu_links()
    Processor-->>Scraper: Unmasked links (decoded targets)
    Scraper->>Scraper: Extract fields with .get() defaults
    Scraper-->>Scraper: Return parsed content
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 A new path emerges for Zhihu's tales so bright,
With cookies and magic, the API takes flight,
Images unmasked and links set free,
Content flows pure as can be! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 35.29% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately reflects the main change—integrating FxZhihu implementation by adding API support, content processing utilities, and updated configuration across multiple Zhihu-related files.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch zhihu-local-scraper
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

You can customize the high-level summary generated by CodeRabbit.

Configure the reviews.high_level_summary_instructions setting to provide custom instructions for generating the high-level summary.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
apps/api/src/services/scrapers/zhihu/__init__.py (1)

796-798: Simplify origin_pin check using .get().

As flagged by static analysis, the key check before dictionary access can be simplified. The current pattern "origin_pin" in data and data["origin_pin"] can be replaced with data.get("origin_pin").

♻️ Proposed fix
-        if "origin_pin" in data and data["origin_pin"]:
-            result["origin_pin_id"] = str(data["origin_pin"]["id"])
+        origin_pin = data.get("origin_pin")
+        if origin_pin:
+            result["origin_pin_id"] = str(origin_pin["id"])
-            result["origin_pin_data"] = Zhihu._resolve_status_api_data(data["origin_pin"])
+            result["origin_pin_data"] = Zhihu._resolve_status_api_data(origin_pin)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/api/src/services/scrapers/zhihu/__init__.py` around lines 796 - 798,
Replace the explicit key-and-truthy check "origin_pin" in data and
data["origin_pin"] with a single get call: retrieve origin_pin =
data.get("origin_pin") and if origin_pin: set result["origin_pin_id"] =
str(origin_pin["id"]) and result["origin_pin_data"] =
Zhihu._resolve_status_api_data(origin_pin); this simplifies the conditional and
avoids double dictionary lookup while preserving behavior.
apps/api/src/services/scrapers/zhihu/content_processing.py (1)

39-44: Unused loop variable index.

The loop variable index is not used in the loop body. Consider using an underscore prefix to indicate it's intentionally unused.

♻️ Proposed fix
-    for index, ref in sorted_refs:
+    for _index, ref in sorted_refs:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/api/src/services/scrapers/zhihu/content_processing.py` around lines 39 -
44, The loop in sorted_refs = sorted(references.items(), key=lambda x:
int(x[0])) assigns the first element to index but never uses it; change the loop
header in the block that builds items (currently "for index, ref in
sorted_refs") to use an unused-variable name (e.g., "for _, ref in sorted_refs")
so it's clear index is intentionally ignored while preserving the sorted_refs,
references, items, and URL/html construction logic.
tests/test_zhihu_content_processing.py (1)

1-11: Consider using a proper import path instead of sys.path manipulation.

The sys.path manipulation is fragile and can break if the directory structure changes. Consider:

  1. Running tests from project root with proper PYTHONPATH
  2. Using relative imports if this is a package
  3. Adding a conftest.py that sets up the path once

However, the comment explains the rationale (avoiding heavy dependencies), so this is acceptable for now if the test infrastructure requires it.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_zhihu_content_processing.py` around lines 1 - 11, The test
currently mutates sys.path to import content_processing which is fragile;
replace this inline sys.path manipulation by one of: (a) configure PYTHONPATH or
run tests from project root so tests import the module with a normal import, (b)
convert the tests folder into a package and use a relative import to import
content_processing, or (c) create a conftest.py that adjusts sys.path once for
the test suite; ensure tests still import fix_images_and_links,
extract_references, and unmask_zhihu_links from content_processing and remove
the sys.path.insert and os.path.dirname usage from
tests/test_zhihu_content_processing.py.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/api/src/services/scrapers/zhihu/content_processing.py`:
- Around line 56-63: The try/except in the URL unmasking block silently swallows
errors; update the block in content_processing.py that parses href (uses
urlparse, parse_qs, and unquote to set a_tag["href"]) to catch Exception as e
and log a debug-level message with the original href and exception details using
Loguru (ensure logger is imported/available), then proceed without raising;
include contextual text like "failed to unmask href" in the log so failures are
discoverable during debugging.

---

Nitpick comments:
In `@apps/api/src/services/scrapers/zhihu/__init__.py`:
- Around line 796-798: Replace the explicit key-and-truthy check "origin_pin" in
data and data["origin_pin"] with a single get call: retrieve origin_pin =
data.get("origin_pin") and if origin_pin: set result["origin_pin_id"] =
str(origin_pin["id"]) and result["origin_pin_data"] =
Zhihu._resolve_status_api_data(origin_pin); this simplifies the conditional and
avoids double dictionary lookup while preserving behavior.

In `@apps/api/src/services/scrapers/zhihu/content_processing.py`:
- Around line 39-44: The loop in sorted_refs = sorted(references.items(),
key=lambda x: int(x[0])) assigns the first element to index but never uses it;
change the loop header in the block that builds items (currently "for index, ref
in sorted_refs") to use an unused-variable name (e.g., "for _, ref in
sorted_refs") so it's clear index is intentionally ignored while preserving the
sorted_refs, references, items, and URL/html construction logic.

In `@tests/test_zhihu_content_processing.py`:
- Around line 1-11: The test currently mutates sys.path to import
content_processing which is fragile; replace this inline sys.path manipulation
by one of: (a) configure PYTHONPATH or run tests from project root so tests
import the module with a normal import, (b) convert the tests folder into a
package and use a relative import to import content_processing, or (c) create a
conftest.py that adjusts sys.path once for the test suite; ensure tests still
import fix_images_and_links, extract_references, and unmask_zhihu_links from
content_processing and remove the sys.path.insert and os.path.dirname usage from
tests/test_zhihu_content_processing.py.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9954da61-4b3d-4755-9e5a-0d3388cc1e05

📥 Commits

Reviewing files that changed from the base of the PR and between 7d4e441 and d86e926.

📒 Files selected for processing (6)
  • apps/api/src/config.py
  • apps/api/src/services/scrapers/zhihu/__init__.py
  • apps/api/src/services/scrapers/zhihu/config.py
  • apps/api/src/services/scrapers/zhihu/content_processing.py
  • template.env
  • tests/test_zhihu_content_processing.py

Comment on lines +56 to +63
try:
parsed = urlparse(href)
qs = parse_qs(parsed.query)
target = qs.get("target", [None])[0]
if target:
a_tag["href"] = unquote(target)
except Exception:
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Consider logging failed URL unmasking attempts.

The silent except: pass makes debugging difficult when URL parsing fails unexpectedly. Per coding guidelines, Loguru should be used for logging. Consider logging at debug level to aid troubleshooting without cluttering normal output.

🛠️ Proposed fix
+from fastfetchbot_shared.utils.logger import logger
 from urllib.parse import urlparse, parse_qs, unquote
             try:
                 parsed = urlparse(href)
                 qs = parse_qs(parsed.query)
                 target = qs.get("target", [None])[0]
                 if target:
                     a_tag["href"] = unquote(target)
-            except Exception:
-                pass
+            except Exception as e:
+                logger.debug(f"Failed to unmask Zhihu link {href}: {e}")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
try:
parsed = urlparse(href)
qs = parse_qs(parsed.query)
target = qs.get("target", [None])[0]
if target:
a_tag["href"] = unquote(target)
except Exception:
pass
try:
parsed = urlparse(href)
qs = parse_qs(parsed.query)
target = qs.get("target", [None])[0]
if target:
a_tag["href"] = unquote(target)
except Exception as e:
logger.debug(f"Failed to unmask Zhihu link {href}: {e}")
🧰 Tools
🪛 Ruff (0.15.6)

[error] 62-63: try-except-pass detected, consider logging the exception

(S110)


[warning] 62-62: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/api/src/services/scrapers/zhihu/content_processing.py` around lines 56 -
63, The try/except in the URL unmasking block silently swallows errors; update
the block in content_processing.py that parses href (uses urlparse, parse_qs,
and unquote to set a_tag["href"]) to catch Exception as e and log a debug-level
message with the original href and exception details using Loguru (ensure logger
is imported/available), then proceed without raising; include contextual text
like "failed to unmask href" in the log so failures are discoverable during
debugging.

@aturret aturret merged commit 3ae90ec into main Mar 20, 2026
2 checks passed
@coderabbitai coderabbitai bot mentioned this pull request Mar 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant