AlphaLab-USTC · Edenwu617 · Apr 14, 2026
diff --git a/README.md b/README.md
@@ -100,6 +100,8 @@ raw/new/  ──→  LLM (Claude Code)  ──→  kb/  ──→  Obsidian
 
 Three operations: **Ingest** (PDF → analysis page), **Query** (ask questions → synthesize → write back), **Lint** (25+ auto-checks).
 
+**Figure extraction:** For arXiv papers, figures are extracted directly from TeX source (`\includegraphics` + `\caption`), yielding original-quality images and clean captions. Falls back to PDF region-based extraction for non-arXiv sources.
+
 ---
 
 ## What Makes It Different
@@ -210,6 +212,8 @@ tags: [memory-evolution, year/2026, venue/arXiv]
 
 **Why not RAG?** At personal KB scale (~100s of sources), a well-maintained index + grep outperforms vector search. No embedding pipeline needed.
 
+**Why TeX-first figure extraction?** For arXiv papers, TeX source gives you the original image files (vector PDFs, high-res PNGs) and structured `\caption{}` text — no heuristic cropping, no hyphenation artifacts. PDF extraction remains as a fallback for non-arXiv sources.
+
 **Why a skill?** `SKILL.md` IS the architecture — 390 lines encoding quality standards, anti-patterns, and workflow rules. No servers, no infra.
 
 ---

diff --git a/README_CN.md b/README_CN.md
@@ -82,6 +82,8 @@ raw/new/  ──→  LLM (Claude Code)  ──→  kb/  ──→  Obsidian
 
 三种操作：**Ingest**（PDF → 分析页）、**Query**（提问 → 综合 → 回写）、**Lint**（25+ 项自动检查）。
 
+**图片提取：** 对于 arXiv 论文，直接从 TeX 源码提取图片（`\includegraphics` + `\caption`），获得原始质量的图片和干净的标题文本。对非 arXiv 来源回退到 PDF 区域截取。
+
 ---
 
 ## 有何不同
@@ -192,6 +194,8 @@ tags: [memory-evolution, year/2026, venue/arXiv]
 
 **为什么不用 RAG？** 个人知识库规模（~几百篇），维护良好的索引 + grep 比向量搜索更好用，不需要 embedding 流水线。
 
+**为什么用 TeX 优先的图片提取？** 对于 arXiv 论文，TeX 源码能直接获取原始图片文件（矢量 PDF、高清 PNG）和结构化的 `\caption{}` 文本 — 无需启发式裁剪，无断词伪影。对非 arXiv 来源保留 PDF 提取作为兜底。
+
 **为什么是 skill？** `SKILL.md` 本身就是架构 — 390 行编码了质量标准、反模式和工作流规则。不需要服务器，不需要基础设施。
 
 ---