Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -100,6 +100,8 @@ raw/new/ ──→ LLM (Claude Code) ──→ kb/ ──→ Obsidian

Three operations: **Ingest** (PDF → analysis page), **Query** (ask questions → synthesize → write back), **Lint** (25+ auto-checks).

**Figure extraction:** For arXiv papers, figures are extracted directly from TeX source (`\includegraphics` + `\caption`), yielding original-quality images and clean captions. Falls back to PDF region-based extraction for non-arXiv sources.

---

## What Makes It Different
Expand Down Expand Up @@ -210,6 +212,8 @@ tags: [memory-evolution, year/2026, venue/arXiv]

**Why not RAG?** At personal KB scale (~100s of sources), a well-maintained index + grep outperforms vector search. No embedding pipeline needed.

**Why TeX-first figure extraction?** For arXiv papers, TeX source gives you the original image files (vector PDFs, high-res PNGs) and structured `\caption{}` text — no heuristic cropping, no hyphenation artifacts. PDF extraction remains as a fallback for non-arXiv sources.

**Why a skill?** `SKILL.md` IS the architecture — 390 lines encoding quality standards, anti-patterns, and workflow rules. No servers, no infra.

---
Expand Down
4 changes: 4 additions & 0 deletions README_CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,8 @@ raw/new/ ──→ LLM (Claude Code) ──→ kb/ ──→ Obsidian

三种操作:**Ingest**(PDF → 分析页)、**Query**(提问 → 综合 → 回写)、**Lint**(25+ 项自动检查)。

**图片提取:** 对于 arXiv 论文,直接从 TeX 源码提取图片(`\includegraphics` + `\caption`),获得原始质量的图片和干净的标题文本。对非 arXiv 来源回退到 PDF 区域截取。

---

## 有何不同
Expand Down Expand Up @@ -192,6 +194,8 @@ tags: [memory-evolution, year/2026, venue/arXiv]

**为什么不用 RAG?** 个人知识库规模(~几百篇),维护良好的索引 + grep 比向量搜索更好用,不需要 embedding 流水线。

**为什么用 TeX 优先的图片提取?** 对于 arXiv 论文,TeX 源码能直接获取原始图片文件(矢量 PDF、高清 PNG)和结构化的 `\caption{}` 文本 — 无需启发式裁剪,无断词伪影。对非 arXiv 来源保留 PDF 提取作为兜底。

**为什么是 skill?** `SKILL.md` 本身就是架构 — 390 行编码了质量标准、反模式和工作流规则。不需要服务器,不需要基础设施。

---
Expand Down
Loading