feat: add TeX-based figure extraction for arXiv papers by Edenwu617 · Pull Request #1 · AlphaLab-USTC/AutoWiki-skill

Edenwu617 · 2026-04-14T09:34:35Z

Summary

Add scripts/paper_extract_figures_tex.py — extracts figures from arXiv TeX source instead of heuristic PDF region cropping
Update README (EN + CN) to document TeX-first extraction strategy

Why

PDF-based figure extraction uses heuristics to guess figure boundaries, which causes:

Hyphenation artifacts in captions ("hand- designed")
Caption truncation at 600 chars
Low-res page-region screenshots instead of original images
Missed appendix figures (not in main PDF text flow)

TeX source gives us structured \caption{} text and original image files directly.

What changed

File	Change
`scripts/paper_extract_figures_tex.py`	New TeX-based extractor (drop-in replacement, same manifest format)
`README.md`	Add TeX extraction docs
`README_CN.md`	Add TeX extraction docs (Chinese)

Test plan

Tested on 13 arXiv papers (Part 2: Harness Engineering)
Verified recursive \input{} expansion (appendix figures captured)
Verified subfigure splitting (\begin{figure} with multiple \includegraphics)
Verified PDF/EPS → PNG conversion via PyMuPDF
Confirmed output manifest format matches paper_extract_figures.py

🤖 Generated with Claude Code

For arXiv papers, extract figures directly from TeX source instead of heuristic PDF region cropping. Parses \begin{figure} environments, resolves \includegraphics paths from e-print tars, and outputs the same manifest format for drop-in compatibility. Benefits over PDF extraction: - Original image files (vector PDF, high-res PNG) instead of screenshots - Structured \caption{} text without hyphenation artifacts or truncation - Subfigure splitting (composite figures become individually referenceable) - Appendix figures captured via recursive \input{} expansion Falls back to PDF extraction for non-arXiv sources. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add TeX-based figure extraction for arXiv papers#1

feat: add TeX-based figure extraction for arXiv papers#1
Edenwu617 wants to merge 1 commit intoAlphaLab-USTC:mainfrom
Edenwu617:feat/tex-figure-extraction

Edenwu617 commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Edenwu617 commented Apr 14, 2026

Summary

Why

What changed

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant