Skip to content

feat: add TeX-based figure extraction for arXiv papers#1

Open
Edenwu617 wants to merge 1 commit intoAlphaLab-USTC:mainfrom
Edenwu617:feat/tex-figure-extraction
Open

feat: add TeX-based figure extraction for arXiv papers#1
Edenwu617 wants to merge 1 commit intoAlphaLab-USTC:mainfrom
Edenwu617:feat/tex-figure-extraction

Conversation

@Edenwu617
Copy link
Copy Markdown

Summary

  • Add scripts/paper_extract_figures_tex.py — extracts figures from arXiv TeX source instead of heuristic PDF region cropping
  • Update README (EN + CN) to document TeX-first extraction strategy

Why

PDF-based figure extraction uses heuristics to guess figure boundaries, which causes:

  • Hyphenation artifacts in captions ("hand- designed")
  • Caption truncation at 600 chars
  • Low-res page-region screenshots instead of original images
  • Missed appendix figures (not in main PDF text flow)

TeX source gives us structured \caption{} text and original image files directly.

What changed

File Change
scripts/paper_extract_figures_tex.py New TeX-based extractor (drop-in replacement, same manifest format)
README.md Add TeX extraction docs
README_CN.md Add TeX extraction docs (Chinese)

Test plan

  • Tested on 13 arXiv papers (Part 2: Harness Engineering)
  • Verified recursive \input{} expansion (appendix figures captured)
  • Verified subfigure splitting (\begin{figure} with multiple \includegraphics)
  • Verified PDF/EPS → PNG conversion via PyMuPDF
  • Confirmed output manifest format matches paper_extract_figures.py

🤖 Generated with Claude Code

For arXiv papers, extract figures directly from TeX source instead of
heuristic PDF region cropping. Parses \begin{figure} environments,
resolves \includegraphics paths from e-print tars, and outputs the
same manifest format for drop-in compatibility.

Benefits over PDF extraction:
- Original image files (vector PDF, high-res PNG) instead of screenshots
- Structured \caption{} text without hyphenation artifacts or truncation
- Subfigure splitting (composite figures become individually referenceable)
- Appendix figures captured via recursive \input{} expansion

Falls back to PDF extraction for non-arXiv sources.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant