A minimal paper fetching & saving tool for:
- arXiv search (with optional HuggingFace Papers HOT marking)
- OpenReview notes fetching by invitation (conference/journal configurable)
- Optional LLM analysis (OpenAI-compatible API)
NOTE:
- OpenReview support is not fully stable yet.
- LLM features are still in testing.
python -m pip install -r requirements.txtpython -m paper_fetch \
--out ./papers \
--collection '<your collection>' \
--arxiv-max 20 \
--keyword '"<your keyword>"' \
--category cs.CV \
--category cs.LG \
--pdf-dir _pdfParameters:
--out: output base directory.--collection: outer folder name under--out.--arxiv-max: number of arXiv papers to fetch.--keyword: arXiv query snippet (arXiv query syntax; can use field prefixes liketi:/abs:and boolean operators).--category: arXiv category filter.--pdf-dir _pdf: write all PDFs into./papers/<collection>/_pdf/(flat directory), which is convenient for Zotero "Linked Attachments".
arXiv field prefixes (arXiv query syntax; this tool passes them through):
ti:titleabs:abstractau:authorco:commentjr:journal referencecat:subject categoryrn:report numberid:id (arXiv identifier)all:all fields
Common AI categories for --category:
cs.AI: Artificial Intelligencecs.CL: Computation and Languagecs.CV: Computer Vision and Pattern Recognitioncs.GR: Computer Graphicscs.LG: Machine Learningcs.MA: Multiagent Systemscs.NE: Neural and Evolutionary Computingcs.RO: Roboticseess.AS: Audio and Speech Processingeess.IV: Image and Video Processingeess.SP: Signal Processingstat.ML: Machine Learning
Multiple --keyword / --category:
- All
--keywordvalues are OR'ed:(kw1 OR kw2 OR ...) - All
--categoryvalues are OR'ed:(cat1 OR cat2 OR ...) - Final query is:
(keywords_part) AND (categories_part)
Tip (AND search):
- Multiple
--keywordflags are OR'ed. - To require multiple terms at the same time, put an explicit
ANDexpression into a single--keyword, for example:
--keyword '(all:"image restoration" AND all:"diffusion")'How many can you pass?
- There is no hard-coded max in this tool (they use
action="append"). - Practical limits come from your shell/OS command length and overly long arXiv query strings.
Example:
python -m paper_fetch \
--out ./papers \
--collection '<your collection>' \
--arxiv-max 20 \
--keyword 'ti:"diffusion"' \
--keyword 'abs:"flow matching"' \
--category cs.CV \
--category cs.LG \
--pdf-dir _pdfZotero (manual, recommended / most stable):
- Set
Settings -> Advanced -> Files and Folders -> Linked Attachment Base Directoryto the parent folder (e.g..../papers). - In Finder, open
./papers/<collection>/_pdf/, select PDFs and drag them into Zotero. - Choose
Link to File, then (optional) right click PDFs:Retrieve Metadata for PDF.
python -m paper_fetch \
--out ./papers \
--collection '<your collection>' \
--hf-max 30Parameters:
--hf-max: fetch the daily HuggingFace Papers trending list and save those papers under_Huggingface.
python -m paper_fetch \
--out ./papers \
--collection '<your collection>' \
--hf-search '<your query>' \
--hf-search-max 30Parameters:
--hf-search: HuggingFace Papers search query (plain text query for https://huggingface.co/papers?q=...; not arXiv query syntax).--hf-search-max: how many HuggingFace search results to fetch in total.
Notes:
-
--hf-searchis a best-effort scraper over the HuggingFace Papers search page. The matching behavior depends on HuggingFace and may change. -
Author-name queries (e.g. searching a person name) are not guaranteed to behave like an exact author filter.
-
If you need author-precise search, use arXiv query syntax with
--keyword, for example:--keyword 'au:"<author name>"'
Optional LLM analysis (disabled by default):
- If you do nothing, LLM analysis is not used and no
analysis.mdis generated. - To enable LLM analysis and generate
analysis.md:
python -m paper_fetch \
--out ./papers \
--collection '<your collection>' \
--arxiv-max 20 \
--enable-llmLLM behavior notes:
- LLM analyzes local paper JSON metadata files only (title/authors/summary/etc). It does not read PDFs.
- LLM runs after fetching has been saved locally, to reduce token usage.
- You can type your own analysis instruction in terminal (multi-line, end with empty line), or provide it non-interactively:
python -m paper_fetch \
--out ./papers \
--collection '<your collection>' \
--arxiv-max 20 \
--enable-llm \
--llm-instruction 'Write a short digest of the fetched papers.'Outputs:
-
./papers/<collection>/index.jsonl -
./papers/<collection>/index.csv -
./papers/<collection>/_arXiv/<YYYY>/<paper_id>.json -
./papers/<collection>/_arXiv/<YYYY>/<paper_id>.pdf(only if--no-pdfis not set) -
./papers/<collection>/_Huggingface/trending_arxiv_ids.txt -
./papers/<collection>/_Huggingface/search_arxiv_ids.txt -
./papers/<collection>/_Huggingface/<YYYY>/<paper_id>.json -
./papers/<collection>/_Huggingface/<YYYY>/<paper_id>.pdf(only if--no-pdfis not set) -
./papers/<collection>/_OpenReview/<YYYY>/<paper_id>.json
When --pdf-dir _pdf is used:
./papers/<collection>/_pdf/<paper_id>.pdf
No report is generated by default.
Notes:
- If arXiv returns 0 results, the program stops with a warning and exit code
2.
Folder naming:
- Outputs are written under
./papers/<collection>/(controlled by--collection). - Under that folder, outputs are separated by source:
_arXiv,_Huggingface,_OpenReview. - Under each source, papers are grouped by year:
<YYYY>/. - Each paper is stored as flat files:
<paper_id>.jsonand (if enabled)<paper_id>.pdf.
You need an OpenReview invitation id.
How to find an invitation id (generic):
- Go to https://openreview.net
- Open the venue/group page you want to fetch (conference/journal/workshop/track).
- Find a submission invitation on that page (often contains
/-/Submissionor similar). - Use the full invitation id string as the value of
--openreview-invitation.
Invitation id examples (reference only):
<VenueID>/<Year>/<Track>/-/Submission
<VenueID>/<Year>/<Track>/-/Blind_Submission
<VenueID>/<Year>/<Track>/-/Paper
Example:
python -m paper_fetch \
--out ./papers \
--collection '<your collection>' \
--openreview-max 200 \
--openreview-invitation '<OpenReview invitation id>' \
--no-pdfNote:
- OpenReview records usually do not include a direct
pdf_urlin this implementation, so--no-pdfis recommended.
Set environment variables:
export LLM_BASE_URL='https://api.openai.com/v1'
export LLM_API_KEY='YOUR_API_KEY'
export LLM_MODEL='<your model id>'The report will be written to ./analysis.md with an extra LLM Analysis section.
Supported / usable models:
- This project sends
LLM_MODELdirectly to the OpenAI-compatible provider specified byLLM_BASE_URL. - Therefore, any chat/completions model that your provider exposes can be used.
- Examples (provider-dependent):
gpt-4o-mini,gpt-4o,gpt-4.1-mini.
This project does not hardcode a topic list. What you can fetch depends on the upstream data sources.
- arXiv fetch is query-based.
--keywordis passed into the arXiv search query (see examples below).- Therefore, any topic that arXiv indexes can be searched, including common AI topics (LLM / RL / diffusion / flow matching) and rare topics.
- However, for rare topics or very strict queries, arXiv may return 0 results. In that case this tool will stop and warn.
Practical guidance:
- If a term is rare, try:
- removing quotes
- searching in title/abstract only:
ti:/abs: - using multiple synonyms as separate
--keywordvalues (they are OR'ed)
Examples (query templates; not a full command):
# simple phrase
--keyword '"<your phrase>"'
# field-specific query: ti=title abs=abstract au=author all=all fields
--keyword 'ti:"<term>"'
--keyword 'abs:"<term>"'
--keyword 'au:"<name>"'
# boolean composition
--keyword '(ti:"<term1>" AND abs:"<term2>")'
Keyword examples (concrete examples; not a full command):
--keyword '"large language model"'
--keyword 'ti:"diffusion"'
--keyword 'abs:"reinforcement learning"'
--keyword '(ti:"transformer" AND abs:"efficient")'
Recommended arXiv categories for AI (you can combine multiple):
cs.LG(Machine Learning)cs.AI(Artificial Intelligence)cs.CV(Computer Vision)cs.CL(Computation and Language / NLP)cs.RO(Robotics)stat.ML(Machine Learning)eess.IV(Image and Video Processing)
- OpenReview fetch is venue/invitation-based.
- If a conference/journal/workshop uses OpenReview, you can fetch it by providing
--openreview-invitation. - This means the coverage depends on whether that venue is hosted on OpenReview (many top ML venues are, but not all AI conferences are).
All options are available from:
python -m paper_fetch --help-
--out- Output directory for all fetched records (required).
-
--analysis-out- Path to write the markdown report.
- Default:
analysis.md(in current working directory).
-
--arxiv-max- Number of arXiv papers to fetch.
- Default:
0.
-
--keyword-
arXiv query snippet (passed into the arXiv query string).
-
Supports field prefixes (e.g.
ti:,abs:,au:,all:) and boolean operators. -
Can be repeated.
--keyword '"<phrase 1>"' --keyword '"<phrase 2>"'
--keyword 'ti:"<term>"' --keyword 'abs:"<term>"'
--keyword '(ti:"<term1>" AND abs:"<term2>")' -
How this tool combines inputs:
-
All
--keywordvalues are combined as:(kw1 OR kw2 OR kw3 ...)
-
All
--categoryvalues are combined as:(cat1 OR cat2 OR cat3 ...)
-
Final query is:
(keywords_part) AND (categories_part)
-
-
Shell quoting tip:
- Use single quotes
'...'around each--keywordto avoid your shell interpreting parentheses and quotes.
- Use single quotes
-
-
--category- arXiv category (e.g.
cs.CV,cs.LG). - Can be repeated.
- arXiv category (e.g.
-
--allow-empty-arxiv-query- By default, this tool refuses to run arXiv fetching when both
--keywordand--categoryare empty. - Use this flag to explicitly allow an empty query.
- By default, this tool refuses to run arXiv fetching when both
--hf-max- How many HuggingFace Papers ids to fetch from https://huggingface.co/papers.
- Used only for HOT marking for arXiv IDs.
- Default:
0.
HuggingFace modes:
- Trending mode (download daily trending list and save those papers under
_Huggingface):
python -m paper_fetch \
--out ./papers \
--hf-max 30- Keyword search mode (search HuggingFace Papers and save matched papers under
_Huggingface):
python -m paper_fetch \
--out ./papers \
--hf-search '<your query>' \
--hf-search-max 30-
--no-hf-trending- Disable HuggingFace HOT marking.
-
--hf-search- HuggingFace Papers search query string.
- Can be repeated.
Notes:
- This is a plain search query for HuggingFace Papers (the
/papers?q=...page). - This is different from arXiv
--keywordwhich follows arXiv query syntax.
-
--hf-search-max- Number of HuggingFace search results to fetch.
- Default:
0.
-
--no-pdf- Do not download PDFs. Only save local paper JSON metadata files.
-
--pdf-dir- Override PDF output directory.
- If relative, it is under
--out/--collection. - Example:
--pdf-dir _pdf.
-
--openreview-max- Number of OpenReview notes to fetch.
- Default:
0(disabled).
-
--openreview-invitation- OpenReview invitation id.
- Required when
--openreview-max > 0.
-
--enable-llm- Enable LLM analysis and append results to the markdown report.
- Default: disabled.
-
--llm-instruction- Provide instruction text for LLM analysis.
- If empty and
--enable-llmis used, the CLI will prompt you interactively in the terminal.
-
--llm-no-interactive- Disable the interactive prompt.
- If
--llm-instructionis empty, a default instruction will be used.
Environment variables (OpenAI-compatible):
-
LLM_BASE_URL- Default:
https://api.openai.com/v1
- Default:
-
LLM_API_KEY- Required when
--enable-llmis used.
- Required when
-
LLM_MODEL- Default:
gpt-4o-mini
- Default:
请直接阅读完整中文文档:README_zh.md
- arXiv PDF downloads include retry + backoff; per-paper failures will not stop the whole run.
- OpenReview fetching requires a correct
invitationid; different venues/years can differ.