GitHub - Hayden727/CiteBot: An Intelligent LaTeX Citation Assistant.

CiteBot

An intelligent citation assistant that analyzes your LaTeX document, searches academic databases, and generates a complete BibTeX file.

Report Bug · Request Feature · 中文文档

Table of Contents

About The Project
Features
Getting Started
Usage
How It Works
Data Sources
Testing
Repository Structure
Contributing
License
Acknowledgments

About The Project

CiteBot automates the tedious process of finding and formatting references for academic papers. Give it your .tex file and a target number of references — it handles the rest: parsing your document, understanding what you're writing about, searching multiple academic databases in parallel, ranking results by relevance, and generating a ready-to-use .bib file.

Built With

Features

Multi-File Project Support — Pass your main .tex file or a project directory and CiteBot automatically tracks \input{}/\include{} to parse the entire project. Generates one unified .bib and inserts citations into each chapter file
LaTeX Parsing — Extracts title, abstract, sections, and existing citations from .tex files (supports \chapter, \section, Chinese documents)
LLM-Powered Keyword Extraction — Uses DeepSeek/OpenAI to understand document semantics and extract precise English academic terms; per-chapter chunked extraction for large projects (100+ keywords). Falls back to NLP ensemble (KeyBERT + YAKE + spaCy) when no LLM API is configured
Multi-Source Search — Queries OpenAlex, Semantic Scholar, PubMed, arXiv, and BioRxiv in parallel via OpenCite
Smart Ranking — Composite scoring: keyword overlap (40%), citation count (25%), recency (20%), abstract similarity (15%)
Deduplication — DOI-based and fuzzy title matching to eliminate duplicates
BibTeX Generation — Fetches authoritative BibTeX via DOI content negotiation with metadata fallback
Citation Insertion — Optionally inserts \cite{} commands into your document (writes to .cited.tex, never overwrites the original). For multi-file projects, each chapter gets its own .cited.tex

Getting Started

Prerequisites

Anaconda or Miniconda
Python 3.11+

Installation

conda create -n citebot python=3.11 -y
conda activate citebot

git clone https://github.com/Hayden727/CiteBot.git
cd CiteBot
pip install -e .

Configuration

Copy the example environment file and fill in your API keys:

cp .env.example .env

Variable	Purpose	Required
`DEEPSEEK_API_KEY`	LLM keyword extraction (great for non-English docs)	Recommended
`OPENAI_API_KEY`	Alternative LLM (set `OPENAI_BASE_URL` + `OPENAI_MODEL` for compatible APIs)	Optional
`SEMANTIC_SCHOLAR_API_KEY`	Semantic Scholar API (free, recommended for CS)	Recommended
`OPENCITE_EMAIL`	OpenAlex polite pool (higher rate limits)	Recommended
`CROSSREF_EMAIL`	CrossRef polite pool	Optional
`PUBMED_API_KEY`	PubMed/NCBI access	Optional

CiteBot works without API keys, but keyword quality and search rate limits will be degraded.

Usage

Basic Usage

# Single-file paper: generate 30 references
citebot paper.tex --num-refs 30 --output references.bib

# Multi-file thesis: pass the main file, auto-tracks \input/\include
citebot main.tex -n 100 -o references.bib -k 50

Advanced Options

# Pass a directory — auto-finds main.tex / thesis.tex inside
citebot thesis/ -n 100 -o refs.bib

# Insert \cite{} into each chapter file (writes .cited.tex copies)
citebot main.tex -n 100 -o refs.bib --insert-cites

# Filter by year range
citebot paper.tex --year-from 2020 --year-to 2025

# Select specific data sources (CS recommended)
citebot paper.tex --sources s2,openalex,arxiv

# Verbose output with reference table
citebot paper.tex -n 20 -o refs.bib -v

All Options

Option	Short	Default	Description
`--num-refs`	`-n`	30	Number of references to find
`--output`	`-o`	`references.bib`	Output `.bib` file path
`--insert-cites`		off	Insert `\cite{}` into `.tex` file
`--year-from`		none	Minimum publication year
`--year-to`		none	Maximum publication year
`--sources`		all	Comma-separated: `openalex,s2,pubmed,arxiv,biorxiv`
`--keywords`	`-k`	15	Number of keywords to extract
`--verbose`	`-v`	off	Show detailed output

How It Works

┌──────────┐    ┌───────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐
│  Parse   │───>│  Extract  │───>│  Search  │───>│  Rank &  │───>│ Generate │
│  .tex    │    │ Keywords  │    │ Papers   │    │  Filter  │    │  .bib    │
└──────────┘    └───────────┘    └──────────┘    └──────────┘    └──────────┘
                                                                       │
                                                                       v
                                                                ┌──────────┐
                                                                │ (Insert  │
                                                                │  cites)  │
                                                                └──────────┘

Parse — Reads your .tex file (or directory), auto-tracks \input{}/\include{} for multi-file projects, extracts title, abstract, sections from all files
Extract Keywords — Uses LLM (DeepSeek/OpenAI) for semantic keyword extraction; for multi-file projects, extracts per-chapter then merges (100+ keywords). NLP ensemble fallback (KeyBERT + YAKE + spaCy)
Search — Builds queries scaled to keyword count and searches academic databases in parallel via OpenCite
Rank & Filter — Deduplicates results and scores each paper on keyword overlap (40%), citation count (25%), recency (20%), and abstract similarity (15%)
Generate — Fetches authoritative BibTeX entries via DOI, falling back to metadata-based generation
Insert (optional) — Adds \cite{} commands at relevant positions in each file (.cited.tex copies)

Data Sources

Source	Coverage	Access
OpenAlex	250M+ works across all disciplines	Open, no key required
Semantic Scholar	200M+ papers, CS/biomedical focus	Free API key recommended
PubMed	36M+ biomedical citations	Free API key recommended
arXiv	2M+ preprints in STEM fields	Open
BioRxiv	Biology preprints	Open

Configurable via --sources. For CS papers, --sources s2,openalex,arxiv is recommended.

Testing

conda activate citebot
python -m pytest tests/ -v --cov=citebot --cov-report=term-missing

Repository Structure

CiteBot/
├── citebot/
│   ├── __init__.py              Package init
│   ├── types.py                 Frozen dataclasses + exception hierarchy
│   ├── config.py                Configuration (OpenCite + CLI params)
│   ├── latex_parser.py          .tex file parsing
│   ├── keyword_extractor.py     LLM-first keyword extraction + NLP fallback
│   ├── literature_searcher.py   Async multi-source search
│   ├── filter_ranker.py         Deduplication + composite scoring
│   ├── bib_generator.py         BibTeX generation + validation
│   ├── cite_inserter.py         Optional \cite{} insertion
│   ├── pipeline.py              Pipeline orchestration
│   └── main.py                  CLI entry point
├── tests/                       Unit + integration tests
├── pyproject.toml               Build configuration
├── requirements.txt             Pinned dependencies
└── .env.example                 API key template

Contributing

Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.

Fork the Project
Create your Feature Branch (git checkout -b feature/amazing-feature)
Commit your Changes (git commit -m 'feat: add amazing feature')
Push to the Branch (git push origin feature/amazing-feature)
Open a Pull Request

License

Distributed under the MIT License. See LICENSE for more information.

Acknowledgments

OpenCite — Multi-source academic search engine
KeyBERT — Keyword extraction with BERT embeddings

TOP

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
assets		assets
citebot		citebot
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
README_zh.md		README_zh.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CiteBot

About The Project

Built With

Features

Getting Started

Prerequisites

Installation

Configuration

Usage

Basic Usage

Advanced Options

All Options

How It Works

Data Sources

Testing

Repository Structure

Contributing

License

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CiteBot

About The Project

Built With

Features

Getting Started

Prerequisites

Installation

Configuration

Usage

Basic Usage

Advanced Options

All Options

How It Works

Data Sources

Testing

Repository Structure

Contributing

License

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages