Skip to content

Hayden727/CiteBot

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CiteBot

Python License Claude Code

CiteBot

An intelligent citation assistant that analyzes your LaTeX document, searches academic databases, and generates a complete BibTeX file.

Report Bug · Request Feature · 中文文档

Table of Contents
  1. About The Project
  2. Features
  3. Getting Started
  4. Usage
  5. How It Works
  6. Data Sources
  7. Testing
  8. Repository Structure
  9. Contributing
  10. License
  11. Acknowledgments

About The Project

CiteBot automates the tedious process of finding and formatting references for academic papers. Give it your .tex file and a target number of references — it handles the rest: parsing your document, understanding what you're writing about, searching multiple academic databases in parallel, ranking results by relevance, and generating a ready-to-use .bib file.

Built With

Python OpenCite DeepSeek Click

Features

  • Multi-File Project Support — Pass your main .tex file or a project directory and CiteBot automatically tracks \input{}/\include{} to parse the entire project. Generates one unified .bib and inserts citations into each chapter file
  • LaTeX Parsing — Extracts title, abstract, sections, and existing citations from .tex files (supports \chapter, \section, Chinese documents)
  • LLM-Powered Keyword Extraction — Uses DeepSeek/OpenAI to understand document semantics and extract precise English academic terms; per-chapter chunked extraction for large projects (100+ keywords). Falls back to NLP ensemble (KeyBERT + YAKE + spaCy) when no LLM API is configured
  • Multi-Source Search — Queries OpenAlex, Semantic Scholar, PubMed, arXiv, and BioRxiv in parallel via OpenCite
  • Smart Ranking — Composite scoring: keyword overlap (40%), citation count (25%), recency (20%), abstract similarity (15%)
  • Deduplication — DOI-based and fuzzy title matching to eliminate duplicates
  • BibTeX Generation — Fetches authoritative BibTeX via DOI content negotiation with metadata fallback
  • Citation Insertion — Optionally inserts \cite{} commands into your document (writes to .cited.tex, never overwrites the original). For multi-file projects, each chapter gets its own .cited.tex

Getting Started

Prerequisites

Installation

conda create -n citebot python=3.11 -y
conda activate citebot

git clone https://github.com/Hayden727/CiteBot.git
cd CiteBot
pip install -e .

Configuration

Copy the example environment file and fill in your API keys:

cp .env.example .env
Variable Purpose Required
DEEPSEEK_API_KEY LLM keyword extraction (great for non-English docs) Recommended
OPENAI_API_KEY Alternative LLM (set OPENAI_BASE_URL + OPENAI_MODEL for compatible APIs) Optional
SEMANTIC_SCHOLAR_API_KEY Semantic Scholar API (free, recommended for CS) Recommended
OPENCITE_EMAIL OpenAlex polite pool (higher rate limits) Recommended
CROSSREF_EMAIL CrossRef polite pool Optional
PUBMED_API_KEY PubMed/NCBI access Optional

CiteBot works without API keys, but keyword quality and search rate limits will be degraded.

Usage

Basic Usage

# Single-file paper: generate 30 references
citebot paper.tex --num-refs 30 --output references.bib

# Multi-file thesis: pass the main file, auto-tracks \input/\include
citebot main.tex -n 100 -o references.bib -k 50

Advanced Options

# Pass a directory — auto-finds main.tex / thesis.tex inside
citebot thesis/ -n 100 -o refs.bib

# Insert \cite{} into each chapter file (writes .cited.tex copies)
citebot main.tex -n 100 -o refs.bib --insert-cites

# Filter by year range
citebot paper.tex --year-from 2020 --year-to 2025

# Select specific data sources (CS recommended)
citebot paper.tex --sources s2,openalex,arxiv

# Verbose output with reference table
citebot paper.tex -n 20 -o refs.bib -v

All Options

Option Short Default Description
--num-refs -n 30 Number of references to find
--output -o references.bib Output .bib file path
--insert-cites off Insert \cite{} into .tex file
--year-from none Minimum publication year
--year-to none Maximum publication year
--sources all Comma-separated: openalex,s2,pubmed,arxiv,biorxiv
--keywords -k 15 Number of keywords to extract
--verbose -v off Show detailed output

How It Works

┌──────────┐    ┌───────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐
│  Parse   │───>│  Extract  │───>│  Search  │───>│  Rank &  │───>│ Generate │
│  .tex    │    │ Keywords  │    │ Papers   │    │  Filter  │    │  .bib    │
└──────────┘    └───────────┘    └──────────┘    └──────────┘    └──────────┘
                                                                       │
                                                                       v
                                                                ┌──────────┐
                                                                │ (Insert  │
                                                                │  cites)  │
                                                                └──────────┘
  1. Parse — Reads your .tex file (or directory), auto-tracks \input{}/\include{} for multi-file projects, extracts title, abstract, sections from all files
  2. Extract Keywords — Uses LLM (DeepSeek/OpenAI) for semantic keyword extraction; for multi-file projects, extracts per-chapter then merges (100+ keywords). NLP ensemble fallback (KeyBERT + YAKE + spaCy)
  3. Search — Builds queries scaled to keyword count and searches academic databases in parallel via OpenCite
  4. Rank & Filter — Deduplicates results and scores each paper on keyword overlap (40%), citation count (25%), recency (20%), and abstract similarity (15%)
  5. Generate — Fetches authoritative BibTeX entries via DOI, falling back to metadata-based generation
  6. Insert (optional) — Adds \cite{} commands at relevant positions in each file (.cited.tex copies)

Data Sources

Source Coverage Access
OpenAlex 250M+ works across all disciplines Open, no key required
Semantic Scholar 200M+ papers, CS/biomedical focus Free API key recommended
PubMed 36M+ biomedical citations Free API key recommended
arXiv 2M+ preprints in STEM fields Open
BioRxiv Biology preprints Open

Configurable via --sources. For CS papers, --sources s2,openalex,arxiv is recommended.

Testing

conda activate citebot
python -m pytest tests/ -v --cov=citebot --cov-report=term-missing

Repository Structure

CiteBot/
├── citebot/
│   ├── __init__.py              Package init
│   ├── types.py                 Frozen dataclasses + exception hierarchy
│   ├── config.py                Configuration (OpenCite + CLI params)
│   ├── latex_parser.py          .tex file parsing
│   ├── keyword_extractor.py     LLM-first keyword extraction + NLP fallback
│   ├── literature_searcher.py   Async multi-source search
│   ├── filter_ranker.py         Deduplication + composite scoring
│   ├── bib_generator.py         BibTeX generation + validation
│   ├── cite_inserter.py         Optional \cite{} insertion
│   ├── pipeline.py              Pipeline orchestration
│   └── main.py                  CLI entry point
├── tests/                       Unit + integration tests
├── pyproject.toml               Build configuration
├── requirements.txt             Pinned dependencies
└── .env.example                 API key template

Contributing

Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.

  1. Fork the Project
  2. Create your Feature Branch (git checkout -b feature/amazing-feature)
  3. Commit your Changes (git commit -m 'feat: add amazing feature')
  4. Push to the Branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

License

Distributed under the MIT License. See LICENSE for more information.

Acknowledgments

  • OpenCite — Multi-source academic search engine
  • KeyBERT — Keyword extraction with BERT embeddings

TOP

About

An Intelligent LaTeX Citation Assistant.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages