Convert a curated BibTeX bibliography and local Zotero PDF library into a hierarchically-condensed, queryable scientific knowledge base.
Ships with an MKID (Microwave Kinetic Inductance Detector) domain profile, but is designed to be redirected to any scientific corpus by editing domain_profile.yaml alone.
Given MyLibrary.bib plus PDFs resolved from Zotero, DeepResearch builds:
| Artifact | Path | Key |
|---|---|---|
| Manifest | output/manifest.db |
citekey |
| Per-paper markdown | output/extracted/<citekey>.md |
citekey |
| Figure / table PNGs | output/extracted/figures/ |
citekey, index |
| Figure / table store (SQLite) | output/db/figures.db |
(citekey, kind, index) |
| Chunk store (SQLite + JSONL) | output/db/chunks.db |
(citekey, chunk_index) |
| Extraction cache | output/db/extraction_cache.db |
(citekey, pdf_sha256) |
| Extraction quality reports | output/db/extraction_quality.db |
citekey |
| Re-extraction queue | output/reextract_queue.txt |
— |
| Re-extraction BibTeX subset | output/reextract.bib |
citekey |
| L1 per-paper digests (Sonnet) | output/db/digests.db |
citekey |
| L2a cluster assignments (Haiku) | output/db/cluster_assignments.db |
citekey |
| L2b cluster syntheses (Opus) | output/db/syntheses.db |
cluster_id |
| L3 field map (Opus) | output/field_map.md |
— |
| Vector index | output/db/lance/ (LanceDB) |
(citekey, chunk_index) |
| BM25 index | built in-process from chunks.db |
— |
All artifacts are keyed by BibTeX citekey. Citekey is the invariant — nothing in the pipeline identifies a paper by title.
BibTeX + PDFs
│
▼
Stage 0 Ingest ─────────────► manifest.db (citekey, pdf_sha256)
│
▼
Stage 1 Extract ────────────► markdown + figures + tables + citation graph
│ (docling — layout, tables, OCR, formula → LaTeX)
▼
Stage 2 Chunk ──────────────► chunks.db (retrieval-sized segments)
│
▼
Stage 2b Triage (Haiku) ───► extraction_quality.db ┐ optional;
│ (score readability, propose cleanup, │ re-run ingest
│ flag severe papers for re-extraction) │ --force-reextract
│ reextract.bib ──────────┘ to repair
▼
Stage 3 Embed + Index ──────► LanceDB (vectors) + BM25 (keywords)
│ (hybrid retriever = weighted RRF fusion)
▼
Stage 4 Digests (Sonnet) ───► digests.db ┐
│
Stage 5 Cluster (Haiku) ────► cluster_assignments │ Tier 2
│ (hierarchical
Stage 6 Syntheses (Opus) ──► syntheses.db │ synthesis)
│
Stage 7 Field map (Opus) ──► field_map.md ┘
│
▼
Stage 8 Query ──────────────► quick (Sonnet) | medium (Sonnet) |
deep (Opus + LaTeX PDF via Tectonic)
│
▼
Stage 9 Evaluate ───────────► LLM-as-judge report (Opus)
- Citekey invariant. Every chunk, digest, figure description, table, synthesis, and cited claim traces back to a BibTeX citekey. LLMs never invent citations — we filter model output against the manifest.
- Incremental execution. Every artifact is keyed by
(citekey, pdf_sha256). Re-runs skip unchanged work. Replace a PDF, and only that paper's downstream artifacts are recomputed. - Budget enforcement. Every Anthropic API call flows through
CostAccumulator(infrastructure/cost.py), which tracks cumulative spend and raisesBudgetExceededwhen a stage's configured cap is hit. Stages catch the exception, log partial progress, and exit cleanly so the next run resumes. - Model tiering. Haiku for classification and repair, Sonnet for structured extraction and drafting, Opus for synthesis and judgment. Substituting a cheaper model in an Opus slot without instruction is a bug.
- Domain-neutral core. All domain knowledge lives in
domain_profile.yaml— classification prompts, subtopic labels, evaluation questions, synthesis guidance. Code is domain-agnostic.
- Python 3.12+ (the project is developed and tested on 3.13). The
py313conda environment is the recommended target. - Zotero with a local PDF library (the ingest stage resolves
file:entries in BibTeX against Zotero's storage directory). - Claude Max subscription or Anthropic API key. DeepResearch ships with a
claude_agent_sdkadapter that routes model calls through the user's Claude Code OAuth session (Claude Max); fall back toANTHROPIC_API_KEYfor direct API use. - Tectonic (optional, required only for
--mode deepLaTeX PDF output):brew install tectonicon macOS, orcargo install tectonic.
git clone <repo-url> && cd DeepResearch
conda activate py313 # or: python3.13 -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"This registers the deep-research console script in the active environment's bin/.
deep-research auto-ingest discovers newly published papers relevant to your corpus, fetches their
PDFs via Unpaywall and your institutional EZProxy session, and feeds them into ingest --full +
synthesize automatically. Designed to be invoked by cron once per month.
-
API keys — export required vars before running (or source from a file):
export DEEP_RESEARCH_ADS_TOKEN="your-ads-api-token" export DEEP_RESEARCH_UNPAYWALL_EMAIL="you@example.edu" export DEEP_RESEARCH_S2_TOKEN="" # optional; empty = unauthenticated low-volume
-
EZProxy cookies (optional) — log into your library's EZProxy in a browser, export session cookies to JSON (see
examples/ezproxy_cookies.json.example), then:export DEEP_RESEARCH_EZPROXY_COOKIES="/path/to/ezproxy_cookies.json"
Cookies expire after a few weeks; refresh them when the EZProxy tier stops returning PDFs.
-
Domain bibstems (optional) — add an
auto_ingestsection todomain_profile.yamlto restrict ADS queries to specific journals:auto_ingest: ads_bibstems: ["ApJ", "MNRAS", "A&A", "PhRvB"] ezproxy_base_url: "https://ezproxy.library.ucsb.edu"
deep-research auto-ingest \
--bib-path corpus.bib --output-dir ./output \
--dry-run --corpus-cap 5Dry-run writes output/auto_ingest/<YYYY-MM>/delta.bib and candidates.json without touching the
corpus or fetching PDFs. Check the scored candidates before enabling the live run.
See examples/auto_ingest.env.example for the env file template and
examples/crontab.example for the exact crontab block to paste into crontab -e.
For the full design rationale, fetch chain details, and exit code reference, see autoupdate.md.
DeepResearch does not require Zotero. It resolves each citekey to a PDF by trying the following in order, short-circuiting on the first hit:
- BibTeX
filefield — if thefile = {...}field on the BibTeX entry points at an existing PDF, that wins. Exports from many reference managers populate this field automatically. --pdf-map <path>— a YAML file mapping citekey → PDF path. Seeexamples/pdf_map.yaml:Paths that don't exist are logged and skipped, so a singleday2003: ~/papers/day-broadband-2003.pdf zobrist2022: ~/papers/zobrist-membraneless-2022.pdf
pdf_map.yamlcan be committed and reused across machines.--pdf-dir <dir>, citekey-named file —<pdf-dir>/<citekey>.pdf. Lowest-friction option if you're willing to rename files. No extra config.--pdf-dir <dir>, fuzzy title match — if nothing else hit, the resolver fuzzy-matches the paper's BibTeX title against PDF filenames in--pdf-dir. Threshold is conservative (Jaccard ≥ 0.6); misses are logged. Zotero's~/Zotero/storagetree works here out of the box.
--zotero-dir is retained as a silent alias for --pdf-dir for backward
compatibility. Papers that no strategy resolves are kept as metadata-only
records — the pipeline continues.
From the project root:
# Tier 1 — parse bibtex, resolve PDFs, extract, chunk, build image assets
deep-research ingest --bib-path MyLibrary.bib --output-dir ./output --full
# Tier 2 — digests → clusters → syntheses → field map → embedding index
deep-research synthesize --bib-path MyLibrary.bib --output-dir ./output --budget 15
# Ask questions
deep-research query --bib-path MyLibrary.bib \
"How does the energy resolution of InHf compare to previous MKIDs?"
# Deep research with LaTeX PDF output (needs Tectonic)
deep-research query --bib-path MyLibrary.bib --mode deep \
--title "Noise Sources in MKIDs" \
"What are the dominant noise sources across MKID architectures?"ingest --full and synthesize are the two expensive stages. Everything downstream (query, eval) reads from disk.
deep-research ingest --bib-path <bib> --output-dir <dir>
[--pdf-dir <path>] [--pdf-map <path>]
[--zotero-dir <path>]
[--full] [--images-dir <path>] [--no-images]
[--accelerator {auto,cpu,cuda,xpu}]
[--batch-size N]
- Without
--full: Stage 0 only — parse BibTeX, resolve PDFs, writemanifest.db. - With
--full: also runs extraction and chunking. Producesoutput/extracted/<citekey>.md,output/extracted/figures/, andoutput/db/chunks.db. - PDF location flags (
--pdf-dir,--pdf-map,--zotero-dir) are documented in Pointing DeepResearch at your PDFs. --accelerator cudaforces docling's vision models onto the GPU (CPU and CUDA only; MPS is unsupported). Default auto-selects and silently falls back to CPU — passcudaexplicitly on a GPU box to fail loudly instead.--batch-size Nstops after N new (non-cached) extractions complete. Re-run to continue — the extraction cache makes resumed runs free. Useful for corpus-scale ingests wrapped in a shell loop.--force-reextractdeletes extraction-cache entries for every citekey in the BibTeX file before extracting, bypassing the(citekey, pdf_sha256)cache. Intended for use aftertriagewritesoutput/reextract.bib— pass that subset bib with--full --ocr-full-page --force-reextractto re-extract only the flagged papers.
Formula enrichment is on by default — docling runs its CodeFormulaV2 vision model on every detected equation and emits real LaTeX. This is expensive on CPU (~20 s / formula on Apple Silicon, ~0.5 s on a CUDA GPU). Results are cached by (citekey, pdf_sha256), so re-runs are free.
deep-research synthesize --bib-path <bib> --output-dir <dir>
[--budget <USD>] [--zotero-dir <path>]
[--skip-index]
[--cost-summary-path <path>]
Chains Tier 2 end-to-end: digests (Sonnet) → cluster assignments (Haiku) → cluster syntheses (Opus) → field map (Opus) → LanceDB embedding index. Requires ingest --full to have run first.
Typical cost for a 7-paper corpus: ~$1–2, ~10 min wall. Scales roughly linearly with paper count for digests, with cluster count for syntheses. Pass --skip-index to skip embedding if you only need the synthesis artifacts.
deep-research triage --bib-path <bib> --output-dir <dir>
[--threshold N] [--max-review-chars N]
[--apply] [--budget <USD>]
Runs Haiku over every <citekey>.md under <output-dir>/extracted/, scoring readability (0–10) and proposing boilerplate cleanup. Results are stored in <output-dir>/db/extraction_quality.db. Papers scored below --threshold (default 6) or explicitly flagged by Haiku as needing full-page OCR are written to:
<output-dir>/reextract.bib— BibTeX subset (feed back toingestas--bib-path)<output-dir>/reextract_queue.txt— citekeys, one per line
Without --apply this is a dry run: proposed line-range deletions are stored in the DB but the .md files are not touched. With --apply, the deletions are applied in place (original saved as <citekey>.original.md) and the affected papers are re-chunked in chunks.db with the existing pdf_sha256 preserved.
Typical repair workflow after triage:
# Step 1 — triage (dry run)
deep-research triage --bib-path MyLibrary.bib --output-dir ./output
# Step 2 — apply boilerplate cleanup in place
deep-research triage --bib-path MyLibrary.bib --output-dir ./output --apply
# Step 3 — re-extract severely damaged papers with full-page OCR
deep-research ingest --bib-path ./output/reextract.bib --output-dir ./output \
--full --ocr-full-page --force-reextractTriage results are cached by (citekey, markdown_sha256), so re-runs after re-extraction automatically skip already-assessed papers whose content has not changed.
deep-research query --bib-path <bib> --output-dir <dir>
[--mode {quick,medium,deep}] [--top-k N]
[--title "..."] [--budget <USD>]
[--citation-boost | --no-citation-boost]
"your question"
| Mode | Context | Model | Output |
|---|---|---|---|
quick (default) |
Retrieved chunks only | Sonnet | Text answer with citations |
medium |
Chunks + per-paper digests | Sonnet | Longer synthesized answer |
deep |
Chunks + digests + cluster syntheses + field map | Opus | LaTeX source + compiled PDF (needs Tectonic) |
--citation-boost / --no-citation-boost: boost retrieval ranks by citation in-degree using <output-dir>/db/citation_edges.db. Default is auto-detect — enabled when the DB exists, silently disabled otherwise. Pass --no-citation-boost to disable even when the DB is present.
The standalone deep-research deep … subcommand is retained as a back-compat alias for query --mode deep. Prefer the unified form.
Deep-mode reports can embed extracted figures and tables inline, with caption attribution via \citet{...}:
- Figure metadata (captions + PNG paths) is persisted to
<output-dir>/db/figures.dbautomatically duringingest --full --images-dir <path>. Without--images-dir, figures are parsed but not rasterised — the hallucination filter will later drop any rows with missing image paths. query --mode deepauto-detectsfigures.db: when present, Opus is shown anAVAILABLE FIGURESsection and may include up to 6 figures inline. No CLI flag controls this — presence offigures.dbis the switch.- Missing
figures.db, missing PNG on disk, or hallucinated\includegraphicspaths all degrade silently to prose-only: the hallucination filter strips any\begin{figure}...\end{figure}block whose path was not in the allowlist. - Backfilling an existing corpus that was ingested before figures.db existed: re-run
ingest --full --force-reextract --images-dir <path>on the same bib. The docling extraction cache is bypassed for those citekeys andfigures.dbis populated as a side effect.
deep-research eval --bib-path <bib> --output-dir <dir> [--report-path <path>]
[--citation-boost | --no-citation-boost]
Runs the evaluation questions declared in domain_profile.yaml through the quick pipeline, then has Opus score each answer (LLM-as-judge). Writes a JSON report summarizing accuracy, citation faithfulness, and per-question scores. --citation-boost / --no-citation-boost controls retrieval-rank boosting by citation in-degree (same semantics as query; default: auto-detect).
deep-research smoke-test --bib-path <bib>
Runs every stage on a tiny fixture corpus without making a single API call — catches import errors, schema drift, and pipeline wiring bugs before you spend money on a real run. Use after pulling new code.
domain_profile.yaml is the single file you edit to redirect the pipeline at a different corpus. It holds:
domain_name: "Microwave Kinetic Inductance Detectors (MKIDs)"
key_topics: [...] # names of subtopics for clustering
classification_prompt_fragments: # domain text injected into LLM prompts
relevance_check: ...
subtopic_assignment: ...
digest_focus: ...
synthesis_guidance: # what "a good synthesis" means here
cluster_theme: ...
field_map: ...
evaluation_questions: # ground-truth Q&A set
- question: "..."
expected_citations: [...]No code changes are required when switching domains.
Each stage's budget is set via the --budget flag (USD). CostAccumulator enforces it; a stage that hits its cap raises BudgetExceeded, logs partial progress, and exits. Re-run to resume from where you stopped.
Priority order for LLM calls:
claude_agent_sdk(Claude Max OAuth) — used when theclaudeCLI is onPATHor the bundled CLI ships with the package. No API key required.ANTHROPIC_API_KEY— direct Anthropic API fallback.
The adapter lives in deep_research/infrastructure/llm_client.py. Input tokens are read as input_tokens + cache_read_input_tokens + cache_creation_input_tokens (the agent SDK routes most input through prompt caching, so reading only input_tokens under-reports by orders of magnitude).
deep_research/
├── infrastructure/ # schemas (Pydantic), cost tracker, cache, LLM client, logging
├── ingest/ # BibTeX parsing (bibtexparser), Zotero PDF resolution, manifest SQLite
├── extraction/ # docling PDF → markdown (formula enrichment), figures, tables, citation graph
├── chunking/ # markdown → retrieval chunks with overlap
├── knowledge_synthesis/ # L1 digests, L2 cluster assignment + syntheses, L3 field map
├── retrieval_index/ # LanceDB vector store, BM25, hybrid fusion, sentence-transformers embeddings
├── query_engine/ # quick / medium / deep mode handlers + router
├── latex_output/ # Jinja templates, .bib subset generation, Tectonic compiler
├── evaluation/ # question sets, LLM-as-judge, aggregate reporting
├── auto_ingest/ # monthly discovery, relevance scoring, PDF fetch, bib update
└── scripts/ # CLI (argparse), smoke test
tests/ # 982 tests; run via pytest
domain_profile.yaml # the only file to edit when switching domains
.claude/spec/ # SPEC.md, CHANGELOG.md, DATA_CONTRACTS.yaml, FINDINGS.md
pytest tests/ # full suite (~1.5s, no external calls)
pytest tests/test_integration.py -v # integration tests only
pytest tests/ --cov=deep_research --cov-report=term-missingThe test suite stubs all external services — no network, no API, no disk state in other repos.
From the reference 7-paper corpus (Apple Silicon, CPU for docling):
| Stage | Wall time | Cost |
|---|---|---|
ingest --full (with formula enrichment) |
~45 min | $0 |
synthesize (Tier 2, Opus for syntheses + field map) |
~10 min | ~$1.20 |
query --mode quick |
~10 s | ~$0.005 |
query --mode deep (full report + PDF) |
~60 s | ~$0.10 |
Formula enrichment dominates ingest --full time. A CUDA box (e.g., RTX 4080) would cut per-paper extraction from minutes to ~10–30 s. Re-runs on the same PDFs are free due to the extraction cache.
See .claude/spec/FINDINGS.md for the full list. Highlights:
sentence-transformersemitsHF_TOKENunauthenticated-request warnings on first load. SetHF_TOKENor ignore.- Cost tracker intentionally over-estimates: it treats
cache_read_input_tokensat full-rate pricing rather than the ~10% cached rate, which slightly inflates the budget number. Actual billing comes from Anthropic. doclingdoes not support MPS for its vision-language models; CPU or CUDA only on Apple hardware.
BSD 3-Clause. See LICENSE.