DeepResearch

Convert a curated BibTeX bibliography and local Zotero PDF library into a hierarchically-condensed, queryable scientific knowledge base.

Ships with an MKID (Microwave Kinetic Inductance Detector) domain profile, but is designed to be redirected to any scientific corpus by editing domain_profile.yaml alone.

What the pipeline produces

Given MyLibrary.bib plus PDFs resolved from Zotero, DeepResearch builds:

Artifact	Path	Key
Manifest	`output/manifest.db`	`citekey`
Per-paper markdown	`output/extracted/<citekey>.md`	`citekey`
Figure / table PNGs	`output/extracted/figures/`	`citekey, index`
Figure / table store (SQLite)	`output/db/figures.db`	`(citekey, kind, index)`
Chunk store (SQLite + JSONL)	`output/db/chunks.db`	`(citekey, chunk_index)`
Extraction cache	`output/db/extraction_cache.db`	`(citekey, pdf_sha256)`
Extraction quality reports	`output/db/extraction_quality.db`	`citekey`
Re-extraction queue	`output/reextract_queue.txt`	—
Re-extraction BibTeX subset	`output/reextract.bib`	`citekey`
L1 per-paper digests (Sonnet)	`output/db/digests.db`	`citekey`
L2a cluster assignments (Haiku)	`output/db/cluster_assignments.db`	`citekey`
L2b cluster syntheses (Opus)	`output/db/syntheses.db`	`cluster_id`
L3 field map (Opus)	`output/field_map.md`	—
Vector index	`output/db/lance/` (LanceDB)	`(citekey, chunk_index)`
BM25 index	built in-process from `chunks.db`	—

All artifacts are keyed by BibTeX citekey. Citekey is the invariant — nothing in the pipeline identifies a paper by title.

Architecture

BibTeX + PDFs
     │
     ▼
 Stage 0  Ingest ─────────────► manifest.db (citekey, pdf_sha256)
     │
     ▼
 Stage 1  Extract ────────────► markdown + figures + tables + citation graph
     │        (docling — layout, tables, OCR, formula → LaTeX)
     ▼
 Stage 2  Chunk ──────────────► chunks.db (retrieval-sized segments)
     │
     ▼
 Stage 2b Triage (Haiku) ───► extraction_quality.db   ┐  optional;
     │        (score readability, propose cleanup,      │  re-run ingest
     │         flag severe papers for re-extraction)    │  --force-reextract
     │                         reextract.bib ──────────┘  to repair
     ▼
 Stage 3  Embed + Index ──────► LanceDB (vectors) + BM25 (keywords)
     │        (hybrid retriever = weighted RRF fusion)
     ▼
 Stage 4  Digests (Sonnet) ───► digests.db          ┐
                                                    │
 Stage 5  Cluster (Haiku) ────► cluster_assignments │  Tier 2
                                                    │  (hierarchical
 Stage 6  Syntheses (Opus) ──► syntheses.db         │   synthesis)
                                                    │
 Stage 7  Field map (Opus) ──► field_map.md         ┘
     │
     ▼
 Stage 8  Query ──────────────► quick (Sonnet) | medium (Sonnet) |
                                 deep (Opus + LaTeX PDF via Tectonic)
     │
     ▼
 Stage 9  Evaluate ───────────► LLM-as-judge report (Opus)

Design principles

Citekey invariant. Every chunk, digest, figure description, table, synthesis, and cited claim traces back to a BibTeX citekey. LLMs never invent citations — we filter model output against the manifest.
Incremental execution. Every artifact is keyed by (citekey, pdf_sha256). Re-runs skip unchanged work. Replace a PDF, and only that paper's downstream artifacts are recomputed.
Budget enforcement. Every Anthropic API call flows through CostAccumulator (infrastructure/cost.py), which tracks cumulative spend and raises BudgetExceeded when a stage's configured cap is hit. Stages catch the exception, log partial progress, and exit cleanly so the next run resumes.
Model tiering. Haiku for classification and repair, Sonnet for structured extraction and drafting, Opus for synthesis and judgment. Substituting a cheaper model in an Opus slot without instruction is a bug.
Domain-neutral core. All domain knowledge lives in domain_profile.yaml — classification prompts, subtopic labels, evaluation questions, synthesis guidance. Code is domain-agnostic.

Installation

Prerequisites

Python 3.12+ (the project is developed and tested on 3.13). The py313 conda environment is the recommended target.
Zotero with a local PDF library (the ingest stage resolves file: entries in BibTeX against Zotero's storage directory).
Claude Max subscription or Anthropic API key. DeepResearch ships with a claude_agent_sdk adapter that routes model calls through the user's Claude Code OAuth session (Claude Max); fall back to ANTHROPIC_API_KEY for direct API use.
Tectonic (optional, required only for --mode deep LaTeX PDF output): brew install tectonic on macOS, or cargo install tectonic.

Install

git clone <repo-url> && cd DeepResearch
conda activate py313                    # or: python3.13 -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"

This registers the deep-research console script in the active environment's bin/.

Auto-ingest (optional)

deep-research auto-ingest discovers newly published papers relevant to your corpus, fetches their PDFs via Unpaywall and your institutional EZProxy session, and feeds them into ingest --full + synthesize automatically. Designed to be invoked by cron once per month.

Setup

API keys — export required vars before running (or source from a file):

export DEEP_RESEARCH_ADS_TOKEN="your-ads-api-token"
export DEEP_RESEARCH_UNPAYWALL_EMAIL="you@example.edu"
export DEEP_RESEARCH_S2_TOKEN=""        # optional; empty = unauthenticated low-volume

EZProxy cookies (optional) — log into your library's EZProxy in a browser, export session cookies to JSON (see examples/ezproxy_cookies.json.example), then:
```
export DEEP_RESEARCH_EZPROXY_COOKIES="/path/to/ezproxy_cookies.json"
```
Cookies expire after a few weeks; refresh them when the EZProxy tier stops returning PDFs.

Domain bibstems (optional) — add an auto_ingest section to domain_profile.yaml to restrict ADS queries to specific journals:

auto_ingest:
  ads_bibstems: ["ApJ", "MNRAS", "A&A", "PhRvB"]
  ezproxy_base_url: "https://ezproxy.library.ucsb.edu"

Manual dry run

deep-research auto-ingest \
    --bib-path corpus.bib --output-dir ./output \
    --dry-run --corpus-cap 5

Dry-run writes output/auto_ingest/<YYYY-MM>/delta.bib and candidates.json without touching the corpus or fetching PDFs. Check the scored candidates before enabling the live run.

Cron setup

See examples/auto_ingest.env.example for the env file template and examples/crontab.example for the exact crontab block to paste into crontab -e.

For the full design rationale, fetch chain details, and exit code reference, see autoupdate.md.

Pointing DeepResearch at your PDFs

DeepResearch does not require Zotero. It resolves each citekey to a PDF by trying the following in order, short-circuiting on the first hit:

BibTeX file field — if the file = {...} field on the BibTeX entry points at an existing PDF, that wins. Exports from many reference managers populate this field automatically.
--pdf-map <path> — a YAML file mapping citekey → PDF path. See examples/pdf_map.yaml:
```
day2003: ~/papers/day-broadband-2003.pdf
zobrist2022: ~/papers/zobrist-membraneless-2022.pdf
```
Paths that don't exist are logged and skipped, so a single pdf_map.yaml can be committed and reused across machines.
--pdf-dir <dir>, citekey-named file — <pdf-dir>/<citekey>.pdf. Lowest-friction option if you're willing to rename files. No extra config.
--pdf-dir <dir>, fuzzy title match — if nothing else hit, the resolver fuzzy-matches the paper's BibTeX title against PDF filenames in --pdf-dir. Threshold is conservative (Jaccard ≥ 0.6); misses are logged. Zotero's ~/Zotero/storage tree works here out of the box.

--zotero-dir is retained as a silent alias for --pdf-dir for backward compatibility. Papers that no strategy resolves are kept as metadata-only records — the pipeline continues.

Quick start

From the project root:

# Tier 1 — parse bibtex, resolve PDFs, extract, chunk, build image assets
deep-research ingest --bib-path MyLibrary.bib --output-dir ./output --full

# Tier 2 — digests → clusters → syntheses → field map → embedding index
deep-research synthesize --bib-path MyLibrary.bib --output-dir ./output --budget 15

# Ask questions
deep-research query --bib-path MyLibrary.bib \
    "How does the energy resolution of InHf compare to previous MKIDs?"

# Deep research with LaTeX PDF output (needs Tectonic)
deep-research query --bib-path MyLibrary.bib --mode deep \
    --title "Noise Sources in MKIDs" \
    "What are the dominant noise sources across MKID architectures?"

ingest --full and synthesize are the two expensive stages. Everything downstream (query, eval) reads from disk.

CLI reference

`deep-research ingest`

deep-research ingest --bib-path <bib> --output-dir <dir>
                     [--pdf-dir <path>] [--pdf-map <path>]
                     [--zotero-dir <path>]
                     [--full] [--images-dir <path>] [--no-images]
                     [--accelerator {auto,cpu,cuda,xpu}]
                     [--batch-size N]

Without --full: Stage 0 only — parse BibTeX, resolve PDFs, write manifest.db.
With --full: also runs extraction and chunking. Produces output/extracted/<citekey>.md, output/extracted/figures/, and output/db/chunks.db.
PDF location flags (--pdf-dir, --pdf-map, --zotero-dir) are documented in Pointing DeepResearch at your PDFs.
--accelerator cuda forces docling's vision models onto the GPU (CPU and CUDA only; MPS is unsupported). Default auto-selects and silently falls back to CPU — pass cuda explicitly on a GPU box to fail loudly instead.
--batch-size N stops after N new (non-cached) extractions complete. Re-run to continue — the extraction cache makes resumed runs free. Useful for corpus-scale ingests wrapped in a shell loop.
--force-reextract deletes extraction-cache entries for every citekey in the BibTeX file before extracting, bypassing the (citekey, pdf_sha256) cache. Intended for use after triage writes output/reextract.bib — pass that subset bib with --full --ocr-full-page --force-reextract to re-extract only the flagged papers.

Formula enrichment is on by default — docling runs its CodeFormulaV2 vision model on every detected equation and emits real LaTeX. This is expensive on CPU (~20 s / formula on Apple Silicon, ~0.5 s on a CUDA GPU). Results are cached by (citekey, pdf_sha256), so re-runs are free.

`deep-research synthesize`

deep-research synthesize --bib-path <bib> --output-dir <dir>
                         [--budget <USD>] [--zotero-dir <path>]
                         [--skip-index]
                         [--cost-summary-path <path>]

Chains Tier 2 end-to-end: digests (Sonnet) → cluster assignments (Haiku) → cluster syntheses (Opus) → field map (Opus) → LanceDB embedding index. Requires ingest --full to have run first.

Typical cost for a 7-paper corpus: ~$1–2, ~10 min wall. Scales roughly linearly with paper count for digests, with cluster count for syntheses. Pass --skip-index to skip embedding if you only need the synthesis artifacts.

`deep-research triage`

deep-research triage --bib-path <bib> --output-dir <dir>
                     [--threshold N] [--max-review-chars N]
                     [--apply] [--budget <USD>]

Runs Haiku over every <citekey>.md under <output-dir>/extracted/, scoring readability (0–10) and proposing boilerplate cleanup. Results are stored in <output-dir>/db/extraction_quality.db. Papers scored below --threshold (default 6) or explicitly flagged by Haiku as needing full-page OCR are written to:

<output-dir>/reextract.bib — BibTeX subset (feed back to ingest as --bib-path)
<output-dir>/reextract_queue.txt — citekeys, one per line

Without --apply this is a dry run: proposed line-range deletions are stored in the DB but the .md files are not touched. With --apply, the deletions are applied in place (original saved as <citekey>.original.md) and the affected papers are re-chunked in chunks.db with the existing pdf_sha256 preserved.

Typical repair workflow after triage:

# Step 1 — triage (dry run)
deep-research triage --bib-path MyLibrary.bib --output-dir ./output

# Step 2 — apply boilerplate cleanup in place
deep-research triage --bib-path MyLibrary.bib --output-dir ./output --apply

# Step 3 — re-extract severely damaged papers with full-page OCR
deep-research ingest --bib-path ./output/reextract.bib --output-dir ./output \
    --full --ocr-full-page --force-reextract

Triage results are cached by (citekey, markdown_sha256), so re-runs after re-extraction automatically skip already-assessed papers whose content has not changed.

`deep-research query`

deep-research query --bib-path <bib> --output-dir <dir>
                    [--mode {quick,medium,deep}] [--top-k N]
                    [--title "..."] [--budget <USD>]
                    [--citation-boost | --no-citation-boost]
                    "your question"

Mode	Context	Model	Output
`quick` (default)	Retrieved chunks only	Sonnet	Text answer with citations
`medium`	Chunks + per-paper digests	Sonnet	Longer synthesized answer
`deep`	Chunks + digests + cluster syntheses + field map	Opus	LaTeX source + compiled PDF (needs Tectonic)

--citation-boost / --no-citation-boost: boost retrieval ranks by citation in-degree using <output-dir>/db/citation_edges.db. Default is auto-detect — enabled when the DB exists, silently disabled otherwise. Pass --no-citation-boost to disable even when the DB is present.

The standalone deep-research deep … subcommand is retained as a back-compat alias for query --mode deep. Prefer the unified form.

Figures in deep reports

Deep-mode reports can embed extracted figures and tables inline, with caption attribution via \citet{...}:

Figure metadata (captions + PNG paths) is persisted to <output-dir>/db/figures.db automatically during ingest --full --images-dir <path>. Without --images-dir, figures are parsed but not rasterised — the hallucination filter will later drop any rows with missing image paths.
query --mode deep auto-detects figures.db: when present, Opus is shown an AVAILABLE FIGURES section and may include up to 6 figures inline. No CLI flag controls this — presence of figures.db is the switch.
Missing figures.db, missing PNG on disk, or hallucinated \includegraphics paths all degrade silently to prose-only: the hallucination filter strips any \begin{figure}...\end{figure} block whose path was not in the allowlist.
Backfilling an existing corpus that was ingested before figures.db existed: re-run ingest --full --force-reextract --images-dir <path> on the same bib. The docling extraction cache is bypassed for those citekeys and figures.db is populated as a side effect.

`deep-research eval`

deep-research eval --bib-path <bib> --output-dir <dir> [--report-path <path>]
                   [--citation-boost | --no-citation-boost]

Runs the evaluation questions declared in domain_profile.yaml through the quick pipeline, then has Opus score each answer (LLM-as-judge). Writes a JSON report summarizing accuracy, citation faithfulness, and per-question scores. --citation-boost / --no-citation-boost controls retrieval-rank boosting by citation in-degree (same semantics as query; default: auto-detect).

`deep-research smoke-test`

deep-research smoke-test --bib-path <bib>

Runs every stage on a tiny fixture corpus without making a single API call — catches import errors, schema drift, and pipeline wiring bugs before you spend money on a real run. Use after pulling new code.

Configuration

Domain profile

domain_profile.yaml is the single file you edit to redirect the pipeline at a different corpus. It holds:

domain_name: "Microwave Kinetic Inductance Detectors (MKIDs)"

key_topics: [...]                          # names of subtopics for clustering

classification_prompt_fragments:           # domain text injected into LLM prompts
  relevance_check: ...
  subtopic_assignment: ...
  digest_focus: ...

synthesis_guidance:                        # what "a good synthesis" means here
  cluster_theme: ...
  field_map: ...

evaluation_questions:                      # ground-truth Q&A set
  - question: "..."
    expected_citations: [...]

No code changes are required when switching domains.

Budget caps

Each stage's budget is set via the --budget flag (USD). CostAccumulator enforces it; a stage that hits its cap raises BudgetExceeded, logs partial progress, and exits. Re-run to resume from where you stopped.

Auth

Priority order for LLM calls:

claude_agent_sdk (Claude Max OAuth) — used when the claude CLI is on PATH or the bundled CLI ships with the package. No API key required.
ANTHROPIC_API_KEY — direct Anthropic API fallback.

The adapter lives in deep_research/infrastructure/llm_client.py. Input tokens are read as input_tokens + cache_read_input_tokens + cache_creation_input_tokens (the agent SDK routes most input through prompt caching, so reading only input_tokens under-reports by orders of magnitude).

Project layout

deep_research/
├── infrastructure/       # schemas (Pydantic), cost tracker, cache, LLM client, logging
├── ingest/               # BibTeX parsing (bibtexparser), Zotero PDF resolution, manifest SQLite
├── extraction/           # docling PDF → markdown (formula enrichment), figures, tables, citation graph
├── chunking/             # markdown → retrieval chunks with overlap
├── knowledge_synthesis/  # L1 digests, L2 cluster assignment + syntheses, L3 field map
├── retrieval_index/      # LanceDB vector store, BM25, hybrid fusion, sentence-transformers embeddings
├── query_engine/         # quick / medium / deep mode handlers + router
├── latex_output/         # Jinja templates, .bib subset generation, Tectonic compiler
├── evaluation/           # question sets, LLM-as-judge, aggregate reporting
├── auto_ingest/          # monthly discovery, relevance scoring, PDF fetch, bib update
└── scripts/              # CLI (argparse), smoke test

tests/                    # 982 tests; run via pytest
domain_profile.yaml       # the only file to edit when switching domains
.claude/spec/             # SPEC.md, CHANGELOG.md, DATA_CONTRACTS.yaml, FINDINGS.md

Testing

pytest tests/                                # full suite (~1.5s, no external calls)
pytest tests/test_integration.py -v          # integration tests only
pytest tests/ --cov=deep_research --cov-report=term-missing

The test suite stubs all external services — no network, no API, no disk state in other repos.

Performance notes

From the reference 7-paper corpus (Apple Silicon, CPU for docling):

Stage	Wall time	Cost
`ingest --full` (with formula enrichment)	~45 min	$0
`synthesize` (Tier 2, Opus for syntheses + field map)	~10 min	~$1.20
`query --mode quick`	~10 s	~$0.005
`query --mode deep` (full report + PDF)	~60 s	~$0.10

Formula enrichment dominates ingest --full time. A CUDA box (e.g., RTX 4080) would cut per-paper extraction from minutes to ~10–30 s. Re-runs on the same PDFs are free due to the extraction cache.

Known issues / follow-ups

See .claude/spec/FINDINGS.md for the full list. Highlights:

sentence-transformers emits HF_TOKEN unauthenticated-request warnings on first load. Set HF_TOKEN or ignore.
Cost tracker intentionally over-estimates: it treats cache_read_input_tokens at full-rate pricing rather than the ~10% cached rate, which slightly inflates the budget number. Actual billing comes from Anthropic.
docling does not support MPS for its vision-language models; CPU or CUDA only on Apple hardware.

License

BSD 3-Clause. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
deep_research		deep_research
examples		examples
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
domain_profile.yaml		domain_profile.yaml
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

DeepResearch

What the pipeline produces

Architecture

Design principles

Installation

Prerequisites

Install

Auto-ingest (optional)

Setup

Manual dry run

Cron setup

Pointing DeepResearch at your PDFs

Quick start

CLI reference

deep-research ingest

deep-research synthesize

deep-research triage

deep-research query

Figures in deep reports

deep-research eval

deep-research smoke-test

Configuration

Domain profile

Budget caps

Auth

Project layout

Testing

Performance notes

Known issues / follow-ups

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`deep-research ingest`

`deep-research synthesize`

`deep-research triage`

`deep-research query`

`deep-research eval`

`deep-research smoke-test`

Packages