Skip to content

MazinLab/deep_research

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

41 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DeepResearch

Convert a curated BibTeX bibliography and local Zotero PDF library into a hierarchically-condensed, queryable scientific knowledge base.

Ships with an MKID (Microwave Kinetic Inductance Detector) domain profile, but is designed to be redirected to any scientific corpus by editing domain_profile.yaml alone.


What the pipeline produces

Given MyLibrary.bib plus PDFs resolved from Zotero, DeepResearch builds:

Artifact Path Key
Manifest output/manifest.db citekey
Per-paper markdown output/extracted/<citekey>.md citekey
Figure / table PNGs output/extracted/figures/ citekey, index
Figure / table store (SQLite) output/db/figures.db (citekey, kind, index)
Chunk store (SQLite + JSONL) output/db/chunks.db (citekey, chunk_index)
Extraction cache output/db/extraction_cache.db (citekey, pdf_sha256)
Extraction quality reports output/db/extraction_quality.db citekey
Re-extraction queue output/reextract_queue.txt
Re-extraction BibTeX subset output/reextract.bib citekey
L1 per-paper digests (Sonnet) output/db/digests.db citekey
L2a cluster assignments (Haiku) output/db/cluster_assignments.db citekey
L2b cluster syntheses (Opus) output/db/syntheses.db cluster_id
L3 field map (Opus) output/field_map.md
Vector index output/db/lance/ (LanceDB) (citekey, chunk_index)
BM25 index built in-process from chunks.db

All artifacts are keyed by BibTeX citekey. Citekey is the invariant — nothing in the pipeline identifies a paper by title.


Architecture

BibTeX + PDFs
     │
     ▼
 Stage 0  Ingest ─────────────► manifest.db (citekey, pdf_sha256)
     │
     ▼
 Stage 1  Extract ────────────► markdown + figures + tables + citation graph
     │        (docling — layout, tables, OCR, formula → LaTeX)
     ▼
 Stage 2  Chunk ──────────────► chunks.db (retrieval-sized segments)
     │
     ▼
 Stage 2b Triage (Haiku) ───► extraction_quality.db   ┐  optional;
     │        (score readability, propose cleanup,      │  re-run ingest
     │         flag severe papers for re-extraction)    │  --force-reextract
     │                         reextract.bib ──────────┘  to repair
     ▼
 Stage 3  Embed + Index ──────► LanceDB (vectors) + BM25 (keywords)
     │        (hybrid retriever = weighted RRF fusion)
     ▼
 Stage 4  Digests (Sonnet) ───► digests.db          ┐
                                                    │
 Stage 5  Cluster (Haiku) ────► cluster_assignments │  Tier 2
                                                    │  (hierarchical
 Stage 6  Syntheses (Opus) ──► syntheses.db         │   synthesis)
                                                    │
 Stage 7  Field map (Opus) ──► field_map.md         ┘
     │
     ▼
 Stage 8  Query ──────────────► quick (Sonnet) | medium (Sonnet) |
                                 deep (Opus + LaTeX PDF via Tectonic)
     │
     ▼
 Stage 9  Evaluate ───────────► LLM-as-judge report (Opus)

Design principles

  • Citekey invariant. Every chunk, digest, figure description, table, synthesis, and cited claim traces back to a BibTeX citekey. LLMs never invent citations — we filter model output against the manifest.
  • Incremental execution. Every artifact is keyed by (citekey, pdf_sha256). Re-runs skip unchanged work. Replace a PDF, and only that paper's downstream artifacts are recomputed.
  • Budget enforcement. Every Anthropic API call flows through CostAccumulator (infrastructure/cost.py), which tracks cumulative spend and raises BudgetExceeded when a stage's configured cap is hit. Stages catch the exception, log partial progress, and exit cleanly so the next run resumes.
  • Model tiering. Haiku for classification and repair, Sonnet for structured extraction and drafting, Opus for synthesis and judgment. Substituting a cheaper model in an Opus slot without instruction is a bug.
  • Domain-neutral core. All domain knowledge lives in domain_profile.yaml — classification prompts, subtopic labels, evaluation questions, synthesis guidance. Code is domain-agnostic.

Installation

Prerequisites

  • Python 3.12+ (the project is developed and tested on 3.13). The py313 conda environment is the recommended target.
  • Zotero with a local PDF library (the ingest stage resolves file: entries in BibTeX against Zotero's storage directory).
  • Claude Max subscription or Anthropic API key. DeepResearch ships with a claude_agent_sdk adapter that routes model calls through the user's Claude Code OAuth session (Claude Max); fall back to ANTHROPIC_API_KEY for direct API use.
  • Tectonic (optional, required only for --mode deep LaTeX PDF output): brew install tectonic on macOS, or cargo install tectonic.

Install

git clone <repo-url> && cd DeepResearch
conda activate py313                    # or: python3.13 -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"

This registers the deep-research console script in the active environment's bin/.


Auto-ingest (optional)

deep-research auto-ingest discovers newly published papers relevant to your corpus, fetches their PDFs via Unpaywall and your institutional EZProxy session, and feeds them into ingest --full + synthesize automatically. Designed to be invoked by cron once per month.

Setup

  1. API keys — export required vars before running (or source from a file):

    export DEEP_RESEARCH_ADS_TOKEN="your-ads-api-token"
    export DEEP_RESEARCH_UNPAYWALL_EMAIL="you@example.edu"
    export DEEP_RESEARCH_S2_TOKEN=""        # optional; empty = unauthenticated low-volume
  2. EZProxy cookies (optional) — log into your library's EZProxy in a browser, export session cookies to JSON (see examples/ezproxy_cookies.json.example), then:

    export DEEP_RESEARCH_EZPROXY_COOKIES="/path/to/ezproxy_cookies.json"

    Cookies expire after a few weeks; refresh them when the EZProxy tier stops returning PDFs.

  3. Domain bibstems (optional) — add an auto_ingest section to domain_profile.yaml to restrict ADS queries to specific journals:

    auto_ingest:
      ads_bibstems: ["ApJ", "MNRAS", "A&A", "PhRvB"]
      ezproxy_base_url: "https://ezproxy.library.ucsb.edu"

Manual dry run

deep-research auto-ingest \
    --bib-path corpus.bib --output-dir ./output \
    --dry-run --corpus-cap 5

Dry-run writes output/auto_ingest/<YYYY-MM>/delta.bib and candidates.json without touching the corpus or fetching PDFs. Check the scored candidates before enabling the live run.

Cron setup

See examples/auto_ingest.env.example for the env file template and examples/crontab.example for the exact crontab block to paste into crontab -e.

For the full design rationale, fetch chain details, and exit code reference, see autoupdate.md.


Pointing DeepResearch at your PDFs

DeepResearch does not require Zotero. It resolves each citekey to a PDF by trying the following in order, short-circuiting on the first hit:

  1. BibTeX file field — if the file = {...} field on the BibTeX entry points at an existing PDF, that wins. Exports from many reference managers populate this field automatically.
  2. --pdf-map <path> — a YAML file mapping citekey → PDF path. See examples/pdf_map.yaml:
    day2003: ~/papers/day-broadband-2003.pdf
    zobrist2022: ~/papers/zobrist-membraneless-2022.pdf
    Paths that don't exist are logged and skipped, so a single pdf_map.yaml can be committed and reused across machines.
  3. --pdf-dir <dir>, citekey-named file<pdf-dir>/<citekey>.pdf. Lowest-friction option if you're willing to rename files. No extra config.
  4. --pdf-dir <dir>, fuzzy title match — if nothing else hit, the resolver fuzzy-matches the paper's BibTeX title against PDF filenames in --pdf-dir. Threshold is conservative (Jaccard ≥ 0.6); misses are logged. Zotero's ~/Zotero/storage tree works here out of the box.

--zotero-dir is retained as a silent alias for --pdf-dir for backward compatibility. Papers that no strategy resolves are kept as metadata-only records — the pipeline continues.

Quick start

From the project root:

# Tier 1 — parse bibtex, resolve PDFs, extract, chunk, build image assets
deep-research ingest --bib-path MyLibrary.bib --output-dir ./output --full

# Tier 2 — digests → clusters → syntheses → field map → embedding index
deep-research synthesize --bib-path MyLibrary.bib --output-dir ./output --budget 15

# Ask questions
deep-research query --bib-path MyLibrary.bib \
    "How does the energy resolution of InHf compare to previous MKIDs?"

# Deep research with LaTeX PDF output (needs Tectonic)
deep-research query --bib-path MyLibrary.bib --mode deep \
    --title "Noise Sources in MKIDs" \
    "What are the dominant noise sources across MKID architectures?"

ingest --full and synthesize are the two expensive stages. Everything downstream (query, eval) reads from disk.


CLI reference

deep-research ingest

deep-research ingest --bib-path <bib> --output-dir <dir>
                     [--pdf-dir <path>] [--pdf-map <path>]
                     [--zotero-dir <path>]
                     [--full] [--images-dir <path>] [--no-images]
                     [--accelerator {auto,cpu,cuda,xpu}]
                     [--batch-size N]
  • Without --full: Stage 0 only — parse BibTeX, resolve PDFs, write manifest.db.
  • With --full: also runs extraction and chunking. Produces output/extracted/<citekey>.md, output/extracted/figures/, and output/db/chunks.db.
  • PDF location flags (--pdf-dir, --pdf-map, --zotero-dir) are documented in Pointing DeepResearch at your PDFs.
  • --accelerator cuda forces docling's vision models onto the GPU (CPU and CUDA only; MPS is unsupported). Default auto-selects and silently falls back to CPU — pass cuda explicitly on a GPU box to fail loudly instead.
  • --batch-size N stops after N new (non-cached) extractions complete. Re-run to continue — the extraction cache makes resumed runs free. Useful for corpus-scale ingests wrapped in a shell loop.
  • --force-reextract deletes extraction-cache entries for every citekey in the BibTeX file before extracting, bypassing the (citekey, pdf_sha256) cache. Intended for use after triage writes output/reextract.bib — pass that subset bib with --full --ocr-full-page --force-reextract to re-extract only the flagged papers.

Formula enrichment is on by default — docling runs its CodeFormulaV2 vision model on every detected equation and emits real LaTeX. This is expensive on CPU (~20 s / formula on Apple Silicon, ~0.5 s on a CUDA GPU). Results are cached by (citekey, pdf_sha256), so re-runs are free.

deep-research synthesize

deep-research synthesize --bib-path <bib> --output-dir <dir>
                         [--budget <USD>] [--zotero-dir <path>]
                         [--skip-index]
                         [--cost-summary-path <path>]

Chains Tier 2 end-to-end: digests (Sonnet) → cluster assignments (Haiku) → cluster syntheses (Opus) → field map (Opus) → LanceDB embedding index. Requires ingest --full to have run first.

Typical cost for a 7-paper corpus: ~$1–2, ~10 min wall. Scales roughly linearly with paper count for digests, with cluster count for syntheses. Pass --skip-index to skip embedding if you only need the synthesis artifacts.

deep-research triage

deep-research triage --bib-path <bib> --output-dir <dir>
                     [--threshold N] [--max-review-chars N]
                     [--apply] [--budget <USD>]

Runs Haiku over every <citekey>.md under <output-dir>/extracted/, scoring readability (0–10) and proposing boilerplate cleanup. Results are stored in <output-dir>/db/extraction_quality.db. Papers scored below --threshold (default 6) or explicitly flagged by Haiku as needing full-page OCR are written to:

  • <output-dir>/reextract.bib — BibTeX subset (feed back to ingest as --bib-path)
  • <output-dir>/reextract_queue.txt — citekeys, one per line

Without --apply this is a dry run: proposed line-range deletions are stored in the DB but the .md files are not touched. With --apply, the deletions are applied in place (original saved as <citekey>.original.md) and the affected papers are re-chunked in chunks.db with the existing pdf_sha256 preserved.

Typical repair workflow after triage:

# Step 1 — triage (dry run)
deep-research triage --bib-path MyLibrary.bib --output-dir ./output

# Step 2 — apply boilerplate cleanup in place
deep-research triage --bib-path MyLibrary.bib --output-dir ./output --apply

# Step 3 — re-extract severely damaged papers with full-page OCR
deep-research ingest --bib-path ./output/reextract.bib --output-dir ./output \
    --full --ocr-full-page --force-reextract

Triage results are cached by (citekey, markdown_sha256), so re-runs after re-extraction automatically skip already-assessed papers whose content has not changed.

deep-research query

deep-research query --bib-path <bib> --output-dir <dir>
                    [--mode {quick,medium,deep}] [--top-k N]
                    [--title "..."] [--budget <USD>]
                    [--citation-boost | --no-citation-boost]
                    "your question"
Mode Context Model Output
quick (default) Retrieved chunks only Sonnet Text answer with citations
medium Chunks + per-paper digests Sonnet Longer synthesized answer
deep Chunks + digests + cluster syntheses + field map Opus LaTeX source + compiled PDF (needs Tectonic)

--citation-boost / --no-citation-boost: boost retrieval ranks by citation in-degree using <output-dir>/db/citation_edges.db. Default is auto-detect — enabled when the DB exists, silently disabled otherwise. Pass --no-citation-boost to disable even when the DB is present.

The standalone deep-research deep … subcommand is retained as a back-compat alias for query --mode deep. Prefer the unified form.

Figures in deep reports

Deep-mode reports can embed extracted figures and tables inline, with caption attribution via \citet{...}:

  • Figure metadata (captions + PNG paths) is persisted to <output-dir>/db/figures.db automatically during ingest --full --images-dir <path>. Without --images-dir, figures are parsed but not rasterised — the hallucination filter will later drop any rows with missing image paths.
  • query --mode deep auto-detects figures.db: when present, Opus is shown an AVAILABLE FIGURES section and may include up to 6 figures inline. No CLI flag controls this — presence of figures.db is the switch.
  • Missing figures.db, missing PNG on disk, or hallucinated \includegraphics paths all degrade silently to prose-only: the hallucination filter strips any \begin{figure}...\end{figure} block whose path was not in the allowlist.
  • Backfilling an existing corpus that was ingested before figures.db existed: re-run ingest --full --force-reextract --images-dir <path> on the same bib. The docling extraction cache is bypassed for those citekeys and figures.db is populated as a side effect.

deep-research eval

deep-research eval --bib-path <bib> --output-dir <dir> [--report-path <path>]
                   [--citation-boost | --no-citation-boost]

Runs the evaluation questions declared in domain_profile.yaml through the quick pipeline, then has Opus score each answer (LLM-as-judge). Writes a JSON report summarizing accuracy, citation faithfulness, and per-question scores. --citation-boost / --no-citation-boost controls retrieval-rank boosting by citation in-degree (same semantics as query; default: auto-detect).

deep-research smoke-test

deep-research smoke-test --bib-path <bib>

Runs every stage on a tiny fixture corpus without making a single API call — catches import errors, schema drift, and pipeline wiring bugs before you spend money on a real run. Use after pulling new code.


Configuration

Domain profile

domain_profile.yaml is the single file you edit to redirect the pipeline at a different corpus. It holds:

domain_name: "Microwave Kinetic Inductance Detectors (MKIDs)"

key_topics: [...]                          # names of subtopics for clustering

classification_prompt_fragments:           # domain text injected into LLM prompts
  relevance_check: ...
  subtopic_assignment: ...
  digest_focus: ...

synthesis_guidance:                        # what "a good synthesis" means here
  cluster_theme: ...
  field_map: ...

evaluation_questions:                      # ground-truth Q&A set
  - question: "..."
    expected_citations: [...]

No code changes are required when switching domains.

Budget caps

Each stage's budget is set via the --budget flag (USD). CostAccumulator enforces it; a stage that hits its cap raises BudgetExceeded, logs partial progress, and exits. Re-run to resume from where you stopped.

Auth

Priority order for LLM calls:

  1. claude_agent_sdk (Claude Max OAuth) — used when the claude CLI is on PATH or the bundled CLI ships with the package. No API key required.
  2. ANTHROPIC_API_KEY — direct Anthropic API fallback.

The adapter lives in deep_research/infrastructure/llm_client.py. Input tokens are read as input_tokens + cache_read_input_tokens + cache_creation_input_tokens (the agent SDK routes most input through prompt caching, so reading only input_tokens under-reports by orders of magnitude).


Project layout

deep_research/
├── infrastructure/       # schemas (Pydantic), cost tracker, cache, LLM client, logging
├── ingest/               # BibTeX parsing (bibtexparser), Zotero PDF resolution, manifest SQLite
├── extraction/           # docling PDF → markdown (formula enrichment), figures, tables, citation graph
├── chunking/             # markdown → retrieval chunks with overlap
├── knowledge_synthesis/  # L1 digests, L2 cluster assignment + syntheses, L3 field map
├── retrieval_index/      # LanceDB vector store, BM25, hybrid fusion, sentence-transformers embeddings
├── query_engine/         # quick / medium / deep mode handlers + router
├── latex_output/         # Jinja templates, .bib subset generation, Tectonic compiler
├── evaluation/           # question sets, LLM-as-judge, aggregate reporting
├── auto_ingest/          # monthly discovery, relevance scoring, PDF fetch, bib update
└── scripts/              # CLI (argparse), smoke test

tests/                    # 982 tests; run via pytest
domain_profile.yaml       # the only file to edit when switching domains
.claude/spec/             # SPEC.md, CHANGELOG.md, DATA_CONTRACTS.yaml, FINDINGS.md

Testing

pytest tests/                                # full suite (~1.5s, no external calls)
pytest tests/test_integration.py -v          # integration tests only
pytest tests/ --cov=deep_research --cov-report=term-missing

The test suite stubs all external services — no network, no API, no disk state in other repos.


Performance notes

From the reference 7-paper corpus (Apple Silicon, CPU for docling):

Stage Wall time Cost
ingest --full (with formula enrichment) ~45 min $0
synthesize (Tier 2, Opus for syntheses + field map) ~10 min ~$1.20
query --mode quick ~10 s ~$0.005
query --mode deep (full report + PDF) ~60 s ~$0.10

Formula enrichment dominates ingest --full time. A CUDA box (e.g., RTX 4080) would cut per-paper extraction from minutes to ~10–30 s. Re-runs on the same PDFs are free due to the extraction cache.


Known issues / follow-ups

See .claude/spec/FINDINGS.md for the full list. Highlights:

  • sentence-transformers emits HF_TOKEN unauthenticated-request warnings on first load. Set HF_TOKEN or ignore.
  • Cost tracker intentionally over-estimates: it treats cache_read_input_tokens at full-rate pricing rather than the ~10% cached rate, which slightly inflates the budget number. Actual billing comes from Anthropic.
  • docling does not support MPS for its vision-language models; CPU or CUDA only on Apple hardware.

License

BSD 3-Clause. See LICENSE.

About

A literature assimilation and knowledge engine.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors