Pre-deployment domain difficulty diagnostic for RAG. Know if your benchmark transfers — before you deploy.
RAG pipelines get benchmarked on one domain and deployed on another. A system that scores 95% recall on financial documents scores 28% on legal text — same architecture, same embedding model, same settings. The failure is invisible until production users report wrong answers.
There is no standard way to predict, before deploying, whether your benchmark results will hold on your actual domain.
Measure the domain, not the pipeline. Vocabulary specificity — how uniquely query terms identify target passages — predicts retrieval difficulty in seconds, without running a single embedding. A corpus where query terms appear in 5 passages is trivially retrievable. A corpus where query terms appear in 500 passages will defeat any embedding model.
ragprobe quantifies this gap before you deploy, not after.
pip install ragproberagprobe score --corpus ./docs --queries queries.jsonDomain Difficulty Report
========================
Overall specificity: 0.177 (HARD)
Reference match: closest to GDPR regulatory text
Expect NeedleCoverage@5 in 15-30% range.
WARNING: If your benchmarks used HotpotQA (0.95)
or FinanceBench (0.98), results will NOT transfer.
Per-query breakdown:
EASY (3 queries) specificity > 0.7
HARD (17 queries) specificity < 0.3
Top ambiguous terms (appear in 100+ passages):
"data" (838), "processing" (412), "controller" (389),
"subject" (301), "personal" (287)
No embeddings. No vector store. No API keys. Runs in seconds.
from ragprobe import DomainProbe
probe = DomainProbe(
corpus=["path/to/docs/"],
queries=["What are the controller's obligations?", "What fines apply?"],
)
report = probe.score()
print(report.specificity) # 0.177
print(report.difficulty) # "hard"
print(report.closest_reference) # "GDPR regulatory text"
print(report.expected_recall_range) # (0.15, 0.35) — reference value
print(report.hardest_queries[:3]) # per-query breakdown
print(report.ambiguous_terms[:5]) # terms causing collisions| Specificity | Difficulty | Expected Recall@5 | What to expect |
|---|---|---|---|
| > 0.8 | Easy | 85–99% | Your benchmark probably transfers. Any decent embedder will work. |
| 0.3 – 0.8 | Medium | 35–80% | Benchmark may partially transfer. Build 10-20 domain-specific test queries before deploying. |
| < 0.3 | Hard | 5–35% | Your benchmark is lying to you. Build domain-specific needle annotations. Don't deploy without domain-specific evaluation. |
Expected Recall@5 ranges are reference values from measured benchmarks at similar specificity levels, not predictions for your specific corpus.
ragprobe ships with measured profiles from real corpora so you can see where your domain sits:
| Domain | Specificity | Difficulty | Recall@5 | Source |
|---|---|---|---|---|
| CaseHOLD (legal holdings) | 0.985 | Easy | 90–99% | Case names, statute numbers act as unique identifiers |
| HotpotQA (Wikipedia) | 0.946 | Easy | 85–95% | Named entities, dates, specific facts |
| Financial (SEC filings) | ~0.95 | Easy | 85–95% | Company names, dates, figures |
| Technical docs (product-specific) | 0.70–0.90 | Medium | 55–80% | Product terms provide moderate specificity |
| Medical (clinical) | 0.40–0.70 | Medium | 35–60% | Clinical terminology helps but overlaps |
| GDPR (regulatory) | 0.177 | Hard | 15–35% | Generic vocabulary: "data," "processing," "controller" everywhere |
| RFC (technical standards) | 0.024 | Hard | 5–20% | "client," "server," "request," "response" in every passage |
ragprobe score --corpus ./my-docs --queries my-queries.json --compare-referencesragprobe ships with three sample domains so you can see the tiers in action immediately:
# Easy domain — Wikipedia factoid passages (specificity ~0.97)
ragprobe score --corpus examples/wiki/corpus --queries examples/wiki/queries.txt --compare-references
# Hard domain — GDPR regulatory text (specificity ~0.28)
ragprobe score --corpus examples/regulatory/corpus --queries examples/regulatory/queries.txt --compare-references
# Medium domain — technical product docs (specificity ~0.70)
ragprobe score --corpus examples/technical/corpus --queries examples/technical/queries.txt --compare-references- Before choosing a benchmark. "Should I trust my HotpotQA results for this legal corpus?" Run ragprobe. 5 seconds. Answer: no.
- Before deploying to a new domain. You built a RAG system on product docs (medium difficulty). Now the team wants to add compliance policies (hard). How much will retrieval degrade? Measure it.
- In CI/CD. Gate deployment if domain difficulty exceeds a threshold without domain-specific evaluation queries.
- For mixed corpora. Real knowledge bases aren't single-domain. ragprobe tells you which parts of your corpus are easy and which will fail silently.
ragprobe score --corpus ./docs --queries queries.json --ci --max-difficulty hard- Single well-known domain. If you know you're deploying on GDPR and you've already built domain-specific evaluation, you don't need ragprobe to tell you it's hard.
- Predicting exact recall numbers. ragprobe predicts a difficulty tier, not a precise metric. It tells you "this is hard" not "you'll get 23.7% recall."
- Comparing retrieval architectures. ragprobe measures domain difficulty, not pipeline quality. Use ragtune for retrieval benchmarking.
- Tokenize each query into non-stopword terms
- Build an inverted index of the corpus (which terms appear in which passages)
- Compute specificity per query: fraction of terms appearing in fewer than 5 passages
- Compute IDF statistics: average and max inverse document frequency per query
- Identify ambiguous terms: terms with highest document frequency
- Compare against built-in reference profiles
- Report difficulty tier, per-query breakdown, and actionable recommendations
The core insight: if the text is lexically ambiguous (many passages share the same vocabulary), no retrieval method — keyword, dense, or hybrid — will have an easy time. Embeddings compress text into vectors; they don't invent semantic distinctions that aren't in the text. ragprobe measures a difficulty floor that applies regardless of architecture.
# Score a corpus against queries
ragprobe score --corpus ./docs --queries queries.json
# JSON output for CI/CD
ragprobe score --corpus ./docs --queries queries.json --format json
# Compare against built-in reference profiles
ragprobe score --corpus ./docs --queries queries.json --compare-references
# CI mode: exit 1 if difficulty exceeds threshold without domain-specific eval
ragprobe score --corpus ./docs --queries queries.json --ci --max-difficulty hard
# Score pre-chunked text (one file per chunk)
ragprobe score --corpus ./chunks/ --queries queries.json --pre-chunked
# Read queries from a plain text file (one per line)
ragprobe score --corpus ./docs --queries questions.txtPart of a RAG tools suite for retrieval quality:
| Tool | Layer | Question it answers |
|---|---|---|
| chunkweaver | Ingestion | "Are my chunks structurally coherent?" |
| ragtune | Evaluation | "How does my retrieval actually perform?" |
| ragprobe | Pre-deployment | "Will my benchmark results transfer to this domain?" |
They compose through standard formats (text files, JSON), not shared dependencies:
chunkweaver legal_doc.txt --preset legal-eu --format jsonl > chunks.jsonl
ragprobe score --corpus ./chunks/ --queries queries.json --pre-chunked
ragtune ingest ./chunks/ --collection test --pre-chunked
ragtune simulate --collection test --queries queries.jsonVocabulary specificity is a pre-retrieval difficulty metric rooted in Query Performance Prediction (QPP), an established area of information retrieval research. The core insight — that retrieval difficulty is predictable from corpus statistics before any embedding is computed — has been validated across TREC benchmarks since the early 2000s.
Key references:
- Hauff, Hiemstra & de Jong, "A Survey of Pre-Retrieval Query Performance Predictors" (CIKM 2008) — foundational survey of pre-retrieval QPP methods
- Thakur et al., "BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models" (NeurIPS 2021) — demonstrates the benchmark transfer problem across 18 domains
- Alexapolsky, "Your Benchmark Doesn't Generalize" — cross-domain RAG experiments showing vocabulary specificity predicts NeedleCoverage collapse from 95% to 28%
ragprobe is, to our knowledge, the first pip-installable tool that makes pre-retrieval domain difficulty metrics accessible to RAG practitioners. The QPP research community produced 20 years of validated metrics; ragprobe packages the most actionable ones for modern retrieval pipelines.
ragprobe/
├── __init__.py # Public API: DomainProbe
├── scorer.py # Core: inverted index, specificity, IDF, difficulty tiers
├── models.py # DomainReport, QueryDifficulty dataclasses
├── profiles.py # Built-in reference profiles (GDPR, RFC, HotpotQA, etc.)
├── loaders.py # Corpus and query loaders (files, JSON, directories)
└── cli.py # CLI entry point
Design principles:
- Zero dependencies for core — stdlib only, no heavy ML frameworks
- CLI requires only
click(pip install ragprobe[cli]) - All scores are deterministic and reproducible
- JSON output for CI/CD integration
- Lexical only. ragprobe measures word-level specificity, not semantic similarity. Two passages with identical vocabulary but different meaning (e.g., "shall erase" vs "may erase") will appear equally specific. This means ragprobe predicts a difficulty floor — actual retrieval may perform slightly better with strong embedding models.
- Correlation, not causation. Vocabulary specificity correlates with retrieval difficulty (validated on GDPR, RFC, HotpotQA, CaseHOLD) but is one of several factors. Answer dispersion (how many passages contain the answer) and semantic role diversity also matter.
- English-centric stopword list. The default stopword list is English. For other languages, pass a custom stopword set.
MIT
Oleksii Alexapolsky (𝕏) — building retrieval quality tools: chunkweaver, ragtune, ragprobe.