Get JSON for each translation
curl https://bolls.life/static/translations/NASB.json > nasb.json
curl https://bolls.life/static/translations/NIV.json > niv.json
curl https://bolls.life/static/translations/ESV.json > esv.json
Generate each csv dataset from JSON
Example.Dataset.gen_bible(" nasb" )
Example.Dataset.gen_bible(" niv" )
Example.Dataset.gen_bible(" esv" )
clean up the nasb csv and then niv & esv
%s/\[\( [^]]* \)\] /\1 /g
%s/’/' /g
%s/‘/' /g
generate the context len 100 chunks for all 3
Example.Prep.generate(" nasb" , encoder: true)
Example.Prep.generate(" niv" , encoder: true)
Example.Prep.generate(" esv" , encoder: true)
combine them all into a single csv
mkdir combined
cd combined
cp ../nasbtraining.csv .
cp ../nivtraining.csv .
cp ../esvtraining.csv .
cat * .csv > pretraining.csv
shuffle and uniq the pretraining data
Example.Encoder.scheduled(280, 0.00017)
Example.Scoring.evaluate ()
Example.Scoring.comprehensive_eval ()
curl https://bolls.life/static/translations/NLT.json > nlt.json
Example.Dataset.gen_bible(" nlt" )
# # vim find/replace
%s/< br> / /g
%s/< i> \( .\{ -}\) < \/ i> /\1 /g
%s/’/' /g
%s/‘/' /g
Seed the database from a given text
Example.Utils.seed(" nlt" )
Example.Utils.add_verse_embeddings ()
Index verses for BM25 search
Example.Verse.index_verses ()
# uncomment last index for ilike perf
vim priv/repo/migrations/20250718010521_add_bm25_stats.exs
mix ecto.reset
# setup python venv
uv python install 3.13
uv venv
source .venv/bin/activate
uv sync
python beir.py
================================================================================
BENCHMARK RESULTS COMPARISON
================================================================================
Dataset: nfcorpus | Total Queries: 324
--------------------------------------------------------------------------------
Metric | BM25 | ILIKE | Improvement
--------------------------------------------------------------------------------
NDCG@10 (relevance) | 0.2940 | 0.1105 | +166.1%
QPS (throughput) | 4.86 | 225.36 | -97.8%
Avg Latency (ms) | 205.94 | 4.44 | -4541.1%
Precision@10 | 0.2806 | 0.1747 | +60.6%
Recall@10 | 0.1302 | 0.0454 | +0.0849pp
Recall % (found relevant) | 66.7 | 28.1 | +38.6pp
--------------------------------------------------------------------------------
Total Time | 66.73 | 1.44 | (seconds)
================================================================================
KEY TAKEAWAYS:
✓ BM25 provides 166.1% better relevance ranking
✓ BM25 retrieves more relevant documents within the top 10 (Precision@10)
✓ BM25 finds a larger proportion of all relevant documents (Recall@10)
⚠ BM25 is 97.8% slower, but much more relevant
✓ BM25 finds relevant results for 38.6% more queries