This repo provides a simple toolchain to evaluate Dify Knowledge Base retrieval quality across different configurations (chunking strategy, TopK, reranking on/off).
Chinese docs:
README.zh-CN.mddocs/FAQ.zh-CN.md
Core scripts (pipeline):
build_evaluation_set.py: build candidate questions from an existing Dify dataset (Knowledge Base)- Manual review: filter candidates and save as
evaluation_set.xlsx rag_evaluator.py: run evaluation for one datasetbatch_evaluation.py: compare multiple datasets/configs in batchvisualization.py: generate charts/reports from summary JSONrun_evaluation.py: one-click batch evaluation + visualization
This project is pure Python. Install the usual data stack:
python3 -m pip install -U pandas numpy openpyxl requests python-dotenv tqdm matplotlib seaborn jiebaRequired:
DIFY_API_KEY=...
Optional:
DIFY_API_BASE=https://api.dify.ai/v1(default)
For batch comparison (3 chunking strategies):
DATASET_ID_GENERAL=...DATASET_ID_PARENT_CHILD=...DATASET_ID_QA=...
For reranking (must match Dify "System Model Settings"):
RERANK_PROVIDER_NAME=localorsiliconflow...RERANK_MODEL_NAME=bge-reranker-baseorBAAI/bge-reranker-v2-m3...
For multi-dataset evaluation correctness:
GOLD_MATCH_MODE=doc_name(recommended when comparing different datasets)
python3 build_evaluation_set.py --action build --dataset-id <ONE_DATASET_ID> --output candidates.xlsxThen manually review candidates.xlsx:
- mark
is_valid=Yfor good rows - fill
category/difficultyif you want grouped analysis - save as
evaluation_set.xlsx
Recommended for comparing multiple datasets: match gold by document name.
python3 rag_evaluator.py \
--dataset-id <DATASET_ID> \
--eval-set evaluation_set.xlsx \
--top-k 5 \
--gold-match doc_nameWith reranking:
python3 rag_evaluator.py \
--dataset-id <DATASET_ID> \
--eval-set evaluation_set.xlsx \
--top-k 5 \
--gold-match doc_name \
--use-rerank \
--rerank-provider siliconflow \
--rerank-model BAAI/bge-reranker-v2-m3python3 run_evaluation.pyOutputs:
results_<timestamp>/summary_*.json|.xlsx- charts:
*.png
gold_doc_id is dataset-scoped in Dify: the same file uploaded to different Knowledge Bases usually gets different document_id.
So for comparing chunk strategies (general vs parent-child vs QA), use:
gold_doc_namein the evaluation set, and--gold-match doc_namewhen evaluating.
See examples/ for a tiny demo corpus and a sample evaluation_set_example.xlsx you can use for smoke-testing.
CRUD_RAG/ is included as a git submodule pointing to https://github.com/IAAR-Shanghai/CRUD_RAG.git.
It is not required for running the Dify evaluation scripts.