🗺️ Overview | 📏 Evaluation | 🔀 Routers | 🚀 Quickstart | 📚 Datasets
RouterXBench is an LLM routing / selective prediction framework. Given a weak model (cheap) and a strong model (expensive), it trains/evaluates different routers (confidence/difficulty predictors) to decide when to escalate to the strong model, and reports metrics such as recovery rate, call rate, and AUROC.
The main entrypoint is src/main.py with three modes:
prepare: generate artifacts per dataset (scores, logits/hidden states, query embeddings, ...)train: train routers (probe / embedding_mlp / trained_deberta / ...)eval: evaluate routers on datasets and write results tosrc/metric_results/
The figure summarizes two things:
- Methodology (right): a router predicts whether the SLM/weak model can answer correctly; if not, we call the LLM/strong model.
- Evaluation framework (left): we evaluate cross-domain robustness, router ability (AUROC), and scenario alignment under different call-rate regimes (LPM / MPM / HCR).
- Cross-domain robustness: evaluate in-domain and out-of-domain datasets (see Datasets).
- In-domain examples in our setting: Alpaca, BigMath, MMLU
- Out-of-domain examples in our setting: Magpie, Math, MMLU-Pro
- Router ability (AUROC): how well the router predicts whether the SLM answers correctly (binary label from correctness/LLM-judge).
- Scenario alignment (call-rate regimes):
- LPM (Low-band Performance Mean): low call-rate band (strict budget)
- MPM (Mid-band Performance Mean): medium call-rate band
- HCR (High-band Call-Rate): high call-rate regime
These bands are controlled in config by recovery_rate_band and lpm_call_rate_band (see config.yaml and src/config.py).
All routers are implemented in src/router.py (and wired into the pipeline via src/pipeline.py / src/main.py).
Select the router via router.router_type in config.yaml:
- Verb-based (
router_type=self_based): routers that operate on model-generated text (verbalized signals), e.g.semantic_entropy,self_questioning. - Logits-based (
router_type=logits_based_routers): routers that operate on weak-model logits (and derived statistics), e.g.max_logits,top10_variance,entropy,confidence_margin,logits_margin(andcoe). - Embedding-based (
router_type=embedding_mlp): route based on pre-computed query embeddings + an MLP head (EmbeddingMLPRouter). - Probe-based (ours) (
router_type=probe): route based on hidden states + a learned probe (ProbeRouter). In our setup we useprobe_type=meanandprobe_type=dynamic_dirichlet.
Batch modes (convenience):
router_type=self_basedwill evaluatesemantic_entropyandself_questioning.router_type=logits_based_routerswill evaluatemax_logits,top10_variance,coe,entropy, andconfidence_margin.
RouterXBench supports two probe types used in our setup:
-
probe_type=mean: uniform (deterministic) fusion,$\hat{z}(x)=\frac{1}{L}\sum_{l=1}^{L} z^{(l)}(x)$ . -
probe_type=dynamic_dirichlet: DynamicDirichlet with a Dirichlet distribution over layer weights.
Given per-layer hidden states
Then a probe head maps
Notes (matching current RouterXBench code):
-
Training: sample
$\alpha$ from Dirichlet for regularization (stochastic fusion). -
Inference: use the Dirichlet expected weights (deterministic fusion). The concentration mass (e.g.
$\beta_0=\sum_l \beta_l$ ) can serve as an uncertainty proxy.
At a high level you will:
- (0) Setup environment from
requirements.txt - (1) Configure models + evaluation knobs in
config.yaml - (2) Start services: vLLM (weak/strong as needed) and xVerify (for math/mmlu/qa)
- (3) Prepare artifacts:
scores/logits/embeddings - (4) Train routers (probes / embedding_mlp / deberta)
- (5) Eval routers and write metrics
Clone the repo:
git clone https://github.com/zhuchichi56/RouterXBench.git && cd RouterXBenchIf you need xVerify (for math / mmlu / qa scoring), clone it into external/xverify:
git clone https://github.com/IAAR-Shanghai/xVerify.git external/xverifyInstall dependencies (recommended in a fresh venv/conda env):
pip install -r requirements.txtIf you run xVerify locally (via vllm serve ...), it may require extra deps (from external/xverify):
pip install -r external/xverify/requirements.txtDefault config is config.yaml (see src/config.py:PipelineConfig.from_yaml). You typically update:
inference.weak_model_path/inference.strong_model_pathinference.base_port,inference.{weak,strong}_gpu_ids,inference.cuda_visible_devices- (if using GPT/Judge) set
OPENAI_API_KEY/OPENAI_API_BASEenv vars or fill them in YAML - (for math/mmlu scoring) configure xVerify (
inference.xverify_model_url, etc.)
RouterXBench queries models via HTTP (src/inference/vllm_client.py).
- When do I need vLLM?
- If your
prepare_stepsincludesscores(default), you need inference access for weak and strong. - If your
evaluses self-based routers (semantic_entropy,self_questioning), you need inference access for the weak model. - If
inference.strong_model_path=false(use remote judge / API), you do not need a local strong-model vLLM.
- If your
- When do I need xVerify?
- For
math / mmlu / qadatasets, scoring uses xVerify to judge correctness (src/loss_calculator.py).
- For
Example: start xVerify server (from src/scripts/run.sh):
CUDA_VISIBLE_DEVICES=3 \
vllm serve /path/to/xVerify-9B-C \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size 1 \
--served-model-name xVerify \
--trust-remote-codeStart server(s) first:
cd src/inference
python start.py --model_path "/path/to/your/model" --base_port 8001 --gpu_list "0,1"Scripts are under src/scripts/ (they cd src and temporarily patch config, restored on exit):
# prepare artifacts
bash src/scripts/prepare_all.sh
# train probes (datasets before -- ; probe types after --)
bash src/scripts/train_probes.sh alpaca_5k_train big_math_5k_train mmlu_train -- mean
# train DynamicDirichlet probe
bash src/scripts/train_probes.sh alpaca_5k_train big_math_5k_train mmlu_train -- dynamic_dirichlet
# evaluate routers
bash src/scripts/eval.shprepare can run three steps (controlled by prepare_steps in config / src/scripts/prepare_all.sh):
- scores: runs weak/strong inference and writes per-sample correctness labels under
results/.- For
generaldatasets, scoring uses LLM-as-a-judge (llm_judge_generalinsrc/data.py). - For
math / mmlu / qadatasets, scoring uses xVerify (src/loss_calculator.py).
- For
- logits: generates weak-model logits + hidden states (used by logits-based routers and probes), written under
logits_output/andhs/. - embeddings: generates query embeddings (used by
embedding_mlp), written underquery_embeddings_output/.
Built-in dataset names are defined in src/data.py:DatasetRegistry. Commonly used ones:
- In-domain:
alpaca_5k_{train,test},big_math_5k_{train,test},mmlu_{train,test} - Out-of-domain:
magpie_5k_{train,test},math,mmlu_pro_* - QA:
hotpotqa_500,hotpotqa_4k,med_qa_1k
If you need to generate some JSONL under src/data/, helper scripts exist:
python src/utils/download_alpaca_data.py 10
python src/utils/download_hotpotqa.py -n 500
python src/utils/download_med_qa.py -n 1000- This repo may reference
external/xverifyfor certain correctness evaluation workflows. - Logs are written under
src/logs/.
- Wanxing Wu (Institut Polytechnique de Paris): Led the project, proposed the initial idea, implemented and improved the code, conducted major experiments, and wrote the paper.
- He Zhu (Peking University): Developed the code architecture and implemented core methods, substantially enhanced the methodology, contributed to experimental design, and co-wrote and revised the manuscript with critical insights.
- Yixia Li (Southern University of Science and Technology): Conducted experiments, participated in idea development and experimental design, and co-wrote and extensively revised the manuscript.
- Guanhua Chen (Southern University of Science and Technology): Supervised and guided the project, and provided resources.
- Hongru Wang (University of Edinburgh): Contributed to manuscript revision.
- Lei Yang, Zhao Jiehui (Deepxi Co.), Jian Yang (Beihang University), Benyou Wang, Bingyi Jing (Chinese University of Hong Kong, Shenzhen): Contributed through discussions.
