RouterXBench：Towards Fair and Comprehensive Evaluation of Routers in Collaborative LLM Systems

🗺️ Overview | 📏 Evaluation | 🔀 Routers | 🚀 Quickstart | 📚 Datasets

RouterXBench is an LLM routing / selective prediction framework. Given a weak model (cheap) and a strong model (expensive), it trains/evaluates different routers (confidence/difficulty predictors) to decide when to escalate to the strong model, and reports metrics such as recovery rate, call rate, and AUROC.

The main entrypoint is src/main.py with three modes:

prepare: generate artifacts per dataset (scores, logits/hidden states, query embeddings, ...)
train: train routers (probe / embedding_mlp / trained_deberta / ...)
eval: evaluate routers on datasets and write results to src/metric_results/

🗺️ Overview

The figure summarizes two things:

Methodology (right): a router predicts whether the SLM/weak model can answer correctly; if not, we call the LLM/strong model.
Evaluation framework (left): we evaluate cross-domain robustness, router ability (AUROC), and scenario alignment under different call-rate regimes (LPM / MPM / HCR).

📏 Evaluation

What we measure

Cross-domain robustness: evaluate in-domain and out-of-domain datasets (see Datasets).
- In-domain examples in our setting: Alpaca, BigMath, MMLU
- Out-of-domain examples in our setting: Magpie, Math, MMLU-Pro
Router ability (AUROC): how well the router predicts whether the SLM answers correctly (binary label from correctness/LLM-judge).
Scenario alignment (call-rate regimes):
- LPM (Low-band Performance Mean): low call-rate band (strict budget)
- MPM (Mid-band Performance Mean): medium call-rate band
- HCR (High-band Call-Rate): high call-rate regime

These bands are controlled in config by recovery_rate_band and lpm_call_rate_band (see config.yaml and src/config.py).

🔀 Routers

All routers are implemented in src/router.py (and wired into the pipeline via src/pipeline.py / src/main.py).

Methodology: router families

Select the router via router.router_type in config.yaml:

Verb-based (router_type=self_based): routers that operate on model-generated text (verbalized signals), e.g. semantic_entropy, self_questioning.
Logits-based (router_type=logits_based_routers): routers that operate on weak-model logits (and derived statistics), e.g. max_logits, top10_variance, entropy, confidence_margin, logits_margin (and coe).
Embedding-based (router_type=embedding_mlp): route based on pre-computed query embeddings + an MLP head (EmbeddingMLPRouter).
Probe-based (ours) (router_type=probe): route based on hidden states + a learned probe (ProbeRouter). In our setup we use probe_type=mean and probe_type=dynamic_dirichlet.

Batch modes (convenience):

router_type=self_based will evaluate semantic_entropy and self_questioning.
router_type=logits_based_routers will evaluate max_logits, top10_variance, coe, entropy, and confidence_margin.

Probe-based (ours): Mean / DynamicDirichlet

RouterXBench supports two probe types used in our setup:

probe_type=mean: uniform (deterministic) fusion, $\hat{z}(x)=\frac{1}{L}\sum_{l=1}^{L} z^{(l)}(x)$.
probe_type=dynamic_dirichlet: DynamicDirichlet with a Dirichlet distribution over layer weights.

Given per-layer hidden states $z^{(1)}(x),\dots,z^{(L)}(x)$, we learn a distribution over layer-weights:

$$ \alpha(x) \sim \mathrm{Dir}(\beta(x)), \quad \hat{z}(x) = \sum_{l=1}^{L}\alpha_l(x), z^{(l)}(x) $$

Then a probe head maps $\hat{z}(x)$ to a probe score (sigmoid) used for routing.

Notes (matching current RouterXBench code):

Training: sample $\alpha$ from Dirichlet for regularization (stochastic fusion).
Inference: use the Dirichlet expected weights (deterministic fusion). The concentration mass (e.g. $\beta_0=\sum_l \beta_l$) can serve as an uncertainty proxy.

🚀 Quickstart

TL;DR: run RouterXBench end-to-end

At a high level you will:

(0) Setup environment from requirements.txt
(1) Configure models + evaluation knobs in config.yaml
(2) Start services: vLLM (weak/strong as needed) and xVerify (for math/mmlu/qa)
(3) Prepare artifacts: scores / logits / embeddings
(4) Train routers (probes / embedding_mlp / deberta)
(5) Eval routers and write metrics

Step 0: Setup environment

Clone the repo:

git clone https://github.com/zhuchichi56/RouterXBench.git && cd RouterXBench

If you need xVerify (for math / mmlu / qa scoring), clone it into external/xverify:

git clone https://github.com/IAAR-Shanghai/xVerify.git external/xverify

Install dependencies (recommended in a fresh venv/conda env):

pip install -r requirements.txt

If you run xVerify locally (via vllm serve ...), it may require extra deps (from external/xverify):

pip install -r external/xverify/requirements.txt

Step 1: Configure

Default config is config.yaml (see src/config.py:PipelineConfig.from_yaml). You typically update:

inference.weak_model_path / inference.strong_model_path
inference.base_port, inference.{weak,strong}_gpu_ids, inference.cuda_visible_devices
(if using GPT/Judge) set OPENAI_API_KEY / OPENAI_API_BASE env vars or fill them in YAML
(for math/mmlu scoring) configure xVerify (inference.xverify_model_url, etc.)

Step 2: Start services (vLLM + optional xVerify)

RouterXBench queries models via HTTP (src/inference/vllm_client.py).

When do I need vLLM?
- If your prepare_steps includes scores (default), you need inference access for weak and strong.
- If your eval uses self-based routers (semantic_entropy, self_questioning), you need inference access for the weak model.
- If inference.strong_model_path=false (use remote judge / API), you do not need a local strong-model vLLM.
When do I need xVerify?
- For math / mmlu / qa datasets, scoring uses xVerify to judge correctness (src/loss_calculator.py).

Example: start xVerify server (from src/scripts/run.sh):

CUDA_VISIBLE_DEVICES=3 \
vllm serve /path/to/xVerify-9B-C \
  --host 0.0.0.0 \
  --port 8000 \
  --tensor-parallel-size 1 \
  --served-model-name xVerify \
  --trust-remote-code

Start server(s) first:

cd src/inference
python start.py --model_path "/path/to/your/model" --base_port 8001 --gpu_list "0,1"

Step 3–5: Run the pipeline

Scripts are under src/scripts/ (they cd src and temporarily patch config, restored on exit):

# prepare artifacts
bash src/scripts/prepare_all.sh

# train probes (datasets before -- ; probe types after --)
bash src/scripts/train_probes.sh alpaca_5k_train big_math_5k_train mmlu_train -- mean

# train DynamicDirichlet probe
bash src/scripts/train_probes.sh alpaca_5k_train big_math_5k_train mmlu_train -- dynamic_dirichlet

# evaluate routers
bash src/scripts/eval.sh

What does `prepare` produce? (artifacts & outputs)

prepare can run three steps (controlled by prepare_steps in config / src/scripts/prepare_all.sh):

scores: runs weak/strong inference and writes per-sample correctness labels under results/.
- For general datasets, scoring uses LLM-as-a-judge (llm_judge_general in src/data.py).
- For math / mmlu / qa datasets, scoring uses xVerify (src/loss_calculator.py).
logits: generates weak-model logits + hidden states (used by logits-based routers and probes), written under logits_output/ and hs/.
embeddings: generates query embeddings (used by embedding_mlp), written under query_embeddings_output/.

📚 Datasets

Built-in dataset names are defined in src/data.py:DatasetRegistry. Commonly used ones:

In-domain: alpaca_5k_{train,test}, big_math_5k_{train,test}, mmlu_{train,test}
Out-of-domain: magpie_5k_{train,test}, math, mmlu_pro_*
QA: hotpotqa_500, hotpotqa_4k, med_qa_1k

If you need to generate some JSONL under src/data/, helper scripts exist:

python src/utils/download_alpaca_data.py 10
python src/utils/download_hotpotqa.py -n 500
python src/utils/download_med_qa.py -n 1000

Notes

This repo may reference external/xverify for certain correctness evaluation workflows.
Logs are written under src/logs/.

👥 Author Contributions

Wanxing Wu (Institut Polytechnique de Paris): Led the project, proposed the initial idea, implemented and improved the code, conducted major experiments, and wrote the paper.
He Zhu (Peking University): Developed the code architecture and implemented core methods, substantially enhanced the methodology, contributed to experimental design, and co-wrote and revised the manuscript with critical insights.
Yixia Li (Southern University of Science and Technology): Conducted experiments, participated in idea development and experimental design, and co-wrote and extensively revised the manuscript.
Guanhua Chen (Southern University of Science and Technology): Supervised and guided the project, and provided resources.
Hongru Wang (University of Edinburgh): Contributed to manuscript revision.
Lei Yang, Zhao Jiehui (Deepxi Co.), Jian Yang (Beihang University), Benyou Wang, Bingyi Jing (Chinese University of Hong Kong, Shenzhen): Contributed through discussions.

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
assets		assets
src		src
.gitignore		.gitignore
README.md		README.md
config.yaml		config.yaml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RouterXBench：Towards Fair and Comprehensive Evaluation of Routers in Collaborative LLM Systems

🗺️ Overview

📏 Evaluation

What we measure

🔀 Routers

Methodology: router families

Probe-based (ours): Mean / DynamicDirichlet

🚀 Quickstart

TL;DR: run RouterXBench end-to-end

Step 0: Setup environment

Step 1: Configure

Step 2: Start services (vLLM + optional xVerify)

Step 3–5: Run the pipeline

What does `prepare` produce? (artifacts & outputs)

📚 Datasets

Notes

👥 Author Contributions

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RouterXBench：Towards Fair and Comprehensive Evaluation of Routers in Collaborative LLM Systems

🗺️ Overview

📏 Evaluation

What we measure

🔀 Routers

Methodology: router families

Probe-based (ours): Mean / DynamicDirichlet

🚀 Quickstart

TL;DR: run RouterXBench end-to-end

Step 0: Setup environment

Step 1: Configure

Step 2: Start services (vLLM + optional xVerify)

Step 3–5: Run the pipeline

What does prepare produce? (artifacts & outputs)

📚 Datasets

Notes

👥 Author Contributions

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

What does `prepare` produce? (artifacts & outputs)

Packages