Replication package for the paper "STARouter: Internal State based LLM Router for Software Testing Tasks"
We include notebooks that quickly walk you through our experimental results under notebooks directory. For each Research Question, refer to
- RQ1. Effectiveness: To what extent does our router approximate the optimal routing scenario? notebook
- RQ2. Generalizability: Does our approach generalize across different contexts? notebook
- Additional: Sensitivity analysis on Input Variations and a Router Optimality demo.
Note: We bundle all extracted internal states, prompt embeddings, and final predictions within this artifact. You can reproduce all tables and figures in the paper using the notebooks above without running any data processing or experiment scripts, but only by updating the
REPO_PATHinconfig.py.
Cost-performance curves for individual runs (model pair/benchmark/input configurations) are stored under results/{BENCHMARK}/preset directories.
Under data directory, all scripts required to
- Label win model: construct_pairwise_data.py
- Extract internal states from SLMs: extract_internal_state.py
- Embed prompts: embed.py
are included, along with resulting data files under each benchmark directory.
Note that we exclude benchmark implementations for brevity, please refer to the original implementations:
- LIBRO: https://github.com/coinse/libro
- TestEval: https://github.com/LLM4SoftwareTesting/TestEval
- APPS: https://github.com/hendrycks/apps
- HumanEval: https://github.com/openai/human-eval
Main scripts are presented in the repository, each containing
- experiment.py: Train and test routers based on internal states and prompt embeddings. Since hyperparameter tuning process takes long, we ran it on subset of tasks as a preliminary exploration and set the preset values. We recommend you to add
--skip_hyperparameter_tuningoption to speed up. - visualize.py: Draw cost-performance(proportion of strong model calls) curve for individual runs. As hyperparameter tuning runs do not store resulting probabilities, by default, we plot results for experiments on preset.
- generalize.py: Test cross-benchmark generalization by training a router on one and test on all others.
Configuration and helper functions are included in
- config.py: main configuration file that contains list of models and benchmarks. You must set the value of
REPO_PATHbased on your machine, which is now set as/root/starouter. - data_utils.py & metric.py: Load model performance metrics on each benchmark and compute RO/CPT based on Trapezoidal rule.
To reproduce the experimental results, we recommend you to use python>=3.9 - we specifically used python=3.9.19.
All required modules are listed in requirements.txt along with their versions. You may download them via pip install -r requirements.txt.
- Label win model
- Vanilla results of all models are included under
data/{BENCHMARK}/combined_results. - Labelled data is stored under
data/{BENCHMARK}/route_data/pairwisedirectory.
- Vanilla results of all models are included under
- Extract internal state
- Downloading open-source language models from HuggingFace requires your HF token along with proper access to each language model.
- Note that downloading all five SLMs may take up ~200GB of disk storage.
- Embed prompts
- Requires Ollama, with which
nomic-embed-textmodel is pulled usingollama pull nomic-embed-textcommand. We assume that the Ollama endpoint is set to its default value (endpoint = 'http://localhost:11434/api/embeddings'), you may modify this value withindata/embed.py - Requires OpenAI api key, which we load from the default environmental variable
OPENAI_API_KEY.
- Requires Ollama, with which
For all three steps, we provide bash scripts(data/*.sh) that iterate over models/benchmarks along with their python scripts.
Currently, scripts for internal state extracting and embedding check the existence of corresponding *.pt files before executing - which they are - to avoid redundant processing cost.
We provide a single script run.sh that executes all routing experiments along with the visualization of individual runs.
python experiment.py --skip_hyperparameter_tuning -p entire
python experiment.py --skip_hyperparameter_tuning --code_generation -p entire
python visualize.py -p entire
python visualize.py --code_generation -p entire
python generalize.py -p entireAll above scripts also check the resulting file existence to avoid redundant executions.
- Experiment routing (
experiment.py)- As noted above, we recommend you to add
--skip_hyperparameter_tuningoption when running the experiments by yourself. --code_generationflag tests routing on code generation benchmarks, if set to true. Otherwise, the default behavior is to experiment on testing benchmarks.
- As noted above, we recommend you to add
- Visualize individual runs (
visualize.py) - Conduct cross-benchmark generalization experiments (
generalize.py)- By default, we iterate over all benchmarks including both testing and code generation tasks
- For RQ2-1, where we analyze the transferability of router predictions, we do not need extra experiments, but only the resulting predictions. In turn, all such processing is done within the python notebook.