CtrlBench-Rec is an evolutionary multi-agent framework with three modules: Initialization, Dynamic Interaction, and Collaborative Fusion. Operating as a closed-loop system, it iterates through initialization, policy alignment, and agent fusion to accelerate group exploration and cultivate elite agents. The central objective is to transform novice agents into a refined set of high-capability super probes that serve as a standardized benchmark for system controllability. The framework operates in two sequential phases: (1) Training phase, refining super probes through interaction and fusion; and (2) Inference and evaluation phase, deploying the probes for multi-dimensional controllability assessments.
├── data/ # Datasets (ML-1M, preprocessed Amazon Toys & Games)
├── model/ # Recommendation model definitions (e.g., SASRec, Narm, Qwen)
├── encoder/ # Textual Encoder for Embedding Generation (e.g., twhin-bert)
├── generated_user_profile/ # User profiles generated at different stages
├── tool/ # Data loaders and embedding processors
├── runner/ # Scripts for training, inference, and evaluation (e.g., epoch.py, evaluation.py)
└── requirements.txt # Project dependencies
Phase I: Evolutionary Training
- Multi-Agent Initialization :Extract static attributes and dynamic trajectories from raw datasets like ML1M.Instantiate agents with a profile expert, an LLM-based decision engine, and tool-calling modules.
- Environment Interaction & Behavior Alignment :Synchronize the Black-Box system's state with the agent's persona by injecting a continuous stream of profile-aligned interaction behaviors.
- Multi-Agent Strategy Fusion :Group agents via K-means clustering to facilitate intra-cluster discussions, followed by a fusion expert integrating these records and profiles into new Super Probes.
Phase II: Inference & Evaluation
- Interaction & Behavior Acquisition :Execute multi-turn interactions with the Black-Box recommender to generate a profile-aligned behavioral stream for the Super Probes.
- Systematic Evaluation
| Tool | Version | Description | Check Installation |
|---|---|---|---|
| Python | 3.10 | Backend runtime | python --version |
1.Environment Configuration
Run the following command in your terminal to install the necessary dependencies:
pip install -r requirements.txt2.Load Bert Encoder
Execute the script to load the twhin-bert encoder:
python runner/load_twhin_bert.py3.Configuration (API Key)
To use the DeepSeek LLM features, you need to provide your API key from https://platform.deepseek.com/api_keys. You can pass it as an environment variable at runtime without permanently modifying your system settings.
For Linux / macOS / WSL Prefix your command with the variable:
DEEPSEEK_API_KEY="your_api_key_here" python ../runner/user_profile_initialize.pyFor Windows (PowerShell) In PowerShell, variables must be set for the current session before running the script:
$env:DEEPSEEK_API_KEY="your_api_key_here"; python ../runner/user_profile_initialize.pyYou can also download this model from:https://huggingface.co/Twitter/twhin-bert-base; After downloading,place it in the "../rec_models/" directory.
We provide experiments using the SASRec recommendation model on the ML-1M dataset, centered around the Task 1 Target Content Discovery Analysis.
Phase I: Evolutionary Training
- Multi-Agent Initialization :Initialize the agent metadata
python runner/user_profile_initialize.py- Interaction & Fusion :Update the entry point in epoch.py to call runner.epoch.sasrec_ml1m_merge, then run the script.
python runner/epoch.pyPhase II: Evolutionary Training
- Interaction & Behavior Acquisition :Update the entry point in epoch.py to call runner.epoch.sasrec_ml1m_debate_epoch20, then run the script.
python runner/epoch.py- Systematic Evaluation : Invoke runner.evaluation.compare_two_profile, modify the original and evaluation profile paths, and run evaluation.py for results.
python runner/evaluation.pyFollowing the experimental setup and evaluation metrics detailed in Section 5.2, "Controllability across different recommender system architectures," we conducted a series of experiments on the SASRec model. The results are presented below:
| Interaction Rounds (t) | MovieLens-1M (Coverage (%) ↑) | MovieLens-1M (Exploration Efficiency ↓) | ||||
|---|---|---|---|---|---|---|
| base(100) | base_small (27) | CtrlBench-Rec | base(100) | base_small (27) | CtrlBench-Rec | |
| t=5 | 5.6% | 2.05% | 2.33% | 2.89 | 2.06 | 1.31 |
| t=10 | 9.68% | 3.91% | 4.71% | 2.98 | 1.84 | 1.39 |
| t=15 | 12.20% | 5.10% | 7.23% | 3.54 | 2.02 | 1.45 |
| t=20 | 15.46% | 6.56% | 8.95% | 3.78 | 2.16 | 1.58 |
