Developed based on the tau-bench project with special thanks to Sierra Inc. for their open-source contribution
tau-bench uses MIT license: View details
🛡️ Strictly prohibited for training with real user data, fully compliant with China's Personal Information Protection Law (PIPL)
Ecom-Bench is a comprehensive evaluation benchmark framework designed for e-commerce customer service dialogue systems, providing standardized validation solutions through real-world scenario simulations before industrial deployment of natural language processing systems.
🎯 Realistic Scenario Recreation
- Supports reproduction of customer service scenarios from mainstream platforms like JD.com
- User profile-driven based on real user behavior characteristics
- Complex multi-turn dialogue interaction verification
🤖 Flexible Architecture Design
- User strategies: Rule-based/Based, Chain-of-Thought (CoT), Human-in-the-loop
- Agent strategies: LLM-driven/Dual-mode (human + AI)
- Model-agnostic: Compatible with mainstream large models like Qwen, DeepSeek-V3
📊 Multi-dimensional Evaluation System
| Dimension | Core Metrics |
|---|---|
| Action Accuracy | Operation command execution accuracy |
| Search Efficiency | Information retrieval precision |
| Output Quality | Response content relevance |
| Response Timeliness | System real-time responsiveness |
🛠️ Engineering Support
- Native support for MCP (Model Context Protocol) tool calling protocol
- Asynchronous high-concurrency dialogue processing engine
- Smart result caching for accelerated evaluation
pip install -r requirements.txtlangchain-community>=0.3.23langchain-openai>=0.3.7langgraph>=0.3.2openai>=1.71.0rich(for beautiful command-line output)pydantic(for data validation)
# Run single trial
python run.py --env story --user-model qwen --agent-model qwen
# Run multiple trials
python run.py --num-trials 5 --env story
# Specify task range
python run.py --start-index 0 --end-index 10
# Run specific tasks
python run.py --task-ids 1 2 3# Custom user and agent strategies
python run.py --user-strategy cot --agent-strategy llm
# Set concurrency and log directory
python run.py --max-concurrency 5 --log-dir ./custom_results
# Enable verbose output
python run.py --verboseEcom-Bench/
├── agent/ # Agent implementation
│ ├── agents_list/ # Different agent types
│ │ ├── agent_human.py # Human agent
│ │ ├── agent_langchain.py # LangChain agent
│ │ └── agent_sdk.py # SDK agent
│ └── servers/ # Server components
├── envs/ # Environment implementation
│ ├── base.py # Base environment class
│ └── story/ # Story scenario environment
│ ├── env.py # Environment logic
│ ├── tasks.py # Task definitions
│ ├── wiki.md # User guide
│ └── wiki.py # Wiki processing
├── user/ # User simulator
│ ├── memory.py # Memory management
│ └── user.py # User implementation
├── main.py # Main execution logic
├── run.py # Command-line interface
├── utils.py # Utility functions
└── requirements.txt # Dependency list
implements a complete e-commerce customer service dialogue environment including:
-
Task loading and management
-
User-agent interaction loop
-
Tool call tracking
-
Performance metric calculation
provides multiple user simulation strategies:
-
UserBased: Rule-driven user behavior
-
UserCoT: Chain-of-Thought reasoning user
-
UserHuman: Human interaction interface
supports multiple customer service agent implementations:
-
Integration with multiple LLM providers (Qwen, DeepSeek, etc.)
-
Support for MCP tool calling
-
Configurable model parameters
The framework provides four main evaluation dimensions:
- Action Accuracy (
reward_actions): Evaluates whether agent-executed operations meet user requirements - Search Quality (
reward_searches): Evaluates accuracy and relevance of information retrieval - Output Quality (
reward_outputs): Evaluates quality and usefulness of responses - Time Efficiency (
reward_time): Evaluates system response time and overall efficiency
contains rich test tasks, each including:
📌 User profile (consumption habits/personality traits)
🎯 Interaction intent and goals
🏬 Platform/store context information
✅ Expected action acceptance criteria
-
Qwen Series: Calls AliCloud DashScope API
-
DeepSeek-V3: Calls Volcano Engine API
-
OpenAI Series: Calls standard OpenAI API
RunConfig class defines all configurable parameters:
-
Model selection and strategy configuration
-
Task range and concurrency settings
-
Logging and output configuration
-
Performance tuning parameters
After execution, results are saved in JSON format containing:
-
Detailed dialogue trajectories
-
Dimension-specific scores
-
Tool call records
-
Performance statistics
Use for further analysis and visualization.
-
Create new environment module under
envs/ -
Inherit
Envbase class and implement required methods -
Register new environment in
envs/__init__.py
-
Implement new agent class under
agent/agents_list/ -
Follow existing interface specifications
-
Update agent selection logic
-
Implement new user class under
user/ -
Implement call method and system prompt loading
-
Register new strategy in environment
This project uses Apache 2.0 license. See LICENSE。
If you use Ecom-Bench in your research, please cite:
@misc{wang2025ecombenchllmagentresolve,
title={ECom-Bench: Can LLM Agent Resolve Real-World E-commerce Customer Support Issues?},
author={Haoxin Wang and Xianhan Peng and Xucheng Huang and Yizhe Huang and Ming Gong and Chenghan Yang and Yang Liu and Ling Jiang},
year={2025},
eprint={2507.05639},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2507.05639},
}For questions or suggestions:
-
Submit GitHub Issue
-
Email: huangyizhe@xiaoduotech.com

