Open-source implementation of the framework from the paper:
MATEval: A Multi-agent Discussion Framework for Advancing Open-Ended Text Evaluation
MATEval is the first multi-agent framework that simulates human-like collaborative discussion for evaluating open-ended text generated by LLMs. Our framework:
✅ Detects 5 types of text errors with human-level accuracy
✅ Generates explainable evaluation reports
✅ Achieves 25% higher correlation with human judgments than existing methods
✅ Successfully deployed in Alipay's business scenarios
| Feature | Description |
|---|---|
| 🤖 Multi-Agent Collaboration | 3 specialized agents (Evaluator/Feedback/Summarizer) simulate human discussion dynamics |
| 🧠 Hybrid Reasoning | Combines Chain-of-Thought (CoT) and Self-Reflection strategies for deeper analysis |
| 📊 Dual-Format Reports | Generates both Q&A summaries for researchers and detailed business reports |
| ⚙️ Auto-Consensus Mechanism | Intelligent feedback loop resolves disagreements between agents |
| 🔍 Error Localization | Pinpoints exact error locations with contextual explanations |
| Category | Method | REP (ρ/τ) | LINC (ρ/τ) | DCONT (ρ/τ) | ILC (ρ/τ) | FER (ρ/τ) |
|---|---|---|---|---|---|---|
| Baselines | BLEU | 0.318/0.260 | 0.193/0.153 | 0.156/0.128 | 0.037/0.031 | -0.010/-0.008 |
| ROUGE-L | -0.017/-0.014 | 0.129/0.102 | 0.202/0.165 | 0.056/0.045 | 0.104/0.084 | |
| RUBERr | 0.036/0.035 | 0.054/0.049 | 0.315/0.297 | -0.018/-0.017 | -0.176/-0.166 | |
| RUBERu | -0.111/-0.091 | 0.038/0.031 | 0.131/0.107 | 0.134/0.110 | 0.180/0.146 | |
| UNION | -0.093/-0.076 | 0.091/0.071 | -0.018/-0.015 | 0.057/0.046 | 0.072/0.059 | |
| Agents | SA | 0.699/0.694▲ | 0.268/0.253 | 0.318/0.312 | 0.240/0.236 | 0.545/0.538 |
| ObO | 0.698/0.692 | 0.170/0.160 | 0.356/0.349 | 0.259/0.248 | 0.484/0.473 | |
| SR | 0.691/0.680 | 0.169/0.154 | 0.354/0.339 | 0.144/0.138 | 0.498/0.478 | |
| CoT | 0.743/0.737 | 0.189/0.180 | 0.288/0.282 | 0.213/0.205 | 0.502/0.491 | |
| Proposed | SR+CoT | 0.735/0.728↑ | 0.281/0.264↑ | 0.391/0.382↑ | 0.263/0.256↑ | 0.575/0.561↑ |
| Δ vs SA | +5.2%/+6.2% | +4.9%/+4.3% | +22.8%/+20.9% | +9.6%/+9.7% | +5.5%/+4.3% |
| Category | Method | REP (ρ/τ) | LINC (ρ/τ) | DCONT (ρ/τ) | ILC (ρ/τ) | FER (ρ/τ) |
|---|---|---|---|---|---|---|
| Baselines | BLEU | 0.087/0.071 | 0.096/0.073 | 0.039/0.033 | -0.114/-0.091 | 0.009/0.007 |
| ROUGE-L | 0.092/0.074 | 0.127/0.096 | 0.083/0.068 | -0.046/-0.037 | 0.049/0.040 | |
| RUBERr | 0.038/0.036 | -0.020/-0.018 | -0.081/-0.076 | 0.035/0.033 | 0.076/0.071 | |
| RUBERu | -0.102/-0.084 | 0.054/0.041 | -0.006/-0.005 | -0.006/-0.007 | 0.111/0.089 | |
| UNION | 0.048/0.039 | 0.010/0.008 | -0.110/-0.090 | -0.038/-0.031 | 0.052/0.042 | |
| Agents | SA | 0.258/0.246▲ | 0.107/0.095 | 0.111/0.105 | 0.192/0.180 | 0.176/0.171 |
| ObO | 0.386/0.380 | 0.183/0.166 | 0.081/0.075 | 0.089/0.082 | 0.299/0.286 | |
| SR | 0.491/0.483 | 0.120/0.107 | 0.224/0.209 | 0.057/0.051 | 0.214/0.208 | |
| CoT | 0.132/0.129 | 0.159/0.139 | 0.203/0.191 | 0.002/0.001 | 0.218/0.211 | |
| Proposed | SR+CoT | 0.430/0.417↑ | 0.215/0.188↑ | 0.265/0.248↑ | 0.290/0.266↑ | 0.299/0.286↑ |
| Δ vs SA | +66.7%/+69.5% | +101%/+97.9% | +138%/+136% | +51%/+47.8% | +69.9%/+67.3% |
Legend:
▲ Baseline comparison | ↑ Best performance | Δ: Percentage improvement over SA | ρ: Spearman's correlation | τ: Kendall's tau
| Dataset | Strategy | REP | LINC | DCONT | ILC | FER |
|---|---|---|---|---|---|---|
| ROCStories | SA | 0.699 | 0.268 | 0.318 | 0.240 | 0.545 |
| SR+CoT | 0.735↑ | 0.281↑ | 0.391↑ | 0.263↑ | 0.575↑ | |
| Δ | +5.2% | +4.9% | +22.8% | +9.6% | +5.5% | |
| WritingPrompts | SA | 0.258 | 0.107 | 0.111 | 0.192 | 0.176 |
| SR+CoT | 0.430↑ | 0.215↑ | 0.265↑ | 0.290↑ | 0.299↑ | |
| Δ | +66.7% | +101% | +138% | +51% | +69.9% |
✅ SR+CoT Dominance
- Outperforms single-agent baseline on 5/5 dimensions for ROCStories
- Achieves 0.290 ρ on ILC (Illogical Content) detection for WritingPrompts, representing 354% improvement over BLEU (-0.114)
✅ Domain Adaptation
- 0.391 ρ on DCONT (Discontinuity) analysis for stories (ROC), +23% over single-agent
- 0.215 ρ on LINC (Logical Inconsistency) detection for creative writing (WP), doubling baseline performance
✅ Multi-Dimensional Superiority
| Metric | Peak Performance | VS Traditional Methods |
|---|---|---|
| REP | 0.735 ρ | +131% (vs BLEU) |
| FER | 0.575 ρ | +5850% (vs BLEU) |
| ILC | 0.290 ρ | +884% (vs ROUGE) |
| Metric | ROCStories | WritingPrompts | Variance |
|---|---|---|---|
| LINC | 0.281 | 0.215 | <15% |
| DCONT | 0.391 | 0.265 | <25% |
| Avg. Score | 0.449 | 0.300 | 33% |
| Dataset | Language | Domain | Access |
|---|---|---|---|
| ROCStories | English | Daily Stories | Public |
| WritingPrompts | English | Creative Writing | Public |
| LOT | Chinese | Long-form Stories | Public |
| Ant (Alipay) | Chinese | Business Cases | Private |
If you find our paper and resource useful in your research, please consider giving a star ⭐ and citation 📝.
@inproceedings{li2024mateval,
title={MATEval: A Multi-Agent Discussion Framework for Advancing Open-Ended Text Evaluation},
author={Li, Yu and Zhang, Shenyu and Wu, Rui and Huang, Xiutian and Chen, Yongrui and Xu, Wenhao and Qi, Guilin and Min, Dehai},
booktitle={International Conference on Database Systems for Advanced Applications},
pages={415--426},
year={2024},
organization={Springer}
}For technical inquiries:
Yu Li - Southeast University