Skip to content

ElevenLiy/MATEval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MATEval: Multi-Agent Text Evaluation Framework

License: MIT arXiv GitHub Stars

Open-source implementation of the framework from the paper:
MATEval: A Multi-agent Discussion Framework for Advancing Open-Ended Text Evaluation

🔍 Overview

MATEval is the first multi-agent framework that simulates human-like collaborative discussion for evaluating open-ended text generated by LLMs. Our framework:

✅ Detects 5 types of text errors with human-level accuracy
✅ Generates explainable evaluation reports
✅ Achieves 25% higher correlation with human judgments than existing methods
✅ Successfully deployed in Alipay's business scenarios

image image

🚀 Key Features

Feature Description
🤖 Multi-Agent Collaboration 3 specialized agents (Evaluator/Feedback/Summarizer) simulate human discussion dynamics
🧠 Hybrid Reasoning Combines Chain-of-Thought (CoT) and Self-Reflection strategies for deeper analysis
📊 Dual-Format Reports Generates both Q&A summaries for researchers and detailed business reports
⚙️ Auto-Consensus Mechanism Intelligent feedback loop resolves disagreements between agents
🔍 Error Localization Pinpoints exact error locations with contextual explanations

📈 Complete Experimental Results

ROCStories Dataset

Category Method REP (ρ/τ) LINC (ρ/τ) DCONT (ρ/τ) ILC (ρ/τ) FER (ρ/τ)
Baselines BLEU 0.318/0.260 0.193/0.153 0.156/0.128 0.037/0.031 -0.010/-0.008
ROUGE-L -0.017/-0.014 0.129/0.102 0.202/0.165 0.056/0.045 0.104/0.084
RUBERr 0.036/0.035 0.054/0.049 0.315/0.297 -0.018/-0.017 -0.176/-0.166
RUBERu -0.111/-0.091 0.038/0.031 0.131/0.107 0.134/0.110 0.180/0.146
UNION -0.093/-0.076 0.091/0.071 -0.018/-0.015 0.057/0.046 0.072/0.059
Agents SA 0.699/0.694▲ 0.268/0.253 0.318/0.312 0.240/0.236 0.545/0.538
ObO 0.698/0.692 0.170/0.160 0.356/0.349 0.259/0.248 0.484/0.473
SR 0.691/0.680 0.169/0.154 0.354/0.339 0.144/0.138 0.498/0.478
CoT 0.743/0.737 0.189/0.180 0.288/0.282 0.213/0.205 0.502/0.491
Proposed SR+CoT 0.735/0.728 0.281/0.264 0.391/0.382 0.263/0.256 0.575/0.561
Δ vs SA +5.2%/+6.2% +4.9%/+4.3% +22.8%/+20.9% +9.6%/+9.7% +5.5%/+4.3%

WritingPrompts Dataset

Category Method REP (ρ/τ) LINC (ρ/τ) DCONT (ρ/τ) ILC (ρ/τ) FER (ρ/τ)
Baselines BLEU 0.087/0.071 0.096/0.073 0.039/0.033 -0.114/-0.091 0.009/0.007
ROUGE-L 0.092/0.074 0.127/0.096 0.083/0.068 -0.046/-0.037 0.049/0.040
RUBERr 0.038/0.036 -0.020/-0.018 -0.081/-0.076 0.035/0.033 0.076/0.071
RUBERu -0.102/-0.084 0.054/0.041 -0.006/-0.005 -0.006/-0.007 0.111/0.089
UNION 0.048/0.039 0.010/0.008 -0.110/-0.090 -0.038/-0.031 0.052/0.042
Agents SA 0.258/0.246▲ 0.107/0.095 0.111/0.105 0.192/0.180 0.176/0.171
ObO 0.386/0.380 0.183/0.166 0.081/0.075 0.089/0.082 0.299/0.286
SR 0.491/0.483 0.120/0.107 0.224/0.209 0.057/0.051 0.214/0.208
CoT 0.132/0.129 0.159/0.139 0.203/0.191 0.002/0.001 0.218/0.211
Proposed SR+CoT 0.430/0.417 0.215/0.188 0.265/0.248 0.290/0.266 0.299/0.286
Δ vs SA +66.7%/+69.5% +101%/+97.9% +138%/+136% +51%/+47.8% +69.9%/+67.3%

Legend:

▲ Baseline comparison | ↑ Best performance | Δ: Percentage improvement over SA | ρ: Spearman's correlation | τ: Kendall's tau

📊 Performance Highlights

Human Correlation Analysis (Spearman's ρ)

Dataset Strategy REP LINC DCONT ILC FER
ROCStories SA 0.699 0.268 0.318 0.240 0.545
SR+CoT 0.735↑ 0.281↑ 0.391↑ 0.263↑ 0.575↑
Δ +5.2% +4.9% +22.8% +9.6% +5.5%
WritingPrompts SA 0.258 0.107 0.111 0.192 0.176
SR+CoT 0.430↑ 0.215↑ 0.265↑ 0.290↑ 0.299↑
Δ +66.7% +101% +138% +51% +69.9%

Key Breakthroughs

SR+CoT Dominance

  • Outperforms single-agent baseline on 5/5 dimensions for ROCStories
  • Achieves 0.290 ρ on ILC (Illogical Content) detection for WritingPrompts, representing 354% improvement over BLEU (-0.114)

Domain Adaptation

  • 0.391 ρ on DCONT (Discontinuity) analysis for stories (ROC), +23% over single-agent
  • 0.215 ρ on LINC (Logical Inconsistency) detection for creative writing (WP), doubling baseline performance

Multi-Dimensional Superiority

Metric Peak Performance VS Traditional Methods
REP 0.735 ρ +131% (vs BLEU)
FER 0.575 ρ +5850% (vs BLEU)
ILC 0.290 ρ +884% (vs ROUGE)

Cross-Dataset Stability

Metric ROCStories WritingPrompts Variance
LINC 0.281 0.215 <15%
DCONT 0.391 0.265 <25%
Avg. Score 0.449 0.300 33%

📁 Dataset Support

Dataset Language Domain Access
ROCStories English Daily Stories Public
WritingPrompts English Creative Writing Public
LOT Chinese Long-form Stories Public
Ant (Alipay) Chinese Business Cases Private

📜 Citation

If you find our paper and resource useful in your research, please consider giving a star ⭐ and citation 📝.

@inproceedings{li2024mateval,
  title={MATEval: A Multi-Agent Discussion Framework for Advancing Open-Ended Text Evaluation},
  author={Li, Yu and Zhang, Shenyu and Wu, Rui and Huang, Xiutian and Chen, Yongrui and Xu, Wenhao and Qi, Guilin and Min, Dehai},
  booktitle={International Conference on Database Systems for Advanced Applications},
  pages={415--426},
  year={2024},
  organization={Springer}
}

📧 Contact

For technical inquiries:
Yu Li - Southeast University


Back to Top

About

MATEval is the first multi-agent framework simulating human collaborative discussion for open-ended text evaluation.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages