MATEval: Multi-Agent Text Evaluation Framework

Open-source implementation of the framework from the paper:
MATEval: A Multi-agent Discussion Framework for Advancing Open-Ended Text Evaluation

🔍 Overview

MATEval is the first multi-agent framework that simulates human-like collaborative discussion for evaluating open-ended text generated by LLMs. Our framework:

✅ Detects 5 types of text errors with human-level accuracy
✅ Generates explainable evaluation reports
✅ Achieves 25% higher correlation with human judgments than existing methods
✅ Successfully deployed in Alipay's business scenarios

🚀 Key Features

Feature	Description
🤖 Multi-Agent Collaboration	3 specialized agents (Evaluator/Feedback/Summarizer) simulate human discussion dynamics
🧠 Hybrid Reasoning	Combines Chain-of-Thought (CoT) and Self-Reflection strategies for deeper analysis
📊 Dual-Format Reports	Generates both Q&A summaries for researchers and detailed business reports
⚙️ Auto-Consensus Mechanism	Intelligent feedback loop resolves disagreements between agents
🔍 Error Localization	Pinpoints exact error locations with contextual explanations

📈 Complete Experimental Results

ROCStories Dataset

Category	Method	REP (ρ/τ)	LINC (ρ/τ)	DCONT (ρ/τ)	ILC (ρ/τ)	FER (ρ/τ)
Baselines	BLEU	0.318/0.260	0.193/0.153	0.156/0.128	0.037/0.031	-0.010/-0.008
	ROUGE-L	-0.017/-0.014	0.129/0.102	0.202/0.165	0.056/0.045	0.104/0.084
	RUBERr	0.036/0.035	0.054/0.049	0.315/0.297	-0.018/-0.017	-0.176/-0.166
	RUBERu	-0.111/-0.091	0.038/0.031	0.131/0.107	0.134/0.110	0.180/0.146
	UNION	-0.093/-0.076	0.091/0.071	-0.018/-0.015	0.057/0.046	0.072/0.059
Agents	SA	0.699/0.694▲	0.268/0.253	0.318/0.312	0.240/0.236	0.545/0.538
	ObO	0.698/0.692	0.170/0.160	0.356/0.349	0.259/0.248	0.484/0.473
	SR	0.691/0.680	0.169/0.154	0.354/0.339	0.144/0.138	0.498/0.478
	CoT	0.743/0.737	0.189/0.180	0.288/0.282	0.213/0.205	0.502/0.491
Proposed	SR+CoT	0.735/0.728↑	0.281/0.264↑	0.391/0.382↑	0.263/0.256↑	0.575/0.561↑
	Δ vs SA	+5.2%/+6.2%	+4.9%/+4.3%	+22.8%/+20.9%	+9.6%/+9.7%	+5.5%/+4.3%

WritingPrompts Dataset

Category	Method	REP (ρ/τ)	LINC (ρ/τ)	DCONT (ρ/τ)	ILC (ρ/τ)	FER (ρ/τ)
Baselines	BLEU	0.087/0.071	0.096/0.073	0.039/0.033	-0.114/-0.091	0.009/0.007
	ROUGE-L	0.092/0.074	0.127/0.096	0.083/0.068	-0.046/-0.037	0.049/0.040
	RUBERr	0.038/0.036	-0.020/-0.018	-0.081/-0.076	0.035/0.033	0.076/0.071
	RUBERu	-0.102/-0.084	0.054/0.041	-0.006/-0.005	-0.006/-0.007	0.111/0.089
	UNION	0.048/0.039	0.010/0.008	-0.110/-0.090	-0.038/-0.031	0.052/0.042
Agents	SA	0.258/0.246▲	0.107/0.095	0.111/0.105	0.192/0.180	0.176/0.171
	ObO	0.386/0.380	0.183/0.166	0.081/0.075	0.089/0.082	0.299/0.286
	SR	0.491/0.483	0.120/0.107	0.224/0.209	0.057/0.051	0.214/0.208
	CoT	0.132/0.129	0.159/0.139	0.203/0.191	0.002/0.001	0.218/0.211
Proposed	SR+CoT	0.430/0.417↑	0.215/0.188↑	0.265/0.248↑	0.290/0.266↑	0.299/0.286↑
	Δ vs SA	+66.7%/+69.5%	+101%/+97.9%	+138%/+136%	+51%/+47.8%	+69.9%/+67.3%

Legend:

▲ Baseline comparison | ↑ Best performance | Δ: Percentage improvement over SA | ρ: Spearman's correlation | τ: Kendall's tau

📊 Performance Highlights

Human Correlation Analysis (Spearman's ρ)

Dataset	Strategy	REP	LINC	DCONT	ILC	FER
ROCStories	SA	0.699	0.268	0.318	0.240	0.545
	SR+CoT	0.735↑	0.281↑	0.391↑	0.263↑	0.575↑
	Δ	+5.2%	+4.9%	+22.8%	+9.6%	+5.5%
WritingPrompts	SA	0.258	0.107	0.111	0.192	0.176
	SR+CoT	0.430↑	0.215↑	0.265↑	0.290↑	0.299↑
	Δ	+66.7%	+101%	+138%	+51%	+69.9%

Key Breakthroughs

✅ SR+CoT Dominance

Outperforms single-agent baseline on 5/5 dimensions for ROCStories
Achieves 0.290 ρ on ILC (Illogical Content) detection for WritingPrompts, representing 354% improvement over BLEU (-0.114)

✅ Domain Adaptation

0.391 ρ on DCONT (Discontinuity) analysis for stories (ROC), +23% over single-agent
0.215 ρ on LINC (Logical Inconsistency) detection for creative writing (WP), doubling baseline performance

✅ Multi-Dimensional Superiority

Metric	Peak Performance	VS Traditional Methods
REP	0.735 ρ	+131% (vs BLEU)
FER	0.575 ρ	+5850% (vs BLEU)
ILC	0.290 ρ	+884% (vs ROUGE)

Cross-Dataset Stability

Metric	ROCStories	WritingPrompts	Variance
LINC	0.281	0.215	<15%
DCONT	0.391	0.265	<25%
Avg. Score	0.449	0.300	33%

📁 Dataset Support

Dataset	Language	Domain	Access
ROCStories	English	Daily Stories	Public
WritingPrompts	English	Creative Writing	Public
LOT	Chinese	Long-form Stories	Public
Ant (Alipay)	Chinese	Business Cases	Private

📜 Citation

If you find our paper and resource useful in your research, please consider giving a star ⭐ and citation 📝.

@inproceedings{li2024mateval,
  title={MATEval: A Multi-Agent Discussion Framework for Advancing Open-Ended Text Evaluation},
  author={Li, Yu and Zhang, Shenyu and Wu, Rui and Huang, Xiutian and Chen, Yongrui and Xu, Wenhao and Qi, Guilin and Min, Dehai},
  booktitle={International Conference on Database Systems for Advanced Applications},
  pages={415--426},
  year={2024},
  organization={Springer}
}

📧 Contact

For technical inquiries:
Yu Li - Southeast University

▲ Back to Top

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
LOT		LOT
ROC		ROC
WP		WP
LICENSE.md		LICENSE.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MATEval: Multi-Agent Text Evaluation Framework

🔍 Overview

🚀 Key Features

📈 Complete Experimental Results

ROCStories Dataset

WritingPrompts Dataset

📊 Performance Highlights

Human Correlation Analysis (Spearman's ρ)

Key Breakthroughs

Cross-Dataset Stability

📁 Dataset Support

📜 Citation

📧 Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MATEval: Multi-Agent Text Evaluation Framework

🔍 Overview

🚀 Key Features

📈 Complete Experimental Results

ROCStories Dataset

WritingPrompts Dataset

📊 Performance Highlights

Human Correlation Analysis (Spearman's ρ)

Key Breakthroughs

Cross-Dataset Stability

📁 Dataset Support

📜 Citation

📧 Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages