Argument Summarization and its Evaluation in the Era of Large Language Models

This repository contains the code and data for our EMNLP 2025 paper: Argument Summarization and its Evaluation in the Era of Large Language Models.

Abstract: Large Language Models (LLMs) have revolutionized various Natural Language Generation (NLG) tasks, including Argument Summarization (ArgSum), a key subfield of Argument Mining. This paper investigates the integration of state-of-the-art LLMs into ArgSum systems and their evaluation. In particular, we propose a novel prompt-based evaluation scheme, and validate it through a novel human benchmark dataset. Our work makes three main contributions: (i) the integration of LLMs into existing ArgSum systems, (ii) the development of two new LLM-based ArgSum systems, benchmarked against prior methods, and (iii) the introduction of an advanced LLM-based evaluation scheme. We demonstrate that the use of LLMs substantially improves both the generation and evaluation of argument summaries, achieving state-of-the-art results and advancing the field of ArgSum. We also show that among the four LLMs integrated in (i) and (ii), Qwen-3-32B, despite having the fewest parameters, performs best, even surpassing GPT-4o.

🎬 Preparations:

Replace the models folder with the following folder from Google Drive: https://drive.google.com/drive/folders/1GUzNhU6DK3KRUV-f4cX2xEb8ifTJKhm6
Insert your username and password of the Summetix API service into argsum/___summetix_login.json

🍽 Structure:

data folder: Datasets
models folder: Language models (LMs) (divided into Match Scorers, Quality Scorers, Metics, and ArgSum Generators)
argsum folder: Python code for functions and classes used in the investigations (+ the code for BLEURT and a json including the login information for the Summetix API service)
investigations folder: Data resulting from the investigations
Jupyter notebooks: Conducted investigations and results

🏄‍♀️ Investigations (.ipynb):

data_processing: Preparation of the raw data for the investigations
explorative_data_analysis: Exploratory data analysis
quality_scorer: Fine-tuning of LMs for argument quality scoring (+ their evaluation)
match_scorer: Fine-tuning of LMs for determining a match score between an argument and argument summary (+ their evaluation)
flan_t5_sum: Fine-tuning of FLAN T5 for argument summary generation (given a cluster of similar arguments)
human_eval: Examination of inter-rater reliability and the correlation between human judgements and automatic evaluation metrics
arg_seperation_capability: Examination of the ability of clustering-based ArgSum systems to separate arguments
get_cluster_sums: Generation of argument summaries with clustering-based ArgSum systems
get_classification_sums: Generation of argument summaries with classification-based ArgSum systems
eval_sums: Automatic evaluation of the generated argument summaries

🧘 Citation

If you use the code or data from this work, please include the following citation:

@inproceedings{altemeyer-etal-2025-argument,
    title = "Argument Summarization and its Evaluation in the Era of Large Language Models",
    author = "Altemeyer, Moritz  and
      Eger, Steffen  and
      Daxenberger, Johannes  and
      Chen, Yanran  and
      Altendorf, Tim  and
      Cimiano, Philipp  and
      Schiller, Benjamin",
    editor = "Christodoulopoulos, Christos  and
      Chakraborty, Tanmoy  and
      Rose, Carolyn  and
      Peng, Violet",
    booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2025",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.emnlp-main.1797/",
    doi = "10.18653/v1/2025.emnlp-main.1797",
    pages = "35490--35511",
    ISBN = "979-8-89176-332-6"
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
argsum		argsum
data		data
investigations		investigations
models		models
.DS_Store		.DS_Store
01_data_processing.ipynb		01_data_processing.ipynb
02_explorative_data_analysis.ipynb		02_explorative_data_analysis.ipynb
03_quality_scorer.ipynb		03_quality_scorer.ipynb
04_match_scorer.ipynb		04_match_scorer.ipynb
05_flan_t5_sum.ipynb		05_flan_t5_sum.ipynb
06_human_eval copy.ipynb		06_human_eval copy.ipynb
06_human_eval.ipynb		06_human_eval.ipynb
07_arg_seperation_capability.ipynb		07_arg_seperation_capability.ipynb
08_get_cluster_sums.ipynb		08_get_cluster_sums.ipynb
09_get_classification_sums.ipynb		09_get_classification_sums.ipynb
10_eval_sums.ipynb		10_eval_sums.ipynb
LICENSE		LICENSE
README.md		README.md
environment.yaml		environment.yaml
overview.png		overview.png
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Argument Summarization and its Evaluation in the Era of Large Language Models

🎬 Preparations:

🍽 Structure:

🏄‍♀️ Investigations (.ipynb):

🧘 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Argument Summarization and its Evaluation in the Era of Large Language Models

🎬 Preparations:

🍽 Structure:

🏄‍♀️ Investigations (.ipynb):

🧘 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages