ABSTRACT

From Prompts to Agents: A Multi-Perspective Evaluation of Large Language Models for File-Level Software Vulnerability Detection and Repair

ABSTRACT

Large Language Models (LLMs) have emerged as promising tools for addressing the growing complexity of software engineering tasks, offering the ability to generate, test, understand, and analyze code at scale. Their capacity to reason about code semantics and detect patterns has led researchers and practitioners to explore their application in automated vulnerability detection. However, most existing work focuses on function- or line-level analysis, overlooking the broader file-level context that LLMs are well-suited to reason about.

This study investigates the effectiveness of LLMs in detecting, classifying, and repairing file-level software vulnerabilities in Python across three high-risk CWE categories: CWE-22, CWE-79, and CWE-89. We benchmark both open- and closed-source models using zero-shot prompting and compare them against static analysis tools. Based on first findings, we went further and introduced two enhancements: a structured prompting strategy that improves reasoning and consistency, and a multi-agent framework that addresses context limitations through chunking, memory retention, and aggregation.

Experimental results show that closed-source models consistently outperform open-source alternatives across all tasks. We concluded that structured prompting increases detection accuracy by up to 5.4% and CWE-79 classification accuracy by over 15%. Our agent-based approach achieves perfect CWE classification and yields the highest patch quality with a ROUGE-L score of 0.382. This work provides a comprehensive and reproducible evaluation framework, highlighting how LLMs—especially when guided by prompt engineering and agentic orchestration—can advance automated vulnerability analysis in real-world settings.

Overview

This project investigate the effectiveness of hybrid source LLMs and an agentic approach with different prompting techniques in detecting, classifying and fixing file-level vulnerabilities. Traditional static analysis tools can miss or incorrectly flag complex vulnerabilities spanning multiple lines or files. Also the existing LLM tools mainly analyze code at a function and line level but neglect the file or repository level. In this project, we explores the capabilities of closed-source and open-source LLMs for:

Detecting vulnerabilities at file-level (e.g., CWE-22, CWE-79, CWE-89).
Classifying vulnerability types and providing human-readable explanations.
Generating automated fixes using prompt engineering and specialized multi-agent systems.

Key Goals:

Testing the performance of python static tools in vulnerabilities detection.
Compare LLMs (GPT-3.5, GPT-4.5, Claude 3.7 Sonnet, Claude 3 Haiku, CodeLlama, Mistral) on detection/classification tasks.
Demonstrate how structured prompts improve results over generic prompts.
Show how agent-based chunking, memory, and judge-fixer solutions further enhance detection and fix generation

Contributors

Chaima Jaziri, Concordia University, Canada. Chaima.jaziri@live.concordia.ca

Caren Rizk, Concordia University, Canada c_izk@live.concordia.ca

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
Basic_prompt		Basic_prompt
Structured_Prompts		Structured_Prompts
.DS_Store		.DS_Store
.gitignore		.gitignore
data_Prep.py		data_Prep.py
files_CWE-22.csv		files_CWE-22.csv
files_CWE-79.csv		files_CWE-79.csv
files_CWE-89.csv		files_CWE-89.csv
readme.md		readme.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

From Prompts to Agents: A Multi-Perspective Evaluation of Large Language Models for File-Level Software Vulnerability Detection and Repair

ABSTRACT

Overview

Key Goals:

Contributors

About

Uh oh!

Releases 2

Packages

Uh oh!

Languages

Chaima-Ja/LLM_and_Agents_Vulnerabilities_Detection

Folders and files

Latest commit

History

Repository files navigation

From Prompts to Agents: A Multi-Perspective Evaluation of Large Language Models for File-Level Software Vulnerability Detection and Repair

ABSTRACT

Overview

Key Goals:

Contributors

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Languages

Packages