From Prompts to Agents: A Multi-Perspective Evaluation of Large Language Models for File-Level Software Vulnerability Detection and Repair
Large Language Models (LLMs) have emerged as promising tools for addressing the growing complexity of software engineering tasks, offering the ability to generate, test, understand, and analyze code at scale. Their capacity to reason about code semantics and detect patterns has led researchers and practitioners to explore their application in automated vulnerability detection. However, most existing work focuses on function- or line-level analysis, overlooking the broader file-level context that LLMs are well-suited to reason about.
This study investigates the effectiveness of LLMs in detecting, classifying, and repairing file-level software vulnerabilities in Python across three high-risk CWE categories: CWE-22, CWE-79, and CWE-89. We benchmark both open- and closed-source models using zero-shot prompting and compare them against static analysis tools. Based on first findings, we went further and introduced two enhancements: a structured prompting strategy that improves reasoning and consistency, and a multi-agent framework that addresses context limitations through chunking, memory retention, and aggregation.
Experimental results show that closed-source models consistently outperform open-source alternatives across all tasks. We concluded that structured prompting increases detection accuracy by up to 5.4% and CWE-79 classification accuracy by over 15%. Our agent-based approach achieves perfect CWE classification and yields the highest patch quality with a ROUGE-L score of 0.382. This work provides a comprehensive and reproducible evaluation framework, highlighting how LLMs—especially when guided by prompt engineering and agentic orchestration—can advance automated vulnerability analysis in real-world settings.
This project investigate the effectiveness of hybrid source LLMs and an agentic approach with different prompting techniques in detecting, classifying and fixing file-level vulnerabilities. Traditional static analysis tools can miss or incorrectly flag complex vulnerabilities spanning multiple lines or files. Also the existing LLM tools mainly analyze code at a function and line level but neglect the file or repository level. In this project, we explores the capabilities of closed-source and open-source LLMs for:
-
Detecting vulnerabilities at file-level (e.g., CWE-22, CWE-79, CWE-89).
-
Classifying vulnerability types and providing human-readable explanations.
-
Generating automated fixes using prompt engineering and specialized multi-agent systems.
-
Testing the performance of python static tools in vulnerabilities detection.
-
Compare LLMs (GPT-3.5, GPT-4.5, Claude 3.7 Sonnet, Claude 3 Haiku, CodeLlama, Mistral) on detection/classification tasks.
-
Demonstrate how structured prompts improve results over generic prompts.
-
Show how agent-based chunking, memory, and judge-fixer solutions further enhance detection and fix generation
Chaima Jaziri, Concordia University, Canada. Chaima.jaziri@live.concordia.ca
Caren Rizk, Concordia University, Canada c_izk@live.concordia.ca