LoCoBench: A Benchmark for Long-Context Large Language Models in Complex Software Engineering

LoCoBench is a comprehensive benchmark specifically designed to evaluate long-context Large Language Models (LLMs) in complex software development scenarios. It provides 8,000 evaluation scenarios across 10 programming languages with context lengths spanning 10K to 1M tokens.

🚀 Quick Start

Prerequisites

Python 3.8 or higher
Git

Installation

# Clone the repository
git clone https://github.com/SalesforceAIResearch/LoCoBench.git
cd LoCoBench

# Install dependencies
pip install -r requirements.txt

# Install LoCoBench package
pip install -e .

Download Evaluation Data

Download the complete evaluation dataset (data.zip):

# Download data.zip from Google Drive
# Visit: https://drive.google.com/file/d/1pK1M1sRrVZUDMKYcwh49CdXug0UzStvl/view?usp=sharing
# Or use gdown (install with: pip install gdown)
gdown https://drive.google.com/uc?id=1pK1M1sRrVZUDMKYcwh49CdXug0UzStvl

# Extract the data
unzip data.zip

# This will create the data/ directory with all evaluation scenarios

Environment Setup

Configure API Keys

Create an api.sh file (gitignored) with your LLM API credentials:

# Copy the template
cp api.sh.template api.sh

# Edit api.sh with your API keys
export OPENAI_API_KEY="your_openai_key_here"
export OPENAI_BASE_URL="https://your-openai-compatible-endpoint/v1"  # Optional
export ANTHROPIC_API_KEY="your_anthropic_key_here"
export CLAUDE_PROVIDER="anthropic"  # Optional: set only if you need to force Anthropic vs Bedrock
export GOOGLE_API_KEY="your_google_key_here"

# Source the file
source api.sh

📊 Running Evaluations

Option 1: Quick Evaluation (Recommended)

Run evaluation on all LoCoBench scenarios:

# Evaluate a single model on all scenarios
locobench evaluate --model gpt-4o --config-path config.yaml

# Evaluate a custom OpenAI-compatible endpoint/model
locobench evaluate --model your-model-name --config-path config.yaml

# Evaluate specific task categories
locobench evaluate --model claude-sonnet-4 --task-category architectural_understanding --difficulty hard

# Evaluate multiple models in parallel
locobench evaluate --model gpt-4o,claude-sonnet-4,gemini-2.5-pro --config-path config.yaml

Option 2: Custom Evaluation

# Evaluate on specific programming languages
locobench evaluate --model gpt-4o --languages python,java,cpp

# Evaluate specific domains
locobench evaluate --model gemini-2.5-pro --domains web_applications,ml_systems

Evaluation Results

Results are saved in evaluation_results/ directory:

evaluation_results/
├── gpt4o_evaluation_results.json          # Detailed results
└── gpt4o_evaluation_results_summary.md    # Human-readable summary

📈 Understanding Results

LoCoBench Score (LCBS)

The unified score (0-5 scale) combines 17 metrics across 4 dimensions:

Software Engineering Excellence (40%): ACS, DTA, CFRD, STS, RS, CS, IS, SES
Functional Correctness (30%): Compilation, Unit Tests, Integration Tests, IDC
Code Quality Assessment (20%): Security Analysis, Code Issues, Style Adherence
Long-Context Utilization (10%): ICU, MMR

Key Metrics Explained

ACS (Architectural Coherence Score): System-level design consistency
DTA (Dependency Traversal Accuracy): Cross-file reasoning ability
CFRD (Cross-File Reasoning Depth): Multi-file understanding
ICU (Information Coverage Utilization): Effective use of long context
MMR (Multi-Session Memory Retention): Context persistence across sessions

📚 Documentation

Generation Guide: How to generate custom scenarios (Phases 1-4)
Contributing: How to contribute to LoCoBench

📄 Citation

@article{Qiu2025LoCoBenchAB,
  title={LoCoBench: A Benchmark for Long-Context Large Language Models in Complex Software Engineering},
  author={Jielin Qiu and Zuxin Liu and Zhiwei Liu and Rithesh Murthy and Jianguo Zhang and Haolin Chen and Shiyu Wang and Ming Zhu and Liangwei Yang and Juntao Tan and Zhepeng Cen and Cheng Qian and Shelby Heinecke and Weiran Yao and Silvio Savarese and Caiming Xiong and Huan Wang},
  journal={ArXiv},
  year={2025},
  volume={abs/2509.09614}
}

🤝 Contributing

We welcome contributions! Please see our Contributing Guide for details.

📜 License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

🙏 Acknowledgments

Salesforce AI Research for supporting this research
The open-source community for various tools and libraries used in this project

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
evaluation_results		evaluation_results
locobench		locobench
AI_ETHICS.md		AI_ETHICS.md
CODEOWNERS		CODEOWNERS
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.txt		LICENSE.txt
LoCoBench_generation.md		LoCoBench_generation.md
README.md		README.md
SECURITY.md		SECURITY.md
api.sh.template		api.sh.template
config.yaml		config.yaml
how_to_license.md		how_to_license.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LoCoBench: A Benchmark for Long-Context Large Language Models in Complex Software Engineering

🚀 Quick Start

Prerequisites

Installation

Download Evaluation Data

Environment Setup

📊 Running Evaluations

Option 1: Quick Evaluation (Recommended)

Option 2: Custom Evaluation

Evaluation Results

📈 Understanding Results

LoCoBench Score (LCBS)

Key Metrics Explained

📚 Documentation

📄 Citation

🤝 Contributing

📜 License

🙏 Acknowledgments

About

Uh oh!

Releases

Packages

Languages

License

PGCodeLLM/LoCoBench

Folders and files

Latest commit

History

Repository files navigation

LoCoBench: A Benchmark for Long-Context Large Language Models in Complex Software Engineering

🚀 Quick Start

Prerequisites

Installation

Download Evaluation Data

Environment Setup

📊 Running Evaluations

Option 1: Quick Evaluation (Recommended)

Option 2: Custom Evaluation

Evaluation Results

📈 Understanding Results

LoCoBench Score (LCBS)

Key Metrics Explained

📚 Documentation

📄 Citation

🤝 Contributing

📜 License

🙏 Acknowledgments

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages