LoCoBench is a comprehensive benchmark specifically designed to evaluate long-context Large Language Models (LLMs) in complex software development scenarios. It provides 8,000 evaluation scenarios across 10 programming languages with context lengths spanning 10K to 1M tokens.
- Python 3.8 or higher
- Git
# Clone the repository
git clone https://github.com/SalesforceAIResearch/LoCoBench.git
cd LoCoBench
# Install dependencies
pip install -r requirements.txt
# Install LoCoBench package
pip install -e .Download the complete evaluation dataset (data.zip):
# Download data.zip from Google Drive
# Visit: https://drive.google.com/file/d/1pK1M1sRrVZUDMKYcwh49CdXug0UzStvl/view?usp=sharing
# Or use gdown (install with: pip install gdown)
gdown https://drive.google.com/uc?id=1pK1M1sRrVZUDMKYcwh49CdXug0UzStvl
# Extract the data
unzip data.zip
# This will create the data/ directory with all evaluation scenarios- Configure API Keys
Create an api.sh file (gitignored) with your LLM API credentials:
# Copy the template
cp api.sh.template api.sh
# Edit api.sh with your API keys
export OPENAI_API_KEY="your_openai_key_here"
export OPENAI_BASE_URL="https://your-openai-compatible-endpoint/v1" # Optional
export ANTHROPIC_API_KEY="your_anthropic_key_here"
export CLAUDE_PROVIDER="anthropic" # Optional: set only if you need to force Anthropic vs Bedrock
export GOOGLE_API_KEY="your_google_key_here"
# Source the file
source api.shRun evaluation on all LoCoBench scenarios:
# Evaluate a single model on all scenarios
locobench evaluate --model gpt-4o --config-path config.yaml
# Evaluate a custom OpenAI-compatible endpoint/model
locobench evaluate --model your-model-name --config-path config.yaml
# Evaluate specific task categories
locobench evaluate --model claude-sonnet-4 --task-category architectural_understanding --difficulty hard
# Evaluate multiple models in parallel
locobench evaluate --model gpt-4o,claude-sonnet-4,gemini-2.5-pro --config-path config.yaml# Evaluate on specific programming languages
locobench evaluate --model gpt-4o --languages python,java,cpp
# Evaluate specific domains
locobench evaluate --model gemini-2.5-pro --domains web_applications,ml_systemsResults are saved in evaluation_results/ directory:
evaluation_results/
βββ gpt4o_evaluation_results.json # Detailed results
βββ gpt4o_evaluation_results_summary.md # Human-readable summary
The unified score (0-5 scale) combines 17 metrics across 4 dimensions:
- Software Engineering Excellence (40%): ACS, DTA, CFRD, STS, RS, CS, IS, SES
- Functional Correctness (30%): Compilation, Unit Tests, Integration Tests, IDC
- Code Quality Assessment (20%): Security Analysis, Code Issues, Style Adherence
- Long-Context Utilization (10%): ICU, MMR
- ACS (Architectural Coherence Score): System-level design consistency
- DTA (Dependency Traversal Accuracy): Cross-file reasoning ability
- CFRD (Cross-File Reasoning Depth): Multi-file understanding
- ICU (Information Coverage Utilization): Effective use of long context
- MMR (Multi-Session Memory Retention): Context persistence across sessions
- Generation Guide: How to generate custom scenarios (Phases 1-4)
- Contributing: How to contribute to LoCoBench
@article{Qiu2025LoCoBenchAB,
title={LoCoBench: A Benchmark for Long-Context Large Language Models in Complex Software Engineering},
author={Jielin Qiu and Zuxin Liu and Zhiwei Liu and Rithesh Murthy and Jianguo Zhang and Haolin Chen and Shiyu Wang and Ming Zhu and Liangwei Yang and Juntao Tan and Zhepeng Cen and Cheng Qian and Shelby Heinecke and Weiran Yao and Silvio Savarese and Caiming Xiong and Huan Wang},
journal={ArXiv},
year={2025},
volume={abs/2509.09614}
}We welcome contributions! Please see our Contributing Guide for details.
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
- Salesforce AI Research for supporting this research
- The open-source community for various tools and libraries used in this project