Skip to content

PGCodeLLM/LoCoBench

Β 
Β 

LoCoBench: A Benchmark for Long-Context Large Language Models in Complex Software Engineering

Python 3.8+ License: Apache 2.0 arXiv

LoCoBench is a comprehensive benchmark specifically designed to evaluate long-context Large Language Models (LLMs) in complex software development scenarios. It provides 8,000 evaluation scenarios across 10 programming languages with context lengths spanning 10K to 1M tokens.

πŸš€ Quick Start

Prerequisites

  • Python 3.8 or higher
  • Git

Installation

# Clone the repository
git clone https://github.com/SalesforceAIResearch/LoCoBench.git
cd LoCoBench

# Install dependencies
pip install -r requirements.txt

# Install LoCoBench package
pip install -e .

Download Evaluation Data

Download the complete evaluation dataset (data.zip):

# Download data.zip from Google Drive
# Visit: https://drive.google.com/file/d/1pK1M1sRrVZUDMKYcwh49CdXug0UzStvl/view?usp=sharing
# Or use gdown (install with: pip install gdown)
gdown https://drive.google.com/uc?id=1pK1M1sRrVZUDMKYcwh49CdXug0UzStvl

# Extract the data
unzip data.zip

# This will create the data/ directory with all evaluation scenarios

Environment Setup

  1. Configure API Keys

Create an api.sh file (gitignored) with your LLM API credentials:

# Copy the template
cp api.sh.template api.sh

# Edit api.sh with your API keys
export OPENAI_API_KEY="your_openai_key_here"
export OPENAI_BASE_URL="https://your-openai-compatible-endpoint/v1"  # Optional
export ANTHROPIC_API_KEY="your_anthropic_key_here"
export CLAUDE_PROVIDER="anthropic"  # Optional: set only if you need to force Anthropic vs Bedrock
export GOOGLE_API_KEY="your_google_key_here"

# Source the file
source api.sh

πŸ“Š Running Evaluations

Option 1: Quick Evaluation (Recommended)

Run evaluation on all LoCoBench scenarios:

# Evaluate a single model on all scenarios
locobench evaluate --model gpt-4o --config-path config.yaml

# Evaluate a custom OpenAI-compatible endpoint/model
locobench evaluate --model your-model-name --config-path config.yaml

# Evaluate specific task categories
locobench evaluate --model claude-sonnet-4 --task-category architectural_understanding --difficulty hard

# Evaluate multiple models in parallel
locobench evaluate --model gpt-4o,claude-sonnet-4,gemini-2.5-pro --config-path config.yaml

Option 2: Custom Evaluation

# Evaluate on specific programming languages
locobench evaluate --model gpt-4o --languages python,java,cpp

# Evaluate specific domains
locobench evaluate --model gemini-2.5-pro --domains web_applications,ml_systems

Evaluation Results

Results are saved in evaluation_results/ directory:

evaluation_results/
β”œβ”€β”€ gpt4o_evaluation_results.json          # Detailed results
└── gpt4o_evaluation_results_summary.md    # Human-readable summary

πŸ“ˆ Understanding Results

LoCoBench Score (LCBS)

The unified score (0-5 scale) combines 17 metrics across 4 dimensions:

  • Software Engineering Excellence (40%): ACS, DTA, CFRD, STS, RS, CS, IS, SES
  • Functional Correctness (30%): Compilation, Unit Tests, Integration Tests, IDC
  • Code Quality Assessment (20%): Security Analysis, Code Issues, Style Adherence
  • Long-Context Utilization (10%): ICU, MMR

Key Metrics Explained

  • ACS (Architectural Coherence Score): System-level design consistency
  • DTA (Dependency Traversal Accuracy): Cross-file reasoning ability
  • CFRD (Cross-File Reasoning Depth): Multi-file understanding
  • ICU (Information Coverage Utilization): Effective use of long context
  • MMR (Multi-Session Memory Retention): Context persistence across sessions

πŸ“š Documentation

πŸ“„ Citation

@article{Qiu2025LoCoBenchAB,
  title={LoCoBench: A Benchmark for Long-Context Large Language Models in Complex Software Engineering},
  author={Jielin Qiu and Zuxin Liu and Zhiwei Liu and Rithesh Murthy and Jianguo Zhang and Haolin Chen and Shiyu Wang and Ming Zhu and Liangwei Yang and Juntao Tan and Zhepeng Cen and Cheng Qian and Shelby Heinecke and Weiran Yao and Silvio Savarese and Caiming Xiong and Huan Wang},
  journal={ArXiv},
  year={2025},
  volume={abs/2509.09614}
}

🀝 Contributing

We welcome contributions! Please see our Contributing Guide for details.

πŸ“œ License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

πŸ™ Acknowledgments

  • Salesforce AI Research for supporting this research
  • The open-source community for various tools and libraries used in this project

About

No description, website, or topics provided.

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.7%
  • Shell 0.3%