GMCQ-benchmark 🧪

GitHub Multiple Choice Questions (GMCQ) is a benchmark developed by the Rootly AI Labs to evaluate a language model's ability to determine the pull request that closes a bug fix issue from a real-world GitHub repository.

We took closed issues with the bug label from leading open-source GitHub repositories, along with the pull requests that closed the issue.

Benchmarking Methodology

To measure performance, Rootly AI Labs fellow Laurence Liang developed a Multiple Choice Questions benchmark leveraging leading open-source public GitHub repositories. Here is our methodology:

We sourced issues labeled "bug" from the leading open-source GitHub repositories.
For each issue, we collected the description and the associated pull request (PR) that solved it.
For benchmarking, we fed models each bug description and 4 PRs to choose from as the answer, with one of them being the PR that solved the issue—no codebase context was included.

Task Format

Given a GitHub issue title and description, without any additional context, the model must determine the correct pull request that closed the issue.

There is one correct issue and three false choices that are pull requests from the same repository that closed different issues.

The pull request description contains the filenames that were changed and the code patch that changed.

Task description:
<GitHub issue title and description>

---

Choice A:

<filenames and code patches>

---

Choice B:

<filenames and code patches>

---

Choice C:

<filenames and code patches>

---

Choice D:

<filenames and code patches>

Evaluation Results

We obtained the following results on version 0.1 of GMCQ using the OpenAI/evals GitHub repository.

Model Name	Accuracy
o4-mini	0.927 ± 0.029
o3	0.915 ± 0.032
grok-3-beta	0.915 ± 0.032
Qwen-2.5-Coder-32B (Groq)	0.902 ± 0.034
grok-3-mini-beta	0.902 ± 0.032
o3-mini	0.893 ± 0.034
Gemini-2.5-Flash (Google)	0.878 ± 0.036
GPT-4o	0.866 ± 0.039
GPT-4.1	0.841 ± 0.039
Gemini-2.0-Flash (Google)	0.841 ± 0.042
GPT-4o mini	0.829 ± 0.042
Qwen-2.5-32B (Groq)	0.793 ± 0.044
Claude 3.5 Sonnet	0.780 ± 0.048
DeepSeek V3.1 (0324) (Together AI)	0.756 ± 0.049
Llama-3.3 70B-versatile (Groq)	0.720 ± 0.050
Llama-4-Maverick (Groq)	0.695 ± 0.051
Llama-4 Scout (Groq)	0.598 ± 0.053
Llama-3.1 8B-instant (Groq)	0.341 ± 0.052

About the Rootly AI Labs

This project was developed by the Rootly AI Labs. The AI Labs is building the future of system reliability and operational excellence. We operate as an open-source incubator, sharing ideas, experimenting, and rapidly prototyping. We're committed to ensuring our research benefits the entire community.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
configs		configs
data/v0.1		data/v0.1
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GMCQ-benchmark 🧪

Benchmarking Methodology

Task Format

Evaluation Results

About the Rootly AI Labs

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Rootly-AI-Labs/GMCQ-benchmark

Folders and files

Latest commit

History

Repository files navigation

GMCQ-benchmark 🧪

Benchmarking Methodology

Task Format

Evaluation Results

About the Rootly AI Labs

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Packages