A new benchmark for measuring LLM's capability to detect bugs in large codebase.
- Similar to the Needle In The Haystack benchmark, the Bug In The Haystack benchmark utilizes Python source code (randomly assembled) as the background noise and syntactic bug as the needle.
- This allows measurement of LLM's capability to retrieve code-related information at a very large context window, which is useful for SWE agent and co-pilot applications.
1 | def fahrenheit_to_celsius(fahrenheit):
2 | return (fahrenheit - 32) * 5.0/9.0
3 |
4 | def is_prime(num:
5 | if num <= 1:
6 | return False
7 | for i in range(2, int(num**0.5) + 1):
8 | if num % i == 0:
9 | return False
10| return True
Answer: 4, missing_parenthesis- *All models were evaluated on their latest versions.
Average Accuracy for Each Model (Exc. 16k Results)
- GPT-4o
- GPT-4o Mini
- GPT-4-Turbo
- Claude-3.5 Sonnet
- Claude-3 Opus
- Gemini 1.5 Pro
- Gemini 1.5 Flash
- GPT-3.5-Turbo
- Codestral
- Llama3-70B
- Command-R+
- Gemini-1.0-Pro
notebooks/bug_in_the_code_stack_python_source_code_preprocessing.ipynbcontains Colab notebook for data processing.notebooks/bug_in_the_code_stack_experiment_openai.ipynbcontains Colab notebook for running the experiment on OpenAI models.notebooks/bug_in_the_code_stack_experiment_litellm_openai_gpt4_turbo.ipynbcontains Colab notebook for running the experiment on GPT-4-Turbo w/t LiteLLM.notebooks/bug_in_the_code_stack_experiment_litellm_openai_gpt4o.ipynbcontains Colab notebook for running the experiment on GPT-4o w/t LiteLLM.notebooks/bug_in_the_code_stack_experiment_litellm_openai_gpt4o_mini.ipynbcontains Colab notebook for running the experiment on GPT-4o Mini w/t LiteLLM.notebooks/bug_in_the_code_stack_experiment_litellm_openai_gpt35.ipynbcontains Colab notebook for running the experiment on GPT-3.5 w/t LiteLLM.notebooks/bug_in_the_code_stack_experiment_litellm_anthropic_claude_35_sonnet.ipynbcontains Colab notebook for running the experiment on Claude-3.5 Sonnet w/t LiteLLM.notebooks/bug_in_the_code_stack_experiment_litellm_anthropic_claude_3_opus.ipynbcontains Colab notebook for running the experiment on Claude-3 Opus w/t LiteLLM.notebooks/bug_in_the_code_stack_experiment_litellm_cohere_commandr.ipynbcontains Colab notebook for running the experiment on Command-R+ w/t LiteLLM.notebooks/bug_in_the_code_stack_experiment_litellm_meta_llama3.ipynbcontains Colab notebook for running the experiment on Llama3 70B w/t LiteLLM.notebooks/bug_in_the_code_stack_experiment_mistral_codestral.ipynbcontains Colab notebook for running the experiment on Mistral Codestral w/t LiteLLMnotebooks/bug_in_the_code_stack_experiment_qwen_codeqwen_local.ipynbcontains Colab notebook for running the experiment on CodeQwen1.5 locally. Make sure to run this on Colab with A100 GPU.notebooks/bug_in_the_code_stack_experiment_genai_gemini10.pycontains Python script for running the experiment on Gemini-1.0-Pro w/t Generative AI package. Make sure to run this locally (doesn't work on Colab).notebooks/bics_helper_result_graphs.ipynbcontains helper functions to create beautiful graphs.notebooks/bics_result_analysis_graphs.ipynbcontains code to analyze the properties of codegen-focused models compared to larger, general-purpose models.
datasets/bug_in_the_code_stack_alpaca_dataset.csvis the preprocessed dataset used for the experiment.
- All notebooks and datasets can also be found at Bug In The Code Stack Google Drive. If you don't have access, please request access to
techandy42@gmail.com.













