Artifact for "Planning a Large Language Model for Static Detection of Runtime Errors in Code Snippets"
ORCA is a novel approach that guides a Large Language Model (LLM) to autonomously plan and navigate control flow graphs (CFGs) for predictive execution of (in)complete code snippets, enabling static detection of runtime errors efficiently and cost-effectively.
The artifact is archived on a public repository (Zenodo), qualifying it for the Available badge. It includes well-documented source code, datasets, and LLM outputs necessary to replicate all experiments, fulfilling the requirements for the Functional badge. Our implementation has been tested primarily with OpenAI’s API. However, the framework can be extended to support other APIs, such as Gemini, Claude, and others, with minimal modifications. See the Extending the Framework to Other APIs section for details. This extensibility supports the Reusable badge.
The source code, data, and model outputs are publicly available on (GitHub) and (Zenodo)
This section describes the prerequisites and contains instructions, to get the project up and running.
Baseline B0: The baseline CodeExecutor transformer model requires a GPU machine (NVIDIA RTX A6000 GPU) for execution.
ORCA requires access to OpenAI API credentials to use gpt-3.5-turbo (gpt-35-turbo-0613). However, we also provide all LLM responses for the dataset (see output), and this can be skipped for experiments' replication.
Currently, ORCA works well on Ubuntu OS, and can be set up easily with all the prerequisite packages by following these instructions (if conda is already installed, update to the latest version with conda update conda, and skip steps 1-3):
-
Download the latest, appropriate version of conda for your machine (tested with
conda 24.1.2). -
Install it by running the
conda_install.shfile, with the command:$ bash conda_install.sh
-
Add
condato bash profile:$ source ~/.bashrc
-
Navigate to
ORCA(top-level directory) and create a conda virtual environment with the includedenvironment.ymlfile using the following command:$ conda env create -f environment.yml
To test successful installation, make sure
orcaappears in the list of conda environments returned withconda env list. -
Activate the virtual environment with the following command:
$ conda activate orca
ORCA traverse the Control flow graph with the Observation, Reasoning, and Actions, while carefully tracking the Symbol Table after each block. Since this process requires a higher token limit, we’ve chosen to use the gpt-3.5-turbo (gpt-35-turbo-0613) model with the 2023-05-15 API version to ensure the successful graph traversal.
To get started, simply set up the model and fill in your credentials to use the ORCA model in the .env file.
AZURE_OPENAI_ENDPOINT = ""
AZURE_OPENAI_KEY = ""
AZURE_API_VERSION = ""- API Usage: Experiments for ORCA, Baseline B1, and Baseline B2 use the OpenAI API with the model
gpt-3.5-turbo-0613. - Estimated API Cost: The total estimated cost for reproducing all ORCA experiments on both buggy and non-buggy datasets is approximately $4.50.
- Hardware for Baseline B0: NVIDIA RTX A6000 GPU (48GB VRAM) was used for running Baseline B0 experiments.
- Estimated Time for Baseline B0: Approximately 34 minutes for processing both buggy and non-buggy datasets.
-
dataset - Contains the dataset files.
- baseline - Contains the dataset files for baseline -
CodeExecutor (B0). - fixeval_cfg_b0.json and fixeval_incom_cfg_b0.json - dataset files for
ORCAandOther Baselines (B1 and B2)
- baseline - Contains the dataset files for baseline -
-
dataset_builder - Includes all the modules required to rebuild the dataset.
- Input_Variable_location - Filter out the main dataset based on the input variable lines location.
- Hunter - Collect the ground truth data by running the instances.
- CFG - Build the Control Flow Graph for the instances.
- Sampling - Randomly select the submissions (Buggy & Non-buggy) from all possible problem ids and merge them.
- Incomplete_Script - Randomly select the submissions (Buggy & Non-buggy) having builtin or external libraries and remove
import statementsfrom it. - temp_dataset - Caching dataset files from all the modules.
-
output - Contains the output files for
All Baselines (B0, B1, B2)andORCA. -
src - Contains source files for
All Baselines (B0, B1, B2)andORCA.- baselines - b0, b1, b2
- orca
To evaluate the accuracy of all baselines and the ORCA model for the research questions (RQs), refer to the table below.
| Approach | Table # & RQ # in Paper | Directory Location | Run Command(s) |
|---|---|---|---|
| ORCA Results | Table - 1 to 9, RQ - 1 to 5 | orca/src/orca/ |
python show_results.py |
| Baseline Bo Results | Table - 5,6,8, RQ - 3 & 4 | orca/src/baselines/b0 |
python show_results.py |
| Baseline B1 Results | Table - 1 to 6, RQ - 1 to 3 | orca/src/baselines/b1 |
python show_results.py |
| Baseline B2 Results | Table - 1 to 6, RQ - 1 to 3 | orca/src/baselines/b2 |
python show_results.py |
Follow the steps below to replicate the results for baselines and the ORCA model.
- Navigate to the
orca/src/baselines/b0directory. - Run the following commands:
python run.py python show_results.py
-
Navigate to the respective directories:
- B1:
orca/src/baselines/b1/ - B2:
orca/src/baselines/b2/
- B1:
-
Run the Pipeline:
Use the following command to execute the pipeline. You can either replace the parameters with custom values or stick to the default settings:- Default Parameters:
temperature:0.7seed:42timeout:120
python pipeline.py --model <LLM_MODEL> --temperature <FLOAT> --seed <INT> --timeout <INT>
- Default Parameters:
-
View the Results: After running the pipeline, check the results using the following command:
python show_results.py
- Navigate to the
orca/src/orca/directory. - Run the Pipeline:
Use the following command to execute the pipeline. You can either replace the parameters with custom values or stick to the default settings:- Default Parameters:
temperature:0.7seed:42timeout:120
python pipeline.py --model <LLM_MODEL> --temperature <FLOAT> --seed <INT> --timeout <INT>
- Default Parameters:
- View the Results:
After running the pipeline, check the results using the following command:
python show_results.py
- Navigate to the
orca/src/orca/inferencedirectory. - Run the Pipeline:
Use the following command to execute the pipeline. You can either replace the parameters with custom values or stick to the default settings:- Default Parameters:
temperature:0.7seed:42timeout:120input_dir:../../../dataset/dataset.jsonoutput_dir:../../../output/orca
python pipeline.py --model <LLM_MODEL> --temperature <FLOAT> --seed <INT> --timeout <INT> --input_dir <Dataset Directory Path> --output_dir <Output Directory Path>
- Default Parameters:
- CFG Tool Limitation: The CFG (Control Flow Graph) tool work with only one method because it can not map block connection for the method calls. Ensure that each datapoint in your dataset contains one method.
Our implementation has been tested primarily with OpenAI’s API. However, the framework can be extended to support other APIs, such as Gemini, Claude, and others, with minimal modifications.
To adapt the framework for a different API, follow these steps:
- Update the
.envFile- Add the new API key. For example, for Google’s Gemini, add
GOOGLE_GEMINI_API_KEY=<your-key>.
- Add the new API key. For example, for Google’s Gemini, add
- Modify the
model.pyFile- In
src/orca/model.py, theAgentInteractionclass manages API interactions. To support a new API:- Install and import the new LLM library (e.g., run
pip install google-generativeaifor Gemini and import it). - Update the
__init__method to initialize the new API client using the key from.env. - Adjust the
api_callmethod to match the new API’s request/response format (e.g., refer to the Gemini API Documentation).
- Install and import the new LLM library (e.g., run
- In
- Download the FixEval Dataset and and move it to the
orca/datasetdirectory. - Download the Testcases zip file and Extract it to the
orca/dataset_builder/Hunter/directory. - Go to
orca/dataset_builderdirectory and runpython -W ignore script.pyto build the dataset. - Go to
orca/src/baselines/b0/dataset_buildingdirectory and runpython -W ignore script.pyto build the dataset for Code Executor Baseline (B0).
-
Code should carry appropriate comments, wherever necessary, and follow the docstring convention in the repository.
-
If you see something that could be improved, send a pull request! We are always happy to look at improvements, to ensure that
orca, as a project, is the best version of itself. If you think something should be done differently (or is just-plain-broken), please create an issue.
See the LICENSE file for more details.