Artifact for "Planning a Large Language Model for Static Detection of Runtime Errors in Code Snippets"

ORCA is a novel approach that guides a Large Language Model (LLM) to autonomously plan and navigate control flow graphs (CFGs) for predictive execution of (in)complete code snippets, enabling static detection of runtime errors efficiently and cost-effectively.

Purpose

The artifact is archived on a public repository (Zenodo), qualifying it for the Available badge. It includes well-documented source code, datasets, and LLM outputs necessary to replicate all experiments, fulfilling the requirements for the Functional badge. Our implementation has been tested primarily with OpenAI’s API. However, the framework can be extended to support other APIs, such as Gemini, Claude, and others, with minimal modifications. See the Extending the Framework to Other APIs section for details. This extensibility supports the Reusable badge.

Provenance

The source code, data, and model outputs are publicly available on (GitHub) and (Zenodo)

Getting Started

This section describes the prerequisites and contains instructions, to get the project up and running.

Setup

Hardware Requirements

Baseline B0: The baseline CodeExecutor transformer model requires a GPU machine (NVIDIA RTX A6000 GPU) for execution.

ORCA requires access to OpenAI API credentials to use gpt-3.5-turbo (gpt-35-turbo-0613). However, we also provide all LLM responses for the dataset (see output), and this can be skipped for experiments' replication.

Project Environment

Currently, ORCA works well on Ubuntu OS, and can be set up easily with all the prerequisite packages by following these instructions (if conda is already installed, update to the latest version with conda update conda, and skip steps 1-3):

Download the latest, appropriate version of conda for your machine (tested with conda 24.1.2).
Install it by running the conda_install.sh file, with the command:
```
$ bash conda_install.sh
```
Add conda to bash profile:
```
$ source ~/.bashrc
```
Navigate to ORCA (top-level directory) and create a conda virtual environment with the included environment.yml file using the following command:
```
$ conda env create -f environment.yml
```
To test successful installation, make sure orca appears in the list of conda environments returned with conda env list.
Activate the virtual environment with the following command:
```
$ conda activate orca
```

API Key Setup

ORCA traverse the Control flow graph with the Observation, Reasoning, and Actions, while carefully tracking the Symbol Table after each block. Since this process requires a higher token limit, we’ve chosen to use the gpt-3.5-turbo (gpt-35-turbo-0613) model with the 2023-05-15 API version to ensure the successful graph traversal. To get started, simply set up the model and fill in your credentials to use the ORCA model in the .env file.

   AZURE_OPENAI_ENDPOINT = ""
   AZURE_OPENAI_KEY = ""
   AZURE_API_VERSION = ""

Experiment Reproduction Details

API Usage: Experiments for ORCA, Baseline B1, and Baseline B2 use the OpenAI API with the model gpt-3.5-turbo-0613.
Estimated API Cost: The total estimated cost for reproducing all ORCA experiments on both buggy and non-buggy datasets is approximately $4.50.
Hardware for Baseline B0: NVIDIA RTX A6000 GPU (48GB VRAM) was used for running Baseline B0 experiments.
Estimated Time for Baseline B0: Approximately 34 minutes for processing both buggy and non-buggy datasets.

Directory Structure

dataset - Contains the dataset files.
- baseline - Contains the dataset files for baseline - CodeExecutor (B0).
- fixeval_cfg_b0.json and fixeval_incom_cfg_b0.json - dataset files for ORCA and Other Baselines (B1 and B2)
dataset_builder - Includes all the modules required to rebuild the dataset.
- Input_Variable_location - Filter out the main dataset based on the input variable lines location.
- Hunter - Collect the ground truth data by running the instances.
- CFG - Build the Control Flow Graph for the instances.
- Sampling - Randomly select the submissions (Buggy & Non-buggy) from all possible problem ids and merge them.
- Incomplete_Script - Randomly select the submissions (Buggy & Non-buggy) having builtin or external libraries and remove import statements from it.
- temp_dataset - Caching dataset files from all the modules.
output - Contains the output files for All Baselines (B0, B1, B2) and ORCA.
src - Contains source files for All Baselines (B0, B1, B2) and ORCA.
- baselines - b0, b1, b2
- orca

Accuracy Evaluation

To evaluate the accuracy of all baselines and the ORCA model for the research questions (RQs), refer to the table below.

Approach	Table # & RQ # in Paper	Directory Location	Run Command(s)
ORCA Results	Table - 1 to 9, RQ - 1 to 5	`orca/src/orca/`	`python show_results.py`
Baseline Bo Results	Table - 5,6,8, RQ - 3 & 4	`orca/src/baselines/b0`	`python show_results.py`
Baseline B1 Results	Table - 1 to 6, RQ - 1 to 3	`orca/src/baselines/b1`	`python show_results.py`
Baseline B2 Results	Table - 1 to 6, RQ - 1 to 3	`orca/src/baselines/b2`	`python show_results.py`

Steps to Reproduce Results

Follow the steps below to replicate the results for baselines and the ORCA model.

For Baseline B0

Navigate to the orca/src/baselines/b0 directory.
Run the following commands:
```
python run.py
python show_results.py
```

For Baseline B1 and B2

Navigate to the respective directories:
- B1: orca/src/baselines/b1/
- B2: orca/src/baselines/b2/
Run the Pipeline:
Use the following command to execute the pipeline. You can either replace the parameters with custom values or stick to the default settings:
- Default Parameters:
  - temperature: 0.7
  - seed: 42
  - timeout: 120
```
python pipeline.py --model <LLM_MODEL> --temperature <FLOAT> --seed <INT> --timeout <INT>
```
View the Results: After running the pipeline, check the results using the following command:
```
python show_results.py
```

For ORCA

Navigate to the orca/src/orca/ directory.
Run the Pipeline:
Use the following command to execute the pipeline. You can either replace the parameters with custom values or stick to the default settings:
- Default Parameters:
  - temperature: 0.7
  - seed: 42
  - timeout: 120
```
python pipeline.py --model <LLM_MODEL> --temperature <FLOAT> --seed <INT> --timeout <INT>
```
View the Results: After running the pipeline, check the results using the following command:
```
python show_results.py
```

Inference: Run ORCA for Custom Dataset

Navigate to the orca/src/orca/inference directory.
Run the Pipeline:
Use the following command to execute the pipeline. You can either replace the parameters with custom values or stick to the default settings:
- Default Parameters:
  - temperature: 0.7
  - seed: 42
  - timeout: 120
  - input_dir: ../../../dataset/dataset.json
  - output_dir: ../../../output/orca
```
python pipeline.py --model <LLM_MODEL> --temperature <FLOAT> --seed <INT> --timeout <INT> --input_dir <Dataset Directory Path> --output_dir <Output Directory Path>
```
CFG Tool Limitation: The CFG (Control Flow Graph) tool work with only one method because it can not map block connection for the method calls. Ensure that each datapoint in your dataset contains one method.

Extending the Framework to Other APIs

Our implementation has been tested primarily with OpenAI’s API. However, the framework can be extended to support other APIs, such as Gemini, Claude, and others, with minimal modifications.

To adapt the framework for a different API, follow these steps:

Update the .env File
- Add the new API key. For example, for Google’s Gemini, add GOOGLE_GEMINI_API_KEY=<your-key>.
Modify the model.py File
- In src/orca/model.py, the AgentInteraction class manages API interactions. To support a new API:
  1. Install and import the new LLM library (e.g., run pip install google-generativeai for Gemini and import it).
  2. Update the __init__ method to initialize the new API client using the key from .env.
  3. Adjust the api_call method to match the new API’s request/response format (e.g., refer to the Gemini API Documentation).

Reproducing the Dataset

Download the FixEval Dataset and and move it to the orca/dataset directory.
Download the Testcases zip file and Extract it to the orca/dataset_builder/Hunter/ directory.
Go to orca/dataset_builder directory and run python -W ignore script.py to build the dataset.
Go to orca/src/baselines/b0/dataset_building directory and run python -W ignore script.py to build the dataset for Code Executor Baseline (B0).

Contributing Guidelines

Code should carry appropriate comments, wherever necessary, and follow the docstring convention in the repository.
If you see something that could be improved, send a pull request! We are always happy to look at improvements, to ensure that orca, as a project, is the best version of itself. If you think something should be done differently (or is just-plain-broken), please create an issue.

License

See the LICENSE file for more details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Artifact for "Planning a Large Language Model for Static Detection of Runtime Errors in Code Snippets"

Purpose

Provenance

Getting Started

Setup

Hardware Requirements

Project Environment

API Key Setup

Experiment Reproduction Details

Directory Structure

Accuracy Evaluation

Steps to Reproduce Results

For Baseline B0

For Baseline B1 and B2

For ORCA

Inference: Run ORCA for Custom Dataset

Extending the Framework to Other APIs

Reproducing the Dataset

Contributing Guidelines

License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 120 Commits
dataset		dataset
dataset_builder		dataset_builder
output		output
src		src
.env		.env
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
icse2025-paper1850.pdf		icse2025-paper1850.pdf

License

SmitPatel910/orca

Folders and files

Latest commit

History

Repository files navigation

Artifact for "Planning a Large Language Model for Static Detection of Runtime Errors in Code Snippets"

Purpose

Provenance

Getting Started

Setup

Hardware Requirements

Project Environment

API Key Setup

Experiment Reproduction Details

Directory Structure

Accuracy Evaluation

Steps to Reproduce Results

For Baseline B0

For Baseline B1 and B2

For ORCA

Inference: Run ORCA for Custom Dataset

Extending the Framework to Other APIs

Reproducing the Dataset

Contributing Guidelines

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages