Transform Evals

A comprehensive evaluation framework for testing code generation capabilities of LLMs on data transformation tasks.

Overview

This project evaluates how well different LLM models can generate Python code for data transformation tasks. It supports multiple LLM providers (Anthropic, Google, OpenAI) and includes two main datasets:

ETL Code Tasks: Standard data transformation tasks
LLM Embedding Tasks: Tasks involving API calls to embedding and LLM services

Features

Multi-Provider Support: Evaluate models from Anthropic, Google, and OpenAI
Web Search Integration: Optional web search tool for API documentation lookup
Comprehensive Evaluation: Uses Claude Sonnet as a judge to evaluate code quality, correctness, efficiency, and robustness
Concurrent Execution: Parallel evaluation of multiple models and test cases
Detailed Results: Generates pass/fail matrices, score tables, and latency metrics

Setup

Prerequisites

Python 3.8+
API keys for at least one LLM provider (Anthropic, Google, OpenAI)

Installation

Clone the repository:

git clone <repository-url>
cd transform-evals

Install dependencies:

pip install -r requirements.txt

Set up environment variables:

cp env.example .env
# Edit .env and add your API keys

Required environment variables:

ANTHROPIC_API_KEY - Required for judge model and Anthropic generator models
GOOGLE_API_KEY - Required for Google generator models
OPENAI_API_KEY - Required for OpenAI generator models

Optional (for specific test cases):

JINA_API_KEY - For Jina AI embedding tasks
VOYAGE_API_KEY - For Voyage AI embedding tasks
COHERE_API_KEY - For Cohere API tasks

Usage

Basic Usage

Run evaluations on all datasets with default models:

python src/evaluate_transforms.py

Running Specific Datasets

Run only the ETL dataset:

python src/evaluate_transforms.py --dataset etl

Run only the LLM embedding dataset:

python src/evaluate_transforms.py --dataset llm_api

Run multiple datasets:

python src/evaluate_transforms.py --dataset etl --dataset llm_api

Using Custom Models

Specify custom models for each provider:

# Use a specific OpenAI model
python src/evaluate_transforms.py --openai-model gpt-5.1-codex-mini --dataset etl

# Use a specific Google model
python src/evaluate_transforms.py --google-model gemini-3-flash-preview --dataset llm_api

# Use a specific Anthropic model
python src/evaluate_transforms.py --anthropic-model claude-haiku-4-5 --dataset etl

# Combine multiple models
python src/evaluate_transforms.py --openai-model gpt-5.1-codex-mini --google-model gemini-3-flash-preview --dataset etl --dataset llm_api

Web Search Configuration

Enable web search (default):

python src/evaluate_transforms.py --web-search yes

Disable web search:

python src/evaluate_transforms.py --web-search no

Or use the legacy flag:

python src/evaluate_transforms.py --no-web-search

Note: Web search is automatically enabled for llm_api dataset tasks that require API documentation lookup, even if disabled globally.

Example Commands

# Run both datasets with no web search for a specific OpenAI model
python src/evaluate_transforms.py --openai-model gpt-5.1-codex-mini --dataset etl --dataset llm_api --web-search no

# Run llm_api dataset with web search for a specific OpenAI model
python src/evaluate_transforms.py --openai-model gpt-5.2-codex-mini --dataset llm_api --web-search yes

# Run with all three providers using custom models
python src/evaluate_transforms.py \
  --openai-model gpt-5.1-codex-mini \
  --google-model gemini-3-flash-preview \
  --anthropic-model claude-haiku-4-5 \
  --dataset etl \
  --dataset llm_api

Project Structure

transform-evals/
├── src/
│   ├── config.py                    # Configuration constants
│   ├── prompts.py                    # Prompt templates
│   ├── evaluate_transforms.py        # Main entry point
│   ├── tools/                        # Tools package
│   │   ├── __init__.py
│   │   └── web_search_tool.py        # Web search tool definitions
│   ├── evaluation/                   # Evaluation modules
│   │   ├── __init__.py
│   │   ├── judge.py                  # Code judging functions (using Claude Sonnet)
│   │   ├── test_case.py               # Test case evaluation logic
│   │   └── dataset.py                 # Dataset-level evaluation
│   ├── generation/                   # Code generation modules
│   │   ├── __init__.py
│   │   ├── code_generation.py        # Code generation functions
│   │   └── code_execution.py         # Code execution utilities
│   └── utils/                        # Utility modules
│       ├── __init__.py
│       ├── clients.py                 # LLM client initialization
│       └── results.py                 # Results processing and display
├── data/
│   ├── elt_code_eval_dataset.json      # ETL dataset
│   └── llm_api_dataset.json            # LLM API dataset
├── results/                            # Evaluation results (JSON files)
├── env.example                         # Environment variables template
└── README.md                           # This file

Evaluation Process

Code Generation: Each model generates Python code for a given transformation task
Code Execution: The generated code is executed with test input data
Judging: Claude Sonnet evaluates the code on multiple dimensions:
- Correctness (0-10)
- Code Quality (0-10)
- Efficiency (0-10)
- Robustness (0-10)
- Similarity to Ground Truth (0-10)
- Overall Score (average of all scores)
Results: Results are saved as JSON files and displayed in tables

Output

Results are saved in the results/ directory with timestamps:

Individual dataset results: code_gen_eval_{dataset_name}_{timestamp}.json
Combined results: code_gen_eval_combined_{timestamp}.json

Each result file includes:

Metadata (models used, timestamp, configuration)
Test case results with generated code, execution results, and evaluations
Summary statistics (average scores, pass rates, latency metrics)

Evaluation Metrics

Pass Rate: Percentage of test cases that pass evaluation
Average Score: Mean overall score across all test cases
Latency Metrics:
- Code generation latency
- Execution latency
- Judging latency
- Total latency

Notes

The judge model (Claude Sonnet) requires an Anthropic API key
Web search is particularly useful for LLM/API embedding tasks that require current API documentation
Some test cases may require additional API keys (Jina, Voyage, Cohere) - these are optional and failures due to missing keys are evaluated leniently
The evaluation framework supports concurrent execution for faster evaluation

Troubleshooting

Missing API Keys

If you see errors about missing API keys, ensure your .env file is properly configured with all required keys.

Rate Limiting

If you encounter rate limit errors, the framework includes automatic retry logic with exponential backoff for Google API calls.

Import Errors

Make sure all dependencies are installed:

pip install -r requirements.txt

License

See LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Transform Evals

Overview

Features

Setup

Prerequisites

Installation

Usage

Basic Usage

Running Specific Datasets

Using Custom Models

Web Search Configuration

Example Commands

Project Structure

Evaluation Process

Output

Evaluation Metrics

Notes

Troubleshooting

Missing API Keys

Rate Limiting

Import Errors

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
assets		assets
data		data
data_gen		data_gen
results		results
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
env.example		env.example

Folders and files

Latest commit

History

Repository files navigation

Transform Evals

Overview

Features

Setup

Prerequisites

Installation

Usage

Basic Usage

Running Specific Datasets

Using Custom Models

Web Search Configuration

Example Commands

Project Structure

Evaluation Process

Output

Evaluation Metrics

Notes

Troubleshooting

Missing API Keys

Rate Limiting

Import Errors

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages