A comprehensive evaluation framework for testing code generation capabilities of LLMs on data transformation tasks.
This project evaluates how well different LLM models can generate Python code for data transformation tasks. It supports multiple LLM providers (Anthropic, Google, OpenAI) and includes two main datasets:
- ETL Code Tasks: Standard data transformation tasks
- LLM Embedding Tasks: Tasks involving API calls to embedding and LLM services
- Multi-Provider Support: Evaluate models from Anthropic, Google, and OpenAI
- Web Search Integration: Optional web search tool for API documentation lookup
- Comprehensive Evaluation: Uses Claude Sonnet as a judge to evaluate code quality, correctness, efficiency, and robustness
- Concurrent Execution: Parallel evaluation of multiple models and test cases
- Detailed Results: Generates pass/fail matrices, score tables, and latency metrics
- Python 3.8+
- API keys for at least one LLM provider (Anthropic, Google, OpenAI)
- Clone the repository:
git clone <repository-url>
cd transform-evals- Install dependencies:
pip install -r requirements.txt- Set up environment variables:
cp env.example .env
# Edit .env and add your API keysRequired environment variables:
ANTHROPIC_API_KEY- Required for judge model and Anthropic generator modelsGOOGLE_API_KEY- Required for Google generator modelsOPENAI_API_KEY- Required for OpenAI generator models
Optional (for specific test cases):
JINA_API_KEY- For Jina AI embedding tasksVOYAGE_API_KEY- For Voyage AI embedding tasksCOHERE_API_KEY- For Cohere API tasks
Run evaluations on all datasets with default models:
python src/evaluate_transforms.pyRun only the ETL dataset:
python src/evaluate_transforms.py --dataset etlRun only the LLM embedding dataset:
python src/evaluate_transforms.py --dataset llm_apiRun multiple datasets:
python src/evaluate_transforms.py --dataset etl --dataset llm_apiSpecify custom models for each provider:
# Use a specific OpenAI model
python src/evaluate_transforms.py --openai-model gpt-5.1-codex-mini --dataset etl
# Use a specific Google model
python src/evaluate_transforms.py --google-model gemini-3-flash-preview --dataset llm_api
# Use a specific Anthropic model
python src/evaluate_transforms.py --anthropic-model claude-haiku-4-5 --dataset etl
# Combine multiple models
python src/evaluate_transforms.py --openai-model gpt-5.1-codex-mini --google-model gemini-3-flash-preview --dataset etl --dataset llm_apiEnable web search (default):
python src/evaluate_transforms.py --web-search yesDisable web search:
python src/evaluate_transforms.py --web-search noOr use the legacy flag:
python src/evaluate_transforms.py --no-web-searchNote: Web search is automatically enabled for llm_api dataset tasks that require API documentation lookup, even if disabled globally.
# Run both datasets with no web search for a specific OpenAI model
python src/evaluate_transforms.py --openai-model gpt-5.1-codex-mini --dataset etl --dataset llm_api --web-search no
# Run llm_api dataset with web search for a specific OpenAI model
python src/evaluate_transforms.py --openai-model gpt-5.2-codex-mini --dataset llm_api --web-search yes
# Run with all three providers using custom models
python src/evaluate_transforms.py \
--openai-model gpt-5.1-codex-mini \
--google-model gemini-3-flash-preview \
--anthropic-model claude-haiku-4-5 \
--dataset etl \
--dataset llm_apitransform-evals/
├── src/
│ ├── config.py # Configuration constants
│ ├── prompts.py # Prompt templates
│ ├── evaluate_transforms.py # Main entry point
│ ├── tools/ # Tools package
│ │ ├── __init__.py
│ │ └── web_search_tool.py # Web search tool definitions
│ ├── evaluation/ # Evaluation modules
│ │ ├── __init__.py
│ │ ├── judge.py # Code judging functions (using Claude Sonnet)
│ │ ├── test_case.py # Test case evaluation logic
│ │ └── dataset.py # Dataset-level evaluation
│ ├── generation/ # Code generation modules
│ │ ├── __init__.py
│ │ ├── code_generation.py # Code generation functions
│ │ └── code_execution.py # Code execution utilities
│ └── utils/ # Utility modules
│ ├── __init__.py
│ ├── clients.py # LLM client initialization
│ └── results.py # Results processing and display
├── data/
│ ├── elt_code_eval_dataset.json # ETL dataset
│ └── llm_api_dataset.json # LLM API dataset
├── results/ # Evaluation results (JSON files)
├── env.example # Environment variables template
└── README.md # This file
- Code Generation: Each model generates Python code for a given transformation task
- Code Execution: The generated code is executed with test input data
- Judging: Claude Sonnet evaluates the code on multiple dimensions:
- Correctness (0-10)
- Code Quality (0-10)
- Efficiency (0-10)
- Robustness (0-10)
- Similarity to Ground Truth (0-10)
- Overall Score (average of all scores)
- Results: Results are saved as JSON files and displayed in tables
Results are saved in the results/ directory with timestamps:
- Individual dataset results:
code_gen_eval_{dataset_name}_{timestamp}.json - Combined results:
code_gen_eval_combined_{timestamp}.json
Each result file includes:
- Metadata (models used, timestamp, configuration)
- Test case results with generated code, execution results, and evaluations
- Summary statistics (average scores, pass rates, latency metrics)
- Pass Rate: Percentage of test cases that pass evaluation
- Average Score: Mean overall score across all test cases
- Latency Metrics:
- Code generation latency
- Execution latency
- Judging latency
- Total latency
- The judge model (Claude Sonnet) requires an Anthropic API key
- Web search is particularly useful for LLM/API embedding tasks that require current API documentation
- Some test cases may require additional API keys (Jina, Voyage, Cohere) - these are optional and failures due to missing keys are evaluated leniently
- The evaluation framework supports concurrent execution for faster evaluation
If you see errors about missing API keys, ensure your .env file is properly configured with all required keys.
If you encounter rate limit errors, the framework includes automatic retry logic with exponential backoff for Google API calls.
Make sure all dependencies are installed:
pip install -r requirements.txtSee LICENSE file for details.