Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -65,3 +65,5 @@ venv.bak/

# Project specific
**/.mplconfig/

results/*
163 changes: 138 additions & 25 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,55 +1,168 @@
# PowerAgentBench

A benchmark suite for evaluating AI agents on power system operational tasks.
PowerAgentBench is a benchmark suite for evaluating AI agents on power system operational and planning tasks. The current release focuses on steady-state studies and includes both conventional scripted baselines and LLM/tool-agent evaluation.

The benchmark is designed around a public/hidden split. Agents see public case data, action spaces, and tool APIs. A hidden evaluator recomputes physical validity and returns discovery, evidence, mitigation, efficiency, and reliability metrics.

## Repository Structure

```
```text
PowerAgentBench/
├── cases/ # Network case data in multiple formats
├── cases/ # Network case data in multiple formats
│ └── case39/
│ ├── pypsa/case39.nc # PyPSA netCDF format
│ ├── matpower/case39.m # MATPOWER .m format
│ └── pandapower/case39.json # PandaPower JSON format
├── benchmarks/ # Benchmark definitions and configs
│ ├── pypsa/case39.nc # PyPSA netCDF format
│ ├── matpower/case39.m # MATPOWER .m format
│ └── pandapower/case39.json # PandaPower JSON format
├── benchmarks/ # Benchmark definitions and task configs
│ └── steady/
│ └── level_1/
│ ├── README.md # Full benchmark specification
│ ├── actionspace.json # Action contract and operating limits
│ ├── actioncost.json # Per-step action costs
│ ├── baseline_summary.json
│ └── solution_template.json
├── scripts/ # Runnable entry points
│ ├── build_case.py # Rebuild the stressed scenario
│ ├── convert_case.py # Export to MATPOWER and PandaPower
│ └── evaluate_solution.py # Score a solution
└── poweragentbench/ # Shared library code
└── benchmark_utils.py # Case construction and scoring
│ ├── level_1/ # N-1 steady-state audit and mitigation
│ │ ├── README.md # Full Level 1 benchmark specification
│ │ ├── actionspace.json # Action contract and operating limits
│ │ ├── actioncost.json # Per-step action costs
│ │ ├── baseline_summary.json
│ │ └── solution_template.json
│ └── level_2/ # Agentic N-2 search and mitigation
│ ├── README.md # Full Level 2 benchmark specification
│ ├── .env.example # Template for private Ollama configuration
│ └── prompts/
│ └── steady_n2_llm_prompt.json # Shared LLM tool-use prompt template
├── scripts/ # Runnable entry points
│ ├── build_case.py # Rebuild the stressed Level 1 scenario
│ ├── convert_case.py # Export case39 to MATPOWER and PandaPower
│ ├── evaluate_solution.py # Score a Level 1 solution
│ ├── run_steady_n2_baselines.py # Run Level 2 scripted baselines
│ └── run_steady_n2_ollama_eval.py # Run Level 2 Ollama-hosted LLM agents
└── poweragentbench/ # Shared library code
├── benchmark_utils.py # Level 1 case construction and scoring
├── steady_state_agentic.py # Level 2 DC N-2 evaluator and baselines
├── llm_agent_adapter.py # JSON-command LLM adapter
└── ollama_client.py # Ollama generate/chat client
```

## Quick Start
## Installation

```bash
pip install -e .
```

## Quick Start

### Level 1: N-1 steady-state audit and mitigation

```bash
# Rebuild the benchmark case from source
python scripts/build_case.py

# Export to MATPOWER and PandaPower formats
python scripts/convert_case.py

# Evaluate a solution
python scripts/evaluate_solution.py --solution benchmarks/steady/level_1/solution_template.json
python scripts/evaluate_solution.py \
--solution benchmarks/steady/level_1/solution_template.json
```

### Level 2: Agentic N-2 search and mitigation

Run scripted baselines on deterministic variants of the existing IEEE 39-bus case:

```bash
python scripts/run_steady_n2_baselines.py \
--case-source case39 \
--cases 8 \
--budget 80 \
--report-k 20
```

Run deployed Ollama LLM agents:

```bash
python scripts/run_steady_n2_ollama_eval.py \
--case-source case39 \
--cases 8 \
--budget 80 \
--report-k 20 \
--max-turns 12 \
--prompt-template benchmarks/steady/level_2/prompts/steady_n2_llm_prompt.json
```

Outputs are written under `results/steady_n2/` as per-case CSVs, aggregate CSVs, tool logs, API debug files, and LaTeX table rows.

## Case Formats

The IEEE 39-bus stressed scenario is provided in three formats so that agents and solvers are not tied to a single tool:

- **PyPSA** (`cases/case39/pypsa/case39.nc`): the primary format used by the evaluator.
- **PandaPower** (`cases/case39/pandapower/case39.json`): for use with PandaPower-based tools.
- **MATPOWER** (`cases/case39/matpower/case39.m`): for use with MATPOWER or MATPOWER-compatible solvers.
- **PyPSA** (`cases/case39/pypsa/case39.nc`): primary format used by the Level 1 evaluator and by the Level 2 case39 converter.
- **PandaPower** (`cases/case39/pandapower/case39.json`): for PandaPower-based tools.
- **MATPOWER** (`cases/case39/matpower/case39.m`): for MATPOWER or MATPOWER-compatible solvers.

## Benchmarks

See `benchmarks/steady/level_1/README.md` for the full specification of the first benchmark task.
### Steady Level 1

`benchmarks/steady/level_1/` evaluates N-1 steady-state audit and mitigation on a stressed IEEE 39-bus case. The agent receives a case, a published contingency list, and a bounded action space. The evaluator checks base-case and contingency violations after the submitted actions.

See:

```text
benchmarks/steady/level_1/README.md
```

### Steady Level 2

`benchmarks/steady/level_2/` evaluates agentic N-2 contingency search and optional mitigation. The agent must spend a limited validation budget, submit evidence-backed ranked contingencies, and optionally improve the hidden post-action violation score.

The default case source is the existing IEEE 39-bus case distributed in this repository. The runner converts it to a lightweight DC representation and creates deterministic operating-point variants from fixed seeds. A synthetic fallback is also available for development.

See:

```text
benchmarks/steady/level_2/README.md
```

## Ollama Configuration

Private or internal Ollama endpoints should not be committed to the repository. Configure them through a local `.env` file.

```bash
cp benchmarks/steady/level_2/.env.example benchmarks/steady/level_2/.env
```

Example local settings:

```bash
POWERAGENTBENCH_OLLAMA_URL=http://localhost:11434/api/generate
POWERAGENTBENCH_OLLAMA_MODELS=qwen3.5:latest mistral-nemo:12b command-r:35b
POWERAGENTBENCH_OLLAMA_TEMPERATURE=0.0
POWERAGENTBENCH_OLLAMA_NUM_CTX=16384
POWERAGENTBENCH_OLLAMA_API_MODE=generate
POWERAGENTBENCH_OLLAMA_THINK=false
POWERAGENTBENCH_OLLAMA_SCHEMA_FORMAT=true
```

The local `.env` file is ignored by Git. You may also pass the same settings through command-line flags or process environment variables.

## Metrics

PowerAgentBench returns per-case and aggregate metrics, including:

- submitted and evidence-backed top-20 recall,
- found top-20 recall,
- evidence rate,
- best severity capture,
- severity regret,
- post-action violation and violation reduction,
- action cost,
- invalid tool calls,
- schema repairs and type coercions,
- duplicate validation requests,
- explicit submission and auto-finalization indicators,
- validation budget use.

These metrics distinguish answer quality, tool evidence, search quality, mitigation quality, and workflow compliance.

## Development Notes

- Use Level 1 to test basic steady-state action submission and physical validation.
- Use Level 2 to test agentic behavior, tool use, validation-budget allocation, and LLM workflows.
- Keep hidden oracle quantities and private endpoint URLs outside the public repository.
- Regenerate results after modifying prompts, adapters, scoring rules, or case-generation settings.
18 changes: 18 additions & 0 deletions benchmarks/steady/level_2/.env.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# Copy this file to benchmarks/steady/level_2_agentic_n2/.env and edit locally.
# Do not commit .env. The benchmark runner loads this file automatically.

# Use your local or internal Ollama endpoint. The internal UBI endpoint used in
# experiments is intentionally not committed to the public repository.
POWERAGENTBENCH_OLLAMA_URL=http://localhost:11434/api/generate

# Space-separated or comma-separated list of models to evaluate.
POWERAGENTBENCH_OLLAMA_MODELS=qwen3.5:latest mistral-nemo:12b gpt-oss:20b command-r:35b

# Deterministic generation and extended context for tool transcripts.
POWERAGENTBENCH_OLLAMA_TEMPERATURE=0.0
POWERAGENTBENCH_OLLAMA_NUM_CTX=16384

# Ollama API options.
POWERAGENTBENCH_OLLAMA_API_MODE=generate
POWERAGENTBENCH_OLLAMA_THINK=false
POWERAGENTBENCH_OLLAMA_SCHEMA_FORMAT=true
1 change: 1 addition & 0 deletions benchmarks/steady/level_2/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
.env
170 changes: 170 additions & 0 deletions benchmarks/steady/level_2/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,170 @@
# Level 2: Agentic N-2 Steady-State Evaluation

This task evaluates whether scripted or LLM agents can search N-2 contingencies, spend a limited validation budget, submit evidence-backed rankings, and optionally mitigate hidden top-20 violations.

The default case source is the existing IEEE 39-bus case distributed in this repository. The runner converts it to a lightweight DC representation and creates deterministic operating-point variants from fixed seeds. A synthetic fallback is also available for development.

## Goal

For each case, an agent should:

1. inspect the public case summary,
2. use public screening tools to prioritize N-2 candidates,
3. spend the limited validation budget on promising candidates,
4. submit a ranked list supported by validation evidence,
5. optionally call redispatch to reduce hidden post-action violations.

## Public and Hidden Split

The agent sees the public case summary, candidate count, tool API, validation budget, prompt template, and tool observations. The evaluator separately computes the hidden oracle over all N-2 candidates and returns discovery, evidence, mitigation, efficiency, and reliability metrics.

The agent never sees hidden oracle labels during the run.

## Public Tools

The Level 2 LLM interface exposes the following tools:

- `case_summary`: return network size, base severity, candidate count, and remaining budget.
- `rank_base_loading`: rank candidates by public base-flow stress.
- `rank_lodf`: rank candidates using an LODF-style approximate screen.
- `validate`: run exact public validation for selected candidates and consume budget.
- `redispatch`: apply bounded preventive redispatch on a focus set.
- `submit`: submit ranked contingencies, mitigation, and diagnosis.

## Run Scripted Baselines

```bash
python scripts/run_steady_n2_baselines.py \
--case-source case39 \
--cases 8 \
--budget 80 \
--report-k 20
```

Outputs are written to:

```text
results/steady_n2/
```

The baseline runner writes per-case CSVs, aggregate CSVs, and LaTeX table rows.

## Configure Ollama Without Committing Private Endpoints

Copy the example environment file and edit it locally:

```bash
cp benchmarks/steady/level_2/.env.example benchmarks/steady/level_2/.env
```

The local `.env` file is ignored by Git. Use it for private or internal network endpoints such as a non-public Ollama server.

Required setting:

```bash
POWERAGENTBENCH_OLLAMA_URL=http://localhost:11434/api/generate
```

Recommended model list:

```bash
POWERAGENTBENCH_OLLAMA_MODELS=qwen3.5:latest mistral-nemo:12b command-r:35b
```

Recommended generation settings used in the paper experiments:

```bash
POWERAGENTBENCH_OLLAMA_TEMPERATURE=0.0
POWERAGENTBENCH_OLLAMA_NUM_CTX=16384
POWERAGENTBENCH_OLLAMA_API_MODE=generate
POWERAGENTBENCH_OLLAMA_THINK=false
POWERAGENTBENCH_OLLAMA_SCHEMA_FORMAT=true
```

The corresponding Ollama request includes:

```json
"options": {
"temperature": 0.0,
"num_ctx": 16384
}
```

## Run Deployed Ollama Agents

After configuring `.env`, run:

```bash
python scripts/run_steady_n2_ollama_eval.py \
--case-source case39 \
--cases 8 \
--budget 80 \
--report-k 20 \
--max-turns 12 \
--prompt-template benchmarks/steady/level_2/prompts/steady_n2_llm_prompt.json
```

You can override the model list from the command line:

```bash
python scripts/run_steady_n2_ollama_eval.py \
--models qwen3.5:latest mistral-nemo:12b gpt-oss:20b command-r:35b \
--case-source case39 \
--cases 8 \
--budget 80 \
--report-k 20 \
--max-turns 12 \
--prompt-template benchmarks/steady/level_2/prompts/steady_n2_llm_prompt.json
```

The runner writes per-case CSVs, aggregate CSVs, tool logs, API debug files, and LaTeX table rows under:

```text
results/steady_n2/
```

## Prompt Template

The default prompt template is:

```text
benchmarks/steady/level_2/prompts/steady_n2_llm_prompt.json
```

The prompt specifies the JSON command schema, allowed tools, validation budget, canonical contingency representation, and expected workflow. Keeping the prompt template in the repository makes multi-model LLM comparisons reproducible.

## Output Metrics

The runner returns per-case and aggregate CSV files with:

- `validated_calls`,
- `reported_top20_recall`,
- `validated_top20_recall`,
- `found_top20_recall`,
- `evidence_rate`,
- `best_capture_validated`,
- `severity_regret`,
- `post_top20_violation`,
- `violation_reduction`,
- `action_cost`,
- `invalid_tool_calls`,
- `schema_repairs`,
- `type_coercions`,
- `duplicate_validation_requests`,
- `submitted_explicitly`,
- `auto_finalized`,
- `validation_budget_used`.

These fields separate search quality, evidence quality, tool compliance, budget use, mitigation, and workflow completion.

## Evaluation Regimes

- **Open**: users can inspect public files and debug agents locally.
- **Sealed**: agents interact only with the public tool server while oracle labels and evaluator scripts remain private.
- **Stress**: agents are rerun across seeds, prompt variants, or scenario variants to measure reliability.

## Notes

- The Level 2 evaluator is a lightweight DC approximation intended to test the benchmark mechanics.
- It is not a replacement for AC security analysis.
- The same public-agent and hidden-evaluator protocol can be connected to AC power flow, SCOPF, voltage-security studies, or commercial simulators.
3 changes: 3 additions & 0 deletions benchmarks/steady/level_2/prompts/steady_n2_llm_prompt.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
{
"system": "You are an engineering agent for PowerAgentBench-SS. Your task is to find high-severity N-2 branch-outage contingencies under a limited validation budget, support submitted claims with tool evidence, optionally mitigate, and submit a ranked list. Return exactly one JSON object per turn and no prose. Use this schema: {\\\"tool\\\":\\\"<tool_name>\\\",\\\"args\\\":{...}}. Allowed commands: {\\\"tool\\\":\\\"case_summary\\\",\\\"args\\\":{}}, {\\\"tool\\\":\\\"rank_base_loading\\\",\\\"args\\\":{\\\"top_n\\\":80}}, {\\\"tool\\\":\\\"rank_lodf\\\",\\\"args\\\":{\\\"top_n\\\":80}}, {\\\"tool\\\":\\\"validate\\\",\\\"args\\\":{\\\"contingencies\\\":[[2,11],[11,37],[11,43],[11,21],[11,41],[11,45],[11,26],[11,15],[11,16],[11,29]]}}, {\\\"tool\\\":\\\"redispatch\\\",\\\"args\\\":{\\\"focus\\\":[[2,11],[11,37],[11,43],[11,21],[11,41]]}}, {\\\"tool\\\":\\\"submit\\\",\\\"args\\\":{\\\"reported\\\":[[2,11],[11,37],[11,43],[11,21],[11,41]],\\\"diagnosis\\\":\\\"brief evidence-backed summary\\\"}}. Important rules: a contingency is a pair of branch ids [i,j], not bus ids. Branch ids are integers from 0 to n_branch-1. Use the canonical field name contingencies for validate and reported for submit. Use ranking tools first, then validate large batches of promising pairs up to the remaining validation budget. Do not validate only two example pairs unless the budget is nearly exhausted. Do not repeat already validated pairs. Always call submit before the final turn. Do not use case_name, bus, branch/bus dictionaries, natural language, or markdown."
}
Loading