Power-Agent · kosmylo · May 17, 2026
diff --git a/.gitignore b/.gitignore
@@ -65,3 +65,5 @@ venv.bak/
 
 # Project specific
 **/.mplconfig/
+
+results/*
diff --git a/README.md b/README.md
@@ -1,55 +1,168 @@
 # PowerAgentBench
 
-A benchmark suite for evaluating AI agents on power system operational tasks.
+PowerAgentBench is a benchmark suite for evaluating AI agents on power system operational and planning tasks. The current release focuses on steady-state studies and includes both conventional scripted baselines and LLM/tool-agent evaluation.
+
+The benchmark is designed around a public/hidden split. Agents see public case data, action spaces, and tool APIs. A hidden evaluator recomputes physical validity and returns discovery, evidence, mitigation, efficiency, and reliability metrics.
 
 ## Repository Structure
 
-```
+```text
 PowerAgentBench/
-├── cases/                          # Network case data in multiple formats
+├── cases/                                      # Network case data in multiple formats
 │   └── case39/
-│       ├── pypsa/case39.nc         # PyPSA netCDF format
-│       ├── matpower/case39.m       # MATPOWER .m format
-│       └── pandapower/case39.json  # PandaPower JSON format
-├── benchmarks/                     # Benchmark definitions and configs
+│       ├── pypsa/case39.nc                     # PyPSA netCDF format
+│       ├── matpower/case39.m                   # MATPOWER .m format
+│       └── pandapower/case39.json              # PandaPower JSON format
+├── benchmarks/                                 # Benchmark definitions and task configs
 │   └── steady/
-│       └── level_1/
-│           ├── README.md           # Full benchmark specification
-│           ├── actionspace.json    # Action contract and operating limits
-│           ├── actioncost.json     # Per-step action costs
-│           ├── baseline_summary.json
-│           └── solution_template.json
-├── scripts/                        # Runnable entry points
-│   ├── build_case.py               # Rebuild the stressed scenario
-│   ├── convert_case.py             # Export to MATPOWER and PandaPower
-│   └── evaluate_solution.py        # Score a solution
-└── poweragentbench/                # Shared library code
-    └── benchmark_utils.py          # Case construction and scoring
+│       ├── level_1/                            # N-1 steady-state audit and mitigation
+│       │   ├── README.md                       # Full Level 1 benchmark specification
+│       │   ├── actionspace.json                # Action contract and operating limits
+│       │   ├── actioncost.json                 # Per-step action costs
+│       │   ├── baseline_summary.json
+│       │   └── solution_template.json
+│       └── level_2/                            # Agentic N-2 search and mitigation
+│           ├── README.md                       # Full Level 2 benchmark specification
+│           ├── .env.example                    # Template for private Ollama configuration
+│           └── prompts/
+│               └── steady_n2_llm_prompt.json   # Shared LLM tool-use prompt template
+├── scripts/                                    # Runnable entry points
+│   ├── build_case.py                           # Rebuild the stressed Level 1 scenario
+│   ├── convert_case.py                         # Export case39 to MATPOWER and PandaPower
+│   ├── evaluate_solution.py                    # Score a Level 1 solution
+│   ├── run_steady_n2_baselines.py              # Run Level 2 scripted baselines
+│   └── run_steady_n2_ollama_eval.py            # Run Level 2 Ollama-hosted LLM agents
+└── poweragentbench/                            # Shared library code
+    ├── benchmark_utils.py                      # Level 1 case construction and scoring
+    ├── steady_state_agentic.py                 # Level 2 DC N-2 evaluator and baselines
+    ├── llm_agent_adapter.py                    # JSON-command LLM adapter
+    └── ollama_client.py                        # Ollama generate/chat client
 ```
 
-## Quick Start
+## Installation
 
 ```bash
 pip install -e .
+```
+
+## Quick Start
 
+### Level 1: N-1 steady-state audit and mitigation
+
+```bash
 # Rebuild the benchmark case from source
 python scripts/build_case.py
 
 # Export to MATPOWER and PandaPower formats
 python scripts/convert_case.py
 
 # Evaluate a solution
-python scripts/evaluate_solution.py --solution benchmarks/steady/level_1/solution_template.json
+python scripts/evaluate_solution.py \
+  --solution benchmarks/steady/level_1/solution_template.json
+```
+
+### Level 2: Agentic N-2 search and mitigation
+
+Run scripted baselines on deterministic variants of the existing IEEE 39-bus case:
+
+```bash
+python scripts/run_steady_n2_baselines.py \
+  --case-source case39 \
+  --cases 8 \
+  --budget 80 \
+  --report-k 20
+```
+
+Run deployed Ollama LLM agents:
+
+```bash
+python scripts/run_steady_n2_ollama_eval.py \
+  --case-source case39 \
+  --cases 8 \
+  --budget 80 \
+  --report-k 20 \
+  --max-turns 12 \
+  --prompt-template benchmarks/steady/level_2/prompts/steady_n2_llm_prompt.json
 ```
 
+Outputs are written under `results/steady_n2/` as per-case CSVs, aggregate CSVs, tool logs, API debug files, and LaTeX table rows.
+
 ## Case Formats
 
 The IEEE 39-bus stressed scenario is provided in three formats so that agents and solvers are not tied to a single tool:
 
-- **PyPSA** (`cases/case39/pypsa/case39.nc`): the primary format used by the evaluator.
-- **PandaPower** (`cases/case39/pandapower/case39.json`): for use with PandaPower-based tools.
-- **MATPOWER** (`cases/case39/matpower/case39.m`): for use with MATPOWER or MATPOWER-compatible solvers.
+- **PyPSA** (`cases/case39/pypsa/case39.nc`): primary format used by the Level 1 evaluator and by the Level 2 case39 converter.
+- **PandaPower** (`cases/case39/pandapower/case39.json`): for PandaPower-based tools.
+- **MATPOWER** (`cases/case39/matpower/case39.m`): for MATPOWER or MATPOWER-compatible solvers.
 
 ## Benchmarks
 
-See `benchmarks/steady/level_1/README.md` for the full specification of the first benchmark task.
+### Steady Level 1
+
+`benchmarks/steady/level_1/` evaluates N-1 steady-state audit and mitigation on a stressed IEEE 39-bus case. The agent receives a case, a published contingency list, and a bounded action space. The evaluator checks base-case and contingency violations after the submitted actions.
+
+See:
+
+```text
+benchmarks/steady/level_1/README.md
+```
+
+### Steady Level 2
+
+`benchmarks/steady/level_2/` evaluates agentic N-2 contingency search and optional mitigation. The agent must spend a limited validation budget, submit evidence-backed ranked contingencies, and optionally improve the hidden post-action violation score.
+
+The default case source is the existing IEEE 39-bus case distributed in this repository. The runner converts it to a lightweight DC representation and creates deterministic operating-point variants from fixed seeds. A synthetic fallback is also available for development.
+
+See:
+
+```text
+benchmarks/steady/level_2/README.md
+```
+
+## Ollama Configuration
+
+Private or internal Ollama endpoints should not be committed to the repository. Configure them through a local `.env` file.
+
+```bash
+cp benchmarks/steady/level_2/.env.example benchmarks/steady/level_2/.env
+```
+
+Example local settings:
+
+```bash
+POWERAGENTBENCH_OLLAMA_URL=http://localhost:11434/api/generate
+POWERAGENTBENCH_OLLAMA_MODELS=qwen3.5:latest mistral-nemo:12b command-r:35b
+POWERAGENTBENCH_OLLAMA_TEMPERATURE=0.0
+POWERAGENTBENCH_OLLAMA_NUM_CTX=16384
+POWERAGENTBENCH_OLLAMA_API_MODE=generate
+POWERAGENTBENCH_OLLAMA_THINK=false
+POWERAGENTBENCH_OLLAMA_SCHEMA_FORMAT=true
+```
+
+The local `.env` file is ignored by Git. You may also pass the same settings through command-line flags or process environment variables.
+
+## Metrics
+
+PowerAgentBench returns per-case and aggregate metrics, including:
+
+- submitted and evidence-backed top-20 recall,
+- found top-20 recall,
+- evidence rate,
+- best severity capture,
+- severity regret,
+- post-action violation and violation reduction,
+- action cost,
+- invalid tool calls,
+- schema repairs and type coercions,
+- duplicate validation requests,
+- explicit submission and auto-finalization indicators,
+- validation budget use.
+
+These metrics distinguish answer quality, tool evidence, search quality, mitigation quality, and workflow compliance.
+
+## Development Notes
+
+- Use Level 1 to test basic steady-state action submission and physical validation.
+- Use Level 2 to test agentic behavior, tool use, validation-budget allocation, and LLM workflows.
+- Keep hidden oracle quantities and private endpoint URLs outside the public repository.
+- Regenerate results after modifying prompts, adapters, scoring rules, or case-generation settings.
diff --git a/benchmarks/steady/level_2/.env.example b/benchmarks/steady/level_2/.env.example
@@ -0,0 +1,18 @@
+# Copy this file to benchmarks/steady/level_2_agentic_n2/.env and edit locally.
+# Do not commit .env. The benchmark runner loads this file automatically.
+
+# Use your local or internal Ollama endpoint. The internal UBI endpoint used in
+# experiments is intentionally not committed to the public repository.
+POWERAGENTBENCH_OLLAMA_URL=http://localhost:11434/api/generate
+
+# Space-separated or comma-separated list of models to evaluate.
+POWERAGENTBENCH_OLLAMA_MODELS=qwen3.5:latest mistral-nemo:12b gpt-oss:20b command-r:35b
+
+# Deterministic generation and extended context for tool transcripts.
+POWERAGENTBENCH_OLLAMA_TEMPERATURE=0.0
+POWERAGENTBENCH_OLLAMA_NUM_CTX=16384
+
+# Ollama API options.
+POWERAGENTBENCH_OLLAMA_API_MODE=generate
+POWERAGENTBENCH_OLLAMA_THINK=false
+POWERAGENTBENCH_OLLAMA_SCHEMA_FORMAT=true
diff --git a/benchmarks/steady/level_2/.gitignore b/benchmarks/steady/level_2/.gitignore
@@ -0,0 +1 @@
+.env
diff --git a/benchmarks/steady/level_2/README.md b/benchmarks/steady/level_2/README.md
@@ -0,0 +1,170 @@
+# Level 2: Agentic N-2 Steady-State Evaluation
+
+This task evaluates whether scripted or LLM agents can search N-2 contingencies, spend a limited validation budget, submit evidence-backed rankings, and optionally mitigate hidden top-20 violations.
+
+The default case source is the existing IEEE 39-bus case distributed in this repository. The runner converts it to a lightweight DC representation and creates deterministic operating-point variants from fixed seeds. A synthetic fallback is also available for development.
+
+## Goal
+
+For each case, an agent should:
+
+1. inspect the public case summary,
+2. use public screening tools to prioritize N-2 candidates,
+3. spend the limited validation budget on promising candidates,
+4. submit a ranked list supported by validation evidence,
+5. optionally call redispatch to reduce hidden post-action violations.
+
+## Public and Hidden Split
+
+The agent sees the public case summary, candidate count, tool API, validation budget, prompt template, and tool observations. The evaluator separately computes the hidden oracle over all N-2 candidates and returns discovery, evidence, mitigation, efficiency, and reliability metrics.
+
+The agent never sees hidden oracle labels during the run.
+
+## Public Tools
+
+The Level 2 LLM interface exposes the following tools:
+
+- `case_summary`: return network size, base severity, candidate count, and remaining budget.
+- `rank_base_loading`: rank candidates by public base-flow stress.
+- `rank_lodf`: rank candidates using an LODF-style approximate screen.
+- `validate`: run exact public validation for selected candidates and consume budget.
+- `redispatch`: apply bounded preventive redispatch on a focus set.
+- `submit`: submit ranked contingencies, mitigation, and diagnosis.
+
+## Run Scripted Baselines
+
+```bash
+python scripts/run_steady_n2_baselines.py \
+  --case-source case39 \
+  --cases 8 \
+  --budget 80 \
+  --report-k 20
+```
+
+Outputs are written to:
+
+```text
+results/steady_n2/
+```
+
+The baseline runner writes per-case CSVs, aggregate CSVs, and LaTeX table rows.
+
+## Configure Ollama Without Committing Private Endpoints
+
+Copy the example environment file and edit it locally:
+
+```bash
+cp benchmarks/steady/level_2/.env.example benchmarks/steady/level_2/.env
+```
+
+The local `.env` file is ignored by Git. Use it for private or internal network endpoints such as a non-public Ollama server.
+
+Required setting:
+
+```bash
+POWERAGENTBENCH_OLLAMA_URL=http://localhost:11434/api/generate
+```
+
+Recommended model list:
+
+```bash
+POWERAGENTBENCH_OLLAMA_MODELS=qwen3.5:latest mistral-nemo:12b command-r:35b
+```
+
+Recommended generation settings used in the paper experiments:
+
+```bash
+POWERAGENTBENCH_OLLAMA_TEMPERATURE=0.0
+POWERAGENTBENCH_OLLAMA_NUM_CTX=16384
+POWERAGENTBENCH_OLLAMA_API_MODE=generate
+POWERAGENTBENCH_OLLAMA_THINK=false
+POWERAGENTBENCH_OLLAMA_SCHEMA_FORMAT=true
+```
+
+The corresponding Ollama request includes:
+
+```json
+"options": {
+  "temperature": 0.0,
+  "num_ctx": 16384
+}
+```
+
+## Run Deployed Ollama Agents
+
+After configuring `.env`, run:
+
+```bash
+python scripts/run_steady_n2_ollama_eval.py \
+  --case-source case39 \
+  --cases 8 \
+  --budget 80 \
+  --report-k 20 \
+  --max-turns 12 \
+  --prompt-template benchmarks/steady/level_2/prompts/steady_n2_llm_prompt.json
+```
+
+You can override the model list from the command line:
+
+```bash
+python scripts/run_steady_n2_ollama_eval.py \
+  --models qwen3.5:latest mistral-nemo:12b gpt-oss:20b command-r:35b \
+  --case-source case39 \
+  --cases 8 \
+  --budget 80 \
+  --report-k 20 \
+  --max-turns 12 \
+  --prompt-template benchmarks/steady/level_2/prompts/steady_n2_llm_prompt.json
+```
+
+The runner writes per-case CSVs, aggregate CSVs, tool logs, API debug files, and LaTeX table rows under:
+
+```text
+results/steady_n2/
+```
+
+## Prompt Template
+
+The default prompt template is:
+
+```text
+benchmarks/steady/level_2/prompts/steady_n2_llm_prompt.json
+```
+
+The prompt specifies the JSON command schema, allowed tools, validation budget, canonical contingency representation, and expected workflow. Keeping the prompt template in the repository makes multi-model LLM comparisons reproducible.
+
+## Output Metrics
+
+The runner returns per-case and aggregate CSV files with:
+
+- `validated_calls`,
+- `reported_top20_recall`,
+- `validated_top20_recall`,
+- `found_top20_recall`,
+- `evidence_rate`,
+- `best_capture_validated`,
+- `severity_regret`,
+- `post_top20_violation`,
+- `violation_reduction`,
+- `action_cost`,
+- `invalid_tool_calls`,
+- `schema_repairs`,
+- `type_coercions`,
+- `duplicate_validation_requests`,
+- `submitted_explicitly`,
+- `auto_finalized`,
+- `validation_budget_used`.
+
+These fields separate search quality, evidence quality, tool compliance, budget use, mitigation, and workflow completion.
+
+## Evaluation Regimes
+
+- **Open**: users can inspect public files and debug agents locally.
+- **Sealed**: agents interact only with the public tool server while oracle labels and evaluator scripts remain private.
+- **Stress**: agents are rerun across seeds, prompt variants, or scenario variants to measure reliability.
+
+## Notes
+
+- The Level 2 evaluator is a lightweight DC approximation intended to test the benchmark mechanics.
+- It is not a replacement for AC security analysis.
+- The same public-agent and hidden-evaluator protocol can be connected to AC power flow, SCOPF, voltage-security studies, or commercial simulators.
diff --git a/benchmarks/steady/level_2/prompts/steady_n2_llm_prompt.json b/benchmarks/steady/level_2/prompts/steady_n2_llm_prompt.json
@@ -0,0 +1,3 @@
+{
+  "system": "You are an engineering agent for PowerAgentBench-SS. Your task is to find high-severity N-2 branch-outage contingencies under a limited validation budget, support submitted claims with tool evidence, optionally mitigate, and submit a ranked list. Return exactly one JSON object per turn and no prose. Use this schema: {\\\"tool\\\":\\\"<tool_name>\\\",\\\"args\\\":{...}}. Allowed commands: {\\\"tool\\\":\\\"case_summary\\\",\\\"args\\\":{}}, {\\\"tool\\\":\\\"rank_base_loading\\\",\\\"args\\\":{\\\"top_n\\\":80}}, {\\\"tool\\\":\\\"rank_lodf\\\",\\\"args\\\":{\\\"top_n\\\":80}}, {\\\"tool\\\":\\\"validate\\\",\\\"args\\\":{\\\"contingencies\\\":[[2,11],[11,37],[11,43],[11,21],[11,41],[11,45],[11,26],[11,15],[11,16],[11,29]]}}, {\\\"tool\\\":\\\"redispatch\\\",\\\"args\\\":{\\\"focus\\\":[[2,11],[11,37],[11,43],[11,21],[11,41]]}}, {\\\"tool\\\":\\\"submit\\\",\\\"args\\\":{\\\"reported\\\":[[2,11],[11,37],[11,43],[11,21],[11,41]],\\\"diagnosis\\\":\\\"brief evidence-backed summary\\\"}}. Important rules: a contingency is a pair of branch ids [i,j], not bus ids. Branch ids are integers from 0 to n_branch-1. Use the canonical field name contingencies for validate and reported for submit. Use ranking tools first, then validate large batches of promising pairs up to the remaining validation budget. Do not validate only two example pairs unless the budget is nearly exhausted. Do not repeat already validated pairs. Always call submit before the final turn. Do not use case_name, bus, branch/bus dictionaries, natural language, or markdown."
+}
Original file line number	Diff line number	Diff line change
Expand Up		@@ -65,3 +65,5 @@ venv.bak/

		# Project specific
		**/.mplconfig/

		results/*