A deterministic local LLM execution engine — not an AI agent, a reliability engineering system.
Forge Runner is a controlled execution engine for unreliable LLM outputs. It takes a user task, asks a local model to produce a structured plan, executes that plan in a sandbox, verifies the result against pre-defined assertions, and retries with failure-specific context when the output does not satisfy the contract.
This is not a chatbot. It does not optimize for open-ended conversation.
This is not an AI agent framework. It does not build multi-agent workflows, autonomous tool ecosystems, or unconstrained planning loops.
The problem it solves is simple: LLMs are probabilistic and unreliable. Most systems call the LLM and hope the output is good. Forge Runner does not. It wraps generation in typed schemas, explicit state transitions, sandboxed execution, assertion-based critique, bounded retries, and replayable traces.
Forge Runner demonstrates how to wrap nondeterministic model behavior in deterministic software engineering controls.
Core idea: turn probabilistic model output into a deterministic, inspectable execution pipeline.
Task → Planner → Executor → Critic ──→ DONE
↑ ↓
└── Retry (informed, bounded) ◄─ FAIL
- Assertion-first critic: The planner defines machine-checkable success assertions before execution; the critic mechanically verifies them instead of asking the LLM to judge itself.
- Informed retry loop: Failed assertions and retry guidance are injected into the next planner prompt so retries target the actual failure, not a random re-roll.
- RunContext as single state object: All orchestration state flows through one typed Pydantic model, making transitions explicit and trace snapshots complete.
- Full replayable traces: Every run emits JSONL events with context snapshots, allowing failures to be replayed, inspected, and debugged after the fact.
Requirements: Python 3.11+, Ollama, and Docker recommended.
# Clone the repository
git clone https://github.com/Absolemzz/forge-runner
cd forge-runner
# Create and activate a virtual environment
python -m venv .venv
# Windows
.venv\Scripts\activate
# macOS/Linux
source .venv/bin/activate
# Install dependencies
pip install -r requirements.txt# Pull the default local model used by Forge Runner
ollama pull qwen2.5-coder:7b
# Confirm Ollama and project configuration are reachable
python main.py check# Run the test suite without requiring Ollama
pytest tests/ -v# Check local configuration and Ollama connectivity
python main.py check# Run a task through the planner, executor, critic, and retry loop
python main.py run "Write a Python function that converts Celsius to Fahrenheit"# Replay a previously recorded run trace
python main.py replay <run_id>=== Forge Runner Replay: 88a03719 ===
[attempt 0] RUN_STARTED
task: "celsius_to_fahrenheit..."
[attempt 0] PLAN_GENERATED
steps: 1 | assertions: 1
[attempt 0] EXECUTION_COMPLETE
exit_code: 0 | duration: 0.026s | sandbox: subprocess
[attempt 0] CRITIQUE_RESULT
verdict: FAIL
FAIL a1: Expected stdout to match output format
[attempt 0] RETRY_INITIATED
guidance: Revise plan to address failed assertions
[attempt 1] EXECUTION_COMPLETE
exit_code: 0 | duration: 0.024s | sandbox: subprocess
...
[attempt 3] CRITIQUE_RESULT
verdict: PASS
PASS a1: stdout matched expected output
[attempt 3] RUN_COMPLETE
succeeded in 4 attempts
Forge Runner is organized around a small deterministic state machine: plan, execute, critique, retry, and terminate. The orchestrator owns state transitions, the agents own isolated responsibilities, and every component communicates through typed contracts in app/models/.
Read the full system overview in docs/architecture.md.
app/
orchestrator/
engine.py # run() entry point and main loop
state.py # valid state transitions
transitions.py # RunContext transition helpers
agents/
planner.py # local LLM planning into PlanSchema
executor.py # sandboxed step execution
critic.py # assertion-based verification
llm/
ollama_client.py # single Ollama API boundary
tools/
sandbox.py # Docker / RestrictedPython execution
file_ops.py # workspace-safe file operations
memory/
store.py # JSONL trace writer
replay.py # trace replay renderer
models/
schemas.py # domain contracts
events.py # trace event envelope
config/
settings.py # FORGE_ settings
docs/
architecture.md
design-decisions.md
execution-flow.md
agents.md
tests/
test_planner.py
test_executor.py
test_critic.py
test_orchestrator.py
test_replay.py
examples/
sample_tasks.md
traces/
main.py
pytest tests/ -vThe LLM is mocked in tests, so no Ollama server or local model is needed for the test suite.
See docs/design-decisions.md for the complete engineering rationale.
- Local Ollama model instead of a cloud API for reproducible, offline execution.
- JSONL trace storage instead of a database for append-only, human-readable replay artifacts.
- Docker sandbox by default, with RestrictedPython documented as a limited fallback.
Configuration is driven by FORGE_ environment variables. Copy .env.example to .env and adjust values such as model name, Ollama base URL, retry count, timeout, sandbox type, Docker limits, trace directory, and workspace directory.
MIT License. See LICENSE.