Skip to content

Absolemzz/forge-runner

Repository files navigation

Forge Runner

Python 3.11 Pydantic v2 Ollama local Docker sandbox pytest 52 passing

A deterministic local LLM execution engine — not an AI agent, a reliability engineering system.

What This Is

Forge Runner is a controlled execution engine for unreliable LLM outputs. It takes a user task, asks a local model to produce a structured plan, executes that plan in a sandbox, verifies the result against pre-defined assertions, and retries with failure-specific context when the output does not satisfy the contract.

This is not a chatbot. It does not optimize for open-ended conversation.

This is not an AI agent framework. It does not build multi-agent workflows, autonomous tool ecosystems, or unconstrained planning loops.

The problem it solves is simple: LLMs are probabilistic and unreliable. Most systems call the LLM and hope the output is good. Forge Runner does not. It wraps generation in typed schemas, explicit state transitions, sandboxed execution, assertion-based critique, bounded retries, and replayable traces.

Forge Runner demonstrates how to wrap nondeterministic model behavior in deterministic software engineering controls.

Core idea: turn probabilistic model output into a deterministic, inspectable execution pipeline.

How It Works

Task → Planner → Executor → Critic ──→ DONE
                  ↑                  ↓
                  └── Retry (informed, bounded) ◄─ FAIL

Key Engineering Decisions

  • Assertion-first critic: The planner defines machine-checkable success assertions before execution; the critic mechanically verifies them instead of asking the LLM to judge itself.
  • Informed retry loop: Failed assertions and retry guidance are injected into the next planner prompt so retries target the actual failure, not a random re-roll.
  • RunContext as single state object: All orchestration state flows through one typed Pydantic model, making transitions explicit and trace snapshots complete.
  • Full replayable traces: Every run emits JSONL events with context snapshots, allowing failures to be replayed, inspected, and debugged after the fact.

Quick Start

Requirements: Python 3.11+, Ollama, and Docker recommended.

# Clone the repository
git clone https://github.com/Absolemzz/forge-runner
cd forge-runner

# Create and activate a virtual environment
python -m venv .venv

# Windows
.venv\Scripts\activate

# macOS/Linux
source .venv/bin/activate

# Install dependencies
pip install -r requirements.txt
# Pull the default local model used by Forge Runner
ollama pull qwen2.5-coder:7b

# Confirm Ollama and project configuration are reachable
python main.py check
# Run the test suite without requiring Ollama
pytest tests/ -v

CLI Usage

# Check local configuration and Ollama connectivity
python main.py check
# Run a task through the planner, executor, critic, and retry loop
python main.py run "Write a Python function that converts Celsius to Fahrenheit"
# Replay a previously recorded run trace
python main.py replay <run_id>

Example Trace Output

=== Forge Runner Replay: 88a03719 ===
[attempt 0] RUN_STARTED
  task: "celsius_to_fahrenheit..."
[attempt 0] PLAN_GENERATED
  steps: 1 | assertions: 1
[attempt 0] EXECUTION_COMPLETE
  exit_code: 0 | duration: 0.026s | sandbox: subprocess
[attempt 0] CRITIQUE_RESULT
  verdict: FAIL
  FAIL a1: Expected stdout to match output format
[attempt 0] RETRY_INITIATED
  guidance: Revise plan to address failed assertions
[attempt 1] EXECUTION_COMPLETE
  exit_code: 0 | duration: 0.024s | sandbox: subprocess
...
[attempt 3] CRITIQUE_RESULT
  verdict: PASS
  PASS a1: stdout matched expected output
[attempt 3] RUN_COMPLETE
  succeeded in 4 attempts

Architecture

Forge Runner is organized around a small deterministic state machine: plan, execute, critique, retry, and terminate. The orchestrator owns state transitions, the agents own isolated responsibilities, and every component communicates through typed contracts in app/models/.

Read the full system overview in docs/architecture.md.

Project Structure

app/
  orchestrator/
    engine.py          # run() entry point and main loop
    state.py           # valid state transitions
    transitions.py     # RunContext transition helpers
  agents/
    planner.py         # local LLM planning into PlanSchema
    executor.py        # sandboxed step execution
    critic.py          # assertion-based verification
  llm/
    ollama_client.py   # single Ollama API boundary
  tools/
    sandbox.py         # Docker / RestrictedPython execution
    file_ops.py        # workspace-safe file operations
  memory/
    store.py           # JSONL trace writer
    replay.py          # trace replay renderer
  models/
    schemas.py         # domain contracts
    events.py          # trace event envelope
  config/
    settings.py        # FORGE_ settings
docs/
  architecture.md
  design-decisions.md
  execution-flow.md
  agents.md
tests/
  test_planner.py
  test_executor.py
  test_critic.py
  test_orchestrator.py
  test_replay.py
examples/
  sample_tasks.md
  traces/
main.py

Running Tests

pytest tests/ -v

The LLM is mocked in tests, so no Ollama server or local model is needed for the test suite.

Design Decisions

See docs/design-decisions.md for the complete engineering rationale.

  • Local Ollama model instead of a cloud API for reproducible, offline execution.
  • JSONL trace storage instead of a database for append-only, human-readable replay artifacts.
  • Docker sandbox by default, with RestrictedPython documented as a limited fallback.

Configuration

Configuration is driven by FORGE_ environment variables. Copy .env.example to .env and adjust values such as model name, Ollama base URL, retry count, timeout, sandbox type, Docker limits, trace directory, and workspace directory.

License

MIT License. See LICENSE.

About

A deterministic local LLM execution engine with structured planning, sandboxed tool execution, critique driven retries, and full run traces.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages