Production-Grade AI Agent Execution on Kubernetes
The AgentTask Operator provides a robust, secure, and K8s-native abstraction for running ephemeral AI agents and code snippets. It bridges the gap between raw Pods and high-level agent frameworks, offering built-in lifecycle management, result capture, and security hardening.
- Secure Isolation:
- Rootless Execution: Tasks run as UID 1000 by default.
- Network Sandbox: Default-deny NetworkPolicies block unauthorized traffic.
- Runtime Profiles: Whitelist specific images and capabilities (e.g.,
python3.11,node18).
- Lifecycle Management:
- TTL Cleanup: Automatically delete finished tasks after a configurable duration (
ttlSecondsAfterFinished). - Finalizers: Ensure external resources (Pods, Sandboxes) are strictly cleaned up before API objects are removed.
- Timeouts & Cancellation: strictly enforce execution limits.
- TTL Cleanup: Automatically delete finished tasks after a configurable duration (
- Observability:
- Standardized Failures: Explicit reasons for failure (
ImagePullFailed,Timeout,PodFailed) in the status. - Prometheus Metrics: Built-in metrics for task counts and duration (
agenttask_total_tasks,agenttask_duration_seconds). - Structured Results: Automatically captures
result.jsonfrom the task execution.
- Standardized Failures: Explicit reasons for failure (
- Developer Experience:
agentctlCLI: A specialized tool to run, inspect, and debug agents without writing YAML.
graph TD
User["User / CLI"] -->|Creates| CR["AgentTask CR"]
Controller["AgentTask Controller"] -->|Watches| CR
Controller -->|Creates| CM["ConfigMap (Code)"]
Controller -->|Creates| Pod["Worker Pod"]
Controller -->|Enforces| NP["NetworkPolicy"]
subgraph "Execution Environment"
Pod -->|Mounts| CM
Pod -->|Runs| Container["Task Container"]
Container -->|Writes| Result["result.json"]
end
Controller -- Collects --> Status["Task Status"]
Status -- Exposes --> Result
- Kubernetes v1.25+
kubectlconfiguredcert-manager(for Webhook validation)
# 1. Install cert-manager
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.16.0/cert-manager.yaml
# 2. Deploy Operator
make deploy IMG=ghcr.io/agenttask/controller:latestThe agentctl CLI is the recommended way to interact with the operator.
cd cmd/agentctl
go build -o ../../agentctl main.go
cd ../..Create a script agent.py:
import json
print("Generating insight...")
print("INSIGHT: {'value': 42}") # Standard output log
# Write structured result
with open("/workspace/artifacts/result.json", "w") as f:
json.dump({"status": "success", "insight": 42}, f)Run it:
./agentctl run agent.py --profile python3.11 --timeout 60# List tasks with standardized reasons
./agentctl list
# Output:
# NAME PHASE REASON AGE POD
# agent-sample-x8dj2 Succeeded 12s agent-sample-x8dj2-pod
# agent-fail-9k2ls Failed Timeout 5m agent-fail-9k2ls-pod
# Get detailed info & results
./agentctl describe agent-sample-x8dj2For GitOps workflows, use the CRD directly:
apiVersion: execution.agenttask.io/v1alpha1
kind: AgentTask
metadata:
name: periodic-agent
spec:
runtimeProfile: "python3.11"
timeoutSeconds: 300
ttlSecondsAfterFinished: 600 # Delete 10m after finish
code:
source: |
import os
print(f"Running secure task in K8s!")
resources:
limits:
cpu: "1"
memory: "512Mi"| Issue | Cause | Fix |
|---|---|---|
ImagePullFailed |
The RuntimeProfile maps to an invalid image or registry auth is missing. | Check controller logs and runtimeProfile config. |
Timeout |
Task took longer than timeoutSeconds. |
Optimize code or increase timeout in Spec. |
Webhook Denied |
The request violated security policies (e.g. invalid profile). | Check kubectl describe events or admission controller logs. |
See Development Guide for local setup instructions.
Apache 2.0