From da58c1a9e8fe32df8cbe55cf5a91dc20e7ba953c Mon Sep 17 00:00:00 2001 From: MK Date: Wed, 4 Mar 2026 04:07:23 -0500 Subject: [PATCH 1/5] feat: add file_create tool with disk persistence and k8s-pod-rightsizer skill - Add file_create builtin tool that writes files to disk and returns structured JSON with path for channel upload and cross-tool reference - Files are written to the agent's .forge/files/ directory via FilesDir context value, with fallback to $TMPDIR/forge-files/ - Add FilesDir to LLMExecutorConfig and inject into execution context - Fix Slack file extraction to preserve raw content for typed files - Add k8s-pod-rightsizer embedded skill with apply workflow instructions - Update docs for tools, runtime, and skills --- docs/runtime.md | 14 + docs/skills.md | 29 + docs/tools.md | 31 + forge-cli/runtime/runner.go | 1 + forge-core/runtime/audit.go | 17 +- forge-core/runtime/loop.go | 50 +- forge-core/tools/builtins/builtins_test.go | 193 +++ forge-core/tools/builtins/file_create.go | 120 ++ forge-core/tools/builtins/register.go | 1 + forge-plugins/channels/slack/slack.go | 8 +- .../embedded/k8s-pod-rightsizer/SKILL.md | 516 ++++++++ .../scripts/k8s-pod-rightsizer.sh | 1082 +++++++++++++++++ forge-skills/local/registry_embedded_test.go | 5 +- 13 files changed, 2059 insertions(+), 8 deletions(-) create mode 100644 forge-core/tools/builtins/file_create.go create mode 100644 forge-skills/local/embedded/k8s-pod-rightsizer/SKILL.md create mode 100644 forge-skills/local/embedded/k8s-pod-rightsizer/scripts/k8s-pod-rightsizer.sh diff --git a/docs/runtime.md b/docs/runtime.md index d08fd13..230ddd8 100644 --- a/docs/runtime.md +++ b/docs/runtime.md @@ -161,6 +161,20 @@ forge serve logs The daemon forks `forge run` in the background with `setsid`, writes state to `.forge/serve.json`, and redirects output to `.forge/serve.log`. Passphrase prompting for encrypted secrets happens in the parent process (which has TTY access) before forking. +## File Output Directory + +The runtime configures a `FilesDir` for tool-generated files (e.g., from `file_create`). This directory defaults to `/.forge/files/` and is injected into the execution context so tools can write files that other tools can reference by path. + +``` +/ + .forge/ + files/ ← file_create output (patches.yaml, reports, etc.) + sessions/ ← conversation persistence + memory/ ← long-term memory +``` + +The `FilesDir` is set via `LLMExecutorConfig.FilesDir` and made available to tools through `runtime.FilesDirFromContext(ctx)`. See [Tools — File Create](tools.md#file-create) for details. + ## Conversation Memory For details on session persistence, context window management, compaction, and long-term memory, see [Memory](memory.md). diff --git a/docs/skills.md b/docs/skills.md index ca3de1f..48c0dbf 100644 --- a/docs/skills.md +++ b/docs/skills.md @@ -150,6 +150,7 @@ forge skills list --tags kubernetes,incident-response | `tavily-search` | — | Search the web using Tavily AI search API | `tavily-search.sh` | | `tavily-research` | — | Deep multi-source research via Tavily API | `tavily-research.sh`, `tavily-research-poll.sh` | | `k8s-incident-triage` | sre | Read-only Kubernetes incident triage using kubectl | — (binary-backed) | +| `k8s-pod-rightsizer` | sre | Analyze workload metrics and produce CPU/memory rightsizing recommendations with optional apply | — (binary-backed) | | `code-review` | developer | AI-powered code review for diffs and files | `code-review-diff.sh`, `code-review-file.sh` | | `code-review-standards` | developer | Initialize and manage code review standards | — (template-based) | | `code-review-github` | developer | Post code review results to GitHub PRs | — (binary-backed) | @@ -218,6 +219,34 @@ The skill accepts two input modes: Requires: `kubectl`, optional `KUBECONFIG`, `K8S_API_DOMAIN`, `DEFAULT_NAMESPACE` environment variables. +### Kubernetes Pod Rightsizer Skill + +The `k8s-pod-rightsizer` skill analyzes real workload metrics (Prometheus or metrics-server fallback) and produces policy-constrained CPU/memory rightsizing recommendations: + +```bash +forge skills add k8s-pod-rightsizer +``` + +This skill operates in three modes: + +| Mode | Purpose | Mutates Cluster | +|------|---------|-----------------| +| `dry-run` | Report recommendations only (default) | No | +| `plan` | Generate strategic merge patch YAMLs | No | +| `apply` | Execute patches with rollback bundle | Yes (requires `i_accept_risk: true`) | + +**Key features:** + +- Deterministic formulas — no LLM-based guessing for recommendations +- Policy model with per-namespace and per-workload overrides (safety factors, min/max bounds, step constraints) +- Prometheus p95 metrics with metrics-server fallback +- Automatic rollback bundle generation in apply mode +- Workload classification: over-provisioned, under-provisioned, right-sized, limit-bound, insufficient-data + +**Apply workflow:** The skill's built-in `mode=apply` handles rollback bundles, strategic merge patches via `kubectl patch`, and rollout verification. Do not manually run `kubectl apply -f` — use `mode=apply` with `i_accept_risk: true` instead. + +Requires: `bash`, `kubectl`, `jq`, `curl`. Optional: `KUBECONFIG`, `K8S_API_DOMAIN`, `PROMETHEUS_URL`, `PROMETHEUS_TOKEN`, `POLICY_FILE`, `DEFAULT_NAMESPACE`. + ### Codegen React Skill The `codegen-react` skill scaffolds and iterates on **Vite + React** applications with Tailwind CSS: diff --git a/docs/tools.md b/docs/tools.md index 030cfd9..3c91835 100644 --- a/docs/tools.md +++ b/docs/tools.md @@ -24,6 +24,7 @@ Tools are capabilities that an LLM agent can invoke during execution. Forge prov | `uuid_generate` | Generate UUID v4 identifiers | | `math_calculate` | Evaluate mathematical expressions | | `web_search` | Search the web for quick lookups and recent information | +| `file_create` | Create a downloadable file, written to the agent's `.forge/files/` directory | | `read_skill` | Load full instructions for an available skill on demand | | `memory_search` | Search long-term memory (when enabled) | | `memory_get` | Read memory files (when enabled) | @@ -80,6 +81,36 @@ tools: | 6 | **Environment isolation** | Only `PATH`, `HOME`, `LANG`, explicit passthrough vars, and proxy vars | | 7 | **Output limits** | Configurable max output size (default: 1MB) to prevent memory exhaustion | +## File Create + +The `file_create` tool generates downloadable files that are both written to disk and uploaded to the user's channel (Slack/Telegram). + +| Field | Description | +|-------|-------------| +| `filename` | Name with extension (e.g., `patches.yaml`, `report.json`) | +| `content` | Full file content as text | + +**Output JSON** includes `filename`, `content`, `mime_type`, and `path`. The `path` field contains the absolute disk location, allowing other tools (e.g., `kubectl apply -f `) to reference the file. + +**File location:** Files are written to the agent's `.forge/files/` directory (under `WorkDir`). The runtime injects this path via `FilesDir` in the executor context. When running outside the full runtime (e.g., tests), falls back to `$TMPDIR/forge-files/`. + +**Allowed extensions:** + +| Extension | MIME Type | +|-----------|-----------| +| `.md` | `text/markdown` | +| `.json` | `application/json` | +| `.yaml`, `.yml` | `text/yaml` | +| `.txt`, `.log` | `text/plain` | +| `.csv` | `text/csv` | +| `.sh` | `text/x-shellscript` | +| `.xml` | `text/xml` | +| `.html` | `text/html` | +| `.py` | `text/x-python` | +| `.ts` | `text/typescript` | + +Filenames with path separators (`/`, `\`) or traversal patterns (`..`) are rejected. + ## Memory Tools When [long-term memory](memory.md) is enabled, two additional tools are registered: diff --git a/forge-cli/runtime/runner.go b/forge-cli/runtime/runner.go index a4410b2..3b3f51a 100644 --- a/forge-cli/runtime/runner.go +++ b/forge-cli/runtime/runner.go @@ -383,6 +383,7 @@ func (r *Runner) Run(ctx context.Context) error { Logger: r.logger, ModelName: mc.Client.Model, CharBudget: charBudget, + FilesDir: filepath.Join(r.cfg.WorkDir, ".forge", "files"), } // Initialize memory persistence (enabled by default). diff --git a/forge-core/runtime/audit.go b/forge-core/runtime/audit.go index 2db26eb..f6af274 100644 --- a/forge-core/runtime/audit.go +++ b/forge-core/runtime/audit.go @@ -64,9 +64,10 @@ func (a *AuditLogger) Emit(event AuditEvent) { a.mu.Unlock() } -// Context key types for correlation and task IDs. +// Context key types for correlation IDs, task IDs, and file directories. type correlationIDKey struct{} type taskIDKey struct{} +type filesDirKey struct{} // WithCorrelationID stores a correlation ID in the context. func WithCorrelationID(ctx context.Context, id string) context.Context { @@ -96,6 +97,20 @@ func TaskIDFromContext(ctx context.Context) string { return "" } +// WithFilesDir stores a files directory path in the context. +func WithFilesDir(ctx context.Context, dir string) context.Context { + return context.WithValue(ctx, filesDirKey{}, dir) +} + +// FilesDirFromContext retrieves the files directory from the context. +// Returns "" if not set. +func FilesDirFromContext(ctx context.Context) string { + if dir, ok := ctx.Value(filesDirKey{}).(string); ok { + return dir + } + return "" +} + // GenerateID produces a 16-character hex random ID using crypto/rand. func GenerateID() string { b := make([]byte, 8) diff --git a/forge-core/runtime/loop.go b/forge-core/runtime/loop.go index 8de27a4..427f93b 100644 --- a/forge-core/runtime/loop.go +++ b/forge-core/runtime/loop.go @@ -31,6 +31,7 @@ type LLMExecutor struct { modelName string // resolved model name for context budget charBudget int // resolved character budget maxToolResultChars int // computed from char budget + filesDir string // directory for file_create output } // LLMExecutorConfig configures the LLM executor. @@ -45,6 +46,7 @@ type LLMExecutorConfig struct { Logger Logger ModelName string // model name for context-aware budgeting CharBudget int // explicit char budget override (0 = auto from model) + FilesDir string // directory for file_create output (default: $TMPDIR/forge-files) } // NewLLMExecutor creates a new LLMExecutor with the given configuration. @@ -93,11 +95,16 @@ func NewLLMExecutor(cfg LLMExecutorConfig) *LLMExecutor { modelName: cfg.ModelName, charBudget: budget, maxToolResultChars: toolLimit, + filesDir: cfg.FilesDir, } } // Execute processes a message through the LLM agent loop. func (e *LLMExecutor) Execute(ctx context.Context, task *a2a.Task, msg *a2a.Message) (*a2a.Message, error) { + if e.filesDir != "" { + ctx = WithFilesDir(ctx, e.filesDir) + } + mem := NewMemory(e.systemPrompt, e.charBudget, e.modelName) // Try to recover session from disk. If found, the disk snapshot @@ -239,13 +246,31 @@ func (e *LLMExecutor) Execute(ctx context.Context, task *a2a.Task, msg *a2a.Mess return nil, fmt.Errorf("after tool exec hook: %w", err) } - // Track large tool outputs for pass-through in the response. - if len(result) > largeToolOutputThreshold { + // Handle file_create tool: always create a file part. + // For other tools with large output, detect content type. + if tc.Function.Name == "file_create" { + var fc struct { + Filename string `json:"filename"` + Content string `json:"content"` + MimeType string `json:"mime_type"` + } + if err := json.Unmarshal([]byte(result), &fc); err == nil && fc.Filename != "" { + largeToolOutputs = append(largeToolOutputs, a2a.Part{ + Kind: a2a.PartKindFile, + File: &a2a.FileContent{ + Name: fc.Filename, + MimeType: fc.MimeType, + Bytes: []byte(fc.Content), + }, + }) + } + } else if len(result) > largeToolOutputThreshold { + name, mime := detectFileType(result, tc.Function.Name) largeToolOutputs = append(largeToolOutputs, a2a.Part{ Kind: a2a.PartKindFile, File: &a2a.FileContent{ - Name: tc.Function.Name + "-output.md", - MimeType: "text/markdown", + Name: name, + MimeType: mime, Bytes: []byte(result), }, }) @@ -327,6 +352,23 @@ func a2aMessageToLLM(msg a2a.Message) llm.ChatMessage { } } +// detectFileType inspects tool output content and returns an appropriate +// filename and MIME type. JSON and YAML content gets typed extensions; +// everything else defaults to markdown. +func detectFileType(content, toolName string) (filename, mimeType string) { + trimmed := strings.TrimSpace(content) + if len(trimmed) > 0 && (trimmed[0] == '{' || trimmed[0] == '[') { + // Quick check: try to parse as JSON. + if json.Valid([]byte(trimmed)) { + return toolName + "-output.json", "application/json" + } + } + if strings.HasPrefix(trimmed, "---") { + return toolName + "-output.yaml", "text/yaml" + } + return toolName + "-output.md", "text/markdown" +} + // llmMessageToA2A converts an LLM chat message to an A2A message. // Any extra parts (e.g. large tool output files) are appended after the text part. func llmMessageToA2A(msg llm.ChatMessage, extraParts ...a2a.Part) *a2a.Message { diff --git a/forge-core/tools/builtins/builtins_test.go b/forge-core/tools/builtins/builtins_test.go index 9add8be..624e237 100644 --- a/forge-core/tools/builtins/builtins_test.go +++ b/forge-core/tools/builtins/builtins_test.go @@ -6,7 +6,10 @@ import ( "net/http" "net/http/httptest" "os" + "path/filepath" "strings" + + "github.com/initializ/forge/forge-core/runtime" "testing" "github.com/initializ/forge/forge-core/tools" @@ -21,6 +24,7 @@ func TestRegisterAll(t *testing.T) { expected := []string{ "http_request", "json_parse", "csv_parse", "datetime_now", "uuid_generate", "math_calculate", "web_search", + "file_create", } for _, name := range expected { if reg.Get(name) == nil { @@ -374,6 +378,195 @@ func TestWebSearchTool_ExplicitPerplexity(t *testing.T) { } } +func TestFileCreateTool(t *testing.T) { + tool := GetByName("file_create") + if tool == nil { + t.Fatal("expected file_create tool to exist") + } + + // Clean up temp files after all subtests. + defer func() { _ = os.RemoveAll(filepath.Join(os.TempDir(), "forge-files")) }() + + t.Run("valid YAML file", func(t *testing.T) { + args, _ := json.Marshal(map[string]string{ + "filename": "patches.yaml", + "content": "---\napiVersion: apps/v1\nkind: Deployment", + }) + result, err := tool.Execute(context.Background(), args) + if err != nil { + t.Fatalf("Execute error: %v", err) + } + var out map[string]string + if err := json.Unmarshal([]byte(result), &out); err != nil { + t.Fatalf("output is not valid JSON: %v", err) + } + if out["filename"] != "patches.yaml" { + t.Errorf("filename: got %q, want %q", out["filename"], "patches.yaml") + } + if out["mime_type"] != "text/yaml" { + t.Errorf("mime_type: got %q, want %q", out["mime_type"], "text/yaml") + } + if out["content"] != "---\napiVersion: apps/v1\nkind: Deployment" { + t.Errorf("content mismatch: got %q", out["content"]) + } + // Verify path field and disk persistence. + if out["path"] == "" { + t.Fatal("expected non-empty path field") + } + diskContent, err := os.ReadFile(out["path"]) + if err != nil { + t.Fatalf("file not found at path %q: %v", out["path"], err) + } + if string(diskContent) != out["content"] { + t.Errorf("disk content mismatch: got %q, want %q", string(diskContent), out["content"]) + } + }) + + t.Run("valid JSON file", func(t *testing.T) { + args, _ := json.Marshal(map[string]string{ + "filename": "report.json", + "content": `{"key":"value"}`, + }) + result, err := tool.Execute(context.Background(), args) + if err != nil { + t.Fatalf("Execute error: %v", err) + } + var out map[string]string + if err := json.Unmarshal([]byte(result), &out); err != nil { + t.Fatalf("output is not valid JSON: %v", err) + } + if out["mime_type"] != "application/json" { + t.Errorf("mime_type: got %q, want %q", out["mime_type"], "application/json") + } + if out["path"] == "" { + t.Fatal("expected non-empty path field") + } + }) + + t.Run("valid Python file", func(t *testing.T) { + args, _ := json.Marshal(map[string]string{ + "filename": "script.py", + "content": "print('hello')", + }) + result, err := tool.Execute(context.Background(), args) + if err != nil { + t.Fatalf("Execute error: %v", err) + } + var out map[string]string + if err := json.Unmarshal([]byte(result), &out); err != nil { + t.Fatalf("output is not valid JSON: %v", err) + } + if out["mime_type"] != "text/x-python" { + t.Errorf("mime_type: got %q, want %q", out["mime_type"], "text/x-python") + } + }) + + t.Run("valid TypeScript file", func(t *testing.T) { + args, _ := json.Marshal(map[string]string{ + "filename": "index.ts", + "content": "const x: number = 1;", + }) + result, err := tool.Execute(context.Background(), args) + if err != nil { + t.Fatalf("Execute error: %v", err) + } + var out map[string]string + if err := json.Unmarshal([]byte(result), &out); err != nil { + t.Fatalf("output is not valid JSON: %v", err) + } + if out["mime_type"] != "text/typescript" { + t.Errorf("mime_type: got %q, want %q", out["mime_type"], "text/typescript") + } + }) + + t.Run("path traversal rejected", func(t *testing.T) { + args, _ := json.Marshal(map[string]string{ + "filename": "../evil.sh", + "content": "rm -rf /", + }) + _, err := tool.Execute(context.Background(), args) + if err == nil { + t.Error("expected error for path traversal") + } + }) + + t.Run("unsupported extension rejected", func(t *testing.T) { + args, _ := json.Marshal(map[string]string{ + "filename": "malware.exe", + "content": "bad", + }) + _, err := tool.Execute(context.Background(), args) + if err == nil { + t.Error("expected error for unsupported extension") + } + }) + + t.Run("empty filename rejected", func(t *testing.T) { + args, _ := json.Marshal(map[string]string{ + "filename": "", + "content": "hello", + }) + _, err := tool.Execute(context.Background(), args) + if err == nil { + t.Error("expected error for empty filename") + } + }) + + t.Run("empty content succeeds", func(t *testing.T) { + args, _ := json.Marshal(map[string]string{ + "filename": "empty.txt", + "content": "", + }) + result, err := tool.Execute(context.Background(), args) + if err != nil { + t.Fatalf("Execute error: %v", err) + } + var out map[string]string + if err := json.Unmarshal([]byte(result), &out); err != nil { + t.Fatalf("output is not valid JSON: %v", err) + } + if out["content"] != "" { + t.Errorf("expected empty content, got %q", out["content"]) + } + // Verify empty file exists on disk. + diskContent, err := os.ReadFile(out["path"]) + if err != nil { + t.Fatalf("file not found at path %q: %v", out["path"], err) + } + if len(diskContent) != 0 { + t.Errorf("expected empty file on disk, got %d bytes", len(diskContent)) + } + }) + + t.Run("uses FilesDir from context", func(t *testing.T) { + customDir := filepath.Join(t.TempDir(), ".forge", "files") + ctx := runtime.WithFilesDir(context.Background(), customDir) + args, _ := json.Marshal(map[string]string{ + "filename": "ctx-test.yaml", + "content": "hello: world", + }) + result, err := tool.Execute(ctx, args) + if err != nil { + t.Fatalf("Execute error: %v", err) + } + var out map[string]string + if err := json.Unmarshal([]byte(result), &out); err != nil { + t.Fatalf("output is not valid JSON: %v", err) + } + wantPath := filepath.Join(customDir, "ctx-test.yaml") + if out["path"] != wantPath { + t.Errorf("path: got %q, want %q", out["path"], wantPath) + } + diskContent, err := os.ReadFile(wantPath) + if err != nil { + t.Fatalf("file not found at %q: %v", wantPath, err) + } + if string(diskContent) != "hello: world" { + t.Errorf("disk content: got %q, want %q", string(diskContent), "hello: world") + } + }) +} + func TestAllToolsHaveCategory(t *testing.T) { for _, tool := range All() { if tool.Category() != tools.CategoryBuiltin { diff --git a/forge-core/tools/builtins/file_create.go b/forge-core/tools/builtins/file_create.go new file mode 100644 index 0000000..a1f4895 --- /dev/null +++ b/forge-core/tools/builtins/file_create.go @@ -0,0 +1,120 @@ +package builtins + +import ( + "context" + "encoding/json" + "fmt" + "os" + "path/filepath" + "strings" + + "github.com/initializ/forge/forge-core/runtime" + "github.com/initializ/forge/forge-core/tools" +) + +// allowedExtensions maps file extensions to their MIME types. +var allowedExtensions = map[string]string{ + ".md": "text/markdown", + ".json": "application/json", + ".yaml": "text/yaml", + ".yml": "text/yaml", + ".txt": "text/plain", + ".log": "text/plain", + ".csv": "text/csv", + ".sh": "text/x-shellscript", + ".xml": "text/xml", + ".html": "text/html", + ".py": "text/x-python", + ".ts": "text/typescript", +} + +type fileCreateTool struct{} + +func (t *fileCreateTool) Name() string { return "file_create" } +func (t *fileCreateTool) Description() string { + return "Create a downloadable file. The file is written to a temporary directory and uploaded to the user's channel (Slack/Telegram). The result includes a 'path' field with the file's location on disk, which can be used with other tools like kubectl apply -f ." +} +func (t *fileCreateTool) Category() tools.Category { return tools.CategoryBuiltin } + +func (t *fileCreateTool) InputSchema() json.RawMessage { + return json.RawMessage(`{ + "type": "object", + "properties": { + "filename": { + "type": "string", + "description": "Filename with extension (e.g., patches.yaml, report.json, output.txt, script.py)" + }, + "content": { + "type": "string", + "description": "The full file content as text" + } + }, + "required": ["filename", "content"] + }`) +} + +func (t *fileCreateTool) Execute(ctx context.Context, args json.RawMessage) (string, error) { + var input struct { + Filename string `json:"filename"` + Content string `json:"content"` + } + if err := json.Unmarshal(args, &input); err != nil { + return "", fmt.Errorf("invalid arguments: %w", err) + } + + // Validate filename is not empty. + if strings.TrimSpace(input.Filename) == "" { + return "", fmt.Errorf("filename is required") + } + + // Reject path traversal and directory separators. + if strings.ContainsAny(input.Filename, "/\\") { + return "", fmt.Errorf("filename must not contain path separators") + } + if input.Filename == "." || input.Filename == ".." { + return "", fmt.Errorf("invalid filename") + } + + // Validate extension against allowlist. + ext := strings.ToLower(filepath.Ext(input.Filename)) + mime, ok := allowedExtensions[ext] + if !ok { + supported := make([]string, 0, len(allowedExtensions)) + for k := range allowedExtensions { + supported = append(supported, k) + } + return "", fmt.Errorf("unsupported file extension %q; supported: %s", ext, strings.Join(supported, ", ")) + } + + // Write file to the agent's .forge/files directory if available, + // otherwise fall back to a system temp directory. + dir := runtime.FilesDirFromContext(ctx) + if dir == "" { + dir = filepath.Join(os.TempDir(), "forge-files") + } + if err := os.MkdirAll(dir, 0o755); err != nil { + return "", fmt.Errorf("creating temp directory: %w", err) + } + filePath := filepath.Join(dir, input.Filename) + if err := os.WriteFile(filePath, []byte(input.Content), 0o644); err != nil { + return "", fmt.Errorf("writing file: %w", err) + } + + // Return structured JSON for the runtime to parse. + out, err := json.Marshal(map[string]string{ + "filename": input.Filename, + "content": input.Content, + "mime_type": mime, + "path": filePath, + }) + if err != nil { + return "", fmt.Errorf("marshalling output: %w", err) + } + return string(out), nil +} + +// MimeFromExtension returns the MIME type for a given file extension. +// Returns empty string if the extension is not in the allowlist. +func MimeFromExtension(ext string) string { + return allowedExtensions[strings.ToLower(ext)] +} diff --git a/forge-core/tools/builtins/register.go b/forge-core/tools/builtins/register.go index fafae62..17e218b 100644 --- a/forge-core/tools/builtins/register.go +++ b/forge-core/tools/builtins/register.go @@ -12,6 +12,7 @@ func All() []tools.Tool { &uuidGenerateTool{}, &mathCalculateTool{}, &webSearchTool{}, + &fileCreateTool{}, } } diff --git a/forge-plugins/channels/slack/slack.go b/forge-plugins/channels/slack/slack.go index a2e983c..7ed59cf 100644 --- a/forge-plugins/channels/slack/slack.go +++ b/forge-plugins/channels/slack/slack.go @@ -838,7 +838,13 @@ func extractLargestFile(msg *a2a.Message) (content, filename string) { } for _, p := range msg.Parts { if p.Kind == a2a.PartKindFile && p.File != nil && len(p.File.Bytes) > len(content) { - content = unwrapJSONContent(string(p.File.Bytes)) + raw := string(p.File.Bytes) + // Only unwrap JSON content for markdown files. + // Preserve raw content for explicitly typed files (json, yaml, etc.) + if strings.HasSuffix(p.File.Name, ".md") { + raw = unwrapJSONContent(raw) + } + content = raw filename = p.File.Name } } diff --git a/forge-skills/local/embedded/k8s-pod-rightsizer/SKILL.md b/forge-skills/local/embedded/k8s-pod-rightsizer/SKILL.md new file mode 100644 index 0000000..4f5a9dc --- /dev/null +++ b/forge-skills/local/embedded/k8s-pod-rightsizer/SKILL.md @@ -0,0 +1,516 @@ +--- +name: k8s-pod-rightsizer +category: sre +tags: + - kubernetes + - rightsizing + - cost-optimization + - resource-management + - prometheus + - capacity-planning + - kubectl +description: Analyze Kubernetes workload metrics and produce policy-constrained CPU/memory rightsizing recommendations with optional patch generation and rollback-safe apply. +metadata: + forge: + requires: + bins: + - bash + - kubectl + - jq + - curl + env: + required: [] + one_of: [] + optional: + - KUBECONFIG + - K8S_API_DOMAIN + - PROMETHEUS_URL + - PROMETHEUS_TOKEN + - POLICY_FILE + - DEFAULT_NAMESPACE + egress_domains: + - "$K8S_API_DOMAIN" + - "$PROMETHEUS_URL" + denied_tools: + - http_request + - web_search + timeout_hint: 300 + trust_hints: + network: true + filesystem: read + shell: true +--- + +# Kubernetes Pod Rightsizer + +Analyzes real Kubernetes workload metrics (Prometheus or metrics-server fallback) and produces policy-constrained recommendations for CPU and memory request/limit adjustments. + +Supports three modes: + +- **dry-run** — Report recommendations only (default, read-only) +- **plan** — Generate strategic merge patch YAMLs +- **apply** — Execute patches with automatic rollback bundle generation + +This skill uses deterministic formulas, never LLM-based guessing. + +--- + +## Tool Usage + +This skill uses `cli_execute` with `kubectl` and `curl` commands. +NEVER use http_request or web_search to interact with Kubernetes or Prometheus. +All cluster operations MUST go through kubectl or the rightsizer script via cli_execute. + +--- + +## Applying Patches + +When the user asks to apply rightsizing patches, use the script's built-in `mode=apply` with `i_accept_risk: true`. + +**NEVER** manually run `kubectl apply -f ` — the script's apply mode provides: +- Automatic rollback bundle generation (backup of current specs) +- Strategic merge patches via `kubectl patch` +- Rollout verification after each patch +- Action logging + +**Correct workflow:** +1. First run with `mode=dry-run` to show recommendations +2. If user confirms, run with `mode=apply` and `i_accept_risk: true` +3. Use `file_create` to provide the user with a downloadable copy of the patches (optional) + +**Example:** +- User: "apply the rightsizing patches" → `{"namespace": "prod", "mode": "apply", "i_accept_risk": true}` + +--- + +## Tool: k8s_pod_rightsizer + +Analyze workload resource usage and recommend CPU/memory request and limit changes. + +**Input:** namespace (string), workload (string), label_selector (string), mode (string), i_accept_risk (boolean), policy_file (string), lookback (string), output_format (string) + +**Output format:** Markdown tables for recommendations. YAML code blocks for patches. JSON for machine-readable output. + +### CRITICAL: Mode Field Rules + +`mode` controls the **action**, NOT the analysis filter. There are ONLY three valid values: + +| mode | Purpose | +|------|---------| +| `dry-run` | Analyze and report recommendations (default) | +| `plan` | Generate patch YAMLs | +| `apply` | Execute patches (requires `i_accept_risk: true`) | + +**NEVER set mode to a classification like "overprovisioned", "underprovisioned", "rightsized", etc.** These are OUTPUT classifications the tool produces, not input modes. + +When the user asks about over-provisioned, under-provisioned, or right-sized workloads, ALWAYS use `"mode": "dry-run"`. The output will include a `classification` field for each workload (e.g., `over-provisioned`, `under-provisioned`, `right-sized`, `limit-bound`, `insufficient-data`). + +Examples: +- "which workloads are over-provisioned?" → `{"mode": "dry-run"}` — read classification from output +- "generate patches for over-provisioned pods" → `{"mode": "plan"}` — patches are generated only for workloads needing changes +- "find under-provisioned deployments" → `{"mode": "dry-run"}` — read classification from output + +--- + +## Input Modes + +### 1) Human Mode (Natural Language) + +Input is a plain string. + +Examples: + +- `rightsize namespace payments-prod` → `{"namespace": "payments-prod", "mode": "dry-run"}` +- `which workloads are over-provisioned in prod?` → `{"namespace": "prod", "mode": "dry-run"}` +- `check resource usage for label app=checkout in prod` → `{"namespace": "prod", "label_selector": "app=checkout", "mode": "dry-run"}` +- `generate patches for over-provisioned workloads in staging` → `{"namespace": "staging", "mode": "plan"}` +- `apply rightsizing to deployment api-gateway in prod` → `{"namespace": "prod", "workload": "deployment/api-gateway", "mode": "apply", "i_accept_risk": true}` + +Behavior: + +- Parse namespace, workload, or selector intent. +- If namespace omitted, use `$DEFAULT_NAMESPACE` if set. +- Default mode is `dry-run`. ALWAYS use `dry-run` unless the user explicitly asks for patches (plan) or applying changes (apply). +- Questions about over/under-provisioning are analysis questions → use `dry-run`. +- Never require the user to remember JSON fields. + +--- + +### 2) Automation Mode (Structured JSON) + +Input JSON schema: + +```json +{ + "namespace": "payments-prod", + "workload": "deployment/payments-api", + "label_selector": "", + "mode": "dry-run", + "i_accept_risk": false, + "policy_file": "", + "lookback": "24h", + "output_format": "markdown" +} +``` + +Rules: + +- `namespace` is required (or `$DEFAULT_NAMESPACE` must be set). +- `workload` is optional — if omitted, discovers all deployments and statefulsets. +- `label_selector` is optional — filters discovered workloads. +- `mode` must be one of: `dry-run`, `plan`, `apply`. +- `i_accept_risk` must be `true` for `apply` mode. +- `output_format`: `markdown` (default), `json`, or `yaml`. + +--- + +## Execution Workflow + +### Step 0 — Preconditions + +Verify cluster access: + +```bash +kubectl cluster-info --request-timeout=5s +``` + +If RBAC denies access, report the error and stop. + +Check Prometheus availability if `$PROMETHEUS_URL` is set: + +```bash +curl -s "$PROMETHEUS_URL/api/v1/status/buildinfo" +``` + +Fall back to metrics-server if Prometheus is unavailable. + +--- + +### Step 1 — Discover Workloads + +If a specific workload is provided, validate it exists: + +```bash +kubectl get -n -o json +``` + +Otherwise, discover all deployments and statefulsets: + +```bash +kubectl get deploy,sts -n -o json +``` + +Filter by `label_selector` if provided. Skip `kube-system` unless explicitly targeted. Extract container resource specs for each workload. + +--- + +### Step 2 — Collect Metrics + +**Prometheus (preferred):** + +Query p95 CPU and memory usage over the lookback window: + +```promql +quantile_over_time(0.95, rate(container_cpu_usage_seconds_total{namespace="NS",pod=~"WORKLOAD.*",container!="POD"}[5m])[LOOKBACK:1m]) +``` + +```promql +quantile_over_time(0.95, container_memory_working_set_bytes{namespace="NS",pod=~"WORKLOAD.*",container!="POD"}[LOOKBACK]) +``` + +Also collect throttle ratios and OOM kill counts. + +**Metrics-server fallback:** + +```bash +kubectl top pod -n --containers +``` + +When using metrics-server fallback, recommendations are advisory-only. Apply mode is blocked. + +--- + +### Step 3 — Compute Recommendations + +All computations use deterministic formulas: + +- **Recommended request** = `p95_usage * safety_factor`, clamped to `[policy_min, policy_max]` +- **Recommended limit** = `recommended_request * burst_multiplier` +- **Step constraint** — changes smaller than `step_percent` of current value are suppressed (avoids churn) + +CPU values are rounded to nearest 10m. Memory values are rounded to nearest MiB. + +--- + +### Step 4 — Generate Report + +Output format depends on `output_format` parameter: + +- **markdown** — Human-readable tables with workload, container, current vs recommended values, savings estimate, and classification +- **json** — Machine-readable array of recommendation objects +- **yaml** — Patch files (plan and apply modes only) + +--- + +### Step 5 — Apply (if mode=apply) + +1. Generate rollback bundle (backup of current resource specs) +2. Show diff preview of all patches +3. Apply strategic merge patches via `kubectl patch` +4. Verify rollout status after each patch +5. Log all actions to `run.log` in the rollback bundle + +--- + +## Policy Model + +Policy files define constraints for rightsizing recommendations. Use `$POLICY_FILE` or `--policy-file` to specify. + +### Example Policy + +```json +{ + "defaults": { + "cpu_safety_factor": 1.25, + "memory_safety_factor": 1.35, + "cpu_burst_multiplier": 2.0, + "memory_burst_multiplier": 1.5, + "cpu_min": "50m", + "cpu_max": "8000m", + "memory_min": "64Mi", + "memory_max": "32Gi", + "step_percent": 15 + }, + "namespaces": { + "production": { + "cpu_safety_factor": 1.4, + "memory_safety_factor": 1.5, + "step_percent": 20 + } + }, + "workloads": { + "production/payments-api": { + "cpu_min": "500m", + "memory_min": "512Mi" + } + } +} +``` + +### Field Reference + +| Field | Type | Default | Description | +|-------|------|---------|-------------| +| `cpu_safety_factor` | float | 1.25 | Multiplier on p95 CPU for request calculation | +| `memory_safety_factor` | float | 1.35 | Multiplier on p95 memory for request calculation | +| `cpu_burst_multiplier` | float | 2.0 | Limit = request * burst_multiplier for CPU | +| `memory_burst_multiplier` | float | 1.5 | Limit = request * burst_multiplier for memory | +| `cpu_min` | string | 10m | Floor for CPU request recommendations | +| `cpu_max` | string | 8000m | Ceiling for CPU request recommendations | +| `memory_min` | string | 32Mi | Floor for memory request recommendations | +| `memory_max` | string | 32Gi | Ceiling for memory request recommendations | +| `step_percent` | int | 15 | Minimum change percentage to trigger a recommendation | + +### Precedence + +Policy values resolve in 3 levels (highest priority first): + +1. **Workload override** — `workloads["namespace/name"]` +2. **Namespace override** — `namespaces["namespace"]` +3. **Defaults** — `defaults` + +Values merge via overlay: workload overrides namespace, which overrides defaults. + +--- + +## Metrics Strategy + +### Prometheus (Preferred) + +When `$PROMETHEUS_URL` is set, the skill queries Prometheus for high-fidelity metrics: + +| Metric | PromQL Pattern | +|--------|---------------| +| p95 CPU | `quantile_over_time(0.95, rate(container_cpu_usage_seconds_total{...}[5m])[LOOKBACK:1m])` | +| p95 Memory | `quantile_over_time(0.95, container_memory_working_set_bytes{...}[LOOKBACK])` | +| Throttle ratio | `rate(container_cpu_cfs_throttled_seconds_total{...}[LOOKBACK]) / rate(container_cpu_cfs_periods_total{...}[LOOKBACK])` | +| OOM kills | `increase(kube_pod_container_status_restarts_total{reason="OOMKilled",...}[LOOKBACK])` | + +Authentication via `$PROMETHEUS_TOKEN` (Bearer token) if set. + +### Metrics-Server Fallback + +When Prometheus is unavailable, falls back to: + +```bash +kubectl top pod -n --containers +``` + +Limitations: + +- Point-in-time snapshot only (no percentile data) +- Recommendations are advisory-only +- Apply mode is blocked +- Step constraint is doubled (30% minimum change) + +--- + +## Decision Engine + +All computations are deterministic and performed via `jq` arithmetic. + +### Request Calculation + +``` +raw_request = p95_usage * safety_factor +clamped_request = clamp(raw_request, policy_min, policy_max) +recommended_request = round(clamped_request) +``` + +### Limit Calculation + +``` +recommended_limit = recommended_request * burst_multiplier +clamped_limit = clamp(recommended_limit, recommended_request, policy_max) +``` + +### Step Constraint + +A recommendation is only emitted if: + +``` +abs(recommended - current) / current >= step_percent / 100 +``` + +This prevents churn from minor fluctuations. + +### Rounding + +- CPU: rounded to nearest 10m (e.g., 137m → 140m) +- Memory: rounded to nearest MiB (e.g., 127.3Mi → 128Mi) + +--- + +## Detection Heuristics + +Each container is classified into one of these patterns: + +| Pattern | Condition | +|---------|-----------| +| **Over-provisioned CPU** | CPU request > p95 CPU * safety_factor * 2 | +| **Under-provisioned CPU** | CPU request < p95 CPU * 0.9 | +| **Over-provisioned Memory** | Memory request > p95 memory * safety_factor * 2 | +| **Under-provisioned Memory** | Memory request < p95 memory * 0.9 | +| **Limit-bound (throttled)** | Throttle ratio > 0.1 or OOM kills > 0 | +| **Right-sized** | Within step_percent of recommended values | +| **Insufficient data** | Fewer than 10 data points in lookback window | + +--- + +## Output Formats + +### Markdown Report (default) + +```markdown +| Workload | Container | Resource | Current | Recommended | Change | Classification | +|----------|-----------|----------|---------|-------------|--------|----------------| +| deploy/api | app | CPU req | 1000m | 400m | -60% | Over-provisioned | +| deploy/api | app | CPU lim | 2000m | 800m | -60% | Over-provisioned | +| deploy/api | app | Mem req | 2Gi | 1Gi | -50% | Over-provisioned | +| deploy/api | app | Mem lim | 4Gi | 1536Mi | -63% | Over-provisioned | +``` + +### JSON Output + +```json +[ + { + "workload": "deployment/api", + "container": "app", + "cpu_request": {"current": "1000m", "recommended": "400m", "change_percent": -60}, + "cpu_limit": {"current": "2000m", "recommended": "800m", "change_percent": -60}, + "memory_request": {"current": "2Gi", "recommended": "1Gi", "change_percent": -50}, + "memory_limit": {"current": "4Gi", "recommended": "1536Mi", "change_percent": -63}, + "classification": "over-provisioned" + } +] +``` + +### Patch YAMLs (plan/apply modes) + +```yaml +apiVersion: apps/v1 +kind: Deployment +metadata: + name: api + namespace: payments-prod +spec: + template: + spec: + containers: + - name: app + resources: + requests: + cpu: "400m" + memory: "1Gi" + limits: + cpu: "800m" + memory: "1536Mi" +``` + +--- + +## Rollback + +When `mode=apply`, a rollback bundle is generated before any patches are applied: + +``` +rollback-/ + backup-.json # Current resource specs + patch-.json # Applied patches + rollback-.sh # kubectl patch commands to restore + run.log # Timestamped action log +``` + +To roll back: + +```bash +bash rollback-/rollback-.sh +``` + +--- + +## Safety Constraints + +This skill MUST: + +- Default to `dry-run` mode — never mutate without explicit mode selection. +- Require `i_accept_risk: true` for `apply` mode. +- Generate rollback bundles before applying any patch. +- Never delete workloads, pods, namespaces, or any Kubernetes resource. +- Never modify RBAC, NetworkPolicy, or Secret resources. +- Never scale replicas. +- Only patch `spec.template.spec.containers[].resources`. +- Block `apply` mode when using metrics-server fallback (insufficient data fidelity). +- Validate all policy values before use. +- Cap lookback window at 30 days. +- Skip `kube-system` namespace unless explicitly targeted. +- Respect step constraints to avoid recommendation churn. +- Log all mutations to the rollback bundle run.log. + +--- + +## Autonomous Compatibility + +This skill is designed to be invoked by: + +- Humans via natural language CLI +- Automation pipelines via structured JSON +- Scheduled cost-optimization sweeps + +It must: + +- Be idempotent (repeated runs produce the same recommendations for the same data) +- Produce deterministic results (no LLM-based guessing) +- Be scope-limited (operates only on specified namespace/workload) +- Generate machine-parseable output for downstream processing diff --git a/forge-skills/local/embedded/k8s-pod-rightsizer/scripts/k8s-pod-rightsizer.sh b/forge-skills/local/embedded/k8s-pod-rightsizer/scripts/k8s-pod-rightsizer.sh new file mode 100644 index 0000000..9021cef --- /dev/null +++ b/forge-skills/local/embedded/k8s-pod-rightsizer/scripts/k8s-pod-rightsizer.sh @@ -0,0 +1,1082 @@ +#!/usr/bin/env bash +# k8s-pod-rightsizer.sh — Analyze Kubernetes workload metrics and produce +# policy-constrained CPU/memory rightsizing recommendations. +# +# Usage: ./k8s-pod-rightsizer.sh '{"namespace":"prod","mode":"dry-run"}' +# +# Requires: kubectl, jq, curl (for Prometheus), bash. +set -euo pipefail + +############################################################################### +# Constants & Defaults +############################################################################### + +# Default policy values (used when no POLICY_FILE is provided) +DEFAULT_CPU_SAFETY_FACTOR="1.25" +DEFAULT_MEMORY_SAFETY_FACTOR="1.35" +DEFAULT_CPU_BURST_MULTIPLIER="2.0" +DEFAULT_MEMORY_BURST_MULTIPLIER="1.5" +DEFAULT_CPU_MIN_MILLI="10" +DEFAULT_CPU_MAX_MILLI="8000" +DEFAULT_MEMORY_MIN_MI="32" +DEFAULT_MEMORY_MAX_MI="32768" +DEFAULT_STEP_PERCENT="15" +DEFAULT_LOOKBACK="24h" + +# Metrics source flag +METRICS_SOURCE="none" +ADVISORY_ONLY="false" + +# Temp directory with cleanup trap +TMPDIR_WORK=$(mktemp -d) +trap 'rm -rf "$TMPDIR_WORK"' EXIT + +############################################################################### +# Input Parsing & Validation +############################################################################### + +INPUT="${1:-}" +if [ -z "$INPUT" ]; then + echo '{"error":"usage: k8s-pod-rightsizer.sh {\"namespace\":\"...\",\"mode\":\"dry-run\"}"}' >&2 + exit 1 +fi + +if ! echo "$INPUT" | jq empty 2>/dev/null; then + echo '{"error":"invalid JSON input"}' >&2 + exit 1 +fi + +NAMESPACE=$(echo "$INPUT" | jq -r '.namespace // empty') +WORKLOAD=$(echo "$INPUT" | jq -r '.workload // empty') +LABEL_SELECTOR=$(echo "$INPUT" | jq -r '.label_selector // empty') +MODE=$(echo "$INPUT" | jq -r '.mode // "dry-run"') +I_ACCEPT_RISK=$(echo "$INPUT" | jq -r '.i_accept_risk // false') +POLICY_FILE_INPUT=$(echo "$INPUT" | jq -r '.policy_file // empty') +LOOKBACK=$(echo "$INPUT" | jq -r '.lookback // empty') +OUTPUT_FORMAT=$(echo "$INPUT" | jq -r '.output_format // "markdown"') + +# Resolve namespace +if [ -z "$NAMESPACE" ]; then + NAMESPACE="${DEFAULT_NAMESPACE:-}" +fi +if [ -z "$NAMESPACE" ]; then + echo '{"error":"namespace is required (provide in input or set DEFAULT_NAMESPACE)"}' >&2 + exit 1 +fi + +# Normalize mode — map common synonyms to canonical values +case "$MODE" in + dry-run|dryrun|dry_run|report|check|analyze|analysis|overprovisioned|over-provisioned|underprovisioned|under-provisioned) + MODE="dry-run" + ;; + plan|patch|patches|generate-patches|generate_patches|diff) + MODE="plan" + ;; + apply|execute|run) + MODE="apply" + ;; + *) + echo "{\"error\":\"invalid mode '$MODE': must be dry-run, plan, or apply\"}" >&2 + exit 1 + ;; +esac + +# Validate apply prerequisites +if [ "$MODE" = "apply" ] && [ "$I_ACCEPT_RISK" != "true" ]; then + echo '{"error":"apply mode requires i_accept_risk: true"}' >&2 + exit 1 +fi + +# Validate output format +case "$OUTPUT_FORMAT" in + markdown|json|yaml) ;; + *) + echo "{\"error\":\"invalid output_format '$OUTPUT_FORMAT': must be markdown, json, or yaml\"}" >&2 + exit 1 + ;; +esac + +# Set lookback with default +if [ -z "$LOOKBACK" ]; then + LOOKBACK="$DEFAULT_LOOKBACK" +fi + +# Validate lookback format and cap at 30d +LOOKBACK_HOURS=$(echo "$LOOKBACK" | jq -Rr ' + if test("^[0-9]+h$") then ltrimstr("") | rtrimstr("h") | tonumber + elif test("^[0-9]+d$") then ltrimstr("") | rtrimstr("d") | tonumber * 24 + else -1 + end +') +if [ "$LOOKBACK_HOURS" -lt 0 ] 2>/dev/null; then + echo '{"error":"invalid lookback format: use Nh or Nd (e.g., 24h, 7d)"}' >&2 + exit 1 +fi +if [ "$LOOKBACK_HOURS" -gt 720 ]; then + echo '{"error":"lookback cannot exceed 30d (720h)"}' >&2 + exit 1 +fi + +# Resolve policy file +POLICY_FILE="${POLICY_FILE_INPUT:-${POLICY_FILE:-}}" + +############################################################################### +# Policy Functions +############################################################################### + +policy_load() { + if [ -n "$POLICY_FILE" ] && [ -f "$POLICY_FILE" ]; then + if ! jq empty "$POLICY_FILE" 2>/dev/null; then + echo '{"error":"invalid JSON in policy file"}' >&2 + exit 1 + fi + cat "$POLICY_FILE" + else + # Return empty policy (will use defaults) + echo '{}' + fi +} + +resolve_policy() { + local policy_json="$1" + local ns="$2" + local workload_key="$3" # "namespace/name" or empty + + # Build effective policy: defaults → namespace override → workload override + jq -n --argjson policy "$policy_json" \ + --arg ns "$ns" \ + --arg wk "$workload_key" \ + --argjson d_csf "$DEFAULT_CPU_SAFETY_FACTOR" \ + --argjson d_msf "$DEFAULT_MEMORY_SAFETY_FACTOR" \ + --argjson d_cbm "$DEFAULT_CPU_BURST_MULTIPLIER" \ + --argjson d_mbm "$DEFAULT_MEMORY_BURST_MULTIPLIER" \ + --argjson d_cmin "$DEFAULT_CPU_MIN_MILLI" \ + --argjson d_cmax "$DEFAULT_CPU_MAX_MILLI" \ + --argjson d_mmin "$DEFAULT_MEMORY_MIN_MI" \ + --argjson d_mmax "$DEFAULT_MEMORY_MAX_MI" \ + --argjson d_step "$DEFAULT_STEP_PERCENT" ' + { + cpu_safety_factor: $d_csf, + memory_safety_factor: $d_msf, + cpu_burst_multiplier: $d_cbm, + memory_burst_multiplier: $d_mbm, + cpu_min_milli: $d_cmin, + cpu_max_milli: $d_cmax, + memory_min_mi: $d_mmin, + memory_max_mi: $d_mmax, + step_percent: $d_step + } as $builtin_defaults | + ($policy.defaults // {}) as $user_defaults | + ($policy.namespaces[$ns] // {}) as $ns_override | + (if $wk != "" then ($policy.workloads[$wk] // {}) else {} end) as $wk_override | + # Merge user defaults over builtin, converting cpu_min/memory_min string values + ($builtin_defaults + ($user_defaults | to_entries | map( + if .key == "cpu_min" then {key: "cpu_min_milli", value: (.value | tostring | gsub("m$";"") | tonumber)} + elif .key == "cpu_max" then {key: "cpu_max_milli", value: (.value | tostring | gsub("m$";"") | tonumber)} + elif .key == "memory_min" then {key: "memory_min_mi", value: (.value | tostring | gsub("Mi$";"") | tonumber)} + elif .key == "memory_max" then {key: "memory_max_mi", value: (.value | tostring | gsub("Gi$";"") | tonumber * 1024)} + else . + end + ) | from_entries)) as $merged_defaults | + # Apply namespace override + ($merged_defaults + ($ns_override | to_entries | map( + if .key == "cpu_min" then {key: "cpu_min_milli", value: (.value | tostring | gsub("m$";"") | tonumber)} + elif .key == "cpu_max" then {key: "cpu_max_milli", value: (.value | tostring | gsub("m$";"") | tonumber)} + elif .key == "memory_min" then {key: "memory_min_mi", value: (.value | tostring | gsub("Mi$";"") | tonumber)} + elif .key == "memory_max" then {key: "memory_max_mi", value: (.value | tostring | gsub("Gi$";"") | tonumber * 1024)} + else . + end + ) | from_entries)) as $after_ns | + # Apply workload override + ($after_ns + ($wk_override | to_entries | map( + if .key == "cpu_min" then {key: "cpu_min_milli", value: (.value | tostring | gsub("m$";"") | tonumber)} + elif .key == "cpu_max" then {key: "cpu_max_milli", value: (.value | tostring | gsub("m$";"") | tonumber)} + elif .key == "memory_min" then {key: "memory_min_mi", value: (.value | tostring | gsub("Mi$";"") | tonumber)} + elif .key == "memory_max" then {key: "memory_max_mi", value: (.value | tostring | gsub("Gi$";"") | tonumber * 1024)} + else . + end + ) | from_entries)) + ' +} + +validate_policy() { + local eff_policy="$1" + local valid + valid=$(echo "$eff_policy" | jq ' + if .cpu_safety_factor < 1 then "cpu_safety_factor must be >= 1" + elif .memory_safety_factor < 1 then "memory_safety_factor must be >= 1" + elif .cpu_burst_multiplier < 1 then "cpu_burst_multiplier must be >= 1" + elif .memory_burst_multiplier < 1 then "memory_burst_multiplier must be >= 1" + elif .cpu_min_milli < 0 then "cpu_min_milli must be >= 0" + elif .cpu_max_milli < .cpu_min_milli then "cpu_max_milli must be >= cpu_min_milli" + elif .memory_min_mi < 0 then "memory_min_mi must be >= 0" + elif .memory_max_mi < .memory_min_mi then "memory_max_mi must be >= memory_min_mi" + elif .step_percent < 0 or .step_percent > 100 then "step_percent must be between 0 and 100" + else "ok" + end + ' -r) + if [ "$valid" != "ok" ]; then + echo "{\"error\":\"policy validation failed: $valid\"}" >&2 + exit 1 + fi +} + +############################################################################### +# Preflight +############################################################################### + +preflight() { + # Use the user's existing kubeconfig — kubectl reads $KUBECONFIG or ~/.kube/config by default + local kc="${KUBECONFIG:-${HOME}/.kube/config}" + if [ ! -f "$kc" ] && [ -z "${KUBECONFIG:-}" ]; then + echo "{\"error\":\"no kubeconfig found at ${kc} — set KUBECONFIG or configure kubectl\"}" >&2 + exit 1 + fi + + local cluster_err + if ! cluster_err=$(kubectl cluster-info --request-timeout=10s 2>&1); then + echo "{\"error\":\"cannot connect to Kubernetes cluster: $(echo "$cluster_err" | head -1 | tr '"' "'")\"}" >&2 + exit 1 + fi +} + +############################################################################### +# Discovery Functions +############################################################################### + +discover_workloads() { + local ns="$1" + local workload_filter="$2" + local label_sel="$3" + + if [ -n "$workload_filter" ]; then + # Specific workload — parse kind/name + local kind name + if echo "$workload_filter" | grep -q '/'; then + kind=$(echo "$workload_filter" | cut -d'/' -f1) + name=$(echo "$workload_filter" | cut -d'/' -f2) + else + # Assume deployment if no kind specified + kind="deployment" + name="$workload_filter" + fi + + local result + if ! result=$(kubectl get "$kind" "$name" -n "$ns" -o json 2>&1); then + echo "{\"error\":\"workload $kind/$name not found in namespace $ns: $result\"}" >&2 + exit 1 + fi + # Wrap single workload into items array + echo "$result" | jq '{items: [.]}' + else + # Discover all deployments and statefulsets + local selector_args="" + if [ -n "$label_sel" ]; then + selector_args="-l $label_sel" + fi + + local deploys sts + # shellcheck disable=SC2086 + deploys=$(kubectl get deploy -n "$ns" $selector_args -o json 2>/dev/null || echo '{"items":[]}') + # shellcheck disable=SC2086 + sts=$(kubectl get sts -n "$ns" $selector_args -o json 2>/dev/null || echo '{"items":[]}') + + # Merge items from both + jq -n --argjson d "$deploys" --argjson s "$sts" '{items: ($d.items + $s.items)}' + fi +} + +extract_containers() { + # Extract container resource specs from workload JSON + local workload_json="$1" + echo "$workload_json" | jq '[ + .items[] | + . as $wl | + { + kind: .kind, + name: .metadata.name, + namespace: .metadata.namespace + } as $meta | + .spec.template.spec.containers[] | + { + workload_kind: ($meta.kind | ascii_downcase), + workload_name: $meta.name, + namespace: $meta.namespace, + container: .name, + current_cpu_request: (.resources.requests.cpu // "0"), + current_cpu_limit: (.resources.limits.cpu // "0"), + current_memory_request: (.resources.requests.memory // "0"), + current_memory_limit: (.resources.limits.memory // "0") + } + ]' +} + +############################################################################### +# Unit Conversion Helpers (via jq) +############################################################################### + +# Convert CPU string (e.g., "500m", "1", "2.5") to millicores integer +cpu_to_milli() { + local val="$1" + echo "$val" | jq -Rr ' + if . == "0" or . == "" then 0 + elif test("m$") then rtrimstr("m") | tonumber + else tonumber * 1000 + end | floor + ' +} + +# Convert memory string (e.g., "512Mi", "1Gi", "1073741824") to MiB integer +memory_to_mi() { + local val="$1" + echo "$val" | jq -Rr ' + if . == "0" or . == "" then 0 + elif test("Gi$") then rtrimstr("Gi") | tonumber * 1024 + elif test("Mi$") then rtrimstr("Mi") | tonumber + elif test("Ki$") then rtrimstr("Ki") | tonumber / 1024 + elif test("G$") then rtrimstr("G") | tonumber * 1000000000 / 1048576 + elif test("M$") then rtrimstr("M") | tonumber * 1000000 / 1048576 + elif test("K$") then rtrimstr("K") | tonumber * 1000 / 1048576 + else tonumber / 1048576 + end | floor + ' +} + +############################################################################### +# Prometheus Metrics +############################################################################### + +query_prom() { + local promql="$1" + local prom_url="${PROMETHEUS_URL:-}" + + if [ -z "$prom_url" ]; then + return 1 + fi + + local auth_header="" + if [ -n "${PROMETHEUS_TOKEN:-}" ]; then + auth_header="Authorization: Bearer ${PROMETHEUS_TOKEN}" + fi + + local response http_code body + if [ -n "$auth_header" ]; then + response=$(curl -s -w "\n%{http_code}" --max-time 30 \ + -G "${prom_url}/api/v1/query" \ + --data-urlencode "query=${promql}" \ + -H "$auth_header") + else + response=$(curl -s -w "\n%{http_code}" --max-time 30 \ + -G "${prom_url}/api/v1/query" \ + --data-urlencode "query=${promql}") + fi + + http_code=$(echo "$response" | tail -1) + body=$(echo "$response" | sed '$d') + + if [ "$http_code" -ne 200 ]; then + echo "" # Return empty on failure + return 1 + fi + + echo "$body" +} + +get_metrics_prom() { + local ns="$1" + local pod_prefix="$2" + local container="$3" + local lookback_val="$4" + + # p95 CPU usage (cores) + local cpu_query="quantile_over_time(0.95, rate(container_cpu_usage_seconds_total{namespace=\"${ns}\",pod=~\"${pod_prefix}.*\",container=\"${container}\"}[5m])[${lookback_val}:1m])" + local cpu_result + cpu_result=$(query_prom "$cpu_query" 2>/dev/null || echo "") + + # p95 Memory usage (bytes) + local mem_query="quantile_over_time(0.95, container_memory_working_set_bytes{namespace=\"${ns}\",pod=~\"${pod_prefix}.*\",container=\"${container}\"}[${lookback_val}])" + local mem_result + mem_result=$(query_prom "$mem_query" 2>/dev/null || echo "") + + # Throttle ratio + local throttle_query="rate(container_cpu_cfs_throttled_seconds_total{namespace=\"${ns}\",pod=~\"${pod_prefix}.*\",container=\"${container}\"}[${lookback_val}]) / rate(container_cpu_cfs_periods_total{namespace=\"${ns}\",pod=~\"${pod_prefix}.*\",container=\"${container}\"}[${lookback_val}])" + local throttle_result + throttle_result=$(query_prom "$throttle_query" 2>/dev/null || echo "") + + # OOM kills + local oom_query="increase(kube_pod_container_status_restarts_total{namespace=\"${ns}\",pod=~\"${pod_prefix}.*\",container=\"${container}\",reason=\"OOMKilled\"}[${lookback_val}])" + local oom_result + oom_result=$(query_prom "$oom_query" 2>/dev/null || echo "") + + # Extract values, defaulting to empty on parse failure + local p95_cpu_cores p95_mem_bytes throttle_ratio oom_kills + + p95_cpu_cores=$(echo "${cpu_result:-}" | jq -r '.data.result[0].value[1] // empty' 2>/dev/null || echo "") + p95_mem_bytes=$(echo "${mem_result:-}" | jq -r '.data.result[0].value[1] // empty' 2>/dev/null || echo "") + throttle_ratio=$(echo "${throttle_result:-}" | jq -r '.data.result[0].value[1] // empty' 2>/dev/null || echo "") + oom_kills=$(echo "${oom_result:-}" | jq -r '.data.result[0].value[1] // empty' 2>/dev/null || echo "") + + # Convert to millicores and MiB + local p95_cpu_milli="0" + local p95_mem_mi="0" + if [ -n "$p95_cpu_cores" ]; then + p95_cpu_milli=$(echo "$p95_cpu_cores" | jq -r '. | tonumber * 1000 | floor') + fi + if [ -n "$p95_mem_bytes" ]; then + p95_mem_mi=$(echo "$p95_mem_bytes" | jq -r '. | tonumber / 1048576 | floor') + fi + + jq -n \ + --argjson cpu "$p95_cpu_milli" \ + --argjson mem "$p95_mem_mi" \ + --arg throttle "${throttle_ratio:-0}" \ + --arg oom "${oom_kills:-0}" \ + --arg source "prometheus" '{ + p95_cpu_milli: $cpu, + p95_memory_mi: $mem, + throttle_ratio: ($throttle | tonumber // 0), + oom_kills: ($oom | tonumber | floor // 0), + source: $source + }' +} + +############################################################################### +# kubectl top Fallback +############################################################################### + +get_metrics_top() { + local ns="$1" + local pod_prefix="$2" + local container="$3" + + ADVISORY_ONLY="true" + + local top_output + if ! top_output=$(kubectl top pod -n "$ns" --containers 2>&1); then + echo '{"error":"metrics-server unavailable: '"$(echo "$top_output" | head -1)"'"}' >&2 + return 1 + fi + + # Parse kubectl top output for matching pods/containers + # Format: POD_NAME CONTAINER CPU(cores) MEMORY(bytes) + local cpu_milli mem_mi + cpu_milli=$(echo "$top_output" | grep -E "^${pod_prefix}" | awk -v c="$container" '$2 == c {print $3}' | head -1 || echo "") + mem_mi=$(echo "$top_output" | grep -E "^${pod_prefix}" | awk -v c="$container" '$2 == c {print $4}' | head -1 || echo "") + + # Convert from kubectl top format + if [ -z "$cpu_milli" ]; then + cpu_milli="0" + else + cpu_milli=$(cpu_to_milli "$cpu_milli") + fi + if [ -z "$mem_mi" ]; then + mem_mi="0" + else + mem_mi=$(memory_to_mi "$mem_mi") + fi + + jq -n \ + --argjson cpu "$cpu_milli" \ + --argjson mem "$mem_mi" '{ + p95_cpu_milli: $cpu, + p95_memory_mi: $mem, + throttle_ratio: 0, + oom_kills: 0, + source: "metrics-server" + }' +} + +############################################################################### +# Compute Engine +############################################################################### + +compute_recommendation() { + local container_info="$1" + local metrics="$2" + local eff_policy="$3" + + jq -n --argjson c "$container_info" --argjson m "$metrics" --argjson p "$eff_policy" \ + --arg advisory "$ADVISORY_ONLY" ' + # Parse current values to millicores/MiB + def parse_cpu: + if . == "0" or . == "" or . == null then 0 + elif test("m$") then rtrimstr("m") | tonumber + else tonumber * 1000 + end | floor; + def parse_mem: + if . == "0" or . == "" or . == null then 0 + elif test("Gi$") then rtrimstr("Gi") | tonumber * 1024 + elif test("Mi$") then rtrimstr("Mi") | tonumber + elif test("Ki$") then rtrimstr("Ki") | tonumber / 1024 + else tonumber / 1048576 + end | floor; + + # Clamp helper + def clamp(lo; hi): if . < lo then lo elif . > hi then hi else . end; + + # Round CPU to nearest 10m + def round_cpu: ((. + 5) / 10 | floor) * 10 | if . < 10 then 10 else . end; + + # Round memory to nearest MiB (already integer) + def round_mem: if . < 1 then 1 else . | floor end; + + ($c.current_cpu_request | parse_cpu) as $cur_cpu_req | + ($c.current_cpu_limit | parse_cpu) as $cur_cpu_lim | + ($c.current_memory_request | parse_mem) as $cur_mem_req | + ($c.current_memory_limit | parse_mem) as $cur_mem_lim | + + $m.p95_cpu_milli as $p95_cpu | + $m.p95_memory_mi as $p95_mem | + + # Step percent (doubled for advisory mode) + (if $advisory == "true" then ($p.step_percent * 2) else $p.step_percent end) as $step | + + # Compute recommended CPU request + ($p95_cpu * $p.cpu_safety_factor | round_cpu | clamp($p.cpu_min_milli; $p.cpu_max_milli)) as $rec_cpu_req | + # Compute recommended CPU limit + ($rec_cpu_req * $p.cpu_burst_multiplier | round_cpu | clamp($rec_cpu_req; $p.cpu_max_milli)) as $rec_cpu_lim | + # Compute recommended memory request + ($p95_mem * $p.memory_safety_factor | round_mem | clamp($p.memory_min_mi; $p.memory_max_mi)) as $rec_mem_req | + # Compute recommended memory limit + ($rec_mem_req * $p.memory_burst_multiplier | round_mem | clamp($rec_mem_req; $p.memory_max_mi)) as $rec_mem_lim | + + # Compute change percentages + (if $cur_cpu_req > 0 then (($rec_cpu_req - $cur_cpu_req) / $cur_cpu_req * 100 | floor) else 100 end) as $cpu_req_pct | + (if $cur_cpu_lim > 0 then (($rec_cpu_lim - $cur_cpu_lim) / $cur_cpu_lim * 100 | floor) else 100 end) as $cpu_lim_pct | + (if $cur_mem_req > 0 then (($rec_mem_req - $cur_mem_req) / $cur_mem_req * 100 | floor) else 100 end) as $mem_req_pct | + (if $cur_mem_lim > 0 then (($rec_mem_lim - $cur_mem_lim) / $cur_mem_lim * 100 | floor) else 100 end) as $mem_lim_pct | + + # Check step constraint — suppress if change is too small + (if $cur_cpu_req > 0 then (($cpu_req_pct | fabs) >= $step) else true end) as $cpu_changed | + (if $cur_mem_req > 0 then (($mem_req_pct | fabs) >= $step) else true end) as $mem_changed | + ($cpu_changed or $mem_changed) as $has_recommendation | + + # Classification + (if $p95_cpu == 0 and $p95_mem == 0 then "insufficient-data" + elif $m.oom_kills > 0 then "limit-bound" + elif $m.throttle_ratio > 0.1 then "limit-bound" + elif ($cur_cpu_req > 0 and $cur_cpu_req > ($p95_cpu * $p.cpu_safety_factor * 2)) then "over-provisioned" + elif ($cur_mem_req > 0 and $cur_mem_req > ($p95_mem * $p.memory_safety_factor * 2)) then "over-provisioned" + elif ($cur_cpu_req > 0 and $cur_cpu_req < ($p95_cpu * 0.9)) then "under-provisioned" + elif ($cur_mem_req > 0 and $cur_mem_req < ($p95_mem * 0.9)) then "under-provisioned" + elif ($has_recommendation | not) then "right-sized" + else "adjust" + end) as $classification | + + { + workload_kind: $c.workload_kind, + workload_name: $c.workload_name, + namespace: $c.namespace, + container: $c.container, + metrics_source: $m.source, + classification: $classification, + has_recommendation: $has_recommendation, + cpu_request: { + current_milli: $cur_cpu_req, + recommended_milli: $rec_cpu_req, + change_percent: $cpu_req_pct + }, + cpu_limit: { + current_milli: $cur_cpu_lim, + recommended_milli: $rec_cpu_lim, + change_percent: $cpu_lim_pct + }, + memory_request: { + current_mi: $cur_mem_req, + recommended_mi: $rec_mem_req, + change_percent: $mem_req_pct + }, + memory_limit: { + current_mi: $cur_mem_lim, + recommended_mi: $rec_mem_lim, + change_percent: $mem_lim_pct + }, + throttle_ratio: $m.throttle_ratio, + oom_kills: $m.oom_kills, + advisory_only: ($advisory == "true") + } + ' +} + +############################################################################### +# Report Generation +############################################################################### + +format_cpu() { + # Convert millicores to display string + local milli="$1" + if [ "$milli" -ge 1000 ]; then + echo "${milli}m" + else + echo "${milli}m" + fi +} + +format_mem() { + # Convert MiB to display string + local mi="$1" + if [ "$mi" -ge 1024 ]; then + local gi + gi=$(echo "$mi" | jq -r '. / 1024 | . * 10 | floor / 10') + echo "${gi}Gi" + else + echo "${mi}Mi" + fi +} + +generate_markdown_report() { + local recommendations_file="$1" + + local count + count=$(jq 'length' "$recommendations_file") + + if [ "$count" -eq 0 ]; then + echo "# Rightsizing Report" + echo "" + echo "No workloads found or no recommendations to make." + return + fi + + local advisory_flag + advisory_flag=$(jq -r '.[0].advisory_only' "$recommendations_file") + + echo "# Rightsizing Report" + echo "" + echo "**Namespace:** ${NAMESPACE}" + echo "**Mode:** ${MODE}" + echo "**Metrics source:** $(jq -r '.[0].metrics_source' "$recommendations_file")" + echo "**Lookback:** ${LOOKBACK}" + if [ "$advisory_flag" = "true" ]; then + echo "" + echo "> **Advisory only** — metrics-server provides point-in-time data only. Use Prometheus for production rightsizing." + fi + echo "" + echo "| Workload | Container | Resource | Current | Recommended | Change | Classification |" + echo "|----------|-----------|----------|---------|-------------|--------|----------------|" + + jq -r '.[] | select(.has_recommendation == true) | + "\(.workload_kind)/\(.workload_name)|\(.container)|\(.cpu_request.current_milli)|\(.cpu_request.recommended_milli)|\(.cpu_request.change_percent)|\(.cpu_limit.current_milli)|\(.cpu_limit.recommended_milli)|\(.cpu_limit.change_percent)|\(.memory_request.current_mi)|\(.memory_request.recommended_mi)|\(.memory_request.change_percent)|\(.memory_limit.current_mi)|\(.memory_limit.recommended_mi)|\(.memory_limit.change_percent)|\(.classification)" + ' "$recommendations_file" | while IFS='|' read -r wl ctr cur_cr rec_cr pct_cr cur_cl rec_cl pct_cl cur_mr rec_mr pct_mr cur_ml rec_ml pct_ml cls; do + echo "| ${wl} | ${ctr} | CPU req | $(format_cpu "$cur_cr") | $(format_cpu "$rec_cr") | ${pct_cr}% | ${cls} |" + echo "| ${wl} | ${ctr} | CPU lim | $(format_cpu "$cur_cl") | $(format_cpu "$rec_cl") | ${pct_cl}% | ${cls} |" + echo "| ${wl} | ${ctr} | Mem req | $(format_mem "$cur_mr") | $(format_mem "$rec_mr") | ${pct_mr}% | ${cls} |" + echo "| ${wl} | ${ctr} | Mem lim | $(format_mem "$cur_ml") | $(format_mem "$rec_ml") | ${pct_ml}% | ${cls} |" + done + + # Summary of right-sized / insufficient-data + local right_sized insufficient + right_sized=$(jq '[.[] | select(.classification == "right-sized")] | length' "$recommendations_file") + insufficient=$(jq '[.[] | select(.classification == "insufficient-data")] | length' "$recommendations_file") + if [ "$right_sized" -gt 0 ] || [ "$insufficient" -gt 0 ]; then + echo "" + echo "**Summary:**" + [ "$right_sized" -gt 0 ] && echo "- ${right_sized} container(s) are right-sized (no changes needed)" + [ "$insufficient" -gt 0 ] && echo "- ${insufficient} container(s) have insufficient data for recommendations" + fi +} + +generate_json_report() { + local recommendations_file="$1" + + jq '[.[] | { + workload: "\(.workload_kind)/\(.workload_name)", + container: .container, + classification: .classification, + advisory_only: .advisory_only, + cpu_request: { + current: "\(.cpu_request.current_milli)m", + recommended: "\(.cpu_request.recommended_milli)m", + change_percent: .cpu_request.change_percent + }, + cpu_limit: { + current: "\(.cpu_limit.current_milli)m", + recommended: "\(.cpu_limit.recommended_milli)m", + change_percent: .cpu_limit.change_percent + }, + memory_request: { + current: "\(.memory_request.current_mi)Mi", + recommended: "\(.memory_request.recommended_mi)Mi", + change_percent: .memory_request.change_percent + }, + memory_limit: { + current: "\(.memory_limit.current_mi)Mi", + recommended: "\(.memory_limit.recommended_mi)Mi", + change_percent: .memory_limit.change_percent + }, + throttle_ratio: .throttle_ratio, + oom_kills: .oom_kills + }]' "$recommendations_file" +} + +############################################################################### +# Patch Generation +############################################################################### + +generate_patches() { + local recommendations_file="$1" + local output_dir="$2" + + jq -c '.[] | select(.has_recommendation == true)' "$recommendations_file" | while IFS= read -r rec; do + local wl_kind wl_name ns ctr + wl_kind=$(echo "$rec" | jq -r '.workload_kind') + wl_name=$(echo "$rec" | jq -r '.workload_name') + ns=$(echo "$rec" | jq -r '.namespace') + ctr=$(echo "$rec" | jq -r '.container') + + local patch_file="${output_dir}/patch-${wl_kind}-${wl_name}.json" + + # Build strategic merge patch via jq + local rec_cpu_req rec_cpu_lim rec_mem_req rec_mem_lim + rec_cpu_req=$(echo "$rec" | jq -r '.cpu_request.recommended_milli') + rec_cpu_lim=$(echo "$rec" | jq -r '.cpu_limit.recommended_milli') + rec_mem_req=$(echo "$rec" | jq -r '.memory_request.recommended_mi') + rec_mem_lim=$(echo "$rec" | jq -r '.memory_limit.recommended_mi') + + # If patch file already exists (multiple containers), merge + if [ -f "$patch_file" ]; then + local existing + existing=$(cat "$patch_file") + echo "$existing" | jq --arg ctr "$ctr" \ + --arg cpu_req "${rec_cpu_req}m" \ + --arg cpu_lim "${rec_cpu_lim}m" \ + --arg mem_req "${rec_mem_req}Mi" \ + --arg mem_lim "${rec_mem_lim}Mi" ' + .spec.template.spec.containers += [{ + name: $ctr, + resources: { + requests: {cpu: $cpu_req, memory: $mem_req}, + limits: {cpu: $cpu_lim, memory: $mem_lim} + } + }] + ' > "$patch_file" + else + jq -n --arg ctr "$ctr" \ + --arg cpu_req "${rec_cpu_req}m" \ + --arg cpu_lim "${rec_cpu_lim}m" \ + --arg mem_req "${rec_mem_req}Mi" \ + --arg mem_lim "${rec_mem_lim}Mi" '{ + spec: { + template: { + spec: { + containers: [{ + name: $ctr, + resources: { + requests: {cpu: $cpu_req, memory: $mem_req}, + limits: {cpu: $cpu_lim, memory: $mem_lim} + } + }] + } + } + } + }' > "$patch_file" + fi + done +} + +generate_rollback() { + local ns="$1" + local recommendations_file="$2" + local rollback_dir="$3" + + mkdir -p "$rollback_dir" + + local timestamp + timestamp=$(date -u +"%Y-%m-%dT%H:%M:%SZ") + echo "[$timestamp] Rollback bundle created" > "${rollback_dir}/run.log" + + # Backup current specs for each unique workload + jq -r '.[] | select(.has_recommendation == true) | "\(.workload_kind)/\(.workload_name)"' "$recommendations_file" | sort -u | while IFS= read -r wl_ref; do + local kind name + kind=$(echo "$wl_ref" | cut -d'/' -f1) + name=$(echo "$wl_ref" | cut -d'/' -f2) + + local backup_file="${rollback_dir}/backup-${kind}-${name}.json" + kubectl get "$kind" "$name" -n "$ns" -o json | jq '{ + spec: { + template: { + spec: { + containers: [.spec.template.spec.containers[] | { + name: .name, + resources: .resources + }] + } + } + } + }' > "$backup_file" + + # Generate rollback script + local rollback_script="${rollback_dir}/rollback-${kind}-${name}.sh" + local patch_content + patch_content=$(cat "$backup_file") + jq -n --arg kind "$kind" --arg name "$name" --arg ns "$ns" \ + --arg patch "$patch_content" '{ + command: "kubectl patch \($kind) \($name) -n \($ns) --type=strategic -p", + patch: $patch + }' > /dev/null # Validate the jq works + + # Write rollback script without variable interpolation in the heredoc + { + echo '#!/usr/bin/env bash' + echo 'set -euo pipefail' + echo "kubectl patch ${kind} ${name} -n ${ns} --type=strategic -p '${patch_content}'" + } > "$rollback_script" + chmod +x "$rollback_script" + + echo "[$timestamp] Backed up $kind/$name" >> "${rollback_dir}/run.log" + done +} + +############################################################################### +# Apply Mode +############################################################################### + +apply_patches() { + local ns="$1" + local recommendations_file="$2" + local patch_dir="$3" + local rollback_dir="$4" + + local timestamp + timestamp=$(date -u +"%Y-%m-%dT%H:%M:%SZ") + + jq -r '.[] | select(.has_recommendation == true) | "\(.workload_kind)/\(.workload_name)"' "$recommendations_file" | sort -u | while IFS= read -r wl_ref; do + local kind name + kind=$(echo "$wl_ref" | cut -d'/' -f1) + name=$(echo "$wl_ref" | cut -d'/' -f2) + + local patch_file="${patch_dir}/patch-${kind}-${name}.json" + if [ ! -f "$patch_file" ]; then + echo "[$timestamp] SKIP: no patch file for $kind/$name" >> "${rollback_dir}/run.log" + continue + fi + + local patch_content + patch_content=$(cat "$patch_file") + + echo "[$timestamp] Applying patch to $kind/$name in $ns" >> "${rollback_dir}/run.log" + + local apply_result + if ! apply_result=$(kubectl patch "$kind" "$name" -n "$ns" --type=strategic -p "$patch_content" 2>&1); then + echo "[$timestamp] FAILED: $kind/$name — $apply_result" >> "${rollback_dir}/run.log" + echo "{\"error\":\"failed to patch $kind/$name: $apply_result\"}" >&2 + echo "Rollback available at: ${rollback_dir}/" >&2 + exit 1 + fi + + echo "[$timestamp] SUCCESS: $apply_result" >> "${rollback_dir}/run.log" + + # Verify rollout + if ! kubectl rollout status "$kind/$name" -n "$ns" --timeout=120s 2>&1; then + echo "[$timestamp] WARNING: rollout not yet complete for $kind/$name" >> "${rollback_dir}/run.log" + fi + done +} + +############################################################################### +# YAML Output (for plan/apply modes) +############################################################################### + +generate_yaml_output() { + local patch_dir="$1" + + for patch_file in "${patch_dir}"/patch-*.json; do + [ -f "$patch_file" ] || continue + local basename_ + basename_=$(basename "$patch_file" .json) + # Extract kind and name from filename: patch--.json + local kind name + kind=$(echo "$basename_" | sed 's/^patch-//' | cut -d'-' -f1) + name=$(echo "$basename_" | sed 's/^patch-[^-]*-//') + + echo "---" + # Convert patch JSON to YAML-like output via jq + jq -r --arg kind "$kind" --arg name "$name" --arg ns "$NAMESPACE" ' + "apiVersion: apps/v1", + "kind: \($kind | gsub("deployment";"Deployment") | gsub("statefulset";"StatefulSet") | gsub("sts";"StatefulSet"))", + "metadata:", + " name: \($name)", + " namespace: \($ns)", + "spec:", + " template:", + " spec:", + " containers:", + (.spec.template.spec.containers[] | + " - name: \(.name)", + " resources:", + " requests:", + " cpu: \"\(.resources.requests.cpu)\"", + " memory: \"\(.resources.requests.memory)\"", + " limits:", + " cpu: \"\(.resources.limits.cpu)\"", + " memory: \"\(.resources.limits.memory)\"" + ) + ' "$patch_file" + done +} + +############################################################################### +# Main Orchestration +############################################################################### + +main() { + # Step 0: Preflight + preflight + + # Load and validate policy + local policy_json eff_policy + policy_json=$(policy_load) + + # Step 1: Discover workloads + local workloads_json containers_json + workloads_json=$(discover_workloads "$NAMESPACE" "$WORKLOAD" "$LABEL_SELECTOR") + + local workload_count + workload_count=$(echo "$workloads_json" | jq '.items | length') + if [ "$workload_count" -eq 0 ]; then + echo '{"error":"no workloads found matching criteria"}' >&2 + exit 1 + fi + + containers_json=$(extract_containers "$workloads_json") + + local container_count + container_count=$(echo "$containers_json" | jq 'length') + if [ "$container_count" -eq 0 ]; then + echo '{"error":"no containers found in matched workloads"}' >&2 + exit 1 + fi + + # Step 2: Collect metrics and compute recommendations + local recommendations="[]" + + local i=0 + while [ "$i" -lt "$container_count" ]; do + local container_info + container_info=$(echo "$containers_json" | jq ".[$i]") + + local wl_kind wl_name ctr_name + wl_kind=$(echo "$container_info" | jq -r '.workload_kind') + wl_name=$(echo "$container_info" | jq -r '.workload_name') + ctr_name=$(echo "$container_info" | jq -r '.container') + + # Resolve policy for this workload + local workload_key="${NAMESPACE}/${wl_name}" + eff_policy=$(resolve_policy "$policy_json" "$NAMESPACE" "$workload_key") + validate_policy "$eff_policy" + + # Collect metrics + local metrics="" + + # Try Prometheus first + if [ -n "${PROMETHEUS_URL:-}" ]; then + metrics=$(get_metrics_prom "$NAMESPACE" "$wl_name" "$ctr_name" "$LOOKBACK" 2>/dev/null || echo "") + fi + + # Fallback to kubectl top + if [ -z "$metrics" ] || [ "$metrics" = "" ]; then + metrics=$(get_metrics_top "$NAMESPACE" "$wl_name" "$ctr_name" 2>/dev/null || echo "") + fi + + if [ -z "$metrics" ] || [ "$metrics" = "" ]; then + # No metrics available — mark as insufficient data + metrics='{"p95_cpu_milli":0,"p95_memory_mi":0,"throttle_ratio":0,"oom_kills":0,"source":"none"}' + fi + + # Step 3: Compute recommendation + local rec + rec=$(compute_recommendation "$container_info" "$metrics" "$eff_policy") + recommendations=$(echo "$recommendations" | jq --argjson r "$rec" '. + [$r]') + + i=$((i + 1)) + done + + # Block apply mode with metrics-server fallback + if [ "$MODE" = "apply" ] && [ "$ADVISORY_ONLY" = "true" ]; then + echo '{"error":"apply mode is blocked when using metrics-server fallback (insufficient data fidelity). Use Prometheus for apply mode."}' >&2 + exit 1 + fi + + # Save recommendations to temp file + local rec_file="${TMPDIR_WORK}/recommendations.json" + echo "$recommendations" > "$rec_file" + + # Check if there are any actionable recommendations + local actionable_count + actionable_count=$(jq '[.[] | select(.has_recommendation == true)] | length' "$rec_file") + + # Step 4: Generate output + case "$MODE" in + dry-run) + case "$OUTPUT_FORMAT" in + markdown) generate_markdown_report "$rec_file" ;; + json) generate_json_report "$rec_file" ;; + yaml) + echo "# YAML output is only available in plan or apply mode" + echo "# Showing markdown report instead" + echo "" + generate_markdown_report "$rec_file" + ;; + esac + ;; + plan) + if [ "$actionable_count" -eq 0 ]; then + echo "No actionable recommendations — all workloads are right-sized or have insufficient data." + exit 0 + fi + + local patch_dir="${TMPDIR_WORK}/patches" + mkdir -p "$patch_dir" + generate_patches "$rec_file" "$patch_dir" + + case "$OUTPUT_FORMAT" in + markdown) + generate_markdown_report "$rec_file" + echo "" + echo "## Generated Patches" + echo "" + generate_yaml_output "$patch_dir" + ;; + json) generate_json_report "$rec_file" ;; + yaml) generate_yaml_output "$patch_dir" ;; + esac + ;; + apply) + if [ "$actionable_count" -eq 0 ]; then + echo "No actionable recommendations — all workloads are right-sized or have insufficient data." + exit 0 + fi + + local ts + ts=$(date -u +"%Y%m%dT%H%M%SZ") + local rollback_dir="rollback-${ts}" + local patch_dir="${TMPDIR_WORK}/patches" + mkdir -p "$patch_dir" + + # Generate rollback bundle + generate_rollback "$NAMESPACE" "$rec_file" "$rollback_dir" + + # Generate patches + generate_patches "$rec_file" "$patch_dir" + + # Show report + generate_markdown_report "$rec_file" + echo "" + echo "## Applying Patches" + echo "" + echo "Rollback bundle: \`${rollback_dir}/\`" + echo "" + + # Apply + apply_patches "$NAMESPACE" "$rec_file" "$patch_dir" "$rollback_dir" + + echo "" + echo "Patches applied successfully. To rollback:" + echo "" + echo '```bash' + echo "ls ${rollback_dir}/rollback-*.sh" + echo '```' + ;; + esac +} + +main diff --git a/forge-skills/local/registry_embedded_test.go b/forge-skills/local/registry_embedded_test.go index 91c8eff..1c6b006 100644 --- a/forge-skills/local/registry_embedded_test.go +++ b/forge-skills/local/registry_embedded_test.go @@ -16,12 +16,12 @@ func TestEmbeddedRegistry_DiscoverAll(t *testing.T) { t.Fatalf("List error: %v", err) } - if len(skills) != 10 { + if len(skills) != 11 { names := make([]string, len(skills)) for i, s := range skills { names[i] = s.Name } - t.Fatalf("expected 10 skills, got %d: %v", len(skills), names) + t.Fatalf("expected 11 skills, got %d: %v", len(skills), names) } // Verify all expected skills are present @@ -41,6 +41,7 @@ func TestEmbeddedRegistry_DiscoverAll(t *testing.T) { "code-review-github": {displayName: "Code Review Github", hasEnv: true, hasBins: true, hasEgress: true}, "codegen-react": {displayName: "Codegen React", hasEnv: false, hasBins: true, hasEgress: true}, "codegen-html": {displayName: "Codegen Html", hasEnv: false, hasBins: true, hasEgress: true}, + "k8s-pod-rightsizer": {displayName: "K8s Pod Rightsizer", hasEnv: false, hasBins: true, hasEgress: false}, } for _, s := range skills { From 9ac31704054eb9794e0d767d84ad143b73cc2655 Mon Sep 17 00:00:00 2001 From: MK Date: Wed, 4 Mar 2026 04:09:41 -0500 Subject: [PATCH 2/5] feat: add distinct icons for all embedded skills MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Replace default cardboard box icon with skill-specific icons: ⚖️ k8s-pod-rightsizer, 🔬 tavily-research, 🔎 code-review, 📏 code-review-standards, ⚛️ codegen-react, 🌐 codegen-html --- forge-cli/internal/tui/steps/skills_step.go | 18 +++++++++++++----- 1 file changed, 13 insertions(+), 5 deletions(-) diff --git a/forge-cli/internal/tui/steps/skills_step.go b/forge-cli/internal/tui/steps/skills_step.go index ca0de1b..818f28f 100644 --- a/forge-cli/internal/tui/steps/skills_step.go +++ b/forge-cli/internal/tui/steps/skills_step.go @@ -413,11 +413,19 @@ func (s *SkillsStep) Apply(ctx *tui.WizardContext) { func skillIcon(name string) string { icons := map[string]string{ - "github": "🐙", - "weather": "🌤️", - "tavily-search": "🔍", - "k8s-incident-triage": "☸️", - "k8s_incident_triage": "☸️", + "github": "🐙", + "weather": "🌤️", + "tavily-search": "🔍", + "tavily-research": "🔬", + "k8s-incident-triage": "☸️", + "k8s_incident_triage": "☸️", + "k8s-pod-rightsizer": "⚖️", + "k8s_pod_rightsizer": "⚖️", + "code-review": "🔎", + "code-review-standards": "📏", + "code-review-github": "🐙", + "codegen-react": "⚛️", + "codegen-html": "🌐", } if icon, ok := icons[name]; ok { return icon From cc71418a57f3236167681bfcde4c81dd1f35f481 Mon Sep 17 00:00:00 2001 From: MK Date: Wed, 4 Mar 2026 04:38:25 -0500 Subject: [PATCH 3/5] refactor: move skill icons from hardcoded map to SKILL.md frontmatter MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Add `icon` field to SkillMetadata and SkillDescriptor, flowing through the parser → scanner → registry → TUI pipeline. Icons are now declared in each skill's SKILL.md frontmatter (e.g. `icon: ⚖️`) and automatically picked up via go:embed — no TUI code changes needed when adding skills. The hardcoded skillIcon() map is replaced with a simple fallback that returns 📦 for skills missing the field. A test ensures all embedded skills declare an icon. --- forge-cli/cmd/init.go | 1 + forge-cli/internal/tui/steps/skills_step.go | 29 ++++++------------- forge-skills/contract/types.go | 2 ++ .../local/embedded/_template/SKILL.md | 1 + .../embedded/code-review-github/SKILL.md | 1 + .../embedded/code-review-standards/SKILL.md | 1 + .../local/embedded/code-review/SKILL.md | 1 + .../local/embedded/codegen-html/SKILL.md | 1 + .../local/embedded/codegen-react/SKILL.md | 1 + forge-skills/local/embedded/github/SKILL.md | 1 + .../embedded/k8s-incident-triage/SKILL.md | 1 + .../embedded/k8s-pod-rightsizer/SKILL.md | 1 + .../local/embedded/tavily-research/SKILL.md | 1 + .../local/embedded/tavily-search/SKILL.md | 1 + forge-skills/local/embedded/weather/SKILL.md | 1 + forge-skills/local/registry_embedded_test.go | 21 ++++++++++++++ forge-skills/local/scanner.go | 1 + 17 files changed, 46 insertions(+), 20 deletions(-) diff --git a/forge-cli/cmd/init.go b/forge-cli/cmd/init.go index 313fb25..2fd67bd 100644 --- a/forge-cli/cmd/init.go +++ b/forge-cli/cmd/init.go @@ -207,6 +207,7 @@ func collectInteractive(opts *initOptions) error { Name: s.Name, DisplayName: s.DisplayName, Description: s.Description, + Icon: s.Icon, RequiredEnv: s.RequiredEnv, OneOfEnv: s.OneOfEnv, OptionalEnv: s.OptionalEnv, diff --git a/forge-cli/internal/tui/steps/skills_step.go b/forge-cli/internal/tui/steps/skills_step.go index 818f28f..a6ef329 100644 --- a/forge-cli/internal/tui/steps/skills_step.go +++ b/forge-cli/internal/tui/steps/skills_step.go @@ -16,6 +16,7 @@ type SkillInfo struct { Name string DisplayName string Description string + Icon string RequiredEnv []string OneOfEnv []string OptionalEnv []string @@ -70,7 +71,10 @@ func NewSkillsStep(styles *tui.StyleSet, skills []SkillInfo) *SkillsStep { var items []components.MultiSelectItem for _, sk := range skills { - icon := skillIcon(sk.Name) + icon := sk.Icon + if icon == "" { + icon = skillIcon(sk.Name) + } var reqs []string if len(sk.RequiredBins) > 0 { reqs = append(reqs, "bins: "+strings.Join(sk.RequiredBins, ", ")) @@ -411,24 +415,9 @@ func (s *SkillsStep) Apply(ctx *tui.WizardContext) { } } -func skillIcon(name string) string { - icons := map[string]string{ - "github": "🐙", - "weather": "🌤️", - "tavily-search": "🔍", - "tavily-research": "🔬", - "k8s-incident-triage": "☸️", - "k8s_incident_triage": "☸️", - "k8s-pod-rightsizer": "⚖️", - "k8s_pod_rightsizer": "⚖️", - "code-review": "🔎", - "code-review-standards": "📏", - "code-review-github": "🐙", - "codegen-react": "⚛️", - "codegen-html": "🌐", - } - if icon, ok := icons[name]; ok { - return icon - } +// skillIcon returns a default icon for skills that don't declare one +// in their SKILL.md frontmatter. Prefer adding "icon:" to frontmatter +// instead of extending this function. +func skillIcon(_ string) string { return "📦" } diff --git a/forge-skills/contract/types.go b/forge-skills/contract/types.go index 740f688..ea1597d 100644 --- a/forge-skills/contract/types.go +++ b/forge-skills/contract/types.go @@ -7,6 +7,7 @@ type SkillDescriptor struct { Description string Category string Tags []string + Icon string RequiredEnv []string OneOfEnv []string OptionalEnv []string @@ -36,6 +37,7 @@ type SkillMetadata struct { Description string `yaml:"description,omitempty"` Category string `yaml:"category,omitempty"` Tags []string `yaml:"tags,omitempty"` + Icon string `yaml:"icon,omitempty"` Metadata map[string]map[string]any `yaml:"metadata,omitempty"` } diff --git a/forge-skills/local/embedded/_template/SKILL.md b/forge-skills/local/embedded/_template/SKILL.md index ff449f1..de1744b 100644 --- a/forge-skills/local/embedded/_template/SKILL.md +++ b/forge-skills/local/embedded/_template/SKILL.md @@ -1,5 +1,6 @@ --- name: my-skill +# icon: 🔧 # Optional: emoji shown in TUI skill picker # category: ops # Optional: sre, research, ops, dev, security, etc. # tags: # Optional: discovery keywords # - example diff --git a/forge-skills/local/embedded/code-review-github/SKILL.md b/forge-skills/local/embedded/code-review-github/SKILL.md index e920a3a..2ac28c5 100644 --- a/forge-skills/local/embedded/code-review-github/SKILL.md +++ b/forge-skills/local/embedded/code-review-github/SKILL.md @@ -1,5 +1,6 @@ --- name: code-review-github +icon: 🐙 category: developer tags: - code-review diff --git a/forge-skills/local/embedded/code-review-standards/SKILL.md b/forge-skills/local/embedded/code-review-standards/SKILL.md index 2f68fb8..3937b57 100644 --- a/forge-skills/local/embedded/code-review-standards/SKILL.md +++ b/forge-skills/local/embedded/code-review-standards/SKILL.md @@ -1,5 +1,6 @@ --- name: code-review-standards +icon: 📏 category: developer tags: - code-review diff --git a/forge-skills/local/embedded/code-review/SKILL.md b/forge-skills/local/embedded/code-review/SKILL.md index 6bab179..4b30c7d 100644 --- a/forge-skills/local/embedded/code-review/SKILL.md +++ b/forge-skills/local/embedded/code-review/SKILL.md @@ -1,5 +1,6 @@ --- name: code-review +icon: 🔎 category: developer tags: - code-review diff --git a/forge-skills/local/embedded/codegen-html/SKILL.md b/forge-skills/local/embedded/codegen-html/SKILL.md index 4d8b8b4..5735a9a 100644 --- a/forge-skills/local/embedded/codegen-html/SKILL.md +++ b/forge-skills/local/embedded/codegen-html/SKILL.md @@ -1,5 +1,6 @@ --- name: codegen-html +icon: 🌐 category: developer tags: - code-generation diff --git a/forge-skills/local/embedded/codegen-react/SKILL.md b/forge-skills/local/embedded/codegen-react/SKILL.md index 68e8bf6..b16f54b 100644 --- a/forge-skills/local/embedded/codegen-react/SKILL.md +++ b/forge-skills/local/embedded/codegen-react/SKILL.md @@ -1,5 +1,6 @@ --- name: codegen-react +icon: ⚛️ category: developer tags: - code-generation diff --git a/forge-skills/local/embedded/github/SKILL.md b/forge-skills/local/embedded/github/SKILL.md index 2e2f8f6..a2f556d 100644 --- a/forge-skills/local/embedded/github/SKILL.md +++ b/forge-skills/local/embedded/github/SKILL.md @@ -1,5 +1,6 @@ --- name: github +icon: 🐙 description: Create issues, PRs, and query repositories metadata: forge: diff --git a/forge-skills/local/embedded/k8s-incident-triage/SKILL.md b/forge-skills/local/embedded/k8s-incident-triage/SKILL.md index ae81312..bc12a62 100644 --- a/forge-skills/local/embedded/k8s-incident-triage/SKILL.md +++ b/forge-skills/local/embedded/k8s-incident-triage/SKILL.md @@ -1,5 +1,6 @@ --- name: k8s-incident-triage +icon: ☸️ category: sre tags: - kubernetes diff --git a/forge-skills/local/embedded/k8s-pod-rightsizer/SKILL.md b/forge-skills/local/embedded/k8s-pod-rightsizer/SKILL.md index 4f5a9dc..ebde5ab 100644 --- a/forge-skills/local/embedded/k8s-pod-rightsizer/SKILL.md +++ b/forge-skills/local/embedded/k8s-pod-rightsizer/SKILL.md @@ -1,5 +1,6 @@ --- name: k8s-pod-rightsizer +icon: ⚖️ category: sre tags: - kubernetes diff --git a/forge-skills/local/embedded/tavily-research/SKILL.md b/forge-skills/local/embedded/tavily-research/SKILL.md index 9adcbc9..c5780f1 100644 --- a/forge-skills/local/embedded/tavily-research/SKILL.md +++ b/forge-skills/local/embedded/tavily-research/SKILL.md @@ -1,5 +1,6 @@ --- name: tavily-research +icon: 🔬 description: Deep multi-source research using Tavily Research API metadata: forge: diff --git a/forge-skills/local/embedded/tavily-search/SKILL.md b/forge-skills/local/embedded/tavily-search/SKILL.md index 9ca9813..222b25c 100644 --- a/forge-skills/local/embedded/tavily-search/SKILL.md +++ b/forge-skills/local/embedded/tavily-search/SKILL.md @@ -1,5 +1,6 @@ --- name: tavily-search +icon: 🔍 description: Search the web using Tavily AI search API metadata: forge: diff --git a/forge-skills/local/embedded/weather/SKILL.md b/forge-skills/local/embedded/weather/SKILL.md index 786ac21..926590a 100644 --- a/forge-skills/local/embedded/weather/SKILL.md +++ b/forge-skills/local/embedded/weather/SKILL.md @@ -1,5 +1,6 @@ --- name: weather +icon: 🌤️ description: Get current weather and forecasts metadata: forge: diff --git a/forge-skills/local/registry_embedded_test.go b/forge-skills/local/registry_embedded_test.go index 1c6b006..71bcfb4 100644 --- a/forge-skills/local/registry_embedded_test.go +++ b/forge-skills/local/registry_embedded_test.go @@ -81,6 +81,9 @@ func TestEmbeddedRegistry_GitHubDetails(t *testing.T) { if s.Description != "Create issues, PRs, and query repositories" { t.Errorf("Description = %q", s.Description) } + if s.Icon != "🐙" { + t.Errorf("Icon = %q, want 🐙", s.Icon) + } if len(s.RequiredEnv) != 1 || s.RequiredEnv[0] != "GH_TOKEN" { t.Errorf("RequiredEnv = %v", s.RequiredEnv) } @@ -190,6 +193,24 @@ func TestEmbeddedRegistry_TavilyResearchDetails(t *testing.T) { } } +func TestEmbeddedRegistry_AllSkillsHaveIcons(t *testing.T) { + reg, err := NewEmbeddedRegistry() + if err != nil { + t.Fatalf("NewEmbeddedRegistry error: %v", err) + } + + skills, err := reg.List() + if err != nil { + t.Fatalf("List error: %v", err) + } + + for _, s := range skills { + if s.Icon == "" { + t.Errorf("skill %q has no icon — add 'icon:' to its SKILL.md frontmatter", s.Name) + } + } +} + func TestEmbeddedRegistry_LoadContent(t *testing.T) { reg, err := NewEmbeddedRegistry() if err != nil { diff --git a/forge-skills/local/scanner.go b/forge-skills/local/scanner.go index 7c3355e..738dfe1 100644 --- a/forge-skills/local/scanner.go +++ b/forge-skills/local/scanner.go @@ -73,6 +73,7 @@ func Scan(fsys fs.FS) ([]contract.SkillDescriptor, error) { sd.Category = meta.Category sd.Tags = meta.Tags + sd.Icon = meta.Icon // Extract forge-specific fields if meta.Metadata != nil { From bd1d3440721b89d63869f5b8a4cac62284dc8bf5 Mon Sep 17 00:00:00 2001 From: MK Date: Wed, 4 Mar 2026 04:50:54 -0500 Subject: [PATCH 4/5] fix: add missing category and tags to all embedded skills - github: category=developer, tags: github, issues, pull-requests, repositories - tavily-research: category=research, tags: research, web-search, tavily, analysis - tavily-search: category=research, tags: web-search, tavily, search - weather: category=utilities, tags: weather, forecast, api Add test to enforce all embedded skills declare category and tags. --- forge-skills/local/embedded/github/SKILL.md | 6 ++++++ .../local/embedded/tavily-research/SKILL.md | 6 ++++++ .../local/embedded/tavily-search/SKILL.md | 5 +++++ forge-skills/local/embedded/weather/SKILL.md | 5 +++++ forge-skills/local/registry_embedded_test.go | 21 +++++++++++++++++++ 5 files changed, 43 insertions(+) diff --git a/forge-skills/local/embedded/github/SKILL.md b/forge-skills/local/embedded/github/SKILL.md index a2f556d..7dd0fed 100644 --- a/forge-skills/local/embedded/github/SKILL.md +++ b/forge-skills/local/embedded/github/SKILL.md @@ -1,6 +1,12 @@ --- name: github icon: 🐙 +category: developer +tags: + - github + - issues + - pull-requests + - repositories description: Create issues, PRs, and query repositories metadata: forge: diff --git a/forge-skills/local/embedded/tavily-research/SKILL.md b/forge-skills/local/embedded/tavily-research/SKILL.md index c5780f1..d0b0f76 100644 --- a/forge-skills/local/embedded/tavily-research/SKILL.md +++ b/forge-skills/local/embedded/tavily-research/SKILL.md @@ -1,6 +1,12 @@ --- name: tavily-research icon: 🔬 +category: research +tags: + - research + - web-search + - tavily + - analysis description: Deep multi-source research using Tavily Research API metadata: forge: diff --git a/forge-skills/local/embedded/tavily-search/SKILL.md b/forge-skills/local/embedded/tavily-search/SKILL.md index 222b25c..2f45ad5 100644 --- a/forge-skills/local/embedded/tavily-search/SKILL.md +++ b/forge-skills/local/embedded/tavily-search/SKILL.md @@ -1,6 +1,11 @@ --- name: tavily-search icon: 🔍 +category: research +tags: + - web-search + - tavily + - search description: Search the web using Tavily AI search API metadata: forge: diff --git a/forge-skills/local/embedded/weather/SKILL.md b/forge-skills/local/embedded/weather/SKILL.md index 926590a..d634df2 100644 --- a/forge-skills/local/embedded/weather/SKILL.md +++ b/forge-skills/local/embedded/weather/SKILL.md @@ -1,6 +1,11 @@ --- name: weather icon: 🌤️ +category: utilities +tags: + - weather + - forecast + - api description: Get current weather and forecasts metadata: forge: diff --git a/forge-skills/local/registry_embedded_test.go b/forge-skills/local/registry_embedded_test.go index 71bcfb4..d0bda32 100644 --- a/forge-skills/local/registry_embedded_test.go +++ b/forge-skills/local/registry_embedded_test.go @@ -193,6 +193,27 @@ func TestEmbeddedRegistry_TavilyResearchDetails(t *testing.T) { } } +func TestEmbeddedRegistry_AllSkillsHaveCategoryAndTags(t *testing.T) { + reg, err := NewEmbeddedRegistry() + if err != nil { + t.Fatalf("NewEmbeddedRegistry error: %v", err) + } + + skills, err := reg.List() + if err != nil { + t.Fatalf("List error: %v", err) + } + + for _, s := range skills { + if s.Category == "" { + t.Errorf("skill %q has no category — add 'category:' to its SKILL.md frontmatter", s.Name) + } + if len(s.Tags) == 0 { + t.Errorf("skill %q has no tags — add 'tags:' to its SKILL.md frontmatter", s.Name) + } + } +} + func TestEmbeddedRegistry_AllSkillsHaveIcons(t *testing.T) { reg, err := NewEmbeddedRegistry() if err != nil { From 14b50661ae03acd292066d91d996f4b868427e63 Mon Sep 17 00:00:00 2001 From: MK Date: Wed, 4 Mar 2026 04:53:27 -0500 Subject: [PATCH 5/5] docs: update skills.md with icon, category, and tags requirements - Add icon/category/tags to the SKILL.md format example - Document all frontmatter fields in a table (icon, category, tags required) - Add Icon column to Built-in Skills table, fill in all categories - Fix parser path reference (forge-skills/parser/parser.go) --- docs/skills.md | 52 +++++++++++++++++++++++++++++++++----------------- 1 file changed, 35 insertions(+), 17 deletions(-) diff --git a/docs/skills.md b/docs/skills.md index 48c0dbf..bbbc368 100644 --- a/docs/skills.md +++ b/docs/skills.md @@ -15,6 +15,12 @@ Skills are defined in a Markdown file (default: `SKILL.md`). The file supports o ```markdown --- name: weather +icon: 🌤️ +category: utilities +tags: + - weather + - forecast + - api description: Weather data skill metadata: forge: @@ -45,13 +51,24 @@ Each `## Tool:` heading defines a tool the agent can call. The frontmatter decla ### YAML Frontmatter -The `metadata.forge.requires` block declares: +Top-level fields: + +| Field | Required | Description | +|-------|----------|-------------| +| `name` | yes | Skill identifier (kebab-case) | +| `icon` | yes | Emoji displayed in the TUI skill picker | +| `category` | yes | Grouping for `forge skills list --category` (e.g., `sre`, `developer`, `research`, `utilities`) | +| `tags` | yes | Discovery keywords for `forge skills list --tags` (kebab-case) | +| `description` | yes | One-line summary | + +The `metadata.forge.requires` block declares runtime dependencies: + - **`bins`** — Binary dependencies that must be in `$PATH` at runtime - **`env.required`** — Environment variables that must be set - **`env.one_of`** — At least one of these environment variables must be set - **`env.optional`** — Optional environment variables for extended functionality -Frontmatter is parsed by `ParseWithMetadata()` in `forge-core/skills/parser.go` and feeds into the compilation pipeline. +Frontmatter is parsed by `ParseWithMetadata()` in `forge-skills/parser/parser.go` and feeds into the compilation pipeline. ### Legacy List Format @@ -118,11 +135,12 @@ Skill scripts run in a restricted environment via `SkillCommandExecutor`: ## Skill Categories & Tags -Skills can declare a `category` and `tags` in their frontmatter for organization and filtering: +All embedded skills must declare `category`, `tags`, and `icon` in their frontmatter. Categories and tags must be lowercase kebab-case. ```markdown --- name: k8s-incident-triage +icon: ☸️ category: sre tags: - kubernetes @@ -131,7 +149,7 @@ tags: --- ``` -Categories and tags must be lowercase kebab-case. Use them to filter skills: +Use categories and tags to filter skills: ```bash # List skills by category @@ -143,19 +161,19 @@ forge skills list --tags kubernetes,incident-response ## Built-in Skills -| Skill | Category | Description | Scripts | -|-------|----------|-------------|---------| -| `github` | — | Create issues, PRs, and query repositories | — (binary-backed) | -| `weather` | — | Get weather data for a location | — (binary-backed) | -| `tavily-search` | — | Search the web using Tavily AI search API | `tavily-search.sh` | -| `tavily-research` | — | Deep multi-source research via Tavily API | `tavily-research.sh`, `tavily-research-poll.sh` | -| `k8s-incident-triage` | sre | Read-only Kubernetes incident triage using kubectl | — (binary-backed) | -| `k8s-pod-rightsizer` | sre | Analyze workload metrics and produce CPU/memory rightsizing recommendations with optional apply | — (binary-backed) | -| `code-review` | developer | AI-powered code review for diffs and files | `code-review-diff.sh`, `code-review-file.sh` | -| `code-review-standards` | developer | Initialize and manage code review standards | — (template-based) | -| `code-review-github` | developer | Post code review results to GitHub PRs | — (binary-backed) | -| `codegen-react` | developer | Scaffold and iterate on Vite + React apps | `codegen-react-scaffold.sh`, `codegen-react-read.sh`, `codegen-react-write.sh`, `codegen-react-run.sh` | -| `codegen-html` | developer | Scaffold standalone Preact + HTM apps (zero dependencies) | `codegen-html-scaffold.sh`, `codegen-html-read.sh`, `codegen-html-write.sh` | +| Skill | Icon | Category | Description | Scripts | +|-------|------|----------|-------------|---------| +| `github` | 🐙 | developer | Create issues, PRs, and query repositories | — (binary-backed) | +| `weather` | 🌤️ | utilities | Get weather data for a location | — (binary-backed) | +| `tavily-search` | 🔍 | research | Search the web using Tavily AI search API | `tavily-search.sh` | +| `tavily-research` | 🔬 | research | Deep multi-source research via Tavily API | `tavily-research.sh`, `tavily-research-poll.sh` | +| `k8s-incident-triage` | ☸️ | sre | Read-only Kubernetes incident triage using kubectl | — (binary-backed) | +| `k8s-pod-rightsizer` | ⚖️ | sre | Analyze workload metrics and produce CPU/memory rightsizing recommendations | — (binary-backed) | +| `code-review` | 🔎 | developer | AI-powered code review for diffs and files | `code-review-diff.sh`, `code-review-file.sh` | +| `code-review-standards` | 📏 | developer | Initialize and manage code review standards | — (template-based) | +| `code-review-github` | 🐙 | developer | Post code review results to GitHub PRs | — (binary-backed) | +| `codegen-react` | ⚛️ | developer | Scaffold and iterate on Vite + React apps | `codegen-react-scaffold.sh`, `codegen-react-read.sh`, `codegen-react-write.sh`, `codegen-react-run.sh` | +| `codegen-html` | 🌐 | developer | Scaffold standalone Preact + HTM apps (zero dependencies) | `codegen-html-scaffold.sh`, `codegen-html-read.sh`, `codegen-html-write.sh` | ### Tavily Research Skill