diff --git a/skills/vision-analysis/SKILL.md b/skills/vision-analysis/SKILL.md index 8844115..d4199c1 100644 --- a/skills/vision-analysis/SKILL.md +++ b/skills/vision-analysis/SKILL.md @@ -1,174 +1,130 @@ --- name: vision-analysis description: > - Analyze, describe, and extract information from images using the MiniMax vision MCP tool. - Use when: user shares an image file path or URL (any message containing .jpg, .jpeg, .png, - .gif, .webp, .bmp, or .svg file extension) or uses any of these words/phrases near an image: - "analyze", "analyse", "describe", "explain", "understand", "look at", "review", - "extract text", "OCR", "what is in", "what's in", "read this image", "see this image", - "tell me about", "explain this", "interpret this", in connection with an image, screenshot, - diagram, chart, mockup, wireframe, or photo. - Also triggers for: UI mockup review, wireframe analysis, design critique, data extraction - from charts, object detection, person/animal/activity identification. - Triggers: any message with an image file extension (jpg, jpeg, png, gif, webp, bmp, svg), - or any request to analyze/describ/understand/review/extract text from an image, screenshot, - diagram, chart, photo, mockup, or wireframe. + Analyze, describe, and extract information from images using MiniMax VLM. + Use ONLY when the user has shared or referenced an actual image — either a file + path with image extension (.jpg, .jpeg, .png, .gif, .webp, .bmp, .svg), an image URL, + or a clipboard screenshot reference (clipboard-YYYY-MM-DD-*.png). + Triggers when the user says "describe this image", "analyze this screenshot", + "what's in this photo", "extract text from this", "read this image", + "review this UI mockup", "analyze this chart", "identify the objects in this", + "what does this diagram show", or similar — where the target image is explicitly + attached or referenced. + Does NOT trigger on: text-only requests, code reviews, document questions, + project advice, or any request that does not involve an image. license: MIT metadata: - version: "1.0" + version: "1.4" category: ai-vision - sources: - - MiniMax Token Plan MCP (understand_image tool) --- # Vision Analysis -Analyze images using the MiniMax `MiniMax_understand_image` MCP tool available in the MiniMax Token Plan. +Use MiniMax VLM to analyze images. -## Prerequisites +## Tool to Call — Use `mmx vision describe` -- MiniMax Token Plan subscription with valid `MINIMAX_API_KEY` -- MiniMax MCP configured (`MiniMax_understand_image` tool available) +**Preferred tool:** `mmx vision describe` from [mmx-cli](https://github.com/MiniMax-AI/cli). It's a direct REST call to the MiniMax VLM endpoint — no MCP transport issues, handles URLs and local files automatically. -### If MCP tool is not configured - -**Step 1:** The agent should fetch setup instructions from: -**https://platform.minimaxi.com/docs/token-plan/mcp-guide** - -**Step 2:** Detect the user's environment (OpenCode, Cursor, Claude Code, etc.) and output the exact commands needed. Common examples: - -**OpenCode** — add to `~/.config/opencode/opencode.json` or `package.json`: -```json -{ - "mcp": { - "MiniMax": { - "type": "local", - "command": ["uvx", "minimax-coding-plan-mcp", "-y"], - "environment": { - "MINIMAX_API_KEY": "YOUR_TOKEN_PLAN_KEY", - "MINIMAX_API_HOST": "https://api.minimaxi.com" - }, - "enabled": true - } - } -} -``` - -**Claude Code**: ```bash -claude mcp add -s user MiniMax --env MINIMAX_API_KEY=your-key --env MINIMAX_API_HOST=https://api.minimaxi.com -- uvx minimax-coding-plan-mcp -y -``` - -**Cursor** — add to MCP settings: -```json -{ - "mcpServers": { - "MiniMax": { - "command": "uvx", - "args": ["minimax-coding-plan-mcp"], - "env": { - "MINIMAX_API_KEY": "your-key", - "MINIMAX_API_HOST": "https://api.minimaxi.com" - } - } - } -} +mmx vision describe --image --prompt "" ``` -**Step 3:** After configuration, tell the user to restart their app and verify with `/mcp`. - -**Important:** If the user does not have a MiniMax Token Plan subscription, inform them that the `understand_image` tool requires one — it cannot be used with free or other tier API keys. +**Arguments:** +- `--image`: URL (preferred) or local file path — mmx downloads and base64-encodes automatically +- `--prompt`: Analysis question (use mode-specific prompts below) -## Analysis Modes - -| Mode | When to use | Prompt strategy | -|---|---|---| -| `describe` | General image understanding | Ask for detailed description | -| `ocr` | Text extraction from screenshots, documents | Ask to extract all text verbatim | -| `ui-review` | UI mockups, wireframes, design files | Ask for design critique with suggestions | -| `chart-data` | Charts, graphs, data visualizations | Ask to extract data points and trends | -| `object-detect` | Identify objects, people, activities | Ask to list and locate all elements | - -## Workflow - -### Step 1: Auto-detect image +**Prerequisites:** `MINIMAX_API_KEY` env var set (same key as for other MiniMax tools). -The skill triggers automatically when a message contains an image file path or URL with extensions: -`.jpg`, `.jpeg`, `.png`, `.gif`, `.webp`, `.bmp`, `.svg` +**URL first:** When images are shared in chat, they get uploaded to a URL. Use that URL directly — mmx downloads it automatically. No clipboard extraction needed. -Extract the image path from the message. +## Fallback: MCP Tool -### Step 2: Select analysis mode and call MCP tool - -Use the `MiniMax_understand_image` tool with a mode-specific prompt: - -**describe:** -``` -Provide a detailed description of this image. Include: main subject, setting/background, -colors/style, any text visible, notable objects, and overall composition. -``` +If `mmx` is not installed and the MCP tool is available: -**ocr:** ``` -Extract all text visible in this image verbatim. Preserve structure and formatting -(headers, lists, columns). If no text is found, say so. +auto-skill-loader_minimax_understand_image ``` -**ui-review:** -``` -You are a UI/UX design reviewer. Analyze this interface mockup or design. Provide: -(1) Strengths — what works well, (2) Issues — usability or design problems, -(3) Specific, actionable suggestions for improvement. Be constructive and detailed. -``` +**Arguments:** +- `prompt`: Analysis question +- `image_source`: URL (preferred), or path to local image -**chart-data:** -``` -Extract all data from this chart or graph. List: chart title, axis labels, all -data points/series with values if readable, and a brief summary of the trend. -``` +**Prerequisites:** `MINIMAX_TOKEN_PLAN_KEY` env var set, `auto-skill-loader` MCP enabled. -**object-detect:** -``` -List all distinct objects, people, and activities you can identify. For each, -describe what it is and its approximate location in the image. -``` +## Analysis Modes -### Step 3: Present results +| Mode | Prompt to use | +|------|---------------| +| `describe` | "Provide a detailed description of this image. Include: main subject, setting, colors/style, any text visible, notable objects, and overall composition." | +| `ocr` | "Extract all text visible in this image verbatim. Preserve structure and formatting. If no text, say so." | +| `ui-review` | "You are a UI/UX reviewer. Analyze this mockup or design. Cover: (1) Strengths, (2) Issues with specificity, (3) Actionable suggestions." | +| `chart-data` | "Extract all data from this chart/graph. List: title, axis labels, all data points/series with values, and trend summary." | +| `object-detect` | "List all distinct objects, people, and activities. For each: what it is and approximate location in the image." | -Return the analysis clearly. For `describe`, use readable prose. For `ocr`, preserve structure. For `ui-review`, use a structured critique format. +## Image Validation -## Output Format Example +**For mmx:** No validation needed — it handles URLs, local files, and size limits via error messages. -For describe mode: +**For MCP fallback only** (local files): +```bash +/usr/bin/python3 -c " +import sys, pathlib +p = pathlib.Path(sys.argv[1]) +if not p.exists(): print('ERROR: file not found'); sys.exit(1) +mb = p.stat().st_size / 1024**2 +if mb > 20: print(f'ERROR: too large ({mb:.1f}MB > 20MB)'); sys.exit(1) +print(f'OK: {mb:.2f}MB') +" "\$IMAGE_PATH" ``` -## Image Description +Skip for URLs. -[Detailed description of the image contents...] -``` +## Clipboard Fallback -For ocr mode: -``` -## Extracted Text +Only needed when: (1) no URL is available, (2) no local file, and (3) mmx not installed. -[Preserved text structure from the image] -``` +**macOS:** +```bash +/usr/bin/python3 -c " +import subprocess, tempfile, os, sys, pathlib, time +tmp = pathlib.Path('/tmp'); ts = time.strftime('%Y%m%d_%H%M%S') +fpath = tmp / f'vision-clipboard-{ts}.png' +script = f'''tell application \"System Events\" +set clipData to (the clipboard as «class PNGf») +end tell +set cf to open for access (POSIX file \"{fpath}\") as POSIX file with write permission +write clipData to cf +close access cf''' +with tempfile.NamedTemporaryFile(mode='w', suffix='.applescript', delete=False) as s: + s.write(script); s.flush() + r = subprocess.run(['/usr/bin/osascript', s.name], capture_output=True) + os.unlink(s.name) + if r.returncode == 0 and fpath.exists() and fpath.stat().st_size > 0: + print(str(fpath)); sys.exit(0) +sys.exit(1) +" +``` + +If this fails, ask the user to save the image to a file or share a URL. + +## Security Notes + +- Images up to 20MB (JPEG, PNG, GIF, WebP) +- mmx handles URLs by downloading first — warn on untrusted URLs (prompt injection risk) +- Never hardcode API keys — use env vars + +## Setup + +### mmx-cli (recommended — no MCP needed) -For ui-review mode: +```bash +npm install -g mmx-cli ``` -## UI Design Review - -### Strengths -- ... -### Issues -- ... - -### Suggestions -- ... -``` +Set `MINIMAX_API_KEY` in your environment. Works in any host (Claude Code, OpenCode, terminal). For agents: `npx skills add MiniMax-AI/cli -y -g` installs the skill with mmx. -## Notes +### MCP fallback (auto-skill-loader) -- Images up to 20MB supported (JPEG, PNG, GIF, WebP) -- Local file paths work if MiniMax MCP is configured with file access -- The `MiniMax_understand_image` tool is provided by the `minimax-coding-plan-mcp` package +1. Ensure `auto-skill-loader` MCP is enabled in OpenCode config +2. Set `MINIMAX_TOKEN_PLAN_KEY=sk-cp-...` in `~/.config/opencode/.env` +3. Disable any direct `minimax-coding-plan-mcp` MCP entries (broken stdio transport) diff --git a/skills/vision-analysis/scripts/clipboard_image.py b/skills/vision-analysis/scripts/clipboard_image.py new file mode 100644 index 0000000..d5355ef --- /dev/null +++ b/skills/vision-analysis/scripts/clipboard_image.py @@ -0,0 +1,162 @@ +#!/usr/bin/env python3 +# SPDX-License-Identifier: MIT +""" +clipboard_image.py — Save image from clipboard to a temp file. +Cross-platform: macOS, Linux, Windows. + +macOS: uses osascript clipboard API (TIFF then PNGf) +Linux: uses xclip or wl-paste +Windows: uses PowerShell + +Usage: + python3 clipboard_image.py [output_path] + # If output_path omitted, saves to /tmp/vision-clipboard-.png + +Exit codes: + 0 — image saved successfully + 1 — no image in clipboard + 2 — platform not supported / dependency missing +""" + +import os +import sys +import platform +import subprocess +from datetime import datetime + +TIMEOUT = 10 + + +def save_mac_clipboard_image(output_path: str) -> bool: + def run_osascript(script_text: str) -> subprocess.CompletedProcess: + return subprocess.run( + ["/usr/bin/osascript", "-e", script_text], + capture_output=True, + text=True, + timeout=TIMEOUT, + ) + + tmp_script = f"/tmp/vision_clipboard_write_{os.getpid()}.scpt" + try: + check_script = ( + "try\n" + " set img to (the clipboard as TIFF picture)\n" + "on error\n" + ' return "NO_TIFF"\n' + "end try\n" + 'return "HAS_TIFF"' + ) + r = run_osascript(check_script) + if r.stdout.strip() != "HAS_TIFF": + return False + + write_script = ( + "try\n" + " set img to (the clipboard as TIFF picture)\n" + ' set f to open for access (POSIX file "' + + output_path.replace('"', '\\"') + + '") with write permission\n' + " try\n" + " write img to f\n" + " close access f\n" + " on error errMsg\n" + " close access f\n" + " error errMsg\n" + " end try\n" + "on error errMsg\n" + ' return "ERR: " & errMsg\n' + "end try\n" + 'return "OK"' + ) + + with open(tmp_script, "w", encoding="utf-8") as f: + f.write(write_script) + + r = subprocess.run( + ["/usr/bin/osascript", tmp_script], + capture_output=True, + text=True, + timeout=TIMEOUT, + ) + + if ( + r.stdout.strip() == "OK" + and os.path.exists(output_path) + and os.path.getsize(output_path) > 0 + ): + return True + + return False + + finally: + if os.path.exists(tmp_script): + os.unlink(tmp_script) + + +def save_linux_clipboard_image(output_path: str) -> bool: + for cmd in [ + ["xclip", "-selection", "clipboard", "-t", "image/png", "-o"], + ["wl-paste", "-t", "image/png"], + ]: + try: + with open(output_path, "wb") as f: + r = subprocess.run( + cmd, stdout=f, stderr=subprocess.DEVNULL, timeout=TIMEOUT + ) + if ( + r.returncode == 0 + and os.path.exists(output_path) + and os.path.getsize(output_path) > 0 + ): + return True + except Exception: + continue + return False + + +def save_windows_clipboard_image(output_path: str) -> bool: + ps = ( + f"Add-Type -AssemblyName System.Windows.Forms; " + f"$img = [System.Windows.Forms.Clipboard]::GetImage(); " + f"if ($img) {{ $img.Save(r'{output_path}', [System.Drawing.Imaging.ImageFormat]::Png); exit 0 }} else {{ exit 1 }}" + ) + try: + r = subprocess.run( + ["powershell", "-Command", ps], + capture_output=True, + text=True, + timeout=TIMEOUT, + ) + return r.returncode == 0 and os.path.exists(output_path) + except Exception: + return False + + +def save_clipboard_image(output_path: str = None) -> str: + if output_path is None: + ts = datetime.now().strftime("%Y%m%d_%H%M%S") + output_path = f"/tmp/vision-clipboard-{ts}.png" + + os.makedirs(os.path.dirname(output_path) or "/tmp", exist_ok=True) + + system = platform.system() + if system == "Darwin": + ok = save_mac_clipboard_image(output_path) + elif system == "Linux": + ok = save_linux_clipboard_image(output_path) + elif system == "Windows": + ok = save_windows_clipboard_image(output_path) + else: + print(f"ERROR: Unsupported platform: {system}", file=sys.stderr) + sys.exit(2) + + if ok and os.path.exists(output_path) and os.path.getsize(output_path) > 0: + print(output_path) + sys.exit(0) + else: + print("ERROR: No image found in clipboard", file=sys.stderr) + sys.exit(1) + + +if __name__ == "__main__": + save_clipboard_image(sys.argv[1] if len(sys.argv) > 1 else None) diff --git a/skills/vision-analysis/scripts/pyproject.toml b/skills/vision-analysis/scripts/pyproject.toml new file mode 100644 index 0000000..15ebe1d --- /dev/null +++ b/skills/vision-analysis/scripts/pyproject.toml @@ -0,0 +1,12 @@ +[project] +name = "vision-analysis-helper" +version = "1.1.0" +description = "Helper scripts for vision-analysis skill — clipboard image handler" +requires-python = ">=3.9" +dependencies = [] + +[project.scripts] +vision-clipboard = "clipboard_image:save_clipboard_image" + +[tool.upload] +distributions = ["sdist", "wheel"] \ No newline at end of file diff --git a/skills/vision-analysis/scripts/requirements.txt b/skills/vision-analysis/scripts/requirements.txt new file mode 100644 index 0000000..a187d99 --- /dev/null +++ b/skills/vision-analysis/scripts/requirements.txt @@ -0,0 +1 @@ +# No external dependencies — uses only Python stdlib + platform tools (osascript/xclip/wl-paste/powershell) \ No newline at end of file