diff --git a/skills/minimax-multimodal-toolkit/SKILL.md b/skills/minimax-multimodal-toolkit/SKILL.md index b7e856f..ee313f0 100644 --- a/skills/minimax-multimodal-toolkit/SKILL.md +++ b/skills/minimax-multimodal-toolkit/SKILL.md @@ -1,23 +1,33 @@ --- name: minimax-multimodal-toolkit -description: > - MiniMax multimodal model skill — use MiniMax Multi-Modal models for speech, music, video, and image. - Create voice, music, video, and images with MiniMax AI: TTS (text-to-speech, voice cloning, voice design, - multi-segment), music (songs, instrumentals), video (text-to-video, image-to-video, start-end frame, - subject reference, templates, long-form multi-scene), image (text-to-image, image-to-image with character - reference), and media processing (convert, concat, trim, extract). - Use when the user mentions MiniMax, multimodal generation, or wants speech/music/video/image AI, - MiniMax APIs, or FFmpeg workflows alongside MiniMax outputs. -license: MIT -metadata: - version: "1.0" - category: media-generation +description: MiniMax multimodal model skill — use MiniMax Multi-Modal models for speech, music, video, and image. Create voice, music, video, and images with MiniMax AI: TTS (text-to-speech, voice cloning, voice design, multi-segment), music (songs, instrumentals), video (text-to-video, image-to-video, start-end frame, subject reference, templates, long-form multi-scene), image (text-to-image, image-to-image with character reference), and media processing (convert, concat, trim, extract). Use when the user mentions MiniMax, multimodal generation, or wants speech/music/video/image AI, MiniMax APIs, or FFmpeg workflows alongside MiniMax outputs. --- # MiniMax Multi-Modal Toolkit Generate voice, music, video, and image content via MiniMax APIs — the unified entry for **MiniMax multimodal** use cases (audio + music + video + image). Includes voice cloning & voice design for custom voices, image generation with character reference, and FFmpeg-based media tools for audio/video format conversion, concatenation, trimming, and extraction. +## Default Models + +When the user does not specify a model, always use the default model for each capability. Do NOT ask the user to choose a model unless they explicitly mention model selection. + +| Capability | Default Model | Notes | +|------------|---------------|-------| +| TTS | `speech-2.8-hd` | Auto emotion matching, recommended | +| Music | `music-2.5` | Only available model | +| Image | `image-01` | Only available model | +| Video | `MiniMax-Hailuo-2.3` | 6s + 768P, supports all modes (t2v/i2v/sef/ref) | + +Only switch to an alternative model (e.g. `speech-2.8-turbo`, `MiniMax-Hailuo-2.3-Fast`) when the user explicitly requests faster generation or names a specific model. + +### Error Handling + +When a default model call fails: + +1. **Always show the user the exact error message** returned by the API — do not silently retry or hide errors. +2. **Video generation quota exhausted**: If `MiniMax-Hailuo-2.3` returns a quota/limit error (e.g. `insufficient_quota`, `rate_limit`, `balance`), automatically retry with `MiniMax-Hailuo-2.3-Fast` and inform the user: "MiniMax-Hailuo-2.3 quota exhausted — automatically retrying with MiniMax-Hailuo-2.3-Fast." +3. **Other capabilities** (TTS, Music, Image): Show the error to the user and wait for their instructions. Do not auto-switch models. + ## Output Directory **All generated files MUST be saved to `minimax-output/` under the AGENT'S current working directory (NOT the skill directory).** Every script call MUST include an explicit `--output` / `-o` argument pointing to this location. Never omit the output argument or rely on script defaults. @@ -43,8 +53,8 @@ MiniMax provides two service endpoints for different regions. Set `MINIMAX_API_H | Region | Platform URL | API Host Value | |--------|-------------|----------------| -| China Mainland(中国大陆) | https://platform.minimaxi.com | `https://api.minimaxi.com` | -| Global(全球) | https://platform.minimax.io | `https://api.minimax.io` | +| China Mainland | https://platform.minimaxi.com | `https://api.minimaxi.com` | +| Global | https://platform.minimax.io | `https://api.minimax.io` | ```bash # China Mainland @@ -76,37 +86,6 @@ Before running any script, check if `MINIMAX_API_KEY` is set in the environment. 1. Ask the user to provide their MiniMax API key 2. Instruct and help user to set it via `export MINIMAX_API_KEY="sk-..."` in their terminal or add it to their shell profile (`~/.zshrc` / `~/.bashrc`) for persistence -## Plan Limits & Quotas - -**IMPORTANT — Always respect the user's plan limits before generating content.** If the user's quota is exhausted or insufficient, warn them before proceeding. - -### Standard Plans - -| Capability | Starter | Plus | Max | -|---|---|---|---| -| M2.7 (chat) | 600 req/5h | 1,500 req/5h | 4,500 req/5h | -| Speech 2.8 | — | 4,000 chars/day | 11,000 chars/day | -| image-01 | — | 50 images/day | 120 images/day | -| Hailuo-2.3-Fast 768P 6s | — | — | 2 videos/day | -| Hailuo-2.3 768P 6s | — | — | 2 videos/day | -| Music-2.5 | — | — | 4 songs/day (≤5 min each) | - -### High-Speed Plans - -| Capability | Plus-HS | Max-HS | Ultra-HS | -|---|---|---|---| -| M2.7-highspeed (chat) | 1,500 req/5h | 4,500 req/5h | 30,000 req/5h | -| Speech 2.8 | 9,000 chars/day | 19,000 chars/day | 50,000 chars/day | -| image-01 | 100 images/day | 200 images/day | 800 images/day | -| Hailuo-2.3-Fast 768P 6s | — | 3 videos/day | 5 videos/day | -| Hailuo-2.3 768P 6s | — | 3 videos/day | 5 videos/day | -| Music-2.5 | — | 7 songs/day (≤5 min each) | 15 songs/day (≤5 min each) | - -**Key quota constraints:** -- **Video resolution: 768P only** — 1080P is not available on any plan -- **Video duration: 6s** — all plan quotas are counted in 6-second units -- **Video quota is very limited** (2–5/day depending on plan) — always confirm with the user before generating video - ## Key Capabilities | Capability | Description | Entry point | @@ -190,8 +169,6 @@ bash scripts/tts/generate_voice.sh convert input.wav -o minimax-output/output.mp |-------|-------| | speech-2.8-hd | Recommended, auto emotion matching | | speech-2.8-turbo | Faster variant | -| speech-2.6-hd | Previous gen, manual emotion | -| speech-2.6-turbo | Previous gen, faster | ### segments.json Format @@ -204,7 +181,7 @@ Default crossfade between segments: **200ms** (`--crossfade 200`). ] ``` -Leave `emotion` empty for speech-2.8 models (auto-matched from text). +Leave `emotion` empty (auto-matched from text by speech-2.8 models). ### IMPORTANT: Multi-Segment Script Generation Rules (Audiobooks, Podcasts, etc.) @@ -303,24 +280,24 @@ Do NOT always default to `1:1`. Analyze the user's request and choose the most a | User intent / context | Recommended ratio | Resolution | |-----------------------|-------------------|------------| -| 头像、图标、社交媒体头像、avatar、icon、profile pic | `1:1` | 1024×1024 | -| 风景、横幅、桌面壁纸、landscape、banner、desktop wallpaper | `16:9` | 1280×720 | -| 传统照片、经典比例、classic photo | `4:3` | 1152×864 | -| 摄影作品、杂志封面、photography、magazine | `3:2` | 1248×832 | -| 人像竖图、海报、portrait photo、poster | `2:3` | 832×1248 | -| 竖版海报、书籍封面、tall poster、book cover | `3:4` | 864×1152 | -| 手机壁纸、社交媒体故事、phone wallpaper、story、reel | `9:16` | 720×1280 | -| 超宽全景、电影画幅、panoramic、cinematic ultrawide | `21:9` | 1344×576 | -| 未指定特定需求 / ambiguous | `1:1` | 1024×1024 | +| Avatar, icon, profile pic, social media avatar | `1:1` | 1024×1024 | +| Landscape, banner, desktop wallpaper | `16:9` | 1280×720 | +| Classic photo, traditional ratio | `4:3` | 1152×864 | +| Photography, magazine cover | `3:2` | 1248×832 | +| Portrait photo, poster | `2:3` | 832×1248 | +| Tall poster, book cover | `3:4` | 864×1152 | +| Phone wallpaper, social story, reel | `9:16` | 720×1280 | +| Ultra-wide panoramic, cinematic ultrawide | `21:9` | 1344×576 | +| Ambiguous / unspecified | `1:1` | 1024×1024 | ### IMPORTANT: Image Count — When to generate multiple images | User intent | Count (`-n`) | |-------------|--------------| | Default / single image request | `1` (default) | -| 用户说"几张"、"多张"、"一些" / "a few", "several" | `3` | -| 用户说"多种方案"、"备选" / "variations", "options" | `3`–`4` | -| 用户明确指定数量 | Use the specified number (1–9) | +| "a few", "several", "some" | `3` | +| "variations", "options", "alternatives" | `3`–`4` | +| User specifies an exact number | Use the specified number (1–9) | ### Text-to-Image Examples @@ -416,30 +393,33 @@ bash scripts/image/generate_image.sh \ | User intent | Script to use | |-------------|---------------| | Default / no special request | `scripts/video/generate_video.sh` (single segment, **6s, 768P**) | -| User explicitly asks for "long video", "multi-scene", "story", or duration > 10s | `scripts/video/generate_long_video.sh` (multi-segment) | +| User explicitly asks for "long video", "multi-scene", "story", or duration > 6s | `scripts/video/generate_long_video.sh` (multi-segment) | -**Default behavior:** Always use single-segment `generate_video.sh` with **duration 6s and resolution 768P** unless the user explicitly asks for a long video or multi-scene video. Do NOT automatically split into multiple segments — a single 6s video is the standard output. Only use `generate_long_video.sh` when the user clearly needs multi-scene or longer content. +**Default behavior:** Always use single-segment `generate_video.sh` with **duration 6s and resolution 768P** unless the user explicitly asks for a long video, multi-scene video, or specifies a total duration exceeding 6 seconds. Do NOT automatically split into multiple segments — a single 6s video is the standard output. Only use `generate_long_video.sh` when the user clearly needs multi-scene or longer content. Entry point (single video): `scripts/video/generate_video.sh` Entry point (long/multi-scene): `scripts/video/generate_long_video.sh` ### Video Model Constraints (MUST follow) -**Supported resolutions and durations by model:** +**Duration limits by model and resolution:** + +| Model | 768P | +|-------|------| +| MiniMax-Hailuo-2.3-Fast | 6s | +| MiniMax-Hailuo-2.3 | 6s | -| Model | Resolution | Duration | -|-------|-----------|----------| -| MiniMax-Hailuo-2.3 | 768P only | 6s or 10s | -| MiniMax-Hailuo-2.3-Fast | 768P only | 6s or 10s | -| MiniMax-Hailuo-02 | 512P, 768P (default) | 6s or 10s | -| T2V-01 / T2V-01-Director | 720P | 6s only | -| I2V-01 / I2V-01-Director / I2V-01-live | 720P | 6s only | -| S2V-01 (ref) | 720P | 6s only | +**Resolution options by model and duration:** + +| Model | 6s | +|-------|-----| +| MiniMax-Hailuo-2.3-Fast | 768P | +| MiniMax-Hailuo-2.3 | 768P | **Key rules:** -- **Default: 6s + 768P** — plan quotas are counted in 6-second units; use 6s unless user explicitly requests 10s -- **1080P is NOT supported** on any plan — always use 768P for Hailuo-2.3/2.3-Fast -- Older models (T2V-01, I2V-01, S2V-01) only support 6s at 720P +- **Default: `MiniMax-Hailuo-2.3` + 6s + 768P** +- `MiniMax-Hailuo-2.3-Fast` only supports `6s + 768P` +- `MiniMax-Hailuo-2.3` only supports `6s + 768P` ### IMPORTANT: Prompt Optimization (MUST follow before generating any video) @@ -449,17 +429,17 @@ Before calling any video generation script, you MUST optimize the user's prompt 1. **Apply the Professional Formula**: `Main subject + Scene + Movement + Camera motion + Aesthetic atmosphere` - BAD: `"A puppy in a park"` - - GOOD: `"A golden retriever puppy runs toward the camera on a sun-dappled grass path in a park, [跟随] smooth tracking shot, warm golden hour lighting, shallow depth of field, joyful atmosphere"` + - GOOD: `"A golden retriever puppy runs toward the camera on a sun-dappled grass path in a park, [Tracking shot] smooth tracking, warm golden hour lighting, shallow depth of field, joyful atmosphere"` -2. **Add camera instructions** using `[指令]` syntax: `[推进]`, `[拉远]`, `[跟随]`, `[固定]`, `[左摇]`, etc. +2. **Add camera instructions** using `[command]` syntax: `[Push in]`, `[Pull out]`, `[Tracking shot]`, `[Static shot]`, `[Pan left]`, etc. 3. **Include aesthetic details**: lighting (golden hour, dramatic side lighting), color grading (warm tones, cinematic), texture (dust particles, rain droplets), atmosphere (intimate, epic, peaceful) -4. **Keep to 1-2 key actions** for 6-10 second videos — do not overcrowd with events +4. **Keep to 1-2 key actions** for 6-second videos — do not overcrowd with events 5. **For i2v mode** (image-to-video): Focus prompt on **movement and change only**, since the image already establishes the visual. Do NOT re-describe what's in the image. - BAD: `"A lake with mountains"` (just repeating the image) - - GOOD: `"Gentle ripples spread across the water surface, a breeze rustles the distant trees, [固定] fixed camera, soft morning light, peaceful and serene"` + - GOOD: `"Gentle ripples spread across the water surface, a breeze rustles the distant trees, [Static shot] fixed camera, soft morning light, peaceful and serene"` 6. **For multi-segment long videos**: Each segment's prompt must be self-contained and optimized individually. The i2v segments (segment 2+) should describe motion/change relative to the previous segment's ending frame. @@ -467,28 +447,34 @@ Before calling any video generation script, you MUST optimize the user's prompt # Text-to-video (default: 6s, 768P) bash scripts/video/generate_video.sh \ --mode t2v \ - --prompt "A golden retriever puppy bounds toward the camera on a sunlit grass path, [跟随] tracking shot, warm golden hour, shallow depth of field, joyful" \ + --prompt "A golden retriever puppy bounds toward the camera on a sunlit grass path, [Tracking shot] warm golden hour, shallow depth of field, joyful" \ --output minimax-output/puppy.mp4 +# Text-to-video with MiniMax-Hailuo-2.3-Fast +bash scripts/video/generate_video.sh \ + --mode t2v \ + --prompt "A golden retriever puppy bounds toward the camera" \ + --model MiniMax-Hailuo-2.3-Fast \ + --output minimax-output/puppy_fast.mp4 + # Image-to-video (prompt focuses on MOTION, not image content) bash scripts/video/generate_video.sh \ --mode i2v \ - --prompt "The petals begin to sway gently in the breeze, soft light shifts across the surface, [固定] fixed framing, dreamy pastel tones" \ + --prompt "The petals begin to sway gently in the breeze, soft light shifts across the surface, [Static shot] dreamy pastel tones" \ --first-frame photo.jpg \ --output minimax-output/animated.mp4 -# Start-end frame interpolation (sef mode uses MiniMax-Hailuo-02) +# Start-end frame interpolation (sef mode) bash scripts/video/generate_video.sh \ --mode sef \ --first-frame start.jpg --last-frame end.jpg \ --output minimax-output/transition.mp4 -# Subject reference (face consistency, ref mode uses S2V-01, 6s only) +# Subject reference (face consistency) bash scripts/video/generate_video.sh \ --mode ref \ - --prompt "A young woman in a white dress walks slowly through a sunlit garden, [跟随] smooth tracking, warm natural lighting, cinematic depth of field" \ + --prompt "A young woman in a white dress walks slowly through a sunlit garden, [Tracking shot] warm natural lighting, cinematic depth of field" \ --subject-image face.jpg \ - --duration 6 \ --output minimax-output/person.mp4 ``` @@ -513,17 +499,15 @@ Multi-scene long videos chain segments together: the first segment generates via # Example: 3-segment story with optimized per-segment prompts (default: 6s/segment, 768P) bash scripts/video/generate_long_video.sh \ --scenes \ - "A lone astronaut stands on a red desert planet surface, wind blowing dust particles, [推进] slow push in toward the visor, dramatic rim lighting, cinematic sci-fi atmosphere" \ - "The astronaut turns and begins walking toward a distant glowing structure on the horizon, dust swirling around boots, [跟随] tracking from behind, vast desolate landscape, golden light from the structure" \ - "The astronaut reaches the structure entrance, a massive doorway pulses with blue energy, [推进] slow push in toward the doorway, light reflects off the visor, awe-inspiring epic scale" \ + "A lone astronaut stands on a red desert planet surface, wind blowing dust particles, [Push in] slow push in toward the visor, dramatic rim lighting, cinematic sci-fi atmosphere" \ + "The astronaut turns and begins walking toward a distant glowing structure on the horizon, dust swirling around boots, [Tracking shot] vast desolate landscape, golden light from the structure" \ + "The astronaut reaches the structure entrance, a massive doorway pulses with blue energy, [Push in] slow push in toward the doorway, light reflects off the visor, awe-inspiring epic scale" \ --music-prompt "cinematic orchestral ambient, slow build, sci-fi atmosphere" \ --output minimax-output/long_video.mp4 # With custom settings bash scripts/video/generate_long_video.sh \ --scenes "Scene 1 prompt" "Scene 2 prompt" \ - --segment-duration 6 \ - --resolution 768P \ --crossfade 0.5 \ --music-prompt "calm ambient background music" \ --output minimax-output/long_video.mp4 @@ -553,10 +537,10 @@ bash scripts/video/generate_template_video.sh \ | Mode | Default Model | Default Duration | Default Resolution | Notes | |------|--------------|-----------------|-------------------|-------| -| t2v | MiniMax-Hailuo-2.3 | 6s | 768P | Latest text-to-video | -| i2v | MiniMax-Hailuo-2.3 | 6s | 768P | Latest image-to-video | -| sef | MiniMax-Hailuo-02 | 6s | 768P | Start-end frame | -| ref | S2V-01 | 6s | 720P | Subject reference, 6s only | +| t2v | MiniMax-Hailuo-2.3 | 6s | 768P | Default supported combo | +| i2v | MiniMax-Hailuo-2.3 | 6s | 768P | Default supported combo | +| sef | MiniMax-Hailuo-2.3 | 6s | 768P | Start-end frame mode | +| ref | MiniMax-Hailuo-2.3 | 6s | 768P | Subject reference mode | ## Media Tools (Audio/Video Processing) diff --git a/skills/minimax-multimodal-toolkit/references/tts-guide.md b/skills/minimax-multimodal-toolkit/references/tts-guide.md index 600ab1b..7b7fe95 100644 --- a/skills/minimax-multimodal-toolkit/references/tts-guide.md +++ b/skills/minimax-multimodal-toolkit/references/tts-guide.md @@ -105,7 +105,7 @@ python scripts/tts/generate_voice.py generate segments.json -o output.mp3 --cros - **Endpoint**: `POST /v1/t2a_v2` - **Base URL**: `https://api.minimaxi.com` - **Auth**: `Authorization: Bearer {MINIMAX_API_KEY}` -- **Models**: speech-2.8-hd (recommended), speech-2.8-turbo, speech-2.6-hd, speech-2.6-turbo, speech-02-hd, speech-02-turbo, speech-01-hd, speech-01-turbo +- **Models**: speech-2.8-hd (recommended), speech-2.8-turbo - **Text limit**: 10,000 characters per request - **Pause marker**: `<#x#>` where x is seconds (0.01–99.99) - **Interjection tags** (speech-2.8 only): `(laughs)`, `(chuckle)`, `(coughs)`, `(sighs)`, `(breath)`, etc. diff --git a/skills/minimax-multimodal-toolkit/references/tts-voice-catalog.md b/skills/minimax-multimodal-toolkit/references/tts-voice-catalog.md index b8650a2..1f63541 100644 --- a/skills/minimax-multimodal-toolkit/references/tts-voice-catalog.md +++ b/skills/minimax-multimodal-toolkit/references/tts-voice-catalog.md @@ -521,8 +521,6 @@ voice = VoiceSetting( | `disgusted` | Repulsed | All | | `surprised` | Astonished | All | | `calm` | Neutral tone | All | -| `fluent` | Natural, lively | speech-2.6 only | -| `whisper` | Soft, gentle | speech-2.6 only | --- diff --git a/skills/minimax-multimodal-toolkit/references/video-api.md b/skills/minimax-multimodal-toolkit/references/video-api.md index 7a51efc..02db866 100644 --- a/skills/minimax-multimodal-toolkit/references/video-api.md +++ b/skills/minimax-multimodal-toolkit/references/video-api.md @@ -20,31 +20,24 @@ ### Text-to-Video (T2V) Models | Model | Resolution | Duration | Notes | |-------|-----------|----------|-------| -| MiniMax-Hailuo-2.3 | 768P (default), 1080P | 6s (1080P), 6/10s (768P) | Recommended, latest | -| MiniMax-Hailuo-2.3-Fast | 768P (default), 1080P | 6s (1080P), 6/10s (768P) | Fast variant | -| MiniMax-Hailuo-02 | 512P, 768P (default), 1080P | 6s (1080P), 6/10s (512P/768P) | Previous gen | -| T2V-01-Director | 720P | 6s | Director control | -| T2V-01 | 720P | 6s | Base model | +| MiniMax-Hailuo-2.3-Fast | 768P | 6s | Fixed combo: 6s + 768P | +| MiniMax-Hailuo-2.3 | 768P | 6s | Fixed combo: 6s + 768P | ### Image-to-Video (I2V) Models | Model | Resolution | Duration | Notes | |-------|-----------|----------|-------| -| MiniMax-Hailuo-2.3 | 768P, 1080P | 6s | Recommended | -| MiniMax-Hailuo-2.3-Fast | 768P, 1080P | 6s | Fast variant | -| MiniMax-Hailuo-02 | 512P, 768P, 1080P | 6/10s | Previous gen | -| I2V-01-Director | 720P | 6s | Director control | -| I2V-01-live | 720P | 6s | Live photo style | -| I2V-01 | 720P | 6s | Base model | +| MiniMax-Hailuo-2.3-Fast | 768P | 6s | Fixed combo: 6s + 768P | +| MiniMax-Hailuo-2.3 | 768P | 6s | Fixed combo: 6s + 768P | ### Start-End Frame Model | Model | Notes | |-------|-------| -| MiniMax-Hailuo-02 | Only model supporting start-end frame | +| MiniMax-Hailuo-2.3 | Supports start-end frame mode | ### Subject Reference Model | Model | Notes | |-------|-------| -| S2V-01 | Face consistency across video | +| MiniMax-Hailuo-2.3 | Use supported duration+resolution combos | --- @@ -56,7 +49,7 @@ | model | string | Yes | - | Model name | | prompt | string | Depends | - | Video description, max 2000 chars | | duration | int | No | 6 | Video length in seconds | -| resolution | string | No | 768P/720P | Video resolution | +| resolution | string | No | 768P | Video resolution | | prompt_optimizer | bool | No | true | Auto-optimize prompt | | fast_pretreatment | bool | No | false | Shorten optimizer duration | | callback_url | string | No | - | Webhook URL | @@ -89,19 +82,21 @@ Each object has `type` and `image` (array of image URLs): ## Camera Instructions -Supported in `[指令]` syntax for Hailuo-2.3, Hailuo-02, and Director models: +Supported in `[command]` syntax for Hailuo-2.3 models: | Category | Instructions | |----------|-------------| -| Pan | `[左移]`, `[右移]` | -| Rotation | `[左摇]`, `[右摇]` | -| Push/Pull | `[推进]`, `[拉远]` | -| Elevation | `[上升]`, `[下降]` | -| Tilt | `[上摇]`, `[下摇]` | -| Zoom | `[变焦推近]`, `[变焦拉远]` | -| Other | `[晃动]`, `[跟随]`, `[固定]` | - -Combine for simultaneous: `[左摇,上升]` (max 3). Sequential: `...[推进], then ...[拉远]` +| Truck (lateral) | `[Truck left]`, `[Truck right]` | +| Pan (horizontal rotation) | `[Pan left]`, `[Pan right]` | +| Push/Pull (depth) | `[Push in]`, `[Pull out]` | +| Pedestal (vertical) | `[Pedestal up]`, `[Pedestal down]` | +| Tilt (vertical rotation) | `[Tilt up]`, `[Tilt down]` | +| Zoom (focal length) | `[Zoom in]`, `[Zoom out]` | +| Shake | `[Shake]` | +| Tracking | `[Tracking shot]` | +| Static | `[Static shot]` | + +Combine for simultaneous: `[Pan left,Pedestal up]` (max 3). Sequential: `...[Push in], then ...[Pull out]` --- diff --git a/skills/minimax-multimodal-toolkit/references/video-prompt-guide.md b/skills/minimax-multimodal-toolkit/references/video-prompt-guide.md index 3145757..7763cea 100644 --- a/skills/minimax-multimodal-toolkit/references/video-prompt-guide.md +++ b/skills/minimax-multimodal-toolkit/references/video-prompt-guide.md @@ -14,9 +14,9 @@ Examples: **Main subject + Scene + Movement + Camera motion + Aesthetic atmosphere** Examples: -- "A couple sits on a park bench, warm golden hour lighting, [固定] framing, intimate and romantic atmosphere" -- "A young man in a suit eats noodles at a street stall, [拉远] revealing the busy night market, warm tones, cinematic" -- "A dancer performs contemporary dance in an empty studio, [跟随] smooth tracking, dramatic side lighting" +- "A couple sits on a park bench, warm golden hour lighting, [Static shot] intimate and romantic atmosphere" +- "A young man in a suit eats noodles at a street stall, [Pull out] revealing the busy night market, warm tones, cinematic" +- "A dancer performs contemporary dance in an empty studio, [Tracking shot] smooth tracking, dramatic side lighting" --- @@ -32,13 +32,13 @@ Examples: ## Camera Instructions Usage ### Simultaneous Camera Movement -Place multiple instructions in one bracket: -- `[左摇,上升]` — pan left while rising -- `[推进,下摇]` — push in while tilting down +Place multiple instructions in one bracket (max 3): +- `[Pan left,Pedestal up]` — pan left while rising +- `[Push in,Tilt down]` — push in while tilting down ### Sequential Camera Movement Place instructions at different points in the prompt: -- "The camera starts with [推进] toward the face, then [拉远] to reveal the full scene" +- "The camera starts with [Push in] toward the face, then [Pull out] to reveal the full scene" --- @@ -75,8 +75,8 @@ Place instructions at different points in the prompt: ## Image-to-Video Prompt Tips Focus on **movement and change** since the image establishes the visual: -- Image of still lake → "Gentle ripples spread across the water surface, a breeze rustles the trees, [固定] fixed camera, peaceful" -- Image of portrait → "The person slowly smiles and turns their head, natural blinking, [推进] subtle push in, warm lighting" +- Image of still lake → "Gentle ripples spread across the water surface, a breeze rustles the trees, [Static shot] peaceful" +- Image of portrait → "The person slowly smiles and turns their head, natural blinking, [Push in] subtle push in, warm lighting" --- @@ -85,7 +85,7 @@ Focus on **movement and change** since the image establishes the visual: 1. **Subject**: Appearance, clothing, color, expression, posture 2. **Action**: 1-2 key temporal actions ("first...then...") 3. **Scene**: Setting with foreground + background + atmosphere -4. **Camera**: `[运镜指令]` for precise control +4. **Camera**: `[Camera command]` for precise control (e.g. `[Push in]`, `[Tracking shot]`, `[Pan left]`) 5. **Aesthetic**: Lighting, color, texture, cinematic quality ## Common Mistakes diff --git a/skills/minimax-multimodal-toolkit/scripts/image/generate_image.sh b/skills/minimax-multimodal-toolkit/scripts/image/generate_image.sh index 04782b9..369a514 100755 --- a/skills/minimax-multimodal-toolkit/scripts/image/generate_image.sh +++ b/skills/minimax-multimodal-toolkit/scripts/image/generate_image.sh @@ -44,7 +44,7 @@ image_to_data_url() { local mime mime="$(file -b --mime-type "$path" 2>/dev/null)" || mime="image/jpeg" local b64 - b64="$(base64 -w 0 < "$path")" + b64="$(base64 < "$path")" echo "data:${mime};base64,${b64}" } @@ -57,78 +57,6 @@ resolve_image() { esac } -# ============================================================================ -# Payload builder — avoids command-line length limits on Windows -# Uses temp files for jq when the payload may contain large base64 data. -# ============================================================================ - -# Build JSON payload, writing large fields (base64 image data) to temp files -# to avoid Windows cmd.exe argument-length limits (~32KB). -build_payload() { - local model="$1" prompt="$2" response_format="$3" n="$4" - local prompt_optimizer="$5" aigc_watermark="$6" - local aspect_ratio="$7" width="$8" height="$9" seed="${10:-}" - local ref_image="${11:-}" - - # Start with base payload using temp file to avoid long command lines - local base_tmp - base_tmp="$(mktemp)" - trap "rm -f '$base_tmp'" EXIT INT TERM HUP - - jq -n \ - --arg model "$model" \ - --arg prompt "$prompt" \ - --arg rf "$response_format" \ - --argjson n "$n" \ - --argjson po "$prompt_optimizer" \ - --argjson aw "$aigc_watermark" \ - '{model: $model, prompt: $prompt, response_format: $rf, n: $n, prompt_optimizer: $po, aigc_watermark: $aw}' \ - > "$base_tmp" - - # Add optional fields, each via temp file to stay within Windows arg limits - if [[ -n "$aspect_ratio" ]]; then - local tmp2; tmp2="$(mktemp)"; trap "rm -f '$base_tmp' '$tmp2'" EXIT INT TERM HUP - jq --arg ar "$aspect_ratio" '. + {aspect_ratio: $ar}' "$base_tmp" > "$tmp2" - mv "$tmp2" "$base_tmp" - fi - if [[ -n "$width" ]]; then - local tmp2; tmp2="$(mktemp)"; trap "rm -f '$base_tmp' '$tmp2'" EXIT INT TERM HUP - jq --argjson w "$width" '. + {width: $w}' "$base_tmp" > "$tmp2" - mv "$tmp2" "$base_tmp" - fi - if [[ -n "$height" ]]; then - local tmp2; tmp2="$(mktemp)"; trap "rm -f '$base_tmp' '$tmp2'" EXIT INT TERM HUP - jq --argjson h "$height" '. + {height: $h}' "$base_tmp" > "$tmp2" - mv "$tmp2" "$base_tmp" - fi - if [[ -n "$seed" ]]; then - local tmp2; tmp2="$(mktemp)"; trap "rm -f '$base_tmp' '$tmp2'" EXIT INT TERM HUP - jq --argjson s "$seed" '. + {seed: $s}' "$base_tmp" > "$tmp2" - mv "$tmp2" "$base_tmp" - fi - - # Subject reference (i2i mode) — build via temp file to avoid huge command-line args - if [[ -n "$ref_image" ]]; then - local img_url - img_url="$(resolve_image "$ref_image")" - # Create temp files and set traps separately to avoid set -u issues - local ref_tmp; ref_tmp="$(mktemp)" - trap "rm -f '$base_tmp' '$ref_tmp'" EXIT INT TERM HUP - local url_tmp; url_tmp="$(mktemp)"; trap "rm -f '$base_tmp' '$ref_tmp' '$url_tmp'" EXIT INT TERM HUP - # Write URL to temp file to avoid long-argument issues, then build JSON - echo -n "$img_url" > "$url_tmp" - # Use jq -s to collect all lines (handles base64 with embedded newlines), take first element - jq -Rs 'split("\n")[0] | {type: "character", image_file: .}' "$url_tmp" > "$ref_tmp" - local tmp2; tmp2="$(mktemp)"; trap "rm -f '$base_tmp' '$ref_tmp' '$url_tmp' '$tmp2'" EXIT INT TERM HUP - jq --slurpfile ref "$ref_tmp" '. + {subject_reference: $ref}' "$base_tmp" > "$tmp2" - mv "$tmp2" "$base_tmp" - fi - - cat "$base_tmp" - rm -f "$base_tmp" - trap - EXIT INT TERM HUP -} - # ============================================================================ # Main # ============================================================================ @@ -179,7 +107,7 @@ Options: -n, --count N Number of images to generate (1-9, default: 1) --seed N Random seed for reproducibility --prompt-optimizer Enable automatic prompt optimization - --aigc-watermark Add AIGC watermark to generated images + --aigc-watermark Add AIGC watermark to generated images --ref-image FILE Character reference image (local file or URL, i2i mode) --response-format FMT Response format: url (default), base64 --no-download Don't download, just print URL(s) @@ -216,13 +144,31 @@ USAGE echo "Error: -n must be between 1 and 9" >&2; exit 1 fi - # Build payload using temp-file method (avoids Windows cmd.exe arg-length limit) + # Build payload local payload - payload=$(build_payload \ - "$model" "$prompt" "$response_format" "$n" \ - "$prompt_optimizer" "$aigc_watermark" \ - "$aspect_ratio" "$width" "$height" "$seed" \ - "$ref_image") + payload=$(jq -n \ + --arg model "$model" \ + --arg prompt "$prompt" \ + --arg rf "$response_format" \ + --argjson n "$n" \ + --argjson po "$prompt_optimizer" \ + --argjson aw "$aigc_watermark" \ + '{model: $model, prompt: $prompt, response_format: $rf, n: $n, prompt_optimizer: $po, aigc_watermark: $aw}') + + [[ -n "$aspect_ratio" ]] && payload=$(echo "$payload" | jq --arg ar "$aspect_ratio" '. + {aspect_ratio: $ar}') + [[ -n "$width" ]] && payload=$(echo "$payload" | jq --argjson w "$width" '. + {width: $w}') + [[ -n "$height" ]] && payload=$(echo "$payload" | jq --argjson h "$height" '. + {height: $h}') + [[ -n "$seed" ]] && payload=$(echo "$payload" | jq --argjson s "$seed" '. + {seed: $s}') + + # Subject reference (i2i mode) + if [[ "$mode" == "i2i" ]]; then + if [[ -z "$ref_image" ]]; then + echo "Error: --ref-image is required for i2i mode" >&2; exit 1 + fi + local img_url + img_url="$(resolve_image "$ref_image")" + payload=$(echo "$payload" | jq --arg img "$img_url" '. + {subject_reference: [{type: "character", image_file: $img}]}') + fi local api_host="${MINIMAX_API_HOST:-https://api.minimaxi.com}" local api_url="${api_host}/v1/image_generation" @@ -231,18 +177,13 @@ USAGE echo "Model: $model" echo "Generating $n image(s)..." - # Write payload to temp file to avoid command-line length limits - local payload_tmp; payload_tmp="$(mktemp)" - trap "rm -f '$payload_tmp'" EXIT INT TERM HUP - echo -n "$payload" > "$payload_tmp" - local raw_output http_code response raw_output="$(curl -s -w "\n%{http_code}" \ -X POST "$api_url" \ -H "Authorization: Bearer ${MINIMAX_API_KEY}" \ -H "Content-Type: application/json" \ --max-time 120 \ - -d "@$payload_tmp" 2>/dev/null)" || { + -d "$payload" 2>/dev/null)" || { echo "Error: curl request failed" >&2 exit 1 } @@ -262,7 +203,6 @@ USAGE local status_msg status_msg="$(echo "$response" | jq -r '.base_resp.status_msg // "Unknown error"')" echo "Error: API error (code $status_code): $status_msg" >&2 - echo "Full response: $response" >&2 exit 1 fi diff --git a/skills/minimax-multimodal-toolkit/scripts/video/generate_long_video.sh b/skills/minimax-multimodal-toolkit/scripts/video/generate_long_video.sh index 9bc253e..42e2a68 100755 --- a/skills/minimax-multimodal-toolkit/scripts/video/generate_long_video.sh +++ b/skills/minimax-multimodal-toolkit/scripts/video/generate_long_video.sh @@ -54,6 +54,28 @@ check_api_key() { fi } +validate_model_constraints() { + local model="$1" duration="$2" resolution="$3" + case "$model" in + MiniMax-Hailuo-2.3-Fast) + if [[ "$duration" != "6" || "$resolution" != "768P" ]]; then + echo "Error: MiniMax-Hailuo-2.3-Fast only supports duration=6 and resolution=768P." >&2 + exit 1 + fi + ;; + MiniMax-Hailuo-2.3) + if [[ "$duration" != "6" || "$resolution" != "768P" ]]; then + echo "Error: MiniMax-Hailuo-2.3 only supports duration=6 and resolution=768P." >&2 + exit 1 + fi + ;; + *) + echo "Error: Unsupported model '$model'. Supported models: MiniMax-Hailuo-2.3-Fast, MiniMax-Hailuo-2.3." >&2 + exit 1 + ;; + esac +} + image_to_data_url() { local path="$1" [[ -f "$path" ]] || { echo "Error: Image not found: $path" >&2; exit 1; } @@ -300,7 +322,7 @@ main() { load_env check_api_key - local scenes=() model="" segment_duration=10 resolution="768P" + local scenes=() model="" segment_duration=6 resolution="768P" local first_frame="" subject_reference="" crossfade=0.5 local music_prompt="" bgm_volume=0.3 fade_in=0 fade_out=0 local output="" @@ -334,8 +356,8 @@ Usage: Options: --scenes TEXT... Scene prompts (2+ required) --model MODEL Model name (default: auto) - --segment-duration SECS Duration per segment (default: 10) - --resolution RES Resolution: 768P, 1080P (default: 768P) + --segment-duration SECS Duration per segment (default: 6) + --resolution RES Resolution: 512P, 768P (default: 768P) --first-frame FILE First frame for scene 1 (local file or URL) --subject-reference FILE Subject reference image --crossfade SECS Crossfade duration between scenes (default: 0.5) @@ -362,6 +384,11 @@ USAGE echo "Error: --output / -o is required" >&2; exit 1 fi + if [[ -z "$model" ]]; then + model="MiniMax-Hailuo-2.3" + fi + validate_model_constraints "$model" "$segment_duration" "$resolution" + local output_dir output_dir="$(dirname "$output")" mkdir -p "$output_dir" @@ -389,12 +416,6 @@ USAGE # Determine model local seg_model="$model" - if [[ -z "$seg_model" ]]; then - case "$seg_mode" in - t2v|i2v) seg_model="MiniMax-Hailuo-2.3" ;; - ref) seg_model="S2V-01" ;; - esac - fi # Build payload local payload diff --git a/skills/minimax-multimodal-toolkit/scripts/video/generate_video.sh b/skills/minimax-multimodal-toolkit/scripts/video/generate_video.sh index 51842d7..a52eef4 100755 --- a/skills/minimax-multimodal-toolkit/scripts/video/generate_video.sh +++ b/skills/minimax-multimodal-toolkit/scripts/video/generate_video.sh @@ -45,6 +45,28 @@ check_api_key() { fi } +validate_model_constraints() { + local model="$1" duration="$2" resolution="$3" + case "$model" in + MiniMax-Hailuo-2.3-Fast) + if [[ "$duration" != "6" || "$resolution" != "768P" ]]; then + echo "Error: MiniMax-Hailuo-2.3-Fast only supports duration=6 and resolution=768P." >&2 + exit 1 + fi + ;; + MiniMax-Hailuo-2.3) + if [[ "$duration" != "6" || "$resolution" != "768P" ]]; then + echo "Error: MiniMax-Hailuo-2.3 only supports duration=6 and resolution=768P." >&2 + exit 1 + fi + ;; + *) + echo "Error: Unsupported model '$model'. Supported models: MiniMax-Hailuo-2.3-Fast, MiniMax-Hailuo-2.3." >&2 + exit 1 + ;; + esac +} + image_to_data_url() { local path="$1" [[ -f "$path" ]] || { echo "Error: Image not found: $path" >&2; exit 1; } @@ -192,7 +214,7 @@ main() { load_env check_api_key - local mode="" prompt="" model="" duration=10 resolution="768P" + local mode="" prompt="" model="" duration=6 resolution="768P" local first_frame="" last_frame="" subject_image="" local prompt_optimizer="" fast_pretreatment="" callback_url="" aigc_watermark="" local output="" @@ -228,7 +250,9 @@ Modes: Options: --mode MODE Generation mode: t2v, i2v, sef, ref (required) --prompt TEXT Text prompt describing the video - --model MODEL Model name (default: T2V-01) + --model MODEL Model name (default: MiniMax-Hailuo-2.3) + --duration SECONDS Duration in seconds (must match model constraints) + --resolution RES Resolution: 512P or 768P (must match model constraints) --first-frame FILE First frame image (local file or URL) --last-frame FILE Last frame image (local file or URL) --subject-image FILE Subject reference image (local file or URL) @@ -255,14 +279,11 @@ USAGE # Default model per mode if [[ -z "$model" ]]; then - case "$mode" in - t2v) model="MiniMax-Hailuo-2.3" ;; - i2v) model="MiniMax-Hailuo-2.3" ;; - sef) model="MiniMax-Hailuo-02" ;; - ref) model="S2V-01" ;; - esac + model="MiniMax-Hailuo-2.3" fi + validate_model_constraints "$model" "$duration" "$resolution" + # Build payload local payload payload=$(jq -n --arg m "$model" '{model: $m}')