Skip to content

[Feature] Scene Continuation & Video Concatenation - Seamless Multi-Clip Assembly #71

@vzeman

Description

@vzeman

Overview

Modern AI video generation models — Kling, Wan 2.1, SVD, LTX, Runway, and others — have a hard architectural constraint: they can only produce short clips, typically 2–10 seconds per generation. This is not an edge case or a temporary limitation to work around. It is the fundamental reality of how these models operate today, and designing around it is the primary challenge of building a usable AI video editor.

To produce a 1-minute video, a user needs approximately 10–30 individual AI-generated clips. A 5-minute video requires 50–150 clips. The editor must make this feel seamless, not like a technical workaround. The clip-chaining system described in this issue is therefore not a Phase 4 polish feature — it is the central architectural pillar of the entire editor. Every other feature (generation, timeline, export) must be built with this multi-clip reality in mind from day one.


Core Concepts

Clip vs Scene vs Project

Term Definition
Clip A single AI-generated video segment. Raw model output. Typically 2–10 seconds long depending on model.
Scene A logical story unit — e.g., "hero walks into the building". May be composed of 1 or many sequential clips to reach the desired duration.
Project A complete video, organized as an ordered sequence of scenes, each composed of clips.

A scene with a target duration of 15 seconds using a model capped at 5 seconds per generation requires at minimum 3 clips chained end-to-end. The editor must make this assembly transparent and effortless.

Clip Chain Architecture

Each scene maintains an internal clip chain: an ordered list of clips played sequentially.

Scene: "Hero enters the building"  (target: 15s)
┌──────────────┬──────────────┬──────────────┐
│   Clip A     │   Clip B     │   Clip C     │
│   0–5s       │   5–10s      │   10–15s     │
│ [thumbnail]  │ [thumbnail]  │ [thumbnail]  │
└──────────────┴──────────────┴──────────────┘
      ↓ last frame         ↓ last frame
   → first frame of B   → first frame of C

Key invariant: the last frame of Clip N automatically becomes the first frame of Clip N+1, maintaining visual continuity across the chain.


Required Features

Clip Chain UI

  • Within each scene card in the timeline, show a horizontal filmstrip of all clips in the chain
  • Each clip thumbnail shows: preview image, duration, model name, generation status (pending / generating / done / error)
  • "+" button at the end of the filmstrip: adds a new clip to the chain, automatically extracting and using the last frame of the previous clip as the image-to-video starting frame
  • Drag to reorder clips within a scene
  • Right-click / long-press context menu on any clip:
    • Regenerate (re-run generation with same or modified parameters)
    • Delete (with continuity options — see Edge Cases below)
    • Replace (swap in a different local video file)
    • Set as Scene Start (promote this clip's first frame as the scene's canonical start frame)
    • Extract Frame (open frame extractor tool at this clip)
    • View generation parameters
  • Collapse / expand the clip chain strip — default collapsed when scene has only 1 clip, expanded when 2+

Auto-Extension Workflow

When a user sets a target duration for a scene that exceeds what a single model generation can produce, the system should offer automatic clip chaining:

  • "Target duration" field per scene (e.g., 15 seconds)
  • "Auto-extend" button: the system calculates the required number of clips (ceil(target ÷ model_max_duration)), generates them sequentially, with each clip using the extracted last frame of the previous clip as the continuity seed
  • Progress indicator throughout generation: "Generating clip 3 of 5 for Scene 2..."
  • Early stop option: user can halt auto-extension at any point if the result already looks satisfactory
  • Auto-extension defaults to the same model and parameters as the first clip in the chain, with an option to override per-extension
  • After auto-extension completes, user reviews the assembled scene and can regenerate individual clips that did not come out well

Cross-Scene Continuity

  • "Continue from previous scene" toggle per scene
  • When enabled: the first clip of the current scene is generated using the last frame of the last clip of the previous scene as its first-frame seed — creating a continuous visual flow across scene boundaries
  • Visual continuity indicator in the timeline: a link icon or connector line between adjacent scenes that have continuity enabled
  • Break continuity: explicitly set a new first frame image to start a scene fresh (a new location, a time-cut, etc.)
  • Continuity state is saved in the project data model per scene

Frame Extraction and Continuity Tools

  • Frame extractor modal: scrubber over the video clip, frame-accurate preview, "Use this frame as continuity seed" button
  • Auto-extract last frame: automatically pull the final frame of any clip for use as the next clip's seed (default behavior for "+")
  • Quality heuristic for auto-extraction: prefer frames without obvious motion blur or mid-blink artifacts (simple pixel variance or sharpness check)
  • All extracted frames are saved to the project asset library with a reference to the source clip and timestamp
  • Extracted frames are displayed in the Asset Library ([Feature] Asset Library - Centralized Media Management for Projects #73) under a "Continuity Frames" category

Transition Options at Clip Boundaries (within a scene)

Transition Description
Seamless cut Default. No transition applied.
Crossfade Overlap end of Clip N with start of Clip N+1, configurable duration 0.1s–2.0s
Motion blur blend Blend frames with increasing/decreasing blur at boundary
Match cut User manually marks a matching frame in each clip; editor aligns them

Transition Options at Scene Boundaries

Transition Description
Hard cut Instantaneous scene change
Fade to black Clip fades out to black before next scene begins
Fade from black Next scene fades in from black
Fade to black + from black Combined: clip fades to black, pause, next fades in
Dissolve Configurable duration: overlapping dissolve between last clip of scene N and first clip of scene N+1
Custom transition clip User uploads a short video file (e.g., a logo sting or abstract wipe) inserted between scenes

Long Video Planning Tools

Before the user starts generating clips, the editor should help them plan the full project:

  • Duration planner: user enters a target total video duration → the editor shows a breakdown: how many scenes, estimated clips per scene, total clip count, and approximate generation cost
  • Model duration reference card: visible in the generation panel, showing the max output duration per model (e.g., Kling 1.6 = 5s or 10s, Wan 2.1 = 4s, SVD-XT = ~3s at 25 frames, LTX-Video = variable)
  • Clip count estimate: shown per-scene and for the whole project before any generation starts
  • Cost estimate: based on total clip count and per-clip cost for the selected model, pulled from the cost tracking system ([Feature] Cost Tracking & API Key Management - Monitor Spending Across All AI Services #75)

Concatenation Engine

  • FFmpeg-based, fully local — all concatenation happens on the user's device, no re-upload to any server
  • Fast path (stream copy): when all clips in the chain share the same resolution, codec, and frame rate, use FFmpeg's concat demuxer for a lossless, near-instant join
  • Full re-encode path: activated when clips differ in resolution or FPS, or when crossfade/dissolve transitions are applied; uses libx264 with configurable quality settings
  • Intermediate concatenation: user can render a partial video (e.g., scenes 1–3 only) to review pacing and continuity before committing to generating the remaining scenes
  • Incremental re-render: when a single clip is changed (regenerated or replaced), only re-concatenate the affected scene and downstream segments — unchanged segments are served from cache
  • Audio stitching: audio tracks (voice narration, background music) are stitched alongside video, with configurable crossfade durations at scene boundaries

Clip Versioning

  • Every AI generation attempt for a clip position in the chain is saved as a version — never silently overwritten
  • The user selects which version is "active" for the chain; only the active version is included in concatenation
  • Switching the active version of Clip N triggers a prompt: "The next clip uses this clip's last frame as its seed. Regenerate downstream clips, or keep them as-is?"
  • "Pin first frame" option per clip position: even if the user swaps active versions, the pinned last-frame extraction for continuity purposes does not change — useful for keeping a stable downstream chain while exploring different generations of one clip
  • Version list is accessible from the right-click context menu on any clip thumbnail

Edge Cases to Handle

Scenario Handling
Clip generation fails mid-chain Offer retry from the failed clip; all previous clips in the chain are preserved and do not need to be regenerated
Model output resolution changes mid-chain Warn the user; offer to auto-scale or crop the differing clip to match the chain's established resolution
Narration audio track is longer than assembled video Options: extend the last clip via auto-extension, trim the audio, or pad with a freeze-frame
User deletes a middle clip from the chain Offer two options: (1) Re-link — use left neighbor's last frame to regenerate a replacement clip, or (2) Leave gap — remove clip, and offer to re-stitch or re-generate continuity manually
Two adjacent clips have a visible jump cut despite continuity Surface the frame extractor and offer to regenerate the right-side clip with a refined first-frame seed
First clip of chain has no seed image (pure text-to-video) Supported; continuity extraction begins from its last frame for subsequent clips

Technical Notes

Data Model

class Project {
  List<Scene> scenes;
}

class Scene {
  String id;
  String title;
  int targetDurationSeconds;
  bool continueFromPreviousScene;
  String? overrideFirstFramePath;
  List<Clip> clipChain;
}

class Clip {
  String id;
  int chainIndex;
  GenerationParams generationParams;
  ClipStatus status; // pending | generating | done | error
  String? localPath;
  String? firstFramePath;
  String? lastFramePath;
  List<ClipVersion> versions;
  String? activeVersionId;
  bool pinnedLastFrame;
}

class ClipVersion {
  String id;
  DateTime generatedAt;
  String localPath;
  String lastFramePath;
  Map<String, dynamic> generationParams;
}

FFmpeg Integration

  • Concatenation: ffmpeg -f concat -safe 0 -i filelist.txt -c copy output.mp4
  • Frame extraction (last frame): ffmpeg -sseof -1 -i input.mp4 -vframes 1 last_frame.png
  • Frame extraction (specific timestamp): ffmpeg -ss TIMESTAMP -i input.mp4 -vframes 1 frame.png
  • Crossfade transition: xfade filter with configurable duration and offset
  • All FFmpeg calls go through ffmpeg_kit_flutter

File Naming Convention

{projectId}/
  scenes/
    {sceneId}/
      clips/
        {clipIndex}_{versionId}.mp4
        {clipIndex}_{versionId}_last_frame.png
      concatenated_scene.mp4
  concatenated_partial_{sceneRange}.mp4
  final_export.mp4

Caching Strategy

  • Each concatenated scene output is cached by a hash of its clip chain (clip IDs + active version IDs)
  • On change to any clip in the chain, invalidate only that scene's cache and any downstream partial or final concatenations
  • Final export is re-generated only from changed scenes; unchanged scene segments are reused from cache

Acceptance Criteria

  • Clip chain UI renders inside each scene card with correct filmstrip layout
  • "+" button auto-extracts last frame and passes it as first-frame seed to the next generation
  • Auto-extension workflow generates N clips sequentially with correct continuity
  • Cross-scene continuity toggle correctly seeds the first clip of a scene from the last clip of the previous scene
  • Frame extractor modal allows manual frame selection and saves to asset library
  • Crossfade and dissolve transitions render correctly via FFmpeg xfade
  • Stream-copy fast path is used when all clips share resolution/codec/FPS
  • Incremental re-render skips re-concatenation of unchanged scene segments
  • Clip versioning stores all attempts and allows active version switching
  • Deleting a middle clip presents re-link and leave-gap options
  • Duration planner estimates clip count and generation cost before any API calls

Related Issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    ai-video-editorAI Video Editor Flutter appfeatureNew feature implementationflutterFlutter/Dart implementationphase-4Phase 4: Polish & Export

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions