Video Content Gap Analyzer

Find what's missing in your video dataset before you waste time collecting more of what you already have.

A FiftyOne plugin that identifies coverage gaps in video datasets using Twelve Labs Marengo embeddings, KMeans clustering, and Pegasus descriptions.

The Problem

Most teams improve their video models by collecting more data. But more data doesn't help if it's more of the same. The CV4Smalls workshop series demonstrated that data curation beats model complexity — the winning EAR 2025 submission used a model architecture from 2019 and won by picking better training data. Before you collect another 10,000 videos, you should know what your dataset is actually missing.

What This Plugin Does

Embeds your video dataset into 512-d vectors using Twelve Labs Marengo 3.0
Clusters videos by semantic similarity with auto-tuned KMeans
Describes each cluster in plain English via Twelve Labs Pegasus 1.2
Detects gaps — sparse clusters, isolated clusters, missing expected categories
Visualizes everything in an interactive scatter plot inside the FiftyOne App

Quick Start

Prerequisites

Python 3.9–3.12
A Twelve Labs API key (free tier provides 600 minutes of indexing)

1. Install the plugin

# Clone the repository
git clone https://github.com/rishimule/video-content-gap-analyzer.git
cd video-content-gap-analyzer

# Install Python dependencies
pip install -r requirements.txt

# Register as a FiftyOne plugin
fiftyone plugins create video-content-gap-analyzer --from-dir .

Or install directly from GitHub:

fiftyone plugins download https://github.com/rishimule/video-content-gap-analyzer
pip install twelvelabs scikit-learn umap-learn numpy huggingface_hub

2. Set your API key

export TWELVELABS_API_KEY="your_key_here"

3. Run the demo

The fastest way to see the plugin in action:

python demo.py

This loads 50 videos from the Voxel51/Safe_and_Unsafe_Behaviours dataset, launches the FiftyOne App, and prints a step-by-step walkthrough in your terminal.

4. Use with your own dataset

import fiftyone as fo
from fiftyone.utils.huggingface import load_from_hub

# Load a dataset from HuggingFace Hub...
dataset = load_from_hub("Voxel51/Safe_and_Unsafe_Behaviours", max_samples=40)

# ...or load your own video dataset
# dataset = fo.Dataset.from_dir("/path/to/videos", fo.types.VideoDirectory)

session = fo.launch_app(dataset)

5. Run the analysis

Press ` in the FiftyOne App to open the operator menu
Search for Analyze Coverage and select it
Configure parameters:
- num_clusters: 0 for auto-selection, or set a specific number
- expected_categories: comma-separated list (e.g., person falling, forklift moving, fire evacuation)
- use_pegasus: True for natural language labels, False for faster runs
- max_samples: 0 to process all, or set a limit
Click Execute and watch the 4-stage progress bar

6. View results

Coverage Panel — Click the + tab in the App panel bar and select Coverage Map to see the interactive UMAP scatter plot with clusters and gap markers.

Gap Report — Press ` again, search for Show Gap Report, and run it to see the coverage score, sparse clusters, and missing categories.

Explore the data — Filter by the sparse_cluster tag, sort by centroid_distance to find outliers, or group by cluster_label to browse by topic.

How It Works

Video Dataset
    │
    ▼
┌─────────────────────┐
│  1. Embed           │  Twelve Labs Marengo 3.0 → 512-d vectors per video
└─────────┬───────────┘
          ▼
┌─────────────────────┐
│  2. Cluster         │  KMeans (auto-k via silhouette score) + UMAP 2D projection
└─────────┬───────────┘
          ▼
┌─────────────────────┐
│  3. Describe        │  Pegasus 1.2 generates plain-English label per cluster
└─────────┬───────────┘
          ▼
┌─────────────────────┐
│  4. Detect Gaps     │  Sparse clusters, isolated clusters, missing categories
└─────────┬───────────┘
          ▼
┌─────────────────────┐
│  5. Visualize       │  Interactive scatter plot + gap summary in FiftyOne App
└─────────────────────┘

Embedding — Each video is uploaded to Twelve Labs and embedded using Marengo 3.0 (visual + audio fusion). Already-embedded samples are skipped on re-runs.

Clustering — L2-normalized embeddings are clustered with KMeans. Set num_clusters=0 to auto-select k via silhouette scoring. Outliers are flagged when their centroid distance exceeds mean + 2σ. UMAP reduces to 2D for visualization.

Describing — The two samples closest to each cluster centroid are sent to Pegasus 1.2, which generates a one-sentence description. These become human-readable cluster labels.

Gap Detection produces three signals:

Sparse clusters — clusters with fewer than 3 samples (underrepresented content)
Isolated clusters — clusters whose centroids are far from all others (unusual content)
Category gaps — if you provide expected categories (e.g., "person falling, forklift moving"), the plugin embeds them with Marengo and measures cosine similarity against your data. Low-similarity categories are flagged as gaps.

Coverage score — A 0–1 score based on how much of the UMAP embedding space your dataset actually occupies (10×10 grid occupancy).

Example Output

Gap Report
──────────────────────────────────
Coverage Score: 0.42

Sparse Clusters (underrepresented):
  • Cluster 3: "Person climbing ladder in warehouse" — 2 samples
  • Cluster 7: "Emergency vehicle arriving at scene" — 1 sample

Isolated Clusters (unusual content):
  • Cluster 7: "Emergency vehicle arriving at scene" — mean distance 0.8231

Category Gaps (missing from dataset):
  • "forklift moving"  — best match: "Person walking in warehouse" (sim: 0.18)
  • "fire evacuation"  — best match: "Emergency vehicle arriving" (sim: 0.24)

Configuration

Parameter	Default	Description
`num_clusters`	`5`	Number of KMeans clusters. Set to `0` for auto-selection via silhouette score.
`expected_categories`	`""`	Comma-separated categories to check for (e.g., `person falling, forklift moving`).
`use_pegasus`	`True`	Generate natural language cluster labels with Pegasus. Disable for faster runs.
`max_samples`	`0`	Max samples to process (`0` = all). Useful for limiting API cost on large datasets.
`outlier_threshold`	`2.0`	Std deviations above mean centroid distance to flag outliers. Lower = more outliers.
`gap_threshold`	`0.3`	Cosine similarity below which a category is flagged as missing. Higher = more gaps.

Fields Written to Samples

Field	Type	Description
`embedding`	`list[float]`	512-d Marengo video embedding
`cluster_id`	`int`	KMeans cluster assignment
`cluster_label`	`str`	Pegasus-generated cluster description
`centroid_distance`	`float`	Cosine distance to cluster centroid
`is_outlier`	`bool`	Whether distance exceeds threshold
`umap_x`, `umap_y`	`float`	2D UMAP coordinates for visualization

Built With

FiftyOne — Dataset management and visualization
Twelve Labs Marengo 3.0 — Multimodal video/text embeddings
Twelve Labs Pegasus 1.2 — Video-to-text generation
scikit-learn — KMeans clustering and silhouette scoring
UMAP — Dimensionality reduction
NumPy — Distance calculations

Hackathon Context

Built at the Video Understanding AI Hackathon at Northeastern University, April 3, 2026. Inspired by the CV4Smalls workshop series and the insight that understanding your data distribution matters more than throwing bigger models at incomplete datasets.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
docs		docs
notebooks		notebooks
tasks		tasks
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
demo.py		demo.py
fiftyone.yml		fiftyone.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Video Content Gap Analyzer

The Problem

What This Plugin Does

Quick Start

Prerequisites

1. Install the plugin

2. Set your API key

3. Run the demo

4. Use with your own dataset

5. Run the analysis

6. View results

How It Works

Example Output

Configuration

Fields Written to Samples

Built With

Hackathon Context

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Video Content Gap Analyzer

The Problem

What This Plugin Does

Quick Start

Prerequisites

1. Install the plugin

2. Set your API key

3. Run the demo

4. Use with your own dataset

5. Run the analysis

6. View results

How It Works

Example Output

Configuration

Fields Written to Samples

Built With

Hackathon Context

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages