Video podcast editing and carousel generation pipeline.
| Step | Local (host) | Docker |
|---|---|---|
| Transcribe | Metal GPU — ~15 min | CPU only in Linux VM — 2–3 h |
| Keyframe optimize | VideoToolbox hardware encode (~1–2 min/hr) | libx264 software encode (slower) |
| Align | --device mps for Metal acceleration |
CPU only |
| Remotion render | Native filesystem I/O | Virtualized filesystem adds overhead |
| Diarize | CPU (no MPS path in script) | CPU — same |
| Cut preview | libx264 | libx264 — same |
| Sync, edit, merge, camera | — | — same either way |
Recommendation: run transcribe, keyframe optimize, align, and Remotion render on the host. Docker is convenient for onboarding and Linux CI but gives up GPU and hardware encode on macOS.
| Step | Local (host) | Docker |
|---|---|---|
| Transcribe | CUDA via whisper.cpp (if CUDA build available) | CPU only — current Dockerfile installs CPU-only PyTorch |
| Keyframe optimize | libx264 — h264_nvenc path not yet implemented |
libx264 — same |
| Align | Auto-detects CUDA — runs on GPU automatically with CUDA PyTorch installed | CPU only (Dockerfile uses CPU-only PyTorch) |
| Diarize | CUDA PyTorch possible — no --device flag in script yet |
CPU only (same limitation) |
| Cut preview | libx264 — h264_nvenc path not yet implemented |
libx264 — same |
| Remotion render | Native filesystem I/O | WSL2 virtualized filesystem adds some overhead |
| Sync, edit, merge, camera | — | — same either way |
Unlike macOS, Docker on Windows can pass CUDA through to containers via NVIDIA Container Toolkit. The current Dockerfile installs CPU-only PyTorch so it won't use the GPU as-is — the Dockerfile would need CUDA PyTorch wheels (
whl/cu128for RTX 50-series / CUDA 12.8) to take advantage of this.
Recommendation: on the host, npm run align auto-detects CUDA and runs on the GPU with no extra flags — provided CUDA PyTorch is installed (see Python setup below). Keyframe optimize, cut preview, and diarize would need code changes to add NVENC/CUDA paths before they benefit from the GPU.
Required:
- Node.js v18+
- ffmpeg —
brew install ffmpeg/apt-get install ffmpeg/ ffmpeg.org - Python 3.9–3.12 (use 3.12) — diarization + forced alignment
Verify:
ffmpeg -version && python3 --version && node --versionAll dependencies included.
docker-compose run --rm --service-ports wizard
docker-compose run --rm app npm run remotionCaption alignment test (port 3001) and camera GUI (port 3000) both work in Docker when using --service-ports.
Transcription runs best on the host (not in Docker). Docker on macOS runs in a Linux VM with no Metal or GPU passthrough — transcription falls back to CPU. Run
npm run transcribedirectly on the host to use Metal GPU acceleration on Apple Silicon (~15 min vs 2–3 h).
npm run video:wizardGuides you interactively through every step. Transcription + diarization run in parallel automatically.
| # | Mode | Description |
|---|---|---|
| 1 | Separate video + audio (need sync) | Aligns audio to video before transcribing. Supports multiple camera angles. |
| 2 | Separate video + audio (in sync) | Skips sync, uses audio directly |
| 3 | Single video file | Extracts audio from video |
| 4 | Audio only | Transcription only, no video output |
Multi-angle (mode 1): When prompted "how many camera angles?", enter 2+. Place each additional angle's video in public/input/video/angle2/, angle3/, etc. Each is synced independently to the same audio and assigned to speakers in the camera GUI.
Python 3.12 required for diarize and align.
Activate a virtual environment for the specific python version:
py -3.12 -m venv .venv.\.venv\Scripts\activateInstall the following list of pip requirements:
python3 -m pip install --upgrade pip setuptools wheel
# or pip3python3 -m pip install -r scripts/diarize/requirements.txtpython3 -m pip install whisperx faster-whisperpython3 -m pip install -r scripts/camera/requirements.txtpython3 -m pip install -r scripts/thumbnail/requirements.txtTroubleshooting —
module 'coverage' has no attribute 'types': An outdatedcoveragepackage conflicts with rembg's dependencies. Fix withpip3 install --upgrade coverage(orpip3 uninstall coverageif you don't use it for testing). Background removal will silently fall back to a plain copy until this is resolved.
NVIDIA GPU (RTX 50-series / RTX 5050) — extra step:
The default pip install torch is CPU-only. Replace it with the CUDA 12.8 wheel so npm run align can use the GPU:
pip install torch --index-url https://download.pytorch.org/whl/cu128RTX 5050 is a Blackwell GPU (SM 120) and requires PyTorch 2.7+. Verify after installing:
python3 -c "import torch; print(torch.__version__, torch.cuda.is_available(), torch.cuda.get_device_name(0))"Expected output: 2.7.x True NVIDIA GeForce RTX 5050. If cuda.is_available() is False, the CUDA wheel wasn't installed — re-run the pip install torch command above.
If python resolves to WindowsApps\python (permission denied), pass the path explicitly:
npm run diarize -- --num-speakers 2 --python .venv\Scripts\python.exe
npm run align -- --python .venv\Scripts\python.exeSingle angle:
npm run syncOutput: public/sync/output/synced-output.mp4
Multi-angle (run via wizard, or call directly from a script using AudioSyncer.syncMultiple):
Outputs: public/sync/output/synced-output-1.mp4, synced-output-2.mp4, etc.
npm run transcribe
npm run transcribe -- --model small.en # faster, less accurate
npm run transcribe -- --timestamp-offset 0.5Timings are for a ~36 min episode:
| Model | Host — Metal GPU (M3) | Host — CPU / Docker | Accuracy |
|---|---|---|---|
tiny.en |
~1–2 min | ~5 min | Low |
small.en |
~5 min | ~20–30 min | Good |
medium.en (default) |
~15 min | ~60–120 min | High |
Docker on macOS has no Metal passthrough — it always falls into the CPU column. Run transcription on the host to use Metal GPU acceleration.
Model downloaded automatically on first use, cached in whisper.cpp/.
Whisper timestamps can lag 0.3–0.6 s. Measure the offset:
cd public/transcribe && npx serve . -p 3001- Open
http://localhost:3001/caption_test.html - Scrub to a word onset, enter the word — page calculates offset
- Repeat with a word 5+ min later; page shows the fix command
The wizard runs this automatically and carries the offset through subsequent steps.
npm run diarize -- --num-speakers 2Output: public/transcribe/output/raw/diarization.json
npm run assign-speakersLabels each segment with detected speaker in transcript.raw.json.
npm run align
npm run align -- --python .venv\Scripts\python.exeRefines segments[].start/end and tokens[].t_dtw in transcript.raw.json. Populates tokens[].t_end (word-end boundary) enabling exact cut boundaries.
npm run edit-transcriptMerges phrases into sentences. Outputs:
public/transcribe/output/edit/transcript.jsonpublic/transcribe/output/edit/transcript.doc.txt
Open transcript.doc.txt. Follow the instructions at the top:
- Rename speakers in the
SPEAKERSsection - Retype words to correct them
- Wrap words in
{curly braces}to cut them - Add
CUTafter a segment number to remove the whole segment
npm run merge-doc
npm run merge-doc:cut-pauses # also remove silences > 0.5 s
npm run merge-doc:cut-pauses -- --pause-threshold 0.3
npm run merge-doc -- --timestamp-offset 0.5Applies doc edits back to transcript.json. Re-running resets any previous pause cuts or offset — always pass the flags you want.
Simulates multi-camera by digitally cropping to the speaking speaker's face on a pacing schedule. Supports multiple physical camera angles: each angle uses a separate synced video file; each speaker is assigned to an angle.
Install MediaPipe:
pip3 install mediapipe pillow
# or:
pip3 install -r scripts/camera/requirements.txtSingle angle:
npm run setup-cameraMulti-angle:
node scripts/camera/setup-camera.js --videos path/to/angle1.mp4 path/to/angle2.mp4Or let the wizard handle it (recommended).
The script:
- Extracts a reference frame from each video at
transcript.meta.videoStart - Runs MediaPipe BlazeFace face detection per angle (offline after first run — model cached at
scripts/camera/blaze_face_short_range.tflite) - Starts the Next.js dev server
Open http://localhost:3000/camera in your browser:
- Use angle tabs to switch between camera angles
- Assign each detected face box to a speaker
- Click Save profiles
Output: public/transcribe/output/camera/camera-profiles.json
Flags:
npm run setup-camera -- --skip-detect # skip auto-detection, draw manually
npm run setup-camera -- --video path/to/v.mp4 # specify video explicitly
npm run setup-camera -- --python python3 # override Python binarynpm run remotionPlays the full recording with all cuts applied. If camera-profiles.json exists, punch-in/punch-out cuts are applied automatically (including multi-angle switching). Remove or rename the file to disable camera cuts.
npm run cut-previewGenerates a flat MP4 for quick review outside Remotion.
The wizard handles the full editing flow interactively.
npm run video:wizard
# Docker:
docker-compose run --rm --service-ports wizard npm run video:wizardResume behaviour: The wizard detects existing work and picks up where you left off. Choose Resume, Jump to a specific step, or Start fresh.
1. Build the transcript doc
After transcription and speaker assignment complete, the wizard generates:
public/edit/transcript.doc.txt
This plain-text file represents every segment of the recording as a numbered line. The wizard opens it automatically.
2. Edit the doc
Each line looks like:
[42] Natasha: And I think the real issue is context windows.
Make edits directly in the file:
| What you want | How to write it |
|---|---|
| Cut a word or phrase | Wrap in {curly braces} — {um}, {you know} |
| Cut an entire segment | Add CUT after the segment number: [42] CUT |
| Fix a transcript error | Retype the word inline |
| Rename a speaker | Edit the SPEAKERS block at the top of the file |
Save the file, return to the terminal, press Enter.
3. Apply edits
The wizard runs merge-doc to bake your changes back into transcript.json:
# Optional flags you can pass when jumping to this step manually:
npm run transcript:merge
npm run transcript:merge:cut-pauses # also auto-cut silences > 0.5 s4. Camera setup (optional)
Sets up digitally-simulated punch-in/punch-out cuts to the speaking face. The wizard:
- Detects faces via MediaPipe
- Opens
http://localhost:3000/camera— assign each face box to a speaker, click Save profiles
Output: public/camera/camera-profiles.json
5. Preview in Remotion Studio
npm run remotionOpen the ragTechVodcast composition. Scrub through the timeline to review all cuts and overlays.
6. Render
npm run shorts:render # renders the final MP4 based on outName in transcript metaOr use Remotion's built-in render button in the Studio.
Short-form clips are vertical (9:16) cuts derived from the longform recording. Each clip lives in public/shorts/<clip-id>/.
Prerequisite: the longform pipeline must have run and produced public/edit/transcript.json.
npm run shorts:wizard
# Docker:
docker-compose run --rm --service-ports wizard npm run shorts:wizardThe wizard offers two paths:
| Path | When to use |
|---|---|
| A — Clip from longform | You recorded landscape and want to cut a vertical clip from it |
| B — Dedicated portrait recording | You recorded directly in portrait (phone, vertical camera) |
1. Pick a clip ID
Give the clip a short slug, e.g. mediocrity. Output goes to public/shorts/mediocrity/.
2. Edit the clip doc
The wizard copies a cleaned version of the longform doc to public/shorts/<clip-id>/transcript.doc.txt and opens it. You define the clip by adding directives:
Define the clip range — required:
> START
[42] This is the first segment to include.
[55] This is the last segment to include.
> END
For a precise sub-segment start or end, use at=:
> START at="the real issue is"
[42] And I think the real issue is context windows.
> END at="context windows"
Add a hook — optional teaser that plays before the main clip:
# Whole segment as hook:
[38] This is a great hook line.
> HOOK
# Specific phrase as hook:
[38] This segment has a {great soundbite} in the middle.
> HOOK "great soundbite"
The hook section plays first, then the main clip from START to END.
Word-level cuts — same as longform:
[42] Remove these {um} {you know} filler words.
Save the file, return to the terminal, press Enter.
3. Apply edits
The wizard runs shorts:merge-doc, which writes public/shorts/<clip-id>/transcript.json including:
meta.videoStart/meta.videoEnd— the source time rangemeta.hookTitle— derived from the first> HOOKphrasemeta.outName— auto-named<source-filename>_<clip-id>.mp4
4. Portrait camera setup
Uses the existing longform camera profiles and re-maps them for the 9:16 frame. Opens http://localhost:3000/camera — review face positions and click Save profiles.
Output: public/shorts/<clip-id>/camera-profiles.json
5. Preview in Remotion Studio
npm run remotionSelect the ShortFormClip composition. The studio reads from public/shorts/mediocrity/ by default; pass ?shortId=<clip-id> in the URL to switch clips.
6. Render
npm run shorts:render -- --id <clip-id>Output MP4 is written to the path stored in transcript.meta.outName.
npm run generate
npm run generate:bulk