deckcreate

Video podcast editing and carousel generation pipeline.

Local vs Docker

macOS (Apple Silicon — M-series)

Step	Local (host)	Docker
Transcribe	Metal GPU — ~15 min	CPU only in Linux VM — 2–3 h
Keyframe optimize	VideoToolbox hardware encode (~1–2 min/hr)	libx264 software encode (slower)
Align	`--device mps` for Metal acceleration	CPU only
Remotion render	Native filesystem I/O	Virtualized filesystem adds overhead
Diarize	CPU (no MPS path in script)	CPU — same
Cut preview	libx264	libx264 — same
Sync, edit, merge, camera	—	— same either way

Recommendation: run transcribe, keyframe optimize, align, and Remotion render on the host. Docker is convenient for onboarding and Linux CI but gives up GPU and hardware encode on macOS.

Windows (NVIDIA GPU — RTX 50-series)

Step	Local (host)	Docker
Transcribe	CUDA via whisper.cpp (if CUDA build available)	CPU only — current Dockerfile installs CPU-only PyTorch
Keyframe optimize	libx264 — `h264_nvenc` path not yet implemented	libx264 — same
Align	Auto-detects CUDA — runs on GPU automatically with CUDA PyTorch installed	CPU only (Dockerfile uses CPU-only PyTorch)
Diarize	CUDA PyTorch possible — no `--device` flag in script yet	CPU only (same limitation)
Cut preview	libx264 — `h264_nvenc` path not yet implemented	libx264 — same
Remotion render	Native filesystem I/O	WSL2 virtualized filesystem adds some overhead
Sync, edit, merge, camera	—	— same either way

Unlike macOS, Docker on Windows can pass CUDA through to containers via NVIDIA Container Toolkit. The current Dockerfile installs CPU-only PyTorch so it won't use the GPU as-is — the Dockerfile would need CUDA PyTorch wheels (whl/cu128 for RTX 50-series / CUDA 12.8) to take advantage of this.

Recommendation: on the host, npm run align auto-detects CUDA and runs on the GPU with no extra flags — provided CUDA PyTorch is installed (see Python setup below). Keyframe optimize, cut preview, and diarize would need code changes to add NVENC/CUDA paths before they benefit from the GPU.

Prerequisites

Option 1: Local

Required:

Node.js v18+
ffmpeg — brew install ffmpeg / apt-get install ffmpeg / ffmpeg.org
Python 3.9–3.12 (use 3.12) — diarization + forced alignment

Verify:

ffmpeg -version && python3 --version && node --version

Option 2: Docker

All dependencies included.

docker-compose run --rm --service-ports wizard
docker-compose run --rm app npm run remotion

Caption alignment test (port 3001) and camera GUI (port 3000) both work in Docker when using --service-ports.

Transcription runs best on the host (not in Docker). Docker on macOS runs in a Linux VM with no Metal or GPU passthrough — transcription falls back to CPU. Run npm run transcribe directly on the host to use Metal GPU acceleration on Apple Silicon (~15 min vs 2–3 h).

Quick start

npm run video:wizard

Guides you interactively through every step. Transcription + diarization run in parallel automatically.

Wizard modes

#	Mode	Description
1	Separate video + audio (need sync)	Aligns audio to video before transcribing. Supports multiple camera angles.
2	Separate video + audio (in sync)	Skips sync, uses audio directly
3	Single video file	Extracts audio from video
4	Audio only	Transcription only, no video output

Multi-angle (mode 1): When prompted "how many camera angles?", enter 2+. Place each additional angle's video in public/input/video/angle2/, angle3/, etc. Each is synced independently to the same audio and assigned to speakers in the camera GUI.

Python setup — diarization + forced alignment

Python 3.12 required for diarize and align.

Activate a virtual environment for the specific python version:

py -3.12 -m venv .venv

.\.venv\Scripts\activate

Install the following list of pip requirements:

python3 -m pip install --upgrade pip setuptools wheel
# or pip3

python3 -m pip install -r scripts/diarize/requirements.txt

python3 -m pip install whisperx faster-whisper

python3 -m pip install -r scripts/camera/requirements.txt

python3 -m pip install -r scripts/thumbnail/requirements.txt

Troubleshooting — module 'coverage' has no attribute 'types': An outdated coverage package conflicts with rembg's dependencies. Fix with pip3 install --upgrade coverage (or pip3 uninstall coverage if you don't use it for testing). Background removal will silently fall back to a plain copy until this is resolved.

NVIDIA GPU (RTX 50-series / RTX 5050) — extra step:

The default pip install torch is CPU-only. Replace it with the CUDA 12.8 wheel so npm run align can use the GPU:

pip install torch --index-url https://download.pytorch.org/whl/cu128

RTX 5050 is a Blackwell GPU (SM 120) and requires PyTorch 2.7+. Verify after installing:

python3 -c "import torch; print(torch.__version__, torch.cuda.is_available(), torch.cuda.get_device_name(0))"

Expected output: 2.7.x True NVIDIA GeForce RTX 5050. If cuda.is_available() is False, the CUDA wheel wasn't installed — re-run the pip install torch command above.

If python resolves to WindowsApps\python (permission denied), pass the path explicitly:

npm run diarize -- --num-speakers 2 --python .venv\Scripts\python.exe
npm run align -- --python .venv\Scripts\python.exe

Manual steps

1. Sync audio and video

Single angle:

npm run sync

Output: public/sync/output/synced-output.mp4

Multi-angle (run via wizard, or call directly from a script using AudioSyncer.syncMultiple): Outputs: public/sync/output/synced-output-1.mp4, synced-output-2.mp4, etc.

2. Transcribe

npm run transcribe
npm run transcribe -- --model small.en   # faster, less accurate
npm run transcribe -- --timestamp-offset 0.5

Timings are for a ~36 min episode:

Model	Host — Metal GPU (M3)	Host — CPU / Docker	Accuracy
`tiny.en`	~1–2 min	~5 min	Low
`small.en`	~5 min	~20–30 min	Good
`medium.en` (default)	~15 min	~60–120 min	High

Docker on macOS has no Metal passthrough — it always falls into the CPU column. Run transcription on the host to use Metal GPU acceleration.

Model downloaded automatically on first use, cached in whisper.cpp/.

2a. Caption alignment check (recommended on first recording)

Whisper timestamps can lag 0.3–0.6 s. Measure the offset:

cd public/transcribe && npx serve . -p 3001
Open http://localhost:3001/caption_test.html
Scrub to a word onset, enter the word — page calculates offset
Repeat with a word 5+ min later; page shows the fix command

The wizard runs this automatically and carries the offset through subsequent steps.

3. Diarize

npm run diarize -- --num-speakers 2

Output: public/transcribe/output/raw/diarization.json

4. Assign speakers

npm run assign-speakers

Labels each segment with detected speaker in transcript.raw.json.

4a. Forced alignment

npm run align
npm run align -- --python .venv\Scripts\python.exe

Refines segments[].start/end and tokens[].t_dtw in transcript.raw.json. Populates tokens[].t_end (word-end boundary) enabling exact cut boundaries.

5. Edit transcript

npm run edit-transcript

Merges phrases into sentences. Outputs:

public/transcribe/output/edit/transcript.json
public/transcribe/output/edit/transcript.doc.txt

6. Edit the doc

Open transcript.doc.txt. Follow the instructions at the top:

Rename speakers in the SPEAKERS section
Retype words to correct them
Wrap words in {curly braces} to cut them
Add CUT after a segment number to remove the whole segment

7. Save edits

npm run merge-doc
npm run merge-doc:cut-pauses               # also remove silences > 0.5 s
npm run merge-doc:cut-pauses -- --pause-threshold 0.3
npm run merge-doc -- --timestamp-offset 0.5

Applies doc edits back to transcript.json. Re-running resets any previous pause cuts or offset — always pass the flags you want.

8. Camera setup (optional)

Simulates multi-camera by digitally cropping to the speaking speaker's face on a pacing schedule. Supports multiple physical camera angles: each angle uses a separate synced video file; each speaker is assigned to an angle.

Install MediaPipe:

pip3 install mediapipe pillow
# or:
pip3 install -r scripts/camera/requirements.txt

Single angle:

npm run setup-camera

Multi-angle:

node scripts/camera/setup-camera.js --videos path/to/angle1.mp4 path/to/angle2.mp4

Or let the wizard handle it (recommended).

The script:

Extracts a reference frame from each video at transcript.meta.videoStart
Runs MediaPipe BlazeFace face detection per angle (offline after first run — model cached at scripts/camera/blaze_face_short_range.tflite)
Starts the Next.js dev server

Open http://localhost:3000/camera in your browser:

Use angle tabs to switch between camera angles
Assign each detected face box to a speaker
Click Save profiles

Output: public/transcribe/output/camera/camera-profiles.json

Flags:

npm run setup-camera -- --skip-detect          # skip auto-detection, draw manually
npm run setup-camera -- --video path/to/v.mp4  # specify video explicitly
npm run setup-camera -- --python python3        # override Python binary

9. Preview in Remotion

npm run remotion

Plays the full recording with all cuts applied. If camera-profiles.json exists, punch-in/punch-out cuts are applied automatically (including multi-angle switching). Remove or rename the file to disable camera cuts.

10. Cut preview (optional)

npm run cut-preview

Generates a flat MP4 for quick review outside Remotion.

Editing: Long-form video

The wizard handles the full editing flow interactively.

npm run video:wizard
# Docker:
docker-compose run --rm --service-ports wizard npm run video:wizard

Resume behaviour: The wizard detects existing work and picks up where you left off. Choose Resume, Jump to a specific step, or Start fresh.

Steps

1. Build the transcript doc

After transcription and speaker assignment complete, the wizard generates:

public/edit/transcript.doc.txt

This plain-text file represents every segment of the recording as a numbered line. The wizard opens it automatically.

2. Edit the doc

Each line looks like:

[42]  Natasha: And I think the real issue is context windows.

Make edits directly in the file:

What you want	How to write it
Cut a word or phrase	Wrap in `{curly braces}` — `{um}`, `{you know}`
Cut an entire segment	Add `CUT` after the segment number: `[42] CUT`
Fix a transcript error	Retype the word inline
Rename a speaker	Edit the `SPEAKERS` block at the top of the file

Save the file, return to the terminal, press Enter.

3. Apply edits

The wizard runs merge-doc to bake your changes back into transcript.json:

# Optional flags you can pass when jumping to this step manually:
npm run transcript:merge
npm run transcript:merge:cut-pauses   # also auto-cut silences > 0.5 s

4. Camera setup (optional)

Sets up digitally-simulated punch-in/punch-out cuts to the speaking face. The wizard:

Detects faces via MediaPipe
Opens http://localhost:3000/camera — assign each face box to a speaker, click Save profiles

Output: public/camera/camera-profiles.json

5. Preview in Remotion Studio

npm run remotion

Open the ragTechVodcast composition. Scrub through the timeline to review all cuts and overlays.

6. Render

npm run shorts:render   # renders the final MP4 based on outName in transcript meta

Or use Remotion's built-in render button in the Studio.

Editing: Short-form clips

Short-form clips are vertical (9:16) cuts derived from the longform recording. Each clip lives in public/shorts/<clip-id>/.

Prerequisite: the longform pipeline must have run and produced public/edit/transcript.json.

npm run shorts:wizard
# Docker:
docker-compose run --rm --service-ports wizard npm run shorts:wizard

Choosing a path

The wizard offers two paths:

Path	When to use
A — Clip from longform	You recorded landscape and want to cut a vertical clip from it
B — Dedicated portrait recording	You recorded directly in portrait (phone, vertical camera)

Path A walkthrough

1. Pick a clip ID

Give the clip a short slug, e.g. mediocrity. Output goes to public/shorts/mediocrity/.

2. Edit the clip doc

The wizard copies a cleaned version of the longform doc to public/shorts/<clip-id>/transcript.doc.txt and opens it. You define the clip by adding directives:

Define the clip range — required:

> START
[42]  This is the first segment to include.
[55]  This is the last segment to include.
> END

For a precise sub-segment start or end, use at=:

> START at="the real issue is"
[42]  And I think the real issue is context windows.
> END at="context windows"

Add a hook — optional teaser that plays before the main clip:

# Whole segment as hook:
[38]  This is a great hook line.
> HOOK

# Specific phrase as hook:
[38]  This segment has a {great soundbite} in the middle.
> HOOK "great soundbite"

The hook section plays first, then the main clip from START to END.

Word-level cuts — same as longform:

[42]  Remove these {um} {you know} filler words.

Save the file, return to the terminal, press Enter.

3. Apply edits

The wizard runs shorts:merge-doc, which writes public/shorts/<clip-id>/transcript.json including:

meta.videoStart / meta.videoEnd — the source time range
meta.hookTitle — derived from the first > HOOK phrase
meta.outName — auto-named <source-filename>_<clip-id>.mp4

4. Portrait camera setup

Uses the existing longform camera profiles and re-maps them for the 9:16 frame. Opens http://localhost:3000/camera — review face positions and click Save profiles.

Output: public/shorts/<clip-id>/camera-profiles.json

5. Preview in Remotion Studio

npm run remotion

Select the ShortFormClip composition. The studio reads from public/shorts/mediocrity/ by default; pass ?shortId=<clip-id> in the URL to switch clips.

6. Render

npm run shorts:render -- --id <clip-id>

Output MP4 is written to the path stored in transcript.meta.outName.

Pipeline: Carousel Generation

npm run generate
npm run generate:bulk

Name		Name	Last commit message	Last commit date
Latest commit History 123 Commits
.claude/skills/select-hooks		.claude/skills/select-hooks
.husky		.husky
.windsurf/workflows		.windsurf/workflows
app		app
docs		docs
public		public
remotion		remotion
scripts		scripts
tests		tests
types		types
vscode-transcript-language		vscode-transcript-language
.babelrc		.babelrc
.dockerignore		.dockerignore
.eslintrc		.eslintrc
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
Dockerfile		Dockerfile
README.md		README.md
carousel-config.example.json		carousel-config.example.json
docker-compose.yml		docker-compose.yml
eslint.config.mjs		eslint.config.mjs
jest.config.js		jest.config.js
next.config.ts		next.config.ts
package-lock.json		package-lock.json
package.json		package.json
postcss.config.mjs		postcss.config.mjs
remotion.config.js		remotion.config.js
remotion.config.ts.bak		remotion.config.ts.bak
tsconfig.json		tsconfig.json

Folders and files

Latest commit

History

Repository files navigation

deckcreate

Local vs Docker

macOS (Apple Silicon — M-series)

Windows (NVIDIA GPU — RTX 50-series)

Prerequisites

Option 1: Local

Option 2: Docker

Quick start

Wizard modes

Python setup — diarization + forced alignment

Manual steps

1. Sync audio and video

2. Transcribe

2a. Caption alignment check (recommended on first recording)

3. Diarize

4. Assign speakers

4a. Forced alignment

5. Edit transcript

6. Edit the doc

7. Save edits

8. Camera setup (optional)

9. Preview in Remotion

10. Cut preview (optional)

Editing: Long-form video

Steps

Editing: Short-form clips

Choosing a path

Path A walkthrough

Pipeline: Carousel Generation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages