This is a Python tool to quickly and simply transcribe some video / audio file into an SRT file using local models with Windows.
TEN VAD is used to aggressively detect speech segments and their timestamps. Then a custom tuned ASR model is used to transcribe each speech segment into text.
That's it. No cloud services needed.
- Clone the repo and install the dependencies
- Download the latest llama.cpp Vulkan binaries and place the contents in the
src/binaries/llamafolder. - Download my tuned ASR models here and place them in the
src/binariesfolder. - Run with e.g:
python cli.py "YOUR_SAMPLE.wav" - After some time, the processed SRT file will show up in the executable folder as
YOUR_SAMPLE.srt
Expect 4 GB VRAM for fast results. Note this ASR model could still be tuned further.