Topics discussed in course:
- Digital Signal Processing
- Automatic Speech Recognition (ASR)
- Key-word spotting (KWS)
- Text-to-Speech (TTS)
- Voice Conversion
- Self supervised learning in Audio
- Codec models
- LLM-based Audio Generation
- Music & Audio Generation
- Speaker verification
| # | Date | Description | Slides |
|---|---|---|---|
| 1 | Lecture: | Introduction and Digital Signal Processing | slides |
| Seminar: | Introduction and Spectrograms, Griffin-Lim Algorithm | notebook | |
| 2 | Lecture: | Automatic Speech Recognition 1: introduction, WER, Datasets, CTC, LAS | slides |
| Seminar: | WER, Levenstein distance, CTC | notebook | |
| 3 | Lecture: | Automatic Speech Recognition 2: RNN-T, Language models in ASR, BPE, Whisper | slides |
| Seminar: | Automatic Speech Recognition 2: RNN-T, Whisper | notebook | |
| 4 | Lecture: | Key-word spotting (KWS) | slides |
| Seminar: | Key-word spotting | notebook | |
| 5 | Lecture: | Text-to-speech 1: WaveNet, Tacotron, FastSpeech, Guided Attention | slides |
| Seminar: | Text-to-speech: Tacotron2 | notebook | |
| 6 | Lecture: | Text-to-speech 2: Neural Vocoders (PWGAN, DiffWave, Glow-TTS, Hi-Fi GAN, VITS) | slides |
| Seminar: | Wavenet | notebook | |
| 7 | Lecture: | Voice Conversion: CycleGAN-VC, StarGAN-VC, AutoVC, Singing Voice Conversion | slides |
| Seminar: | VAE Wavenet Vocoder, Normalizing Flow | notebook | |
| 8 | Lecture: | Self-supervised learning in Audio (wav2vec2, GigaAM, HuBERT, BEST-RQ) | slides |
| Seminar: | HIFI-GAN | notebook | |
| 9 | Lecture: | Speaker verification and identification | slides |
| Seminar: | Speaker verification, Angular Softmax, Margin Softmax | notebook | |
| 10 | Lecture: | Codec Models (RVQ, SoundStream, Encodec, Mimi), VQ-VAE, VALL-E | slides |
| Seminar: | Encodec, Soundstream, Residual Vector Quantization | notebook | |
| 11 | Lecture: | LLM-based audio models: SEED-ASR, Llama3, Phi4, SpeechGPT, Mini-Omni, Llama-Omni, Moshi | slides |
| Seminar: | VITS, Normalizing flows | notebook | |
| 12 | Lecture: | Audio & Music Generation: Jukebox, Diffusion models (Diffsound, Riffusion, Noise2Music), AudioLM & MusicLM, AudioGen & MusicGen, MeLoDy, YuE, Music Agents | slides |
| Seminar: | |||
| TBD | Lecture: | FishSpeech, XTTS, SpearTTS, MQTTS |
- 5 homeworks each of 2 points = 10 points
- final test = 1 point
- Bonus points in HWs
- maximum points: 10 (hws) + 1 (test) + (bonus points in hws) = 11 points + bonus points
Pavel Severilov
- telegram: @severilov
- e-mail: pseverilov@gmail.com
- BIO:
- Education: MIPT
- Experience: AI-assistants (NLP, ASR, OCR), signals (Samokat+Kuper, Domclick, Dbrain, Gazpromneft, MIL-team)
- Lecturer: AI Masters, MIPT, ex-Deep Learning School
Daniel Knyazev
- telegram: @daniel_knyazev
- e-mail: xmaximuskn@gmail.com
- BIO:
- Education: MIPT
- Experience: xlabs-ai, Sberdevices
Roman Vlasov
- telegram: @roman_studentin
- e-mail: vlasovroman2017@gmail.com
- BIO:
- Education: MIPT, AI Masters
- Experience: Computer Vision (Yandex), LLM NLP & TTS (SberDevices), LLM in e2e speech understanding+synthesis
