Deep Learning for Audio Course, 2026

Description

Topics discussed in course:

Digital Signal Processing
Automatic Speech Recognition (ASR)
Key-word spotting (KWS)
Text-to-Speech (TTS)
Voice Conversion
Self supervised learning in Audio
Codec models
LLM-based Audio Generation
Music & Audio Generation
Speaker verification

Course materials

Materials

#	Date	Description	Slides
1	Lecture:	Introduction and Digital Signal Processing	slides
	Seminar:	Introduction and Spectrograms, Griffin-Lim Algorithm	notebook
2	Lecture:	Automatic Speech Recognition 1: introduction, WER, Datasets, CTC, LAS	slides
	Seminar:	WER, Levenstein distance, CTC	notebook
3	Lecture:	Automatic Speech Recognition 2: RNN-T, Language models in ASR, BPE, Whisper	slides
	Seminar:	Automatic Speech Recognition 2: RNN-T, Whisper	notebook
4	Lecture:	Key-word spotting (KWS)	slides
	Seminar:	Key-word spotting	notebook
5	Lecture:	Text-to-speech 1: WaveNet, Tacotron, FastSpeech, Guided Attention	slides
	Seminar:	Text-to-speech: Tacotron2	notebook
6	Lecture:	Text-to-speech 2: Neural Vocoders (PWGAN, DiffWave, Glow-TTS, Hi-Fi GAN, VITS)	slides
	Seminar:	Wavenet	notebook
7	Lecture:	Voice Conversion: CycleGAN-VC, StarGAN-VC, AutoVC, Singing Voice Conversion	slides
	Seminar:	VAE Wavenet Vocoder, Normalizing Flow	notebook
8	Lecture:	Self-supervised learning in Audio (wav2vec2, GigaAM, HuBERT, BEST-RQ)	slides
	Seminar:	HIFI-GAN	notebook
9	Lecture:	Speaker verification and identification	slides
	Seminar:	Speaker verification, Angular Softmax, Margin Softmax	notebook
10	Lecture:	Codec Models (RVQ, SoundStream, Encodec, Mimi), VQ-VAE, VALL-E	slides
	Seminar:	Encodec, Soundstream, Residual Vector Quantization	notebook
11	Lecture:	LLM-based audio models: SEED-ASR, Llama3, Phi4, SpeechGPT, Mini-Omni, Llama-Omni, Moshi	slides
	Seminar:	VITS, Normalizing flows	notebook
12	Lecture:	Audio & Music Generation: Jukebox, Diffusion models (Diffsound, Riffusion, Noise2Music), AudioLM & MusicLM, AudioGen & MusicGen, MeLoDy, YuE, Music Agents	slides
	Seminar:
TBD	Lecture:	FishSpeech, XTTS, SpearTTS, MQTTS

Homeworks

Homework	Date	Deadline	Description
1 (2 points)	February, 4	February, 18 (23:59)	Audio classification Audio preprocessing
2 (2 points)	February, 17	March, 4 (23:59)	ASR-1: CTC
3 (2 points)	March, 6	March, 22 (23:59)	ASR-2: RNN-T
4 (2 points)			Speaker Verification
5 (2 points)			Text-to-speech: FastPitch

Game rules

5 homeworks each of 2 points = 10 points
final test = 1 point
Bonus points in HWs
maximum points: 10 (hws) + 1 (test) + (bonus points in hws) = 11 points + bonus points

Authors

Pavel Severilov

telegram: @severilov
e-mail: pseverilov@gmail.com
BIO:
- Education: MIPT
- Experience: AI-assistants (NLP, ASR, OCR), signals (Samokat+Kuper, Domclick, Dbrain, Gazpromneft, MIL-team)
- Lecturer: AI Masters, MIPT, ex-Deep Learning School

Daniel Knyazev

telegram: @daniel_knyazev
e-mail: xmaximuskn@gmail.com
BIO:
- Education: MIPT
- Experience: xlabs-ai, Sberdevices

Roman Vlasov

telegram: @roman_studentin
e-mail: vlasovroman2017@gmail.com
BIO:
- Education: MIPT, AI Masters
- Experience: Computer Vision (Yandex), LLM NLP & TTS (SberDevices), LLM in e2e speech understanding+synthesis

Name		Name	Last commit message	Last commit date
Latest commit History 126 Commits
homework		homework
lectures		lectures
seminars		seminars
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
logo.png		logo.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Deep Learning for Audio Course, 2026

Description

Course materials

Materials

Homeworks

Game rules

Authors

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Deep Learning for Audio Course, 2026

Description

Course materials

Materials

Homeworks

Game rules

Authors

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages