Authors:
-
Massyl Adjal (Project Lead)
-
Yasser Bouhai
-
Zineddine
-
Mohamed Mouad Boularas
Institution: Sorbonne Université — Master Ingénierie des Systèmes Intelligents
Repository: https://github.com/2ineddine/MatchaTTS-Implementation-Analysis
This project focuses on the reproducibility and analysis of Matcha-TTS, a state-of-the-art non-autoregressive text-to-speech architecture based on Optimal-Transport Conditional Flow Matching (OT-CFM).
Matcha-TTS is designed to synthesize high-quality mel-spectrograms efficiently. Unlike diffusion models that require many iterative steps, Matcha-TTS uses an ODE-based decoder to transform noise into speech representations along a straight trajectory.
We present a complete re-implementation of the model, featuring a Transformer-based text encoder with Rotational Position Embeddings (RoPE) and a lightweight 1D U-Net decoder. We evaluated our implementation against the official pre-trained checkpoint using the LJ Speech dataset.
MatchaTTS Architecture
The system functions sequentially through two main subsystems:
-
Text Encoder: Transforms raw text into contextualized embeddings using a stack of transformer blocks.
-
Preprocessing: Uses 1D convolutions (Prenet) for local feature extraction.
-
Positional Encoding: Implements RoPE (Rotary Position Embedding), applied to half the embedding dimensions to preserve raw information.
Text Encoder architecture inspired by Glow-TTS and Grad-TTS. The text is first tokenized, then processed by a Prenet before being enriched by transformer layers with RoPE.
-
-
Duration Predictor & Alignment:
-
Flow Matching: Replaces standard diffusion with OT-CFM. It predicts a vector field
$v_t$ to guide simple noise$x_0$ to a complex mel-spectrogram$x_1$ .Overview of the Training (CFM) and Inference (ODE) algorithms implemented in our code.
-
1D U-Net Hybrid: The decoder combines ResNet blocks (local features) with Transformer blocks (global context).
The 1D U-Net architecture structure implemented in
decoder.py. -
SnakeBeta Activation: A key innovation found by the author is the use of SnakeBeta activation in Feed-Forward layers, which is periodic and better suited for audio waveform generation than ReLU.
-
Dataset: LJ Speech (approx. 24 hours of single-speaker English audio).
-
Text Processing: IPA phonemization via
espeak-ngwith interspersing (blank tokens) to stabilize MAS. -
Audio Processing:
We compared the Original Paper's reported results against our Re-implementation (Original) and a Custom Variation. All models were trained for 150 epochs on a 4-GPU cluster.
Quantitative Analysis
| Metric | Matcha (Paper) | Matcha (Retested) | Custom (Ours) |
|---|---|---|---|
| Parameters | 18.2M | 18.2M | 18.2M |
| RTF (GPU) | |||
| WER (%) | 2.09 | ||
| Synthesis Time (s) | - |
Subjective Evaluation (MOS)
Mean Opinion Score evaluated by 31 participants.
| Model | MOS (3 samples) |
|---|---|
| Matcha (Paper) | |
| Original Re-implementation | |
| Our Custom Model |
Synthesis time as a function of character count. Points represent individual samples, while trend curves show the average behavior of models. A quasi-linear relationship is observed, with constant overhead dominating for short texts.
-
Reproducibility: Our re-implementation achieves a MOS of 3.86, virtually identical to the original paper's 3.84, confirming successful reproduction of audio quality.
-
Efficiency: Our implementations achieved an RTF of ~0.019, performing faster than the paper's reported 0.03829. Synthesis time scales quasi-linearly with text length.
-
Critical Challenge: We identified that Mel Spectrogram Normalization is mandatory. An initial version without it failed to converge, resulting in extremely high loss and artifacts.
-
18M Parameter (Main): The primary version used for all final results. Stable training and high quality. it is the current main branch.
-
16M Parameter (Simplified): A version with simplified transformer blocks in the decoder. It achieved comparable training dynamics to the 18M version.
-
18M Parameter (from-scratch): A version with our own implementation of all fundamental modules (For example : Attention blocks and other modules found in the difusion library). It achieved worse results than the other versions. This version can be found in the ZedBranch2 branch.
Validation Loss Prior Loss 

Duration Loss Diffusion/Flow Loss 

Training curves comparing 16M (pink) and 18M (orange) implementations. All losses show similar convergence patterns.
-
Original Paper: Mehta, S., et al. "Matcha-TTS: A fast TTS architecture with conditional flow matching." arXiv:2309.03199 (2023).
-
Implementation Report: Adjal M., Bouhai Y., B. Z., Boularas M.M. "MatchaTTS Implementation And Analysis." Sorbonne Université (2025-2026).






