Skip to content

thu-spmi/ASR-Benchmarks

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

51 Commits
 
 

Repository files navigation

ASR benchmarks

An effort to track benchmarking results over widely-used datasets for ASR (Automatic Speech Recognition). Note that the ASR results are affected by a number of factors, it is thus important to report results along with those factors for fair comparisons. In this way, we can measure the progress in a more scientifically way. Feel free to add and correct!

Nomenclature (alphabetical ordering)

Terms Explanations
AM Acoustic Model. Options: DNN-HMM / CTC / ATT / ATT+CTC / RNN-T / CTC-CRF. Note that: we list some end-to-end (e2e) models (e.g., ATT, RNN-T) in this field, although those e2e models contains an implicit/internal LM through the encoder.
AM size (M) The number of parameters in millions in the Acoustic Model. Also we report the total number of parameters in e2e models in this field.
ATT Attention based Seq2Seq, including LAS (Listen Attend and Spell).
CER Character Error Rate
Data Aug. whether any forms of data augmentations are used, such as SP (3-fold Speech Perturbation from Kaldi), SA (SpecAugment)
Ext. Res. whether any forms of external resources beyond the standard datasets are used, such as external speech (more transcribed speech or unlabeled speech), external text corpus, pretrained models
L #Layer, e.g., L24 denotes that the number of layers is 24
LM Language Model, explicitly used, word-level (by default). ''---'' denotes not using shallow fusion with explicit/external LMs, particularly for ATT, RNN-T.
LM size (M) The number of parameters in millions in the neural Language Model. For n-gram LMs, this field denotes the total number of n-gram features.
NAS Neural Architecture Search
Unit phone (namely monophone), biphone, triphone, wp (word-piece), character, chenone, BPE (byte-pair encoding)
WER Word Error Rate
--- not applied
? not known from the original paper

WSJ

This dataset contains about 80 hours of training data, consisting of read sentences from the Wall Street Journal, recorded under clean conditions. Available from the LDC as WSJ0 under the catalog number LDC93S6B.

The evaluation dataset contains the simpler eval92 subset and the harder dev93 subset.

Results are sorted by eval92 WER.

eval92 WER dev93 WER Unit AM AM size (M) LM LM size (M) Data Aug. Ext. Res. Paper
2.50 5.48 mono-phone CTC-CRF, deformable TDNN 11.9 4-gram 10.24 SP --- Deformable TDNN
2.7 5.3 bi-phone LF-MMI, TDNN-LSTM ? 4-gram ? SP --- LF-MMI TASLP2018
2.77 5.68 mono-phone CTC-CRF, TDNN NAS 11.9 4-gram 10.24 SP --- NAS SLT2021
3.0 6.0 bi-phone EE-LF-MMI, TDNN-LSTM ? 4-gram ? SP --- EE-LF-MMI TASLP2018
3.2 5.7 mono-phone CTC-CRF, VGG-BLSTM 16 4-gram 10.24 SP --- CAT IS2020
3.4 5.9 sub-word ATT, LSTM 18 RNN 113 --- --- ESPRESSO ASRU2019
3.79 6.23 mono-phone CTC-CRF, BLSTM 13.5 4-gram 10.24 SP --- CTC-CRF ICASSP2019
4.9 --- mono-char ATT+CTC, Transformers ? 4-gram ? SA --- phoneBPE-IS2020
5.0 8.1 mono-char CTC-CRF, VGG-BLSTM 16 4-gram 10.24 SP --- CAT IS2020

Swbd

This dataset contains about 260 hours of English telephone conversations between two strangers on a preassigned topic (LDC97S62). The testing is commonly conducted on eval2000 (a.k.a. hub5'00 evaluation, LDC2002S09 for speech data and LDC2002T43 for transcripts), which consists of two test subsets - Switchboard (SW) and CallHome (CH).

Results in square brackets denote the weighted average over SW and CH based on our calculation when not reported in the original paper.

Results are sorted by Sum WER.

SW CH Sum Unit AM AM size (M) LM LM size (M) Data Aug. Ext. Res. Paper
6.3 13.3 [9.8] charBPE &phoneBPE ATT+CTC, Transformers, L24 enc, L12 dec ? multi-level RNNLM ? SA Fisher transcripts phoneBPE-IS2020
6.4 13.4 9.9 char RNN-T, BLSTM-LSTM, ivector 57 LSTM 84 SP, SA, etc. Fisher transcripts Advancing RNN-T ICASSP2021
6.5 13.9 10.2 phone LF-MMI, TDNN-f ? Transformer 25 SP Fisher transcripts P-Rescoring ICASSP2021
6.8 14.1 [10.5] wp 1k ATT ? LSTM ? SA Fisher transcripts SpecAug IS2019
6.9 14.5 10.7 phone CTC-CRF Conformer 51.82 Transformer 25 SP, SA Fisher transcripts Advancing CTC-CRF
7.2 14.8 11.1 wp CTC-CRF Conformer 51.85 Transformer 25 SP, SA Fisher transcripts Advancing CTC-CRF
7.9 15.7 11.8 char RNN-T BLSTM-LSTM 57 LSTM 5 SP, SA, etc. --- Advancing RNN-T ICASSP2021
7.9 16.1 12.1 phone CTC-CRF Conformer 51.82 4-gram 4.71 SP, SA Fisher transcripts Advancing CTC-CRF
8.3 17.1 [12.7] bi-phone LF-MMI, TDNN-LSTM ? LSTM ? SP Fisher transcripts LF-MMI TASLP2018
8.6 17.0 12.8 phone LF-MMI, TDNN-f ? 4-gram ? SP Fisher transcripts P-Rescoring ICASSP2021
8.5 17.4 [13.0] bi-phone EE-LF-MMI, TDNN-LSTM ? LSTM ? SP Fisher transcripts EE-LF-MMI TASLP2018
8.8 17.4 13.1 mono-phone CTC-CRF, VGG-BLSTM 39.2 LSTM ? SP Fisher transcripts CAT IS2020
9.0 18.1 [13.6] BPE ATT/CTC ? Transformer ? SP Fisher transcripts ESPnet-Transformer ASRU2019
9.7 18.4 14.1 mono-phone CTC-CRF, chunk-based VGG-BLSTM 39.2 4-gram 4.71 SP Fisher transcripts CAT IS2020
9.8 18.8 14.3 mono-phone CTC-CRF, VGG-BLSTM 39.2 4-gram 4.71 SP Fisher transcripts CAT IS2020
10.3 19.3 [14.8] mono-phone CTC-CRF, BLSTM 13.5 4-gram 4.71 SP Fisher transcripts CTC-CRF ICASSP2019

FisherSwbd

The Fisher dataset contains about 1600 hours of English conversational telephone speech (First part: LDC2004S13 for speech data, LDC2004T19 for transcripts; second part: LDC2005S13 for speech data, LDC2005T19 for transcripts).

FisherSwbd includes both Fisher and Switchboard datasets, which is around 2000 hours in total. Evaluation is commonly conducted over eval2000 and RT03 (LDC2007S10) datasets.

Results are sorted by Sum WER.

SW CH Sum RT03 Unit AM AM size (M) LM LM size (M) Data Aug. Ext. Res. Paper
7.5 14.3 [10.9] 10.7 bi-phone LF-MMI, TDNN-LSTM ? LSTM ? SP --- LF-MMI TASLP2018
7.6 14.5 [11.1] 11.0 bi-phone EE-LF-MMI, TDNN-LSTM ? LSTM ? SP --- EE-LF-MMI TASLP2018
7.3 15.0 11.2 ? mono-phone CTC-CRF, VGG-BLSTM 39.2 LSTM ? SP --- CAT IS2020
8.3 15.5 [11.9] ? char ATT ? --- --- SP --- Tencent-IS2018
8.1 17.5 [12.8] ? char RNN-T ? 4-gram ? SP --- Baidu-ASRU2017

Librispeech

The LibriSpeech dataset is derived from audiobooks that are part of the LibriVox project, and contains 1000 hours of speech sampled at 16 kHz. The dataset is freely available for download, along with separately prepared LM training corpus and pre-built language models. Notably, the LM training corpus introduced in the original librispeech task consists of additional 800M words, which is 80 times larger than the 10M words corresponding to the transcriptions of the 1000-hour labeled speech.

There are four test sets: dev-clean, dev-other, test-clean and test-other. For the sake of display, the results are sorted by test-clean WER.

test clean WER test other WER Unit/AM/AM size (M) LM/LM size (M) Data Aug./Ext. Res. Paper
1.4 2.5 wp/RNN-T Conformer, Pre-training + Noisy Student Training Self-training/1017 ---/--- SA/Libri-Light unlab-60k hours w2v-BERT
1.5 2.8 wp/RNN-T Conformer, Pre-training/1017 ---/--- SA/Libri-Light unlab-60k hours w2v-BERT
1.75 4.46 triphone/LF-MMI multistream CNN/20.6M 1 self-attentive simple recurrent unit (SRU) L24/139 SA/--- ASAPP-ASR
1.8 3.6 wp/CTC Conformer, wav2vec2.0/1017 ---/--- SA/Libri-Light unlab-60k hours ConformerCTC
1.9 3.9 wp/RNN-T Conformer/119 LSTM only on transcripts/~100M 1 SA/--- Conformer
1.9 4.1 wp/RNN-T ContextNet (L)/112.7 LSTM only on transcripts/? SA/--- ContextNet
2.1 4.2 wp/CTC vggTransformer/81 Transformer L42/338 1 2 SP, SA/--- FB2020WPM
2.1 4.3 wp/RNN-T Conformer/119 ---/--- SA/--- Conformer
2.26 4.85 chenone/DNN-HMM Transformer seq. disc./90 Transformer/? SP, SA/--- TransHybrid
2.3 5.0 triphone/DNN-HMM BLSTM/? Transformer/? ---/--- RWTH19ASR
2.31 4.79 wp/CTC vggTransformer/81 4-gram/145 3 SP, SA/--- FB2020WPM
2.5 5.8 wp/ATT CNN-BLSTM/? RNN/? SA/--- SpecAug IS2019
2.51 5.95 phone/CTC-CRF Conformer/51.82 Transformer L42/338 2 SA/--- Advancing CTC-CRF
2.54 6.33 wp/CTC-CRF Conformer/51.85 Transformer L42/338 2 SA/--- Advancing CTC-CRF
2.6 5.59 chenone/DNN-HMM Transformer/90 4-gram/? SP, SA/--- TransHybrid
2.7 5.9 wp/CTC Conformer/116 ---/--- SA/--- ConformerCTC
2.8 6.8 wp/ATT CNN-BLSTM/? ---/? SA/--- SpecAug IS2019
2.8 9.3 wp/DNN-HMM LSTM/? transformer/? ---/--- RWTH19ASR
3.61 8.10 phone/CTC-CRF Conformer/51.82 4-gram/145 3 SA/--- Advancing CTC-CRF
4.09 10.65 phone/CTC-CRF BLSTM/13 4-gram/145 3 ---/--- CTC-CRF ICASSP2019
4.28 --- tri-phone/LF-MMI TDNN/? 4-gram/? SP/--- LF-MMI Interspeech

LLM-ASR results (Projector only trained)

We separate LLM based ASR results into another table.

  • Unless otherwise stated, only the Projector is trained, while the Speech Encoder and the LLM are frozen.
  • AM indicate the Speech Encoder, not including the Projector;
  • LM indicate the LLM;
test clean WER test other WER AM AM size (M) LM LM size (M) Paper
1.8 3.4 HuBert-xlarge + LS-960 FT 964M Vicuna-7B 7B SLAM-ASR Table 8
2.0 4.2 WavLM-large + LS-960 FT 316.62M Vicuna-7B 7B SLAM-ASR Table 8, namely 1.96, 4.18 in Table 5
2.58 6.47 Whisper-large 634.86M Vicuna-7B 7B SLAM-ASR Table 5
2.72 6.79 Whisper-medium 305.68M Vicuna-7B 7B SLAM-ASR Table 5
4.19 9.50 Whisper-small 87.00M Vicuna-7B 7B SLAM-ASR Table 5
4.33 8.62 Whisper-large 634.86M TinyLlama-Chat 1.1B SLAM-ASR Table 4
5.01 8.67 Whisper-medium 305.68M TinyLlama-Chat 1.1B SLAM-ASR Table 2
5.94 11.5 Whisper-small 87.00M TinyLlama-Chat 1.1B SLAM-ASR Table 2
6.73 9.13 HuBert-xlarge 964M TinyLlama 1.1B SpeechLLM-2B, train projector and LLM-LoRA
11.51 16.68 WavLM-large 316.62M TinyLlama 1.1B SpeechLLM-1.5B, train projector and LLM-LoRA

More LLM-ASR results

  • Learnable (bold text)
test clean test other Method Paper
2.28 5.20 Whisper large-v2 (634.86M) + 80-query Q-Former(24.5M connector)+ Vicuna-13B Q-Former Table 1
2.3 4.8 NeMo Conformer encoder (110M) + Adapter + 2B Megatron LLM Lora + nucleus sampling SALM Table 1
2.10 4.26 HuBERT Large (317M) LS-960h FT + Projector (18.88M) + Vicuna-7B SLAM-ASR Table 5
2.29 5.67 HuBERT-large frozen + Adapter + Vicuna-7B frozen (48M total learnable) A Comprehensive..., Model S1
1.78 3.62 HuBERT-large LoRA + Adapter + Vicuna-7B LoRA (65M total learnable) A Comprehensive..., Model S4
1.85 3.77 HuBERT-large LoRA + Adapter + Vicuna-7B LoRA (337M total learnable) A Comprehensive..., Model S10
2.65 5.03 WavLM-large (316.62M) + Projector + Llama3.1-8B (the SLAM recipe) Measuring the Redundancy..., Table 1
1.70 3.56 WavLM-large (316.62M) + Projector + Qwen2.5-7B LoRA (the SLAM recipe) Measuring the Redundancy..., Table 1
1.97 3.78 WavLM-large (316.62M) BPE-phoneme-100 FT + Projector + Qwen3-8B-Base Lora-1.4B Phoneme-informed projector

Streaming results

Latency = Chunk-size of Encoder + Look-ahead of Encoder

test clean WER test other WER Latency (s) AM Encoder LLM Decoder /Decoder left context Paper
3.4 5.5 - Conformer 110M 92M SpeechLLM-XL non-streaming baseline
2.7 6.7 1.28+0.24 Emformer 107M 92M /5.12s (left 4 chunks) SpeechLLM-XL
3.0 7.4 1.92+0.96 80M llama-2-7b-hf LoRA rank-16 (27.3M)/1s ReaLLM

AISHELL-1

AISHELL-ASR0009-OS1, is a 178- hour open source mandarin speech dataset. It is a part of AISHELL-ASR0009, which contains utterances from 11 domains, including smart home, autonomous driving, and industrial production. The whole recording was made in quiet indoor environment, using 3 different devices at the same time: high fidelity microphone (44.1kHz, 16-bit,); Android-system mobile phone (16kHz, 16-bit), iOS-system mobile phone (16kHz, 16-bit). Audios in high fidelity were re-sampled to 16kHz to build AISHELL- ASR0009-OS1. 400 speakers from different accent areas in China were invited to participate in the recording. The corpus is divided into training, development and testing sets.

test CER Unit AM AM size (M) LM LM size (M) Data Aug. Ext. Res. Paper
4.18 char RNN-T+CTC, Conformer, LF-MMI 89 word 3-gram ? SA+SP --- e2e-word-ngram
4.5 char ATT+CTC, Conformer ? LSTM ? SA+SP --- WNARS
4.6 char ATT+CTC, Transformer Enc(94.4)+Dec(61.0)+CTC branch(3.3)=158.7 --- --- SP wav2vec2.0, DistilGPT2 preformer-ASRU2021
4.63 char ATT+CTC, Conformer ? bidirectional attention rescoring ? SA+SP --- U2++
4.7 char ATT+CTC, Conformer ? Transformer ? SA+SP --- ESPnet-2
4.72 char ATT+CTC, Conformer ? attention rescoring ? SA+SP --- U2
4.8 char RNN-T+CTC, Conformer 84.3 --- --- SA+SP --- ESPnet-1
5.2 char Comformer ? --- --- SA --- intermediate CTC loss
6.34 phone CTC-CRF, VGG-BLSTM 16 word 4-gram 0.7 SP --- CAT IS2020

Streaming results

The latency is defined as the chunk size plus the right context (if any). ∆ is the additional latency introduced by rescoring the first-pass hypotheses, which is typically less than 100ms for a utterance.

CER Latency (ms) AM LM Method
4.79 400+2+Δ CUSIDE+NNLM rescoring
5.47 400+2 CUSIDE
5.05 640+Δ U2++ w/ rescoring
5.22 640+Δ WNARS w/ rescoring
5.81 640 U2++
6.6 1920 MMA wide chunk
6.8 1280 HS-DACS Transformer
7.39 600 SCAMA
7.5 960 MMA narrow chunk

CHiME-4

The 4th CHiME challenge sets a target for distant-talking automatic speech recognition using a read speech dataset. Two types of data are employed: 'Real data' - speech data that is recorded in real noisy environments (on a bus, cafe, pedestrian area, and street junction) uttered by actual talkers. 'Simulated data' - noisy utterances that have been generated by artificially mixing clean speech data with noisy backgrounds.

There are four test sets. For the sake of display, the results are sorted by eval real WER.

dev simu WER dev real WER eval simu WER eval real WER Unit AM AM size (M) LM LM size (M) Data Aug. Ext. Res. Paper
1.15 1.50 1.45 1.99 phone wide-residual BLSTM ? LSTM ? --- --- Complex Spectral Mapping
1.78 1.69 2.12 2.24 phone 6 DCNN ensemble ? LSTM ? --- --- USTC-iFlytek CHiME4
2.10 1.90 2.66 2.74 phone LF-MMI, TDNN ? LSTM ? --- --- Kaldi-CHiME4

References

Short-hands Full references
Deformable TDNN Keyu An, Yi Zhang, Zhijian Ou. Deformable TDNN with adaptive receptive fields for speech recognition. Interspeech 2021.
CTC-CRF ICASSP2019 H. Xiang, Z. Ou. CRF-based Single-stage Acoustic Modeling with CTC Topology. ICASSP 2019.
CAT IS2020 K. An, H. Xiang, Z. Ou. CAT: A CTC-CRF based ASR Toolkit Bridging the Hybrid and the End-to-end Approaches towards Data Efficiency and Low Latency. INTERSPEECH 2020.
NAS SLT2021 H. Zheng, K. AN, Z. Ou. Efficient Neural Architecture Search for End-to-end Speech Recognition via Straight-Through Gradients. SLT 2021.
Conformer Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, Ruoming Pang. Conformer: Convolution-augmented Transformer for Speech Recognition. INTERSPEECH 2020.
ContextNet Wei Han∗ , Zhengdong Zhang∗ , Yu Zhang, Jiahui Yu, Chung-Cheng Chiu, James Qin, Anmol Gulati, Ruoming Pang, Yonghui Wu. ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context. INTERSPEECH 2020.
ASAPP-ASR Jing Pan, Joshua Shapiro, Jeremy Wohlwend, Kyu J. Han, Tao Lei, Tao Ma. ASAPP-ASR: Multistream CNN and Self-Attentive SRU for SOTA Speech Recognition. INTERSPEECH 2020.
U2 Binbin Zhang , Di Wu , Zhuoyuan Yao , Xiong Wang, Fan Yu, Chao Yang, Liyong Guo, Yaguang Hu, Lei Xie , Xin Lei. Unified Streaming and Non-streaming Two-pass End-to-end Model for Speech Recognition. arXiv:2012.05481.
Kaldi-CHiME4 Szu-Jui Chen, Aswin Shanmugam Subramanian, Hainan Xu, Shinji Watanabe. Building state-of-the-art distant speech recognition using the CHiME-4 challenge with a setup of speech enhancement baseline. INTERSPEECH 2018.
USTC-iFlytek CHiME4 system Jun Du , Yan-Hui Tu , Lei Sun , Feng Ma , Hai-Kun Wang , Jia Pan , Cong Liu , Jing-Dong Chen , Chin-Hui Lee. The USTC-iFlytek System for CHiME-4 Challenge.
Complex Spectral Mapping Zhong-Qiu Wang, Peidong Wang, DeLiang Wang. Complex Spectral Mapping for Single- and Multi-Channel Speech Enhancement and Robust ASR. TASLP 2020.
intermediate CTC loss Jaesong Lee , Shinji Watanabe. Intermediate Loss Regularization for CTC-based Speech Recognition. ICASSP 2021
WNARS Zhichao Wang, Wenwen Yang, Pan Zhou, Wei Chen. WNARS: WFST based Non-autoregressive Streaming End-to-End Speech Recognition. arXiv:2104.03587.
LF-MMI H. Hadian, H. Sameti, D. Povey, and S. Khudanpur. Flat-start single-stage discriminatively trained HMM-based models for ASR. TASLP 2018.
LF-MMI Interspeech D. Povey, et al. Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI. Interspeech 2016.
ESPRESSO Yiming Wang, Tongfei Chen, Hainan Xu, Shuoyang Ding, Hang Lv, Yiwen Shao, Nanyun Peng, Lei Xie, Shinji Watanabe, and Sanjeev Khudanpur. Espresso: A fast end- to-end neural speech recognition toolkit. ASRU 2019.
Advancing RNN-T George Saon, Zoltan Tueske, Daniel Bolanos, Brian Kingsbury. Advancing RNN Transducer Technology for Speech Recognition. ICASSP 2021.
P-Rescroing Ke Li, Daniel Povey, Sanjeev Khudanpur. A Parallelizable Lattice Rescoring Strategy with Neural Language Models. ICASSP 2021.
SpecAug D. S. Park, W. Chan, Y. Zhang, et al. SpecAugment: A simple data augmentation method for automatic speech recognition. Interspeech 2019.
ESPnet-Transformer S. Karita, N. Chen, and et al. A comparative study on transformer vs RNN in speech applications. ASRU 2019.
Baidu-ASRU2017 E. Battenberg, J. Chen, R. Child, A. Coates, Y. Li, H. Liu, S. Satheesh, A. Sriram, and Z. Zhu. Exploring neural transducers for end-to-end speech recognition. ASRU 2017.
Tencent-IS2018 C. Weng, J. Cui, G. Wang, J. Wang, C. Yu, D. Su, and D. Yu. Improving attention based sequence-to-sequence models for end-to-end English conversational speech recognition. Interspeech 2018.
phoneBPE-IS2020 Weiran Wang, Guangsen Wang, Aadyot Bhatnagar, Yingbo Zhou, Caiming Xiong, and Richard Socher. An investigation of phone-based subword units for end-to-end speech recognition. Interspeech 2020.
RWTH19ASR C. Luscher, E. Beck, K. Irie, M. Kitza, W. Michel, A. Zeyer, R. Schluter, and H. Ney. RWTH ASR systems for LibriSpeech: Hybrid vs attention-w/o data augmentation. Interspeech 2019.
ConformerCTC Edwin G Ng, Chung-Cheng Chiu, Yu Zhang, and William Chan. Pushing the limits of non-autoregressive speech recognition. Interspeech 2021.
FB2020WPM F. Zhang, Y. Wang, X. Zhang, C. Liu, et al. Fast, Simpler and More Accurate Hybrid ASR Systems Using Wordpieces. InterSpeech, 2020.
TransHybrid Yongqiang Wang, Abdelrahman Mohamed, Duc Le, Chunxi Liu, Alex Xiao, Jay Mahadeokar, Hongzhao Huang, Andros Tjandra, Xiaohui Zhang, Frank Zhang, Christian Fuegen, Geoffrey Zweig, and Michael L. Seltzer. Transformer based acoustic modeling for hybrid speech recognition. ICASSP 2020.
U2++ Di Wu, Binbin Zhang, et al. U2++: Unified Two-pass Bidirectional End-to-end Model for Speech Recognition. arXiv:2106.05642.
Advancing CTC-CRF Huahuan Zheng*, Wenjie Peng*, Zhijian Ou, Jinsong Zhang. Advancing CTC-CRF Based End-to-End Speech Recognition with Wordpieces and Conformers. arXiv:2107.03007.
e2e-word-ngram Jinchuan Tian, Jianwei Yu, et al. Improving Mandarin End-to-End Speech Recognition with Word N-gram Language Model. arXiv:2201.01995.
Transformer-LM K. Irie, A. Zeyer, R. Schluter, and H. Ney. Language Modeling with Deep Transformers. Interspeech, 2019.
w2v-BERT Yu-An Chung, Yu Zhang, Wei Han, Chung-Cheng Chiu, James Qin, Ruoming Pang, Yonghui Wu. W2v-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training. arXiv:2108.06209.

Footnotes

  1. from correspondence with the authors 2 3

  2. used the 42-layer transformer LM in this paper for Librispeech. 2 3

  3. used the 4-gram LM provided along with the Libripseech dataset, available here 2 3

About

An effort to track benchmarking results over widely-used datasets for ASR.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors