End-to-end local speech-analysis pipeline for two tasks:
- Listen & Repeat – compare a student’s utterance to a reference prompt
- Interview – evaluate a student’s answer to an interviewer’s question
The tool downloads (or opens) audio, transcribes it with faster-whisper, and produces JSON reports with fluency, pronunciation, grammar, and task-specific metrics. Mispronunciations and discourse checks leverage OpenAI GPT-4o; interview relevance uses OpenAI embeddings.
- Features
- Requirements
- Installation
- Configuration
- Usage (CLI)
- Programmatic Use
- Output (JSON) – What You Get
- How It Works (brief)
- Works with local files or URLs for both prompt and student audio
- Transcription with word-level timestamps (faster-whisper)
- Metrics (per task, aggregated across pairs):
- Duration and speech rate (wpm)
- Repeat accuracy (WER) + incorrect_segments
- Pause stats →
pause_frequency_level,pause_appropriateness_level - Pronunciation: mispronounced words via GPT-4o audio + 0–100 accuracy score
- Grammar: issues + score via GPT-4o text
- Interview-only extras:
- Relevance (question ↔ answer) via text embeddings cosine similarity → CEFR label
- Discourse (coherence/organization) via GPT-4o text → CEFR label
- Vocabulary block (complexity + diversity proxies)
- Word repetition level
- Grammar formatted into
errors[{original_sentence, corrected_sentence, fdiff[]}]
- Single class handles both tasks; choose with
--task listen_repeator--task interview - JSON written to the path you pass in (
--out), one file per task
- Python 3.9+ (3.10/3.11 recommended)
- FFmpeg installed and on PATH (required by pydub)
- A GPU is optional; CPU works. (For GPU, install CUDA-capable PyTorch; faster-whisper will detect it.)
pip install -r requirements.txt
- Ensure FFmpeg is installed: macOS (Homebrew): brew install ffmpeg
- Create a .env file in the project root with your OpenAI API key:
- OPENAI_API_KEY=sk-...
- The script reads this automatically via python-dotenv. You can also pass api_key to the class directly if using the API.
python local_listen_repeat.py \
--task listen_repeat \
--out out/listen_report.json \
--pairs "data/p01_prompt.wav:data/p01_student.wav" \
"data/p02_prompt.wav:data/p02_student.wav"
"data/p02_prompt.wav:data/p02_student.wav"
python local_listen_repeat.py \
--task interview \
--out out/interview_report.json \
--pairs "data/q1.wav:data/a1.wav" \
"data/q2.wav:data/a2.wav"--pairsaccepts one or more items, each in the formprompt:student- Each side can be a local path or an HTTPS URL
- The output directory is created if it doesn’t exist
from speaking_report import LocalSpeakingAssessmentReport, ListenRepeatPair
# Listen & Repeat
lr_pairs = [
ListenRepeatPair("data/p01_prompt.wav", "data/p01_student.wav"),
ListenRepeatPair("data/p02_prompt.wav", "data/p02_student.wav"),
]
lr = LocalSpeakingAssessmentReport(task="listen_repeat")
lr.generate_report(lr_pairs, out_path="out/listen_report.json")
# Interview
int_pairs = [
ListenRepeatPair("data/q1.wav", "data/a1.wav"),
ListenRepeatPair("data/q2.wav", "data/a2.wav"),
]
interview = LocalSpeakingAssessmentReport(task="interview")
interview.generate_report(int_pairs, out_path="out/interview_report.json")All reports share common fields; Interview adds a few more.
{
"version": "1.0",
"generation_failed": false,
"errors": [],
"overall_score": { "cefr": "B1", "toefl_score": "4", "old_toefl_score": "23" },
"speech_rate": 123,
"duration": "02:37",
"repeat_accuracy": { "score": 76 },
"incorrect_segments": ["..."],
"mispronounced_words": [{"word": "temperature"}],
"fluency": {
"speech_rate_level": "B1",
"coherence_level": "B1",
"pause_frequency_level": "B2",
"pause_appropriateness_level": "A2",
"repeat_accuracy_level": "B1",
"description": "Speech is understandable ...",
"description_cn": "整体可理解 ..."
},
"pronunciation": {
"prosody_rhythm_level": "B1",
"vowel_fullness_level": "B1",
"intonation_level": "B1",
"accuracy_score": 92,
"description": "Pronunciation is generally intelligible ...",
"description_cn": "发音整体清晰 ..."
},
"grammar": {
"accuracy_level": "B1",
"repeat_accuracy_level": "B1",
"issues": [
{ "span": "there is many data", "explanation": "Agreement error", "suggestion": "there are many data" }
]
}
}{
"relevance": { "score": "B2" },
"discourse": { "score": "B1" },
"vocabulary": {
"complexity_level": "B1",
"diversity_level": "B2",
"description": "Lexical complexity and diversity ...",
"description_cn": "..."
},
"fluency": { "word_repetition_level": "B2" },
"grammar": {
"accuracy_level": "B1",
"errors": [
{
"original_sentence": "there is many data",
"corrected_sentence": "there are many data",
"fdiff": [
{
"has_error": true,
"orig": "there is",
"corr": "there are",
"error_type_description": "Subject–verb agreement",
"feedback": "Use plural verb with 'data'.",
"feedback_cn": ""
}
]
}
],
"description": ""
}
}- Load audio (path or URL).
- Transcribe with
faster-whisper(word_timestamps=True). - Compute
- WER (via
jiwer) → repeat accuracy; collectincorrect_segmentsvia diff. - Pause metrics from word-time gaps (≥ 0.30 s pause; ≥ 1.0 s long pause).
- Speech rate = words per minute (aggregate duration).
- WER (via
- Mispronunciations – GPT-4o audio compares prompt audio + text vs. student audio + text and returns unique words.
- Grammar – GPT-4o text returns
{"issues": [...], "score": <0–100>}(JSON enforced). - Interview extras
- Relevance – embed each
(question, answer)withtext-embedding-3-small; cosine similarity → CEFR label (conservative minimum across pairs). - Discourse – GPT-4o text returns a CEFR label from the full answer transcript.
- Vocabulary – proxies: average word length + type/token ratio.
- Word repetition – repeated token rate → CEFR band.
- Relevance – embed each
- Combine everything into the final JSON and write to
--out.