Local Speaking Assessment Report

End-to-end local speech-analysis pipeline for two tasks:

Listen & Repeat – compare a student’s utterance to a reference prompt
Interview – evaluate a student’s answer to an interviewer’s question

The tool downloads (or opens) audio, transcribes it with faster-whisper, and produces JSON reports with fluency, pronunciation, grammar, and task-specific metrics. Mispronunciations and discourse checks leverage OpenAI GPT-4o; interview relevance uses OpenAI embeddings.

Features

Works with local files or URLs for both prompt and student audio
Transcription with word-level timestamps (faster-whisper)
Metrics (per task, aggregated across pairs):
- Duration and speech rate (wpm)
- Repeat accuracy (WER) + incorrect_segments
- Pause stats → pause_frequency_level, pause_appropriateness_level
- Pronunciation: mispronounced words via GPT-4o audio + 0–100 accuracy score
- Grammar: issues + score via GPT-4o text
Interview-only extras:
- Relevance (question ↔ answer) via text embeddings cosine similarity → CEFR label
- Discourse (coherence/organization) via GPT-4o text → CEFR label
- Vocabulary block (complexity + diversity proxies)
- Word repetition level
- Grammar formatted into errors[{original_sentence, corrected_sentence, fdiff[]}]
Single class handles both tasks; choose with --task listen_repeat or --task interview
JSON written to the path you pass in (--out), one file per task

Requirements

Python 3.9+ (3.10/3.11 recommended)
FFmpeg installed and on PATH (required by pydub)
A GPU is optional; CPU works. (For GPU, install CUDA-capable PyTorch; faster-whisper will detect it.)

Python packages

pip install -r requirements.txt

Installation

Ensure FFmpeg is installed: macOS (Homebrew): brew install ffmpeg

Configuration

Create a .env file in the project root with your OpenAI API key:
OPENAI_API_KEY=sk-...
The script reads this automatically via python-dotenv. You can also pass api_key to the class directly if using the API.

CLI

Run Listen & Repeat

python local_listen_repeat.py \
  --task listen_repeat \
  --out out/listen_report.json \
  --pairs "data/p01_prompt.wav:data/p01_student.wav" \
          "data/p02_prompt.wav:data/p02_student.wav"
          "data/p02_prompt.wav:data/p02_student.wav"

Run Interview

python local_listen_repeat.py \
  --task interview \
  --out out/interview_report.json \
  --pairs "data/q1.wav:data/a1.wav" \
          "data/q2.wav:data/a2.wav"

--pairs accepts one or more items, each in the form prompt:student
Each side can be a local path or an HTTPS URL
The output directory is created if it doesn’t exist

Programmatic Use

from speaking_report import LocalSpeakingAssessmentReport, ListenRepeatPair

# Listen & Repeat
lr_pairs = [
    ListenRepeatPair("data/p01_prompt.wav", "data/p01_student.wav"),
    ListenRepeatPair("data/p02_prompt.wav", "data/p02_student.wav"),
]
lr = LocalSpeakingAssessmentReport(task="listen_repeat")
lr.generate_report(lr_pairs, out_path="out/listen_report.json")

# Interview
int_pairs = [
    ListenRepeatPair("data/q1.wav", "data/a1.wav"),
    ListenRepeatPair("data/q2.wav", "data/a2.wav"),
]
interview = LocalSpeakingAssessmentReport(task="interview")
interview.generate_report(int_pairs, out_path="out/interview_report.json")

Output (JSON) – What You Get

All reports share common fields; Interview adds a few more.

Common Fields

{
  "version": "1.0",
  "generation_failed": false,
  "errors": [],
  "overall_score": { "cefr": "B1", "toefl_score": "4", "old_toefl_score": "23" },
  "speech_rate": 123,
  "duration": "02:37",
  "repeat_accuracy": { "score": 76 },
  "incorrect_segments": ["..."],
  "mispronounced_words": [{"word": "temperature"}],
  "fluency": {
    "speech_rate_level": "B1",
    "coherence_level": "B1",
    "pause_frequency_level": "B2",
    "pause_appropriateness_level": "A2",
    "repeat_accuracy_level": "B1",
    "description": "Speech is understandable ...",
    "description_cn": "整体可理解 ..."
  },
  "pronunciation": {
    "prosody_rhythm_level": "B1",
    "vowel_fullness_level": "B1",
    "intonation_level": "B1",
    "accuracy_score": 92,
    "description": "Pronunciation is generally intelligible ...",
    "description_cn": "发音整体清晰 ..."
  },
  "grammar": {
    "accuracy_level": "B1",
    "repeat_accuracy_level": "B1",
    "issues": [
      { "span": "there is many data", "explanation": "Agreement error", "suggestion": "there are many data" }
    ]
  }
}

Interview-only Additions

{
  "relevance": { "score": "B2" },
  "discourse": { "score": "B1" },
  "vocabulary": {
    "complexity_level": "B1",
    "diversity_level": "B2",
    "description": "Lexical complexity and diversity ...",
    "description_cn": "..."
  },
  "fluency": { "word_repetition_level": "B2" },
  "grammar": {
    "accuracy_level": "B1",
    "errors": [
      {
        "original_sentence": "there is many data",
        "corrected_sentence": "there are many data",
        "fdiff": [
          {
            "has_error": true,
            "orig": "there is",
            "corr": "there are",
            "error_type_description": "Subject–verb agreement",
            "feedback": "Use plural verb with 'data'.",
            "feedback_cn": ""
          }
        ]
      }
    ],
    "description": ""
  }
}

How It Works (brief)

Load audio (path or URL).
Transcribe with faster-whisper (word_timestamps=True).
Compute
- WER (via jiwer) → repeat accuracy; collect incorrect_segments via diff.
- Pause metrics from word-time gaps (≥ 0.30 s pause; ≥ 1.0 s long pause).
- Speech rate = words per minute (aggregate duration).
Mispronunciations – GPT-4o audio compares prompt audio + text vs. student audio + text and returns unique words.
Grammar – GPT-4o text returns {"issues": [...], "score": <0–100>} (JSON enforced).
Interview extras
- Relevance – embed each (question, answer) with text-embedding-3-small; cosine similarity → CEFR label (conservative minimum across pairs).
- Discourse – GPT-4o text returns a CEFR label from the full answer transcript.
- Vocabulary – proxies: average word length + type/token ratio.
- Word repetition – repeated token rate → CEFR band.
Combine everything into the final JSON and write to --out.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
experiments		experiments
.gitignore		.gitignore
README.md		README.md
audio_report.json		audio_report.json
fixing_file.py		fixing_file.py
icnale_eval.py		icnale_eval.py
interview_report.json		interview_report.json
listen_and_repeat_scores.json		listen_and_repeat_scores.json
listen_repeat_report.json		listen_repeat_report.json
local_listen_repeat.py		local_listen_repeat.py
local_listen_repeat_modified.py		local_listen_repeat_modified.py
make_icnale_manifest.py		make_icnale_manifest.py
report.json		report.json
requirements.txt		requirements.txt
results_icnale.json		results_icnale.json
results_icnale_5.json		results_icnale_5.json
results_icnale_test.json		results_icnale_test.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Local Speaking Assessment Report

Table of Contents

Features

Requirements

Python packages

Installation

Configuration

CLI

Run Listen & Repeat

Run Interview

Programmatic Use

Output (JSON) – What You Get

Common Fields

Interview-only Additions

How It Works (brief)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Local Speaking Assessment Report

Table of Contents

Features

Requirements

Python packages

Installation

Configuration

CLI

Run Listen & Repeat

Run Interview

Programmatic Use

Output (JSON) – What You Get

Common Fields

Interview-only Additions

How It Works (brief)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages