Skip to content

VER-296: Fix transcription step not generating correct timestamp#56

Merged
quancao-ea merged 4 commits intomainfrom
fix/timestamp-transcription
Jan 28, 2026
Merged

VER-296: Fix transcription step not generating correct timestamp#56
quancao-ea merged 4 commits intomainfrom
fix/timestamp-transcription

Conversation

@quancao-ea
Copy link
Copy Markdown
Collaborator

@quancao-ea quancao-ea commented Jan 27, 2026

Summary by CodeRabbit

  • New Features

    • Multi-segment batch transcription for processing large audio inputs
    • Transcription output redesigned to structured JSON with ordered segment entries
    • Inline annotations for non-speech elements (e.g., [silence], [music], [applause], [laughter]) and requirement to transcribe every segment
  • Documentation

    • New system instruction and output schema documenting segment handling, numbering, and annotation guidelines; prompt mapping updated to include them

✏️ Tip: You can customize this high-level summary in your review settings.

Refactor timestamped transcription to process audio in segments with structured JSON output format for improved accuracy and handling of long audio files.
@linear
Copy link
Copy Markdown

linear Bot commented Jan 27, 2026

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jan 27, 2026

Walkthrough

Refactors transcription from a single-file flow to batch multi-segment processing, adds system instruction and JSON output schema for segment-indexed transcripts, and updates the pipeline to split audio, transcribe batches via Gemini, and aggregate per-segment transcripts.

Changes

Cohort / File(s) Summary
Prompt & Schema
prompts/Gemini_timestamped_transcription_generation_prompt.md, prompts/Gemini_timestamped_transcription_output_schema.json, prompts/Gemini_timestamped_transcription_system_instruction.md
Converted prompt to expect multiple numbered audio segments and JSON output; removed phrase-level timestamps and 15s constraints; added inline non-speech annotations; introduced JSON Schema requiring segments array with segment_number and transcript.
Processing Pipeline
src/processing_pipeline/stage_1.py
Replaced single-upload flow with segmentation and batch transcription; added split_audio_into_segments(), transcribe_batch(), format_final_transcription(); extended GeminiTimestampTranscriptionGenerator.run() signature and updated transcribe invocation; added exponential-backoff retry and temp-file cleanup.
Prompt Import Mapping
src/scripts/import_prompts_to_db.py
Added system_instruction and output_schema entries to PROMPT_MAPPING["gemini_timestamped_transcription"].

Sequence Diagram(s)

sequenceDiagram
    participant Pipeline as Processing Pipeline
    participant Storage as Temp Segment Files
    participant Gemini as Gemini API
    participant Aggregator as Result Aggregator

    Pipeline->>Storage: split_audio_into_segments(mp3_file)
    Storage-->>Pipeline: [segment_1, segment_2, ..., segment_N]

    loop for each batch
        Pipeline->>Gemini: transcribe_batch(batch_segment_paths, prompt_version)
        Gemini-->>Pipeline: [{segment_number, transcript}, ...]
    end

    Pipeline->>Aggregator: format_final_transcription(all_transcripts, segment_length)
    Aggregator-->>Pipeline: {"segments":[{segment_number,transcript},...]} 
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested reviewers

  • nhphong

Poem

🐰 I hopped through audio, slice by slice,

Batching whispers, laughter, and spice,
Numbered segments snug in JSON beds,
Clean transcripts nesting in tiny threads,
A rabbit's cheer for tidy threads and nice ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly addresses the main issue (fixing timestamp generation) and accurately reflects the core changes in the PR.
Linked Issues check ✅ Passed The PR successfully implements batch-based transcription processing with validation, error handling, and improved segmentation parameters to fix timestamp generation issues.
Out of Scope Changes check ✅ Passed All changes are directly related to fixing timestamp generation through prompt restructuring, schema definition, batch processing, and segmentation logic.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/timestamp-transcription

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 Pylint (4.0.4)
src/processing_pipeline/stage_1.py

************* Module .pylintrc
.pylintrc:1:0: F0011: error while parsing the configuration: File contains no section headers.
file: '.pylintrc', line: 1
'disable=C0116\n' (config-parse-error)
[
{
"type": "convention",
"module": "src.processing_pipeline.stage_1",
"obj": "",
"line": 21,
"column": 0,
"endLine": null,
"endColumn": null,
"path": "src/processing_pipeline/stage_1.py",
"symbol": "line-too-long",
"message": "Line too long (101/100)",
"message-id": "C0301"
},
{
"type": "convention",
"module": "src.processing_pipeline.stage_1",
"obj": "",
"line": 73,
"column": 0,
"endLine": null,
"endColumn": null,
"path": "src/processing_pipeline/stage_1.py",
"symbol": "line-too-long",
"message": "Line too long (149/100)",
"message-id": "C0301"
},
{
"type": "convention",
"module": "src.p

... [truncated 39534 characters] ...

src.processing_pipeline.stage_1",
"obj": "",
"line": 6,
"column": 0,
"endLine": 6,
"endColumn": 11,
"path": "src/processing_pipeline/stage_1.py",
"symbol": "wrong-import-order",
"message": "standard import "uuid" should be placed before third party import "boto3"",
"message-id": "C0411"
},
{
"type": "convention",
"module": "src.processing_pipeline.stage_1",
"obj": "",
"line": 7,
"column": 0,
"endLine": 7,
"endColumn": 14,
"path": "src/processing_pipeline/stage_1.py",
"symbol": "wrong-import-order",
"message": "standard import "pathlib" should be placed before third party import "boto3"",
"message-id": "C0411"
}
]


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@src/processing_pipeline/stage_1.py`:
- Around line 644-654: The code currently only prints a warning when
returned_segments count doesn't match expected_count; change this to fail fast
by raising an exception so callers can retry — check returned_segments (from
result.get("segments", [])) against expected_count (len(batch_paths) and
actual_count), and if they differ raise a descriptive exception (e.g.,
ValueError) that includes expected_count, actual_count and batch_start;
additionally validate each segment's segment_number (segment["segment_number"])
to ensure it falls within the expected range (e.g., 0..len(batch_paths)-1
relative to batch_start) before writing into all_transcripts to avoid
misnumbering or silent drops.
- Around line 666-700: The transcribe_batch function currently returns
result.parsed without validation which allows downstream code to hit
AttributeError when result.parsed is None; modify transcribe_batch (the method
annotated with `@optional_task`) to validate result.parsed and its expected fields
before returning — specifically check that result.parsed is not None and that
result.parsed.get("candidates") (or "segments") is a non-null list (use a safe
null-check like candidates = result.parsed.get("candidates") or []), and if
validation fails raise a descriptive exception so the task's retry logic can
trigger; mirror the validation pattern used in Stage1Executor.run but ensure the
null-safe handling of candidates to avoid AttributeError.
🧹 Nitpick comments (2)
prompts/Gemini_timestamped_transcription_output_schema.json (1)

1-23: Tighten schema constraints to enforce 1‑indexed segments.

The schema describes 1‑indexed segments but doesn’t enforce segment_number >= 1 (or prevent extra fields). Strengthening constraints improves validation and protects timestamp accuracy.

♻️ Suggested schema tightening
 {
     "type": "object",
+    "additionalProperties": false,
     "required": ["segments"],
     "properties": {
         "segments": {
             "type": "array",
+            "minItems": 1,
             "description": "Array of transcribed segments in order",
             "items": {
                 "type": "object",
+                "additionalProperties": false,
                 "required": ["segment_number", "transcript"],
                 "properties": {
                     "segment_number": {
                         "type": "integer",
+                        "minimum": 1,
                         "description": "The segment number (1-indexed, matching the order provided)"
                     },
                     "transcript": {
                         "type": "string",
                         "description": "The transcript for this segment."
                     }
                 }
             }
         }
     }
 }
prompts/Gemini_timestamped_transcription_generation_prompt.md (1)

7-12: Explicitly require “JSON only” (no code fences/extra text).

The example uses a code fence, which can nudge the model to wrap output in Markdown. Add a clear rule to reduce parse failures.

✍️ Prompt tweak
 ## Output Requirements

-Return a JSON object with a `segments` array containing:
+Return a JSON object with a `segments` array containing:
+Output JSON only — no Markdown/code fences and no extra commentary.
 - **segment_number**: The segment number (1-indexed, matching the input order)
 - **transcript**: The transcribed text for that segment

Also applies to: 38-63

Comment thread src/processing_pipeline/stage_1.py
Comment thread src/processing_pipeline/stage_1.py
Improve error handling for segment count mismatches and validate
segment numbers are within expected range to prevent indexing
errors during batch processing.
Validate that Gemini returns a parsed response and handle cases where
the response is missing or truncated due to token limits. This prevents
silent failures when processing transcriptions.
Reduce segment length to 20s and increase batch size to 30 for
improved transcription accuracy and processing efficiency.
@quancao-ea quancao-ea merged commit 700abfc into main Jan 28, 2026
2 checks passed
@quancao-ea quancao-ea deleted the fix/timestamp-transcription branch March 17, 2026 02:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant