VER-296: Fix transcription step not generating correct timestamp#56
VER-296: Fix transcription step not generating correct timestamp#56quancao-ea merged 4 commits intomainfrom
Conversation
Refactor timestamped transcription to process audio in segments with structured JSON output format for improved accuracy and handling of long audio files.
WalkthroughRefactors transcription from a single-file flow to batch multi-segment processing, adds system instruction and JSON output schema for segment-indexed transcripts, and updates the pipeline to split audio, transcribe batches via Gemini, and aggregate per-segment transcripts. Changes
Sequence Diagram(s)sequenceDiagram
participant Pipeline as Processing Pipeline
participant Storage as Temp Segment Files
participant Gemini as Gemini API
participant Aggregator as Result Aggregator
Pipeline->>Storage: split_audio_into_segments(mp3_file)
Storage-->>Pipeline: [segment_1, segment_2, ..., segment_N]
loop for each batch
Pipeline->>Gemini: transcribe_batch(batch_segment_paths, prompt_version)
Gemini-->>Pipeline: [{segment_number, transcript}, ...]
end
Pipeline->>Aggregator: format_final_transcription(all_transcripts, segment_length)
Aggregator-->>Pipeline: {"segments":[{segment_number,transcript},...]}
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 Pylint (4.0.4)src/processing_pipeline/stage_1.py************* Module .pylintrc ... [truncated 39534 characters] ... src.processing_pipeline.stage_1", Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@src/processing_pipeline/stage_1.py`:
- Around line 644-654: The code currently only prints a warning when
returned_segments count doesn't match expected_count; change this to fail fast
by raising an exception so callers can retry — check returned_segments (from
result.get("segments", [])) against expected_count (len(batch_paths) and
actual_count), and if they differ raise a descriptive exception (e.g.,
ValueError) that includes expected_count, actual_count and batch_start;
additionally validate each segment's segment_number (segment["segment_number"])
to ensure it falls within the expected range (e.g., 0..len(batch_paths)-1
relative to batch_start) before writing into all_transcripts to avoid
misnumbering or silent drops.
- Around line 666-700: The transcribe_batch function currently returns
result.parsed without validation which allows downstream code to hit
AttributeError when result.parsed is None; modify transcribe_batch (the method
annotated with `@optional_task`) to validate result.parsed and its expected fields
before returning — specifically check that result.parsed is not None and that
result.parsed.get("candidates") (or "segments") is a non-null list (use a safe
null-check like candidates = result.parsed.get("candidates") or []), and if
validation fails raise a descriptive exception so the task's retry logic can
trigger; mirror the validation pattern used in Stage1Executor.run but ensure the
null-safe handling of candidates to avoid AttributeError.
🧹 Nitpick comments (2)
prompts/Gemini_timestamped_transcription_output_schema.json (1)
1-23: Tighten schema constraints to enforce 1‑indexed segments.The schema describes 1‑indexed segments but doesn’t enforce
segment_number >= 1(or prevent extra fields). Strengthening constraints improves validation and protects timestamp accuracy.♻️ Suggested schema tightening
{ "type": "object", + "additionalProperties": false, "required": ["segments"], "properties": { "segments": { "type": "array", + "minItems": 1, "description": "Array of transcribed segments in order", "items": { "type": "object", + "additionalProperties": false, "required": ["segment_number", "transcript"], "properties": { "segment_number": { "type": "integer", + "minimum": 1, "description": "The segment number (1-indexed, matching the order provided)" }, "transcript": { "type": "string", "description": "The transcript for this segment." } } } } } }prompts/Gemini_timestamped_transcription_generation_prompt.md (1)
7-12: Explicitly require “JSON only” (no code fences/extra text).The example uses a code fence, which can nudge the model to wrap output in Markdown. Add a clear rule to reduce parse failures.
✍️ Prompt tweak
## Output Requirements -Return a JSON object with a `segments` array containing: +Return a JSON object with a `segments` array containing: +Output JSON only — no Markdown/code fences and no extra commentary. - **segment_number**: The segment number (1-indexed, matching the input order) - **transcript**: The transcribed text for that segmentAlso applies to: 38-63
Improve error handling for segment count mismatches and validate segment numbers are within expected range to prevent indexing errors during batch processing.
Validate that Gemini returns a parsed response and handle cases where the response is missing or truncated due to token limits. This prevents silent failures when processing transcriptions.
Reduce segment length to 20s and increase batch size to 30 for improved transcription accuracy and processing efficiency.
Summary by CodeRabbit
New Features
Documentation
✏️ Tip: You can customize this high-level summary in your review settings.