VER-296: Fix transcription step not generating correct timestamp by quancao-ea · Pull Request #56 · PublicDataWorks/verdad

quancao-ea · 2026-01-27T12:04:39Z

Summary by CodeRabbit

New Features
- Multi-segment batch transcription for processing large audio inputs
- Transcription output redesigned to structured JSON with ordered segment entries
- Inline annotations for non-speech elements (e.g., [silence], [music], [applause], [laughter]) and requirement to transcribe every segment
Documentation
- New system instruction and output schema documenting segment handling, numbering, and annotation guidelines; prompt mapping updated to include them

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Refactor timestamped transcription to process audio in segments with structured JSON output format for improved accuracy and handling of long audio files.

linear · 2026-01-27T12:04:43Z

VER-296 Fix transcription step not generating correct timestamp

coderabbitai · 2026-01-27T12:05:12Z

Walkthrough

Refactors transcription from a single-file flow to batch multi-segment processing, adds system instruction and JSON output schema for segment-indexed transcripts, and updates the pipeline to split audio, transcribe batches via Gemini, and aggregate per-segment transcripts.

Changes

Cohort / File(s)	Summary
Prompt & Schema `prompts/Gemini_timestamped_transcription_generation_prompt.md`, `prompts/Gemini_timestamped_transcription_output_schema.json`, `prompts/Gemini_timestamped_transcription_system_instruction.md`	Converted prompt to expect multiple numbered audio segments and JSON output; removed phrase-level timestamps and 15s constraints; added inline non-speech annotations; introduced JSON Schema requiring `segments` array with `segment_number` and `transcript`.
Processing Pipeline `src/processing_pipeline/stage_1.py`	Replaced single-upload flow with segmentation and batch transcription; added `split_audio_into_segments()`, `transcribe_batch()`, `format_final_transcription()`; extended `GeminiTimestampTranscriptionGenerator.run()` signature and updated transcribe invocation; added exponential-backoff retry and temp-file cleanup.
Prompt Import Mapping `src/scripts/import_prompts_to_db.py`	Added `system_instruction` and `output_schema` entries to `PROMPT_MAPPING["gemini_timestamped_transcription"]`.

Sequence Diagram(s)

sequenceDiagram
    participant Pipeline as Processing Pipeline
    participant Storage as Temp Segment Files
    participant Gemini as Gemini API
    participant Aggregator as Result Aggregator

    Pipeline->>Storage: split_audio_into_segments(mp3_file)
    Storage-->>Pipeline: [segment_1, segment_2, ..., segment_N]

    loop for each batch
        Pipeline->>Gemini: transcribe_batch(batch_segment_paths, prompt_version)
        Gemini-->>Pipeline: [{segment_number, transcript}, ...]
    end

    Pipeline->>Aggregator: format_final_transcription(all_transcripts, segment_length)
    Aggregator-->>Pipeline: {"segments":[{segment_number,transcript},...]}

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Integrate Gemini 2.5 into the Analysis Pipeline #16 — Modifies the same Gemini transcription flow and stage_1 batch/transcription integration.

Suggested reviewers

nhphong

Poem

🐰 I hopped through audio, slice by slice,

Batching whispers, laughter, and spice,
Numbered segments snug in JSON beds,
Clean transcripts nesting in tiny threads,
A rabbit's cheer for tidy threads and nice ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly addresses the main issue (fixing timestamp generation) and accurately reflects the core changes in the PR.
Linked Issues check	✅ Passed	The PR successfully implements batch-based transcription processing with validation, error handling, and improved segmentation parameters to fix timestamp generation issues.
Out of Scope Changes check	✅ Passed	All changes are directly related to fixing timestamp generation through prompt restructuring, schema definition, batch processing, and segmentation logic.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch fix/timestamp-transcription

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 Pylint (4.0.4)

src/processing_pipeline/stage_1.py

************* Module .pylintrc
.pylintrc:1:0: F0011: error while parsing the configuration: File contains no section headers.
file: '.pylintrc', line: 1
'disable=C0116\n' (config-parse-error)
[
{
"type": "convention",
"module": "src.processing_pipeline.stage_1",
"obj": "",
"line": 21,
"column": 0,
"endLine": null,
"endColumn": null,
"path": "src/processing_pipeline/stage_1.py",
"symbol": "line-too-long",
"message": "Line too long (101/100)",
"message-id": "C0301"
},
{
"type": "convention",
"module": "src.processing_pipeline.stage_1",
"obj": "",
"line": 73,
"column": 0,
"endLine": null,
"endColumn": null,
"path": "src/processing_pipeline/stage_1.py",
"symbol": "line-too-long",
"message": "Line too long (149/100)",
"message-id": "C0301"
},
{
"type": "convention",
"module": "src.p

... [truncated 39534 characters] ...

src.processing_pipeline.stage_1",
"obj": "",
"line": 6,
"column": 0,
"endLine": 6,
"endColumn": 11,
"path": "src/processing_pipeline/stage_1.py",
"symbol": "wrong-import-order",
"message": "standard import "uuid" should be placed before third party import "boto3"",
"message-id": "C0411"
},
{
"type": "convention",
"module": "src.processing_pipeline.stage_1",
"obj": "",
"line": 7,
"column": 0,
"endLine": 7,
"endColumn": 14,
"path": "src/processing_pipeline/stage_1.py",
"symbol": "wrong-import-order",
"message": "standard import "pathlib" should be placed before third party import "boto3"",
"message-id": "C0411"
}
]

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In `@src/processing_pipeline/stage_1.py`:
- Around line 644-654: The code currently only prints a warning when
returned_segments count doesn't match expected_count; change this to fail fast
by raising an exception so callers can retry — check returned_segments (from
result.get("segments", [])) against expected_count (len(batch_paths) and
actual_count), and if they differ raise a descriptive exception (e.g.,
ValueError) that includes expected_count, actual_count and batch_start;
additionally validate each segment's segment_number (segment["segment_number"])
to ensure it falls within the expected range (e.g., 0..len(batch_paths)-1
relative to batch_start) before writing into all_transcripts to avoid
misnumbering or silent drops.
- Around line 666-700: The transcribe_batch function currently returns
result.parsed without validation which allows downstream code to hit
AttributeError when result.parsed is None; modify transcribe_batch (the method
annotated with `@optional_task`) to validate result.parsed and its expected fields
before returning — specifically check that result.parsed is not None and that
result.parsed.get("candidates") (or "segments") is a non-null list (use a safe
null-check like candidates = result.parsed.get("candidates") or []), and if
validation fails raise a descriptive exception so the task's retry logic can
trigger; mirror the validation pattern used in Stage1Executor.run but ensure the
null-safe handling of candidates to avoid AttributeError.

🧹 Nitpick comments (2)

prompts/Gemini_timestamped_transcription_output_schema.json (1)

1-23: Tighten schema constraints to enforce 1‑indexed segments.

The schema describes 1‑indexed segments but doesn’t enforce segment_number >= 1 (or prevent extra fields). Strengthening constraints improves validation and protects timestamp accuracy.

♻️ Suggested schema tightening

 {
     "type": "object",
+    "additionalProperties": false,
     "required": ["segments"],
     "properties": {
         "segments": {
             "type": "array",
+            "minItems": 1,
             "description": "Array of transcribed segments in order",
             "items": {
                 "type": "object",
+                "additionalProperties": false,
                 "required": ["segment_number", "transcript"],
                 "properties": {
                     "segment_number": {
                         "type": "integer",
+                        "minimum": 1,
                         "description": "The segment number (1-indexed, matching the order provided)"
                     },
                     "transcript": {
                         "type": "string",
                         "description": "The transcript for this segment."
                     }
                 }
             }
         }
     }
 }

prompts/Gemini_timestamped_transcription_generation_prompt.md (1)

7-12: Explicitly require “JSON only” (no code fences/extra text).

The example uses a code fence, which can nudge the model to wrap output in Markdown. Add a clear rule to reduce parse failures.
✍️ Prompt tweak
 ## Output Requirements

-Return a JSON object with a `segments` array containing:
+Return a JSON object with a `segments` array containing:
+Output JSON only — no Markdown/code fences and no extra commentary.
 - **segment_number**: The segment number (1-indexed, matching the input order)
 - **transcript**: The transcribed text for that segment
Also applies to: 38-63

Improve error handling for segment count mismatches and validate segment numbers are within expected range to prevent indexing errors during batch processing.

Validate that Gemini returns a parsed response and handle cases where the response is missing or truncated due to token limits. This prevents silent failures when processing transcriptions.

Reduce segment length to 20s and increase batch size to 30 for improved transcription accuracy and processing efficiency.

Add Gemini segmented transcription with JSON output

321a8a7

Refactor timestamped transcription to process audio in segments with structured JSON output format for improved accuracy and handling of long audio files.

coderabbitai Bot reviewed Jan 27, 2026

View reviewed changes

Comment thread src/processing_pipeline/stage_1.py

Comment thread src/processing_pipeline/stage_1.py

quancao-ea added 3 commits January 28, 2026 15:19

Fix segment validation in Gemini timestamp transcription

2c80f55

Improve error handling for segment count mismatches and validate segment numbers are within expected range to prevent indexing errors during batch processing.

Add validation for Gemini timestamp transcription response

46c39fb

Validate that Gemini returns a parsed response and handle cases where the response is missing or truncated due to token limits. This prevents silent failures when processing transcriptions.

Adjust Gemini transcription segment and batch parameters

89240f4

Reduce segment length to 20s and increase batch size to 30 for improved transcription accuracy and processing efficiency.

quancao-ea merged commit 700abfc into main Jan 28, 2026
2 checks passed

coderabbitai Bot mentioned this pull request Feb 2, 2026

VER-297: Reactivate the preprocess step: Initial detection in Stage 1 #58

Merged

coderabbitai Bot mentioned this pull request Feb 12, 2026

VER-299: Fix stage 1: Use pinned gemini version instead of gemini-flash-latest #60

Merged

coderabbitai Bot mentioned this pull request Mar 1, 2026

VER-303: Split prompt_versions stage column into stage + sub_stage #66

Merged

quancao-ea deleted the fix/timestamp-transcription branch March 17, 2026 02:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VER-296: Fix transcription step not generating correct timestamp#56

VER-296: Fix transcription step not generating correct timestamp#56
quancao-ea merged 4 commits intomainfrom
fix/timestamp-transcription

quancao-ea commented Jan 27, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

linear Bot commented Jan 27, 2026

Uh oh!

coderabbitai Bot commented Jan 27, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

quancao-ea commented Jan 27, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

linear Bot commented Jan 27, 2026

Uh oh!

coderabbitai Bot commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

quancao-ea commented Jan 27, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jan 27, 2026 •

edited

Loading