Skip to content

Comments

Add earnings config#123

Closed
nithinraok wants to merge 4 commits intoNVIDIA:mainfrom
nithinraok:earnings_pc
Closed

Add earnings config#123
nithinraok wants to merge 4 commits intoNVIDIA:mainfrom
nithinraok:earnings_pc

Conversation

@nithinraok
Copy link
Collaborator

@nithinraok nithinraok commented Jun 1, 2025

Add Earnings21/22 Dataset Processing Pipeline with Forced Alignment

Overview

This PR introduces a complete 7-step processing pipeline for converting Earnings21 and Earnings22 datasets to NeMo format with advanced forced alignment capabilities. The pipeline supports both full dataset processing and evaluation subsets with optional speaker segmentation.

High-Level Changelog

New Features

Core Pipeline Processors:

  • CreateInitialAudioAndManifest: Initial audio manifest creation with automatic audio conversion (MP3 → WAV, multi-channel → mono, any sample rate → 16kHz)
  • CreateFullAudioManifestEarnings21: Ground truth text reconstruction from NLP token files with punctuation/capitalization preservation
  • NeMoForcedAligner: Word-level forced alignment using NeMo ASR models with CTC heads
  • CreateSentenceSegmentedManifest: Intelligent sentence-level segmentation based on CTM files with punctuation-aware splitting
  • SpeakerSegmentedManifest: Speaker-change detection and segmentation with optional metadata mapping

Dataset Support:

  • Earnings21 support (full dataset + eval10 subset)
  • Earnings22 support
  • Dual NLP file location handling for flexible dataset structures
  • Speaker metadata CSV integration for name mapping

Audio Processing:

  • Automatic audio format conversion (MP3/WAV → 16kHz mono WAV)
  • Accurate duration calculation from audio files
  • Batch processing with configurable test mode

Pipeline Configuration

7-Step Processing Workflow:

  1. Initial Audio Manifest → Full audio files with duration
  2. Text Population → Add ground truth transcripts from NLP files
  3. Text Cleaning → Remove artifacts, brackets, special characters
  4. Forced Alignment → Generate word-level CTM files with timestamps
  5. Sentence Segmentation → Create sentence-level segments from CTM data
  6. Speaker Segmentation → Create speaker-level segments (optional)
  7. Field Filtering → Keep only required manifest fields

Key Configuration Options:

  • dataset_type: "earnings21" | "earnings22"
  • subset: "full" | "eval10" (earnings21 only)
  • forced_alignment_model: Configurable NeMo ASR model
  • preserve_punctuation / preserve_capitalization: Text processing options
  • include_speaker_info / include_tags: Optional metadata inclusion

Output Formats

Sentence-Level Segments (Primary Output):

{
  "audio_filepath": "/path/to/audio.wav",
  "duration": 15.2,
  "offset": 45.3,
  "text": "This is a complete sentence with proper punctuation.",
  "alignment": [
    {"word": "This", "start": 45.3, "end": 45.6},
    {"word": "is", "start": 45.6, "end": 45.8}
  ]
}

Speaker-Level Segments (Optional):

{
  "audio_filepath": "/path/to/audio.wav", 
  "duration": 0,
  "text": "Speaker segment text...",
  "speaker": "speaker_1",
  "segment_id": 0
}

Usage Examples

# Process Earnings21 full dataset
python main.py --config-path=dataset_configs/english/earnings21 --config-name=config \
  dataset_type=earnings21 \
  dataset_root=/path/to/earnings21 \
  output_directory=/path/to/output

# Process Earnings22 with custom model
python main.py --config-path=dataset_configs/english/earnings21 --config-name=config \
  dataset_type=earnings22 \
  forced_alignment_model=nvidia/parakeet-tdt_ctc-1.1b \
  dataset_root=/path/to/earnings22 \
  output_directory=/path/to/output

@Jorjeous
Copy link
Collaborator

Jorjeous commented Jun 3, 2025

Good day, could you please sign the commits
[currently need rebase and sign]

@nithinraok nithinraok force-pushed the earnings_pc branch 2 times, most recently from 4ac9463 to aa2b67c Compare June 3, 2025 18:39
Signed-off-by: Nithin Rao Koluguri <nithinrao.koluguri@gmail.com>
Jorjeous and others added 3 commits June 5, 2025 06:16
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
@Jorjeous
Copy link
Collaborator

Jorjeous commented Jun 5, 2025

Docs fixed, error from non existing link, that's ok

Copy link
Collaborator

@Jorjeous Jorjeous left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, waiting for env vars to merge

Copy link
Collaborator

@Jorjeous Jorjeous Jun 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

env vars is missing, (in test config file)

Copy link
Collaborator

@Jorjeous Jorjeous left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets cover changes with tests

@nithinraok
Copy link
Collaborator Author

Canceling this.

Opened PR #130 in the origin branch to avoid test failures from cert issues.

@nithinraok nithinraok closed this Jun 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants