Add earnings config by nithinraok · Pull Request #123 · NVIDIA/NeMo-speech-data-processor

nithinraok · 2025-06-01T16:40:37Z

Add Earnings21/22 Dataset Processing Pipeline with Forced Alignment

Overview

This PR introduces a complete 7-step processing pipeline for converting Earnings21 and Earnings22 datasets to NeMo format with advanced forced alignment capabilities. The pipeline supports both full dataset processing and evaluation subsets with optional speaker segmentation.

High-Level Changelog

New Features

Core Pipeline Processors:

CreateInitialAudioAndManifest: Initial audio manifest creation with automatic audio conversion (MP3 → WAV, multi-channel → mono, any sample rate → 16kHz)
CreateFullAudioManifestEarnings21: Ground truth text reconstruction from NLP token files with punctuation/capitalization preservation
NeMoForcedAligner: Word-level forced alignment using NeMo ASR models with CTC heads
CreateSentenceSegmentedManifest: Intelligent sentence-level segmentation based on CTM files with punctuation-aware splitting
SpeakerSegmentedManifest: Speaker-change detection and segmentation with optional metadata mapping

Dataset Support:

Earnings21 support (full dataset + eval10 subset)
Earnings22 support
Dual NLP file location handling for flexible dataset structures
Speaker metadata CSV integration for name mapping

Audio Processing:

Automatic audio format conversion (MP3/WAV → 16kHz mono WAV)
Accurate duration calculation from audio files
Batch processing with configurable test mode

Pipeline Configuration

7-Step Processing Workflow:

Initial Audio Manifest → Full audio files with duration
Text Population → Add ground truth transcripts from NLP files
Text Cleaning → Remove artifacts, brackets, special characters
Forced Alignment → Generate word-level CTM files with timestamps
Sentence Segmentation → Create sentence-level segments from CTM data
Speaker Segmentation → Create speaker-level segments (optional)
Field Filtering → Keep only required manifest fields

Key Configuration Options:

dataset_type: "earnings21" | "earnings22"
subset: "full" | "eval10" (earnings21 only)
forced_alignment_model: Configurable NeMo ASR model
preserve_punctuation / preserve_capitalization: Text processing options
include_speaker_info / include_tags: Optional metadata inclusion

Output Formats

Sentence-Level Segments (Primary Output):

{
  "audio_filepath": "/path/to/audio.wav",
  "duration": 15.2,
  "offset": 45.3,
  "text": "This is a complete sentence with proper punctuation.",
  "alignment": [
    {"word": "This", "start": 45.3, "end": 45.6},
    {"word": "is", "start": 45.6, "end": 45.8}
  ]
}

Speaker-Level Segments (Optional):

{
  "audio_filepath": "/path/to/audio.wav", 
  "duration": 0,
  "text": "Speaker segment text...",
  "speaker": "speaker_1",
  "segment_id": 0
}

Usage Examples

# Process Earnings21 full dataset
python main.py --config-path=dataset_configs/english/earnings21 --config-name=config \
  dataset_type=earnings21 \
  dataset_root=/path/to/earnings21 \
  output_directory=/path/to/output

# Process Earnings22 with custom model
python main.py --config-path=dataset_configs/english/earnings21 --config-name=config \
  dataset_type=earnings22 \
  forced_alignment_model=nvidia/parakeet-tdt_ctc-1.1b \
  dataset_root=/path/to/earnings22 \
  output_directory=/path/to/output

Jorjeous · 2025-06-03T11:38:10Z

Good day, could you please sign the commits
[currently need rebase and sign]

Signed-off-by: Nithin Rao Koluguri <nithinrao.koluguri@gmail.com>

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

Jorjeous · 2025-06-05T13:53:42Z

Docs fixed, error from non existing link, that's ok

Jorjeous

LGTM, waiting for env vars to merge

Jorjeous · 2025-06-03T11:34:05Z

dataset_configs/english/earnings21/config.yaml

env vars is missing, (in test config file)

Jorjeous

Lets cover changes with tests

nithinraok · 2025-06-17T17:36:35Z

Canceling this.

Opened PR #130 in the origin branch to avoid test failures from cert issues.

nithinraok force-pushed the earnings_pc branch from 86e5b24 to fa86bd5 Compare June 1, 2025 16:42

nithinraok force-pushed the earnings_pc branch 2 times, most recently from 4ac9463 to aa2b67c Compare June 3, 2025 18:39

Add earnings config

fd81561

Signed-off-by: Nithin Rao Koluguri <nithinrao.koluguri@gmail.com>

nithinraok force-pushed the earnings_pc branch from aa2b67c to fd81561 Compare June 3, 2025 18:43

karpnv requested a review from Jorjeous June 5, 2025 08:23

Jorjeous and others added 3 commits June 5, 2025 06:16

add docs in existing con

6bd3da4

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

fixconflict

5e5c978

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

Merge branch 'main' into earnings_pc

c818007

Jorjeous requested changes Jun 5, 2025

View reviewed changes

dataset_configs/english/earnings21/config.yaml

Copy link

Collaborator

Jorjeous Jun 3, 2025 •

edited

Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

env vars is missing, (in test config file)

Jorjeous reviewed Jun 7, 2025

View reviewed changes

nithinraok closed this Jun 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Add earnings config#123

Add earnings config#123
nithinraok wants to merge 4 commits intoNVIDIA:mainfrom
nithinraok:earnings_pc

nithinraok commented Jun 1, 2025 •

edited

Loading

Uh oh!

Jorjeous commented Jun 3, 2025

Uh oh!

Jorjeous commented Jun 5, 2025

Uh oh!

Jorjeous left a comment

Uh oh!

Jorjeous Jun 3, 2025 •

edited

Loading

Uh oh!

Jorjeous left a comment

Uh oh!

nithinraok commented Jun 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

nithinraok commented Jun 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Add Earnings21/22 Dataset Processing Pipeline with Forced Alignment

Overview

High-Level Changelog

New Features

Pipeline Configuration

Output Formats

Usage Examples

Uh oh!

Jorjeous commented Jun 3, 2025

Uh oh!

Jorjeous commented Jun 5, 2025

Uh oh!

Jorjeous left a comment

Choose a reason for hiding this comment

Uh oh!

Jorjeous Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Jorjeous left a comment

Choose a reason for hiding this comment

Uh oh!

nithinraok commented Jun 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nithinraok commented Jun 1, 2025 •

edited

Loading

Jorjeous Jun 3, 2025 •

edited

Loading