Data Transforms, Padding, and Cropping

This guide covers all data transformation options in DeepFense, including padding, cropping, resampling, and augmentations.

Overview

DeepFense applies transforms to audio data in two stages:

Base Transforms: Always applied (train/val/test) - preprocessing like padding, resampling
Augmentations: Only during training (probabilistic) - data augmentation like noise, RIR, etc.

Base Transforms

Base transforms are deterministic preprocessing steps applied to all data.

1. Audio Loading (`load_audio`)

Purpose: Load audio files from disk and perform initial preprocessing

Parameters:

target_sr (int, default: 16000): Target sample rate (audio is resampled if needed)
mono (bool, default: True): Convert to mono (averages channels if multi-channel)

Example:

data:
  train:
    base_transform:
      - type: "load_audio"
        args:
          target_sr: 16000
          mono: True

How to Check: Audio is automatically resampled if file SR ≠ target_sr. Set mono: False to keep stereo.

2. Padding/Cropping (`pad`)

Purpose: Ensure all audio has the same length for batching

Parameters:

max_len (int, required): Target length in samples
- Example: 160000 = 10 seconds at 16kHz
- Example: 64000 = 4 seconds at 16kHz
random_pad (bool, default: False):
- False: Crop from start if audio > max_len
- True: Randomly crop (random start position) if audio > max_len
pad_type (str, default: "repeat"):
- "repeat": Repeat the waveform to fill length if audio < max_len
- Other types: Currently only "repeat" is supported

Example:

data:
  train:
    base_transform:
      - type: "pad"
        args:
          max_len: 160000        # 10 seconds at 16kHz
          random_pad: True       # Random crop if longer
          pad_type: "repeat"     # Repeat if shorter

Common Lengths:

160000 samples = 10 seconds @ 16kHz
64000 samples = 4 seconds @ 16kHz
32000 samples = 2 seconds @ 16kHz

How to Change:

# For longer audio (e.g., 15 seconds)
max_len: 240000  # 15 * 16000

# For shorter audio (e.g., 5 seconds)
max_len: 80000   # 5 * 16000

# To always crop from start (no randomness)
random_pad: False

3. Random Crop (`RandomCrop`)

Purpose: Randomly crop audio to fixed length (alternative to pad with random_pad: True)

Parameters:

output_size (int, required): Target length in samples

Example:

data:
  train:
    base_transform:
      - type: "RandomCrop"
        args:
          output_size: 160000  # 10 seconds at 16kHz

Note: RandomCrop is similar to pad with random_pad: True, but doesn't pad short audio (truncates instead).

Augmentation Transforms

Augmentations are probabilistic transforms applied only during training to improve robustness.

1. RawBoost

Purpose: Advanced audio augmentation suite (noise, filtering, etc.)

Parameters:

noise_ratio (float, 0.0-1.0, default: 1.0): Probability of applying augmentation
algo (int, default: 5): Algorithm variant (1-5)
* (various): Additional RawBoost-specific parameters

Example:

data:
  train:
    augment_transform:
      - type: "rawboost"
        args:
          noise_ratio: 0.5      # Apply 50% of the time
          algo: 5

2. Room Impulse Response (RIR)

Purpose: Simulate room acoustics using impulse responses

Parameters:

noise_ratio (float, 0.0-1.0): Probability of applying
csv_file (str, required): Path to CSV file with RIR paths

Example:

data:
  train:
    augment_transform:
      - type: "rir"
        args:
          noise_ratio: 0.3
          csv_file: "/path/to/rir_files.csv"

3. Codec Compression

Purpose: Simulate audio codec compression artifacts

Parameters:

noise_ratio (float, 0.0-1.0): Probability of applying
* (various): Codec-specific parameters

Example:

data:
  train:
    augment_transform:
      - type: "codec"
        args:
          noise_ratio: 0.2

4. Additive Noise

Purpose: Add Gaussian noise to audio

Parameters:

noise_ratio (float, 0.0-1.0): Probability of applying
snr_range (list, default: [5, 15]): Signal-to-noise ratio range [min, max] in dB

Example:

data:
  train:
    augment_transform:
      - type: "add_noise"  # or "AdditiveNoise"
        args:
          noise_ratio: 0.3
          snr_range: [10, 20]  # SNR between 10-20 dB

5. Speed Perturbation

Purpose: Vary playback speed (time stretching)

Parameters:

noise_ratio (float, 0.0-1.0): Probability of applying
speed_range (list): Speed variation range [min, max]
- Example: [0.9, 1.1] = 90% to 110% speed

Example:

data:
  train:
    augment_transform:
      - type: "speed_perturb"
        args:
          noise_ratio: 0.5
          speed_range: [0.95, 1.05]

6. Other Augmentations

morph: Audio morphing
add_babble: Add babble noise
drop_freq: Frequency dropout
drop_chunk: Time dropout
do_clip: Clipping augmentation

See Augmentations Documentation for complete list.

Complete Transform Pipeline Example

data:
  sampling_rate: 16000  # Global sample rate
  
  train:
    dataset_type: "DetectionDataset"
    parquet_files: ["/path/to/train.parquet"]
    
    # Base transforms (always applied)
    base_transform:
      - type: "load_audio"
        args:
          target_sr: 16000
          mono: True
      
      - type: "pad"
        args:
          max_len: 160000        # 10 seconds
          random_pad: True       # Random crop if longer
          pad_type: "repeat"     # Repeat if shorter
    
    # Augmentations (probabilistic, training only)
    augment_transform:
      - type: "rawboost"
        args:
          noise_ratio: 0.5
          algo: 5
      
      - type: "rir"
        args:
          noise_ratio: 0.3
          csv_file: "/path/to/rir.csv"
      
      - type: "add_noise"
        args:
          noise_ratio: 0.2
          snr_range: [5, 15]
      
      - type: "speed_perturb"
        args:
          noise_ratio: 0.3
          speed_range: [0.9, 1.1]
  
  val:
    # Validation: only base transforms, no augmentations
    base_transform:
      - type: "load_audio"
        args:
          target_sr: 16000
          mono: True
      
      - type: "pad"
        args:
          max_len: 160000
          random_pad: False      # No random crop for validation
          pad_type: "repeat"
    
    augment_transform: []  # No augmentations

Checking Current Transform Configuration

Method 1: Inspect Config File

# View your config file
cat config/train.yaml | grep -A 20 "base_transform\|augment_transform"

Method 2: Check Saved Config

After training, check the saved config:

cat outputs/your_experiment/config.yaml | grep -A 20 "base_transform\|augment_transform"

Method 3: Programmatic Check

import yaml

with open("config/train.yaml", "r") as f:
    config = yaml.safe_load(f)

# Check base transforms
print("Base Transforms:")
for transform in config["data"]["train"]["base_transform"]:
    print(f"  - {transform['type']}: {transform.get('args', {})}")

# Check augmentations
print("\nAugmentations:")
for aug in config["data"]["train"].get("augment_transform", []):
    print(f"  - {aug['type']}: {aug.get('args', {})}")

Common Transform Scenarios

Scenario 1: Fixed-Length Audio (10 seconds)

base_transform:
  - type: "pad"
    args:
      max_len: 160000        # 10 seconds @ 16kHz
      random_pad: True       # Random crop if > 10s
      pad_type: "repeat"     # Repeat if < 10s

Scenario 2: Variable-Length Audio (No Padding)

For variable-length batches, you can skip padding and handle it in collate function (advanced):

base_transform:
  - type: "load_audio"
    args:
      target_sr: 16000
      mono: True
# No pad transform - handled in DataLoader

Scenario 3: Shorter Audio (4 seconds)

base_transform:
  - type: "pad"
    args:
      max_len: 64000         # 4 seconds @ 16kHz
      random_pad: False      # Always crop from start
      pad_type: "repeat"

Scenario 4: Longer Audio (15 seconds)

base_transform:
  - type: "pad"
    args:
      max_len: 240000        # 15 seconds @ 16kHz
      random_pad: True
      pad_type: "repeat"

Scenario 5: Aggressive Augmentation

augment_transform:
  - type: "rawboost"
    args:
      noise_ratio: 0.8       # Apply 80% of the time
  - type: "rir"
    args:
      noise_ratio: 0.6
  - type: "add_noise"
    args:
      noise_ratio: 0.5
      snr_range: [0, 10]     # Lower SNR = more noise
  - type: "speed_perturb"
    args:
      noise_ratio: 0.4
      speed_range: [0.85, 1.15]  # Wider range

Scenario 6: Minimal Augmentation

augment_transform:
  - type: "add_noise"
    args:
      noise_ratio: 0.2       # Apply 20% of the time
      snr_range: [15, 25]    # Higher SNR = less noise

Transform Order

Transforms are applied in the order specified:

Base transforms are applied first (in order)
Augmentations are applied after base transforms (in order)
Each augmentation is applied independently with its noise_ratio probability

Example:

base_transform:
  - type: "load_audio"   # 1. Load audio
  - type: "pad"          # 2. Pad/crop

augment_transform:
  - type: "rawboost"     # 3. Apply RawBoost (50% chance)
  - type: "add_noise"    # 4. Apply noise (30% chance, independent)

Troubleshooting

Issue: Audio length mismatch

Problem: "RuntimeError: Expected input batch_size (X) to match target batch_size (Y)"

Solution: Ensure all audio is padded to the same length:

base_transform:
  - type: "pad"
    args:
      max_len: 160000  # Must match your target length

Issue: Out of memory

Problem: GPU out of memory during training

Solutions:

Reduce max_len (shorter audio):

max_len: 80000  # 5 seconds instead of 10

Reduce batch_size
Reduce number of augmentations

Issue: Augmentations not applying

Problem: Augmentations seem to have no effect

Check:

Verify noise_ratio > 0.0
Ensure augmentations are in train section, not val
Check that transform is registered: deepfense list --component-type transforms

Issue: Audio quality degradation

Problem: Too much augmentation causing poor training

Solution: Reduce augmentation probabilities:

augment_transform:
  - type: "rawboost"
    args:
      noise_ratio: 0.3  # Reduce from 0.5
  - type: "add_noise"
    args:
      noise_ratio: 0.1  # Reduce from 0.3
      snr_range: [15, 25]  # Increase SNR (less noise)

Summary Table

Transform	Type	Purpose	Key Parameters	When Applied
`load_audio`	Base	Load & resample	`target_sr`, `mono`	Always
`pad`	Base	Pad/crop to fixed length	`max_len`, `random_pad`, `pad_type`	Always
`RandomCrop`	Base	Random crop	`output_size`	Always
`rawboost`	Aug	Advanced augmentation	`noise_ratio`, `algo`	Training (probabilistic)
`rir`	Aug	Room simulation	`noise_ratio`, `csv_file`	Training (probabilistic)
`codec`	Aug	Codec simulation	`noise_ratio`	Training (probabilistic)
`add_noise`	Aug	Add noise	`noise_ratio`, `snr_range`	Training (probabilistic)
`speed_perturb`	Aug	Speed variation	`noise_ratio`, `speed_range`	Training (probabilistic)

Next Steps

See Configuration Reference for all parameters
See Augmentations Documentation for complete augmentation list
See Adding Augmentations to create custom transforms
See Data Preparation in README for parquet format

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Transforms, Padding, and Cropping

Overview

Base Transforms

1. Audio Loading (`load_audio`)

2. Padding/Cropping (`pad`)

3. Random Crop (`RandomCrop`)

Augmentation Transforms

1. RawBoost

2. Room Impulse Response (RIR)

3. Codec Compression

4. Additive Noise

5. Speed Perturbation

6. Other Augmentations

Complete Transform Pipeline Example

Checking Current Transform Configuration

Method 1: Inspect Config File

Method 2: Check Saved Config

Method 3: Programmatic Check

Common Transform Scenarios

Scenario 1: Fixed-Length Audio (10 seconds)

Scenario 2: Variable-Length Audio (No Padding)

Scenario 3: Shorter Audio (4 seconds)

Scenario 4: Longer Audio (15 seconds)

Scenario 5: Aggressive Augmentation

Scenario 6: Minimal Augmentation

Transform Order

Troubleshooting

Issue: Audio length mismatch

Issue: Out of memory

Issue: Augmentations not applying

Issue: Audio quality degradation

Summary Table

Next Steps

FilesExpand file tree

data_transforms.md

Latest commit

History

data_transforms.md

File metadata and controls

Data Transforms, Padding, and Cropping

Overview

Base Transforms

1. Audio Loading (load_audio)

2. Padding/Cropping (pad)

3. Random Crop (RandomCrop)

Augmentation Transforms

1. RawBoost

2. Room Impulse Response (RIR)

3. Codec Compression

4. Additive Noise

5. Speed Perturbation

6. Other Augmentations

Complete Transform Pipeline Example

Checking Current Transform Configuration

Method 1: Inspect Config File

Method 2: Check Saved Config

Method 3: Programmatic Check

Common Transform Scenarios

Scenario 1: Fixed-Length Audio (10 seconds)

Scenario 2: Variable-Length Audio (No Padding)

Scenario 3: Shorter Audio (4 seconds)

Scenario 4: Longer Audio (15 seconds)

Scenario 5: Aggressive Augmentation

Scenario 6: Minimal Augmentation

Transform Order

Troubleshooting

Issue: Audio length mismatch

Issue: Out of memory

Issue: Augmentations not applying

Issue: Audio quality degradation

Summary Table

Next Steps

1. Audio Loading (`load_audio`)

2. Padding/Cropping (`pad`)

3. Random Crop (`RandomCrop`)