This is a personal project of mine on sound source detection (audio classification) of environmental sounds. This is a culmination of what I've learning so far in deep learning and PyTorch.
In this project, I used a causal depthwise separable CNN with subspectral normalization for environmental sound classification for the following reasons:
- Causal convolutions ensure that the model does not violate the temporal ordering of the audio inputs.
- Depthwise separable convolutions reduce the total number of parameters, which also reduces model complexity.
- Subspectral normalization because I got better results using it than when batch normalizaion was used.
The overall structure is
- 7x7 standard conv layer (32 kernels, 2x2 stride)
- 3x3 depthwise conv layer (32 kernels, 1x1 stride)
- 1x1 pointwise conv layer (64 kernels, 1x1 stride)
- 3x3 depthwise conv layer (64 kernels, 2x2 stride)
- 1x1 pointwise conv layer (64 kernels, 1x1 stride)
- 3x3 depthwise conv layer (64 kernels, 1x1 stride)
- 1x1 pointwise conv layer (128 kernels, 1x1 stride)
- 3x3 depthwise conv layer (128 kernels, 2x2 stride)
- 1x1 pointwise conv layer (128 kernels, 1x1 stride)
- 3x3 depthwise conv layer (128 kernels, 1x1 stride)
- 1x1 pointwise conv layer (256 kernels, 1x1 stride)
- 3x3 depthwise conv layer (256 kernels, 2x2 stride)
- 1x1 pointwise conv layer (256 kernels, 1x1 stride)
- 3x3 depthwise conv layer (256 kernels, 1x1 stride)
- 1x1 pointwise conv layer (512 kernels, 1x1 stride)
- 3x3 depthwise conv layer (512 kernels, 2x2 stride)
- 1x1 pointwise conv layer (512 kernels, 1x1 stride)
- 1x1 Global average pooling
- Linear layer (50 output features)
Other things to note about my choice of model architecture:
- Each convolutional layer is followed by a hard swish and a subspectral normalization.
- The total number of parameters is 575986.
- I used the sequence conv-activation-norm mainly for data whitening, which can't be achieved if the sequence conv-norm-activation is used. There is still a lot of debate regarding this (see more here).
- Global average pooling was used such that the model is forced to learn to predict from the output of the convolutional layers.
- Dropout (p=0.5) was applied right before the linear layer.
- Xavier normal initialization was used.
I will now present the methods I used for training.
- The ESC-50 dataset was used.
- Log-Mel spectrograms were generated with a frame size of 25 ms and hop size of 10 ms for each audio sample.
- I used the following augmentations:
- Pitch shifting (0, ±1, ±2, ±2.5, and ±3.5 decibels). Because pitch shifting is slow, this augmentation was done ahead of time and results were stored in my Google drive account. I tested multiple pitch shifting libraries and the best one was PyRubberband.
- Time shifting in the range [0, 1] seconds.
- SNR mixing. The "noise" is a sound sample drawn randomly from a different target class to the main sound sample.
- Time stretching by a factor in the range [0.8, 1.2].
- Axis masking. A maximum of 16 consecutive frequency bins and a maximum of 100 time bins were masked.
- Stochastic gradient descent (SGD) with Nesterov momentum was used with the following parameters:
- 32 batch size
- 0.0005 fixed learning rate
- 0.001 weight decay
- 0.9 momentum
- A maximum of 200 epochs was used. I noticed that model performance still has not reached a stable value after 200 epochs, so it is possible that the true model performance could be even better.
The accuracy for each fold is as follows:
- Fold 1: 80.2%
- Fold 2: 74.5%
- Fold 3: 77.8%
- Fold 4: 82.2%
- Fold 5: 73.0%
The corresponding five fold cross-validation accuracy is 77.5% ± 3.4%.