Skip to content

IoSR-Surrey/WaveLoc

 
 

Repository files navigation

End2End sound localization model

This is a fork of bingo-todd's repository trying to replicate the results of the following paper: P. Vecchiotti, N. Ma, S. Squartini, and G. J. Brown, “End-to-end binaural sound localisation from the raw waveform,” in 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 2019, pp. 451–455.

Only WaveLoc-GTF is implemented, which is one of the two models proposed in the paper.

bingo-todd reported slightly different results from those presented in the paper (see below). This fork also includes some very minor changes with respect to bingo-todd's version, which were necessary to make the code run correctly, as well as a fix for a bug where the training dataset leaked in some cases.

The results obtained here are slightly different from those reported by bingo-todd. This could be due to the minor code changes, but, given how small the changes are, it could also be due to differences in the architecture used for training.

Model

Requirements

You need:

  • Python 3.7
  • BasicTools (in bingo-todd's other repository) (included here as submodule)
  • A number of dependencies included in the requirements.txt file (e.g., TensorFlow 1.14, pysofar etc)
  • To train and evaluate you need:
    • A copy of the TIMIT dataset (you have to secure this yourself)
    • The Surrey RealRoomBRIRs dataset (included here as submodule)

Installation

Download this repository recursively:

git clone --recursive https://github.com/enzodesena/WaveLoc.git
cd WaveLoc

Dev container quick start (Cursor / VS Code)

This repository includes a .devcontainer/ setup so you can run the project inside Docker while keeping your editor on the host.

  1. Install and start Docker Desktop.
  2. Open the WaveLoc folder in Cursor or VS Code.
  3. Run Dev Containers: Reopen in Container from the Command Palette.
  4. Wait for the initial build and dependency install to finish.

Notes:

  • The dev container uses python:3.7-bullseye.
  • On Apple Silicon, it runs as linux/amd64 for compatibility with this project.
  • After opening in the container, terminals and Python execution run inside Docker.

Manual setup

Start Docker (Apple Silicon users only)

If you haven't already done so already, install Docker (you will need Homebrew installed to be able to run this):

brew install --cask docker
open -a docker

Then start a Docker container using Python 3.7 (make sure you are within the WaveLoc folder):

docker run --platform linux/amd64 -it --rm \
  -v "$PWD":/work \
  -w /work \
  python:3.7-bullseye bash

Download the data and extract TIMIT

Obtain the TIMIT dataset and unzip it into WaveLoc/data/external/darpa-timit-acousticphonetic-continuous-speech in such a way that the files are for instance in WaveLoc/data/external/darpa-timit-acousticphonetic-continuous-speech/data/Test/DR1/... .

Install dependencies (not needed if using dev container)

apt-get update
apt-get install -y pkg-config libhdf5-dev libnetcdf-dev gcc g++ gfortran
conda create -n waveloc python=3.7 	# Not needed if using docker
conda activate waveloc 							# Not needed if using docker
pip install -r requirements.txt

Generating dataset and training model

(cd gen_dataset && ./run.sh)  # This takes a few hours
python train_mct.py           # This also takes a few hours

Training

Dataset

  • BRIR

    Surrey binaural room impulse response (BRIR) database, including anechoic room and 4 reverberation room.

    Room A B C D
    RT_60(s) 0.32 0.47 0.68 0.89
    DDR(dB) 6.09 5.31 8.82 6.12
  • Sound source (TIMIT database) sentences per azimuth

    Train Validate Evaluate
    24 6 15

Multi-conditional training(MCT)

For each reverberant room, the rest 3 reverberant rooms and anechoic room are used for training

Training curves

Evaluation

Root mean square error(RMSE) is used as the metrics of performance. For each reverberant room, the evaluation was performed 3 times to get more stable results and the test dataset was regenerated each time.

Since binaural sound is directly fed to models without extra preprocess and there may be short pulses in speech, the localization result was reported based on chunks rather than frames. Each chunk consisted of 25 consecutive frames.

Paper results vs original

Reverberant room A B C D
Results of this repository 1.7 2.0 1.0 2.7
Bingo Todd's result 1.5 2.0 1.4 2.7
Result in paper 1.5 3.0 1.7 3.5

About

End-to-End binaural sound localization

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 99.5%
  • Other 0.5%