Skip to content

sarulab-speech/RELATE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

67 Commits
 
 
 
 
 
 

Repository files navigation

RELATE dataset

Contents

This dataset consists of the following contents:

  • Audio–text pairs

    This dataset includes a number of audio-text pairs shown in the table in statistics section below. The audios consist of both natural and synthetic types. Each text is written in English. All audio sample was converted to mono 16-bit 16 kHz audio.

  • 11-point subjective evaluation scores between audio and text by each listener

    This dataset includes three types of subjective evaluation scores for relevance between audio and text.

    • REL score: The overall relevance of the text and audio.
    • Inclusion of sound event (IS) score: The extent to which the sound events described in the text are included in the audio.
    • Order of sound event (OS) score: The degree of matching between the time series of sound events described in the text and the audio.

    The score is on a scale of eleven from 0 (does not match at all) to 10 (matched exactly). Each audio–text pair is evaluated by four listeners on average, who are native English speakers.

  • Listener attributes

    For listener attributes, we gathered information such as age, gender, nationality, birthplace, residence, and experience in audio evaluation. The table below shows the questions and options.

ID Question Options
Q1 Age 0-20, 21-30, 31-40, 41-50, 51-60, 61-
Q2 Gender Male, Female, Non-binary
Q3 What is your nationality? Choose from pulldown
Q4 Where are you from? Choose from pulldown
Q5 In which country do you live? Choose from pulldown
Q6 How many times have you participated in ratings of audio samples? 0, 1, 2, 3, 4, 5-
Q7 When did you last participate in other ratings of audio samples? Never, 0-1 month ago, 1-2 months ago, 2-6 months ago,
7-12 months ago, 1-5 years ago, Longer ago
Q8 On average, how many times have you heard an audio repeatedly? 1, 2, 3, 4, 5-
Q9 What type of audio device did you use? Headphone, Earphone, Others
Q10 Was the surrounding environment quiet during the ratings of audio samples? Quiet the whole time, Quiet most of the time, Sometimes quiet, Noisy most of the time, Always noisy
Q11 How difficult were the evaluations? Easy, Not so difficult, Not so easy, Difficult
Q12 Do/did you work in the field of speech or audio technology? Yes, No

Statistics

This table shows the statistics of RELATE dataset.

REL IS OS
Train Test Train Test Train Test
Evaluations 9,963 7,797 7,641 5,865 4,017 2,943
Audio-text pairs 2,862 2,598 2,649 2,334 1,281 1,185
Audio duration [s] 28,806 26,129 26,654 23,476 12,880 11,901
Listeners 1,085 873 864 635 714 525

Directory structure

The directory structure of this dataset is as follows:

RELATE
├── listener_attributes
│   ├── REL.csv
│   └── IS_and_OS.csv 
└── scores
	├── REL.csv
	├── IS.csv 
	└── OS.csv

RELATE_wave
├── audiocaps
│   ├── train
│	│	├── 10176.wav
│	│	├──   .
│   │   ├──   .
│	│   └──   .
│   └── test
├── audioldm
│	├── train
│   └── test
├── audioldm2
│	├── train
│   └── test
├── tango
│	├── train
│   └── test
└── tango2
 	├── train
    └── test

File format

  • listener_attributes/[REL, IS_and_OS].csv: There are 13 columns in the each csv file.

    • listener_id: The listener ID who scored the subjective evaluation scores.
    • question[1, 2, ..., 12]: The answer of the question in questionnaire. The answes are quantified, with options assigned a numerical value starting from 1 for the leftmost choice and increasing sequentially.
  • scores/REL.csv: There are 8 columns in the each csv file.

    • wavname: The file name of audio.
    • text: The text's corresponding audio.
    • score: The evaluation score. The score range is 0 to 10.
    • listener_id: The listener ID who scored the subjective evaluation scores.
    • audio type: Shows whether the audio is natural or synthetic. If synthetic, contains the name of synthetic model.
    • anchor label: Whether this is an anchor pair (irrelevant audio–text pair for screening) or not.
    • in RELATE dataset: Whether the pair belongs to the train set, validation set or test set in RELATE dataset. "Excluded" if that is an anchor pair or screened.
    • in AudioCaps: Whether the pair belongs to the train set or test set in AudioCaps. "None" if that is an anchor pair.
  • scores/[IS, OS].csv: There are 7 columns in the each csv file.

    • wavname: The file name of audio.
    • text: The text's corresponding audio.
    • score: The evaluation score. The score range is 0 to 10.
    • listener_id: The listener ID who scored the subjective evaluation scores.
    • audio type: Shows whether the audio is natural or synthetic. If synthetic, contains the name of synthetic model.
    • anchor label: Whether this is an anchor pair (irrelevant audio–text pair for screening) or not.
    • in AudioCaps: Whether the pair belongs to the train set or test set in AudioCaps. "None" if that is an anchor pair.

Download wave files

You can download wave files from here

Screening of data

You can conduct screening of listeners according to the method in our paper.

import pandas as pd
df = pd.read_csv("scores/REL.csv")
df = df[df["in RELATE dataset"]!="excluded"]
df.to_csv("scores/REL_screened.csv", index=False)

Citation

Y. Kanamori, Y. Okamoto, T. Takano, S. Takamichi, Y. Saito, H. Saruwatari, "RELATE: Subjective evaluation dataset for automatic evaluation of relevance between text and audio," Proc. INTERSPEECH 2025, pp. 3155-3159, 2025, doi: 10.21437/Interspeech.2025-1830.

Acknowledgment

The work was supported by JSPS KAKENHI Grant Number 23K24895, 24K23880, 25K21221, JST Moonshot Grant Number JPMJMS2237.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors