Skip to content

codeislife99/Multimodal_Emotion_Analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

394 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Download Links

Please download both the datasets before cloning this repository

Processed Dataset - http://sorena.multicomp.cs.cmu.edu/downloads/MOSEI/
Raw Dataset - http://sorena.multicomp.cs.cmu.edu/downloads_raw/MOSEI

Specify Environment variable : export LC_ALL=C.UTF-8

Random seed initialization https://discuss.pytorch.org/t/random-seed-initialization/7854 torch.manual_seed(777) torch.cuda.manual_seed(777) np.random.seed(777)

Baselines and Metrics:

  The following metrics are defined :
  Metric 1 = MSE with sum across categories [0.73]
  Metric 2 = MAE with sum across categories [0.8686]
  Metric 3 = Huber Loss (Smooth L1 Loss)    [0.3263]
  Metric 4 = Binary classification accuracy at threshold=0.5 [0.xxxx]
  Metric 5 = Weighted accuracy at threshold=0.1 [0.xxxx]

Attention with sum_{t}{av} instead of mean(av) and m+a+v+t instead of m+(avt)

Model Modality Metric 1 Val Metric 1 Test Metric 1 Train Metric 4 Val Matric 4 Test Metric 5 Val Metric 5 Test Metric 2 Val Metric 2 Test Metric 3 Val Metric 3 Test
Random - 0.63 0.647 0.7938 0.8121 0.2822 0.2946

Attention with sum_{t}{av} instead of mean(av)

Model Modality Metric 1 Val Metric 1 Test Metric 1 Train Metric 4 Val Matric 4 Test Metric 5 Val Metric 5 Test Metric 2 Val Metric 2 Test Metric 3 Val Metric 3 Test
Random - 0.63 0.647 0.7938 0.8121 0.2822 0.2946
Triple Attention-scalar-1024 V+A+T(scalarAttTime)- __4.pth(seed777) -- testclean -- 0.4696 0.4090
Triple Attention-scalar-1024 V+A+T(scalarAttTime)- __4.pth(seed777) -- testaudionoise -- 0.5071 0.4090
Triple Attention-scalar-1024 V+A+T(scalarAttTime)- __4.pth(seed777) -- testvisionnoise -- 0.5034 0.4090
Triple Attention-scalar-1024-mixvisionnoise V+A+T(scalarAttTime)- __4.pth(seed777) -- testclean -- 0.4932 0.4131
Triple Attention-scalar-1024-mixvisionnoise V+A+T(scalarAttTime)- __4.pth(seed777) -- testaudionoise -- 0.5173 0.4131
Triple Attention-scalar-1024-mixvisionnoise V+A+T(scalarAttTime)- __4.pth(seed777) -- testvisionnoise -- 0.4934 0.4131

Old attention (downweighting memory update by mean(0).unsqueeze(0) instead of sum(0) )

Model Modality Metric 1 Val Metric 1 Test Metric 1 Train Metric 4 Val Matric 4 Test Metric 5 Val Metric 5 Test Metric 2 Val Metric 2 Test Metric 3 Val Metric 3 Test
Random - 0.63 0.647 0.7938 0.8121 0.2822 0.2946
Triple Attention V+A+T 0.4765 0.4709
Triple Attention-scalar V+A+T(scalarAttTime) __5.pth 0.5193 0.5346 0.5986
Triple Attention-scalar V+A+T(scalarAttTime) __6.pth 0.5439 0.5520 0.5742
Triple Attention-scalar-1024 pretrained V+A+T (scalarAttTime)- __1.pth 0.5159 0.5072 0.4498
Triple Attention-scalar-1024 pretrained V+A+T (scalarAttTime)- __2.pth 0.5018 0.4866 0.4103
Triple Attention-scalar-1024 pretrained V+A+T (scalarAttTime)- __3.pth 0.5176 0.5043 0.3703
Triple Attention-scalar-1024 V+A+T(scalarAttTime)- __3.pth 0.4816 0.4790 0.4605
Triple Attention-scalar-1024 V+A+T(scalarAttTime)- __4.pth 0.4884 0.4806 0.4345
Triple Attention-scalar-1024 V+A+T(scalarAttTime)- __4.pth(seed777) -- testclean -- 0.4772 0.4239 0.9072 0.9115 0.5900 0.6115
Triple Attention-scalar-1024 V+A+T(scalarAttTime)- __4.pth(seed777) -- testaudionoise -- 0.5221 0.4239
Triple Attention-scalar-1024 V+A+T(scalarAttTime)- __4.pth(seed777) -- testvisionnoise -- 0.5026 0.4239
Triple Attention-scalar-1024 V+A+T(scalarAttTime)- __5.pth 0.4789 0.4745 0.4087
Triple Attention-scalar-1024 V+A+T(scalarAttTime)- __5.pth(seed777) -- testclean -- 0.4806 0.3986
Triple Attention-scalar-1024 V+A+T(scalarAttTime)- __5.pth(seed777) -- testnoise -- 0.3986
Triple Attention-scalar-1024 V+A+T(scalarAttTime)- __6.pth 0.4870 0.4929 0.3830
Triple Attention-scalar-1024-audioablation V+A+T(scalarAttTime)- __2.pth -- testclean 0.4823 0.5005
Triple Attention-scalar-1024-audioablation V+A+T(scalarAttTime)- __3.pth -- testclean 0.4843 0.4703
Triple Attention-scalar-1024-audioablation V+A+T(scalarAttTime)- __4.pth -- testclean *0.4780 0.4439
Triple Attention-scalar-1024-audioablation V+A+T(scalarAttTime)- __5.pth -- testclean 0.4831 0.4180
Triple Attention-scalar-1024-audioablation V+A+T(scalarAttTime)- __2.pth -- testnoise 0.5135 0.5005
Triple Attention-scalar-1024-audioablation V+A+T(scalarAttTime)- __3.pth -- testnoise 0.4781 0.4703
Triple Attention-scalar-1024-audioablation V+A+T(scalarAttTime)- __4.pth -- testnoise *0.4843 0.4439
Triple Attention-scalar-1024-audioablation V+A+T(scalarAttTime)- __5.pth -- testnoise 0.4845 0.4180
Triple Attention-scalar-1024-visionablation V+A+T(scalarAttTime)- __2.pth -- testclean 0.4968 0.4950
Triple Attention-scalar-1024-visionablation V+A+T(scalarAttTime)- __3.pth -- testclean 0.5065 0.4701
Triple Attention-scalar-1024-visionablation V+A+T(scalarAttTime)- __4.pth -- testclean *0.4811 0.4473
Triple Attention-scalar-1024-visionablation V+A+T(scalarAttTime)- __5.pth -- testclean 0.5031 0.4242
Triple Attention-scalar-1024-visionablation V+A+T(scalarAttTime)- __2.pth -- testnoise 0.4987 0.4950
Triple Attention-scalar-1024-visionablation V+A+T(scalarAttTime)- __3.pth -- testnoise 0.5032 0.4701
Triple Attention-scalar-1024-visionablation V+A+T(scalarAttTime)- __4.pth -- testnoise *0.4850 0.4473
Triple Attention-scalar-1024-visionablation V+A+T(scalarAttTime)- __5.pth -- testnoise 0.5072 0.4242
Triple Attention-1024 V+A+T(attElement)- __3.pth 0.4910 0.4859 0.4604
Triple Attention-1024 V+A+T(attElement)- __4.pth 0.5003 0.4793 0.4303
Triple Attention-1024 V+A+T(attTime)- __3.pth 0.4855 0.4919 0.4671
Triple Attention-1024 V+A+T(attTime)- __4.pth 0.4888 0.4889 0.4409
Triple Attention-1024 V+A+T(attTime)- __5.pth 0.4761 0.4816 0.4144
Triple Attention-1024 V+A+T(attTime)- __6.pth 0.4970 0.4973 0.3879
Triple Attention-1024 V+A+T(attTime)- __7.pth 0.5103 0.5121 0.3608
Triple Attention-1024-gated V+A+T(attTime)- __4.pth 0.4746 0.4631 0.4512
Triple Attention-1024-gated V+A+T(attTime)- __5.pth 0.4911 0.4734 0.4225
Triple Attention-scalar-1024-gated V+A+T(scalarAttTime)- __4.pth 0.4812 0.4683 0.4470
Triple Attention-scalar-1024-gated V+A+T(scalarAttTime)- __5.pth 0.4838 0.4649 0.4262
Triple Attention-scalar-1024-gated-k3 V+A+T(scalarAttTime)- __5.pth 0.4812 0.4912 0.4171
Triple Attention-scalar-1024-gated-k3 V+A+T(scalarAttTime)- __4.pth 0.4730 0.4709 0.3993
Triple Attention-scalar-1024-gated-k3 V+A+T(scalarAttTime)- __3.pth 0.4715 0.4602 0.4670 0.9110 0.9142 0.5775 0.5924 0.9066
Triple Attention-scalar-1024-gated-k1 V+A+T(scalarAttTime)- __5.pth 0.5151 0.5034 0.4579
Triple Attention-scalar-1024-gated-k1 V+A+T(scalarAttTime)- __4.pth 0.5088 0.4957 0.4763
Triple Attention-scalar-1024-gated-k1 V+A+T(scalarAttTime)- __3.pth 0.5105 0.5042 0.4953
Triple Attention-scalar-1024-pretrained-gated-k3 V+A+T(scalarAttTime)- __4.pth 0.4789 0.4219
Triple Attention-scalar-1024-pretrained-gated-k3 V+A+T(scalarAttTime)- __5.pth 0.4992 0.4524
Early Concatenation V+A+T
Late Weighting V+A+T __0.pth 0.5175 0.5647
Late Weighting V+A+T __1.pth 0.5047 0.4302
Late Weighting V+A+T __2.pth 0.5098 0.3781
Late Weighting V+A+T __3.pth 0.5413 0.3434
Dual Attention V+T
Dual Attention V+A 0.5157 0.5103
Dual Attention A+T
Early Concatenation V+T
Early Concatenation V+A
Early Concatenation A+T
Late Weighting V+T
Late Weighting V+A
Late Weighting A+T
LSTM + Attention V
LSTM + Attention T 0.6285
LSTM + Attention A
LSTM V 0.5170 0.5106
LSTM T 0.6026 0.6056
LSTM A

Scoring

To add scoring capabilities in your python script

  1. import the scoring function from cmu_score import ComputePerformance

  2. initialise a numpy array to store all reference and hypotheses while epoch<no_of_epochs: overall_hyp=np.zeros((0,no_of_emotions)) overall_ref=np.zeros((0,no_of_emotions))

  3. assign the outputs and ground truth value overall_hyp = np.concatenate((overall_hyp,outputs.data.cpu().numpy()),axis=0) overall_ref = np.concatenate((overall_ref,gt.data.cpu().numpy()),axis=0)

  4. At the end of the epoch, score it score=ComputePerformance(overall_ref,overall_hyp); print('Scoring -- Epoch [%d], Sample [%d], Binary accuracy %.4f' % (epoch+1, K, score['binaryaccuracy'])) print('Scoring -- Epoch [%d], Sample [%d], MSE %.4f' % (epoch+1, K, score['MSE'])) print('Scoring -- Epoch [%d], Sample [%d], MAE %.4f' % (epoch+1, K, score['MAE']))

Structure of Pickle Files

1. Emotions.pkl

Let emo_intsts be = array([Anger_Intensity, Disgust_Intensity, Fear_Intensity, Happy_Intensity ,Sad_Intensity, Surprise_Intensity]
{"Video Name": {"Segment ID i_1 ": emo_intsts,"Segment ID i_2 ": emo_intsts, .... ,"Segment ID i_n ": emo_intsts}}

There are 23453 segments in total in mosei (train+val+test)

k=0 for i in mosei_emotions.keys(): ... for j in mosei_emotions[i].keys(): ... k = k + 1

These emotion labels are emotion intensities, out of 23453 segments 6542 of them gives indecisive classes

k2=0 for i in mosei_emotions.keys(): ... for j in mosei_emotions[i].keys(): ... if(max(mosei_emotions[i][j])==min(mosei_emotions[i][j])) or sorted(mosei_emotions[i][j],reverse=True)[0]==sorted(mosei_emotions[i][j],reverse=True)[1]: ... k2 = k2+ 1

Train Set Emotion Intensity Stats:
    0-1 = 94964
    1-2 = 3275 
    2-3 = 515 
    Max Intensity  = 3.0
    Min Intensity  = 0.0 
    Mean Intensity = 0.17
    Mean Non-Zero Intensity = 0.74
    Mean Per Emotion Intensity = [ 0.1565  0.1233  0.0401  0.4836  0.1596  0.04842]
Validation Set Emotion Intensity Stats:
    0-1 = 11031 
    1-2 = 278 
    2-3 = 37 
    Max Intensity  = 3.0
    Min Intensity  = 0.0
    Mean Intensity = 0.15
    Mean Non-Zero Intensity = 0.68
    Mean Per Emotion Intensity = [ 0.1207   0.0888  0.0436  0.4341  0.1656  0.0497]
Test Set Emotion Intensity Stats:
    0-1 = 29574 
    1-2 = 914 
    2-3 = 130 
    Max Intensity  = 3.0
    Min Intensity  = 0.0
    Mean Intensity = 0.16
    Mean Non-Zero Intensity = 0.72
    Mean Per Emotion Intensity = [ 0.1602  0.1140   0.0409   0.4685  0.1407  0.0437]
2. Words.pkl
3. Embeddings.pkl

To load the embeddings from CMU-MOSEI, run the following

  1. build softlinks in mmdata/data/pickled/ $ cd mmdata/data/pickled/ $ ln -s ../../../*.pkl .
  2. run this $ python3 creating_text_files-SDKload.py

Two folders will be created: text_files_segbased : segment-base embeddings text_files_videobased : segment-base embeddings, but each embedding files has a scope covering the whole video

4. Train/Test/Valid.pkl

Contains a set of all the train/test/validation video names

Length of Dataset - 3228 Videos divided into 22677 Video Clips of ~3-8 seconds
Length of Training Set - 2250 Videos divided into 16303 Video Clips
Length of Validation Set - 300 Videos divided into 1861 Video Clips
Length of Test Set - 678 Videos divided into 4645 Video Clips
Length of Truncated Set:
('train', 11112)
('test', 3303)
('val', 1341)

Train Set Video Length Stats:
    0-2 = 322
    2-4 = 2975
    4-6 = 3979
    6-8 = 3340
    8-10 = 2111
    10-15 = 2398
    15-20 = 780
    20+ = 398
Validation Set Video Length Stats:
    0-2 = 28
    2-4 = 260
    4-6 = 434
    6-8 = 400
    8-10 = 291
    10-15 = 324
    15-20 = 80
    20+ = 44
Test Set Video Length Stats:
    0-2 = 80
    2-4 = 845
    4-6 = 1246
    6-8 = 1019
    8-10 = 743
    10-15 = 689
    15-20 = 214
    20+ = 139
5. Facet.pkl

Let facet_features be = array([feature_1_val,feature_2_val,....,feature_35_val])
There are 35 features for each frame
{ "facet" :{"Video Name": {"Segment ID i_1 ": ((start_time_frame_1,end_time_frame_1,facet_features),... (start_time_frame_n,end_time_frame_n,facet_features)),"Segment ID i_2 ": ..., .... ,"Segment ID i_n ": ....}}}

6. Sentiments.pkl
7. Covarep.pkl

Let covarep_features be = array([feature_1_val,feature_2_val,....,feature_74_val])
COVAREP features are taken at a time interval of 0.01sec(10ms) which is the original sampling rate of the COVAREP authors.
There are 74 features for each 0.01 segment.
{ "facet" :{"Video Name": {"Segment ID i_1 ": ((start_time_frame_1,end_time_frame_1,covarep_features),... (start_time_frame_n,end_time_frame_n,covarep_features)),"Segment ID i_2 ": ..., .... ,"Segment ID i_n ": ....}}}

Due to 43 features in some files the following number of features were present for vision but couldn't be used for audio(and hence removed from all three folds):

  1. Test : 1322
  2. Train : 5105
  3. Val : 494

TO DOs After Cloning the Repository

  1. Put all downloaded pickle files in the same directory.
  2. Run creating_audio_files.py using python2.7 (This might take a while depending on your machine). This will generate a folder with all the audio_files containing covarep features with the corresponding video name.
  3. Run transfer_valid_audio.py
  4. Run dual_attention.py

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  

Languages