This is a malware dataset that comprises 34,343 malware samples across 62 families. The malware samples are assigned reliable family labels and annotated with first-seen timestamps based on VirusTotal reports
We split MOST into training and test sets using December 5, 2018 as the split point. In addition, we varied the number of large families containing over 100 malware samples in the training set and obtained different settings.
To make sure that we have sufficient malware samples in the training set to avoid overfitting, we selected the top M largest families as seen families, where M = 10, 20, or 30, as shown in the following table:
| #Seen : #Unseen | Seen Families | Unseen Families | |||
|---|---|---|---|---|---|
| #Sample | #Family | #Sample | #Family | ||
| 10:52 | Training set | 18,970 | 10 | ||
| Test set | 4,654 | 10 | 2,316 | 52 | |
| 20:42 | Training set | 22,816 | 20 | ||
| Test set | 5,443 | 20 | 1,527 | 42 | |
| 30:32 | Training set | 25,037 | 30 | ||
| Test set | 5,838 | 30 | 1,132 | 32 | |
For example, in the first setting, the training set contains 18,970 samples from 10 seen families, while the test set has 4,654 samples from 10 seen families and 2,316 samples from 52 unseen families. In all settings, the training set includes samples found before the split point, and the test set includes those found afterward.
This repository is organized as follows:
/
│
└─data
├─10-52
│ ├─training
│ │ ├─0
│ │ ├─1
│ │ ├─...
│ │ └─9
│ └─testing
│ ├─0
│ ├─1
│ ├─...
│ └─61
│
├─20-42
│ ├─training
│ │ ├─0
│ │ ├─1
│ │ ├─...
│ │ └─19
│ └─testing
│ ├─0
│ ├─1
│ ├─...
│ └─61
│
└─30-32
├─training
│ ├─0
│ ├─1
│ ├─...
│ └─29
└─testing
├─0
├─1
├─...
└─61
The data/ directory contains datasets for three distinct experimental settings. Each setting is represented by a subdirectory named in the format XX-YY, where XX signifies the number of seen families and YY signifies the number of unseen families. For example, the 10-52 directory corresponds to a setting with 10 seen families and 52 unseen families.
Within each setting's directory (e.g., data/10-52/):
- The
trainingsubdirectory exclusively contains samples from the seen families. These are further organized into numerically named subdirectories. For instance, indata/10-52/training/, subdirectories are named0through9, each representing one of the 10 seen families. - The
testingsubdirectory contains samples from both seen and unseen families. These are also organized into numerically named subdirectories. For example, indata/10-52/testing/, subdirectories are named0through61, encompassing all 10 seen families and 52 unseen families.
Finally, each numerically named subdirectory (e.g., data/10-52/training/0/ or data/10-52/testing/61/) holds the actual data samples belonging to that specific family. Each sample is a 54×54 static feature image.