MOST: a dataset for Malware Open-Set classification with Time-consistent splitting

Introduction

This is a malware dataset that comprises 34,343 malware samples across 62 families. The malware samples are assigned reliable family labels and annotated with first-seen timestamps based on VirusTotal reports

Time-consistent splitting

We split MOST into training and test sets using December 5, 2018 as the split point. In addition, we varied the number of large families containing over 100 malware samples in the training set and obtained different settings.

To make sure that we have sufficient malware samples in the training set to avoid overfitting, we selected the top M largest families as seen families, where M = 10, 20, or 30, as shown in the following table:

#Seen : #Unseen		Seen Families		Unseen Families
#Seen : #Unseen		#Sample	#Family	#Sample	#Family
10:52	Training set	18,970	10
10:52	Test set	4,654	10	2,316	52
20:42	Training set	22,816	20
20:42	Test set	5,443	20	1,527	42
30:32	Training set	25,037	30
30:32	Test set	5,838	30	1,132	32

For example, in the first setting, the training set contains 18,970 samples from 10 seen families, while the test set has 4,654 samples from 10 seen families and 2,316 samples from 52 unseen families. In all settings, the training set includes samples found before the split point, and the test set includes those found afterward.

Usage

This repository is organized as follows:

/
│  
└─data
    ├─10-52
    │  ├─training
    │  │  ├─0
    │  │  ├─1
    │  │  ├─...
    │  │  └─9
    │  └─testing
    │     ├─0
    │     ├─1
    │     ├─...
    │     └─61
    │  
    ├─20-42
    │  ├─training
    │  │  ├─0
    │  │  ├─1
    │  │  ├─...
    │  │  └─19
    │  └─testing
    │     ├─0
    │     ├─1
    │     ├─...
    │     └─61
    │   
    └─30-32
       ├─training
       │  ├─0
       │  ├─1
       │  ├─...
       │  └─29
       └─testing
          ├─0
          ├─1
          ├─...
          └─61

The data/ directory contains datasets for three distinct experimental settings. Each setting is represented by a subdirectory named in the format XX-YY, where XX signifies the number of seen families and YY signifies the number of unseen families. For example, the 10-52 directory corresponds to a setting with 10 seen families and 52 unseen families.

Within each setting's directory (e.g., data/10-52/):

The training subdirectory exclusively contains samples from the seen families. These are further organized into numerically named subdirectories. For instance, in data/10-52/training/, subdirectories are named 0 through 9, each representing one of the 10 seen families.
The testing subdirectory contains samples from both seen and unseen families. These are also organized into numerically named subdirectories. For example, in data/10-52/testing/, subdirectories are named 0 through 61, encompassing all 10 seen families and 52 unseen families.

Finally, each numerically named subdirectory (e.g., data/10-52/training/0/ or data/10-52/testing/61/) holds the actual data samples belonging to that specific family. Each sample is a 54×54 static feature image.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data		data
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MOST: a dataset for Malware Open-Set classification with Time-consistent splitting

Introduction

Time-consistent splitting

Usage

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Yuxia-Sun/MOST

Folders and files

Latest commit

History

Repository files navigation

MOST: a dataset for Malware Open-Set classification with Time-consistent splitting

Introduction

Time-consistent splitting

Usage

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Packages