Skip to content

Yuxia-Sun/MOST

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 

Repository files navigation

MOST: a dataset for Malware Open-Set classification with Time-consistent splitting

Introduction

This is a malware dataset that comprises 34,343 malware samples across 62 families. The malware samples are assigned reliable family labels and annotated with first-seen timestamps based on VirusTotal reports

Time-consistent splitting

We split MOST into training and test sets using December 5, 2018 as the split point. In addition, we varied the number of large families containing over 100 malware samples in the training set and obtained different settings.

To make sure that we have sufficient malware samples in the training set to avoid overfitting, we selected the top M largest families as seen families, where M = 10, 20, or 30, as shown in the following table:

#Seen : #Unseen Seen Families Unseen Families
#Sample #Family #Sample #Family
10:52 Training set 18,970 10
Test set 4,654 10 2,316 52
20:42 Training set 22,816 20
Test set 5,443 20 1,527 42
30:32 Training set 25,037 30
Test set 5,838 30 1,132 32

For example, in the first setting, the training set contains 18,970 samples from 10 seen families, while the test set has 4,654 samples from 10 seen families and 2,316 samples from 52 unseen families. In all settings, the training set includes samples found before the split point, and the test set includes those found afterward.

Usage

This repository is organized as follows:

/
│  
└─data
    ├─10-52
    │  ├─training
    │  │  ├─0
    │  │  ├─1
    │  │  ├─...
    │  │  └─9
    │  └─testing
    │     ├─0
    │     ├─1
    │     ├─...
    │     └─61
    │  
    ├─20-42
    │  ├─training
    │  │  ├─0
    │  │  ├─1
    │  │  ├─...
    │  │  └─19
    │  └─testing
    │     ├─0
    │     ├─1
    │     ├─...
    │     └─61
    │   
    └─30-32
       ├─training
       │  ├─0
       │  ├─1
       │  ├─...
       │  └─29
       └─testing
          ├─0
          ├─1
          ├─...
          └─61

The data/ directory contains datasets for three distinct experimental settings. Each setting is represented by a subdirectory named in the format XX-YY, where XX signifies the number of seen families and YY signifies the number of unseen families. For example, the 10-52 directory corresponds to a setting with 10 seen families and 52 unseen families.

Within each setting's directory (e.g., data/10-52/):

  • The training subdirectory exclusively contains samples from the seen families. These are further organized into numerically named subdirectories. For instance, in data/10-52/training/, subdirectories are named 0 through 9, each representing one of the 10 seen families.
  • The testing subdirectory contains samples from both seen and unseen families. These are also organized into numerically named subdirectories. For example, in data/10-52/testing/, subdirectories are named 0 through 61, encompassing all 10 seen families and 52 unseen families.

Finally, each numerically named subdirectory (e.g., data/10-52/training/0/ or data/10-52/testing/61/) holds the actual data samples belonging to that specific family. Each sample is a 54×54 static feature image.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •