The data_features_combined folder has a small dataset with extracted features. To recreate the full dataset check the Dataset Section.
The data_test folder has executables for testing the model. ATTENTION: this folder contains real malware executables which can be harmful.
- Quick start
- Build the sample solution
- Train and test model
- Requirements
- Generate Dataset
- PE Files Datasets
Instead of building solution from code, download the competition docker image from here.
An additional docker image with a better overall model is provided here
Before you proceed, you must install Docker Engine for your operating system.
Load the docker image
docker load -i ml.rarRun the docker container:
docker run -itp 8080:8080 --memory=1g mlTest the solution on malicious and benign samples of your choosing via:
python -m test -m data/DikeDataset-main/files/malware -b data/DikeDataset-main/files/benignBefore you proceed, you must install Docker Engine for your operating system.
A sample solution that you may modify is included in the defender folder.
Install Python requirements needed to test the solution:
pip install -r requirements.txtOPTIONAL: To apply obfuscation to the code, copy the defender folder somewhere else since it is applied in place and run
pyminify defender/ --in-place --remove-literal-statementsCompile python code to run faster and slightly obfuscate code run
python out.pySome trained models can be found in defender/saved_models.
Add the *.pkl file to use as model into docker/models/, we will later set the model to use during docker run.
From the root folder that contains the Dockerfile, build the solution:
docker build -t ml .Run the docker container:
docker run -itp 8080:8080 --memory=1g mlThe flag -p 8080:8080 maps the container's port 8080 to the host's port 8080.
The flag --memory=1g limits the container with 1GB of RAM.
The flag --env MODEL_FILE="models/ml_classifier.pkl" can be added to specify which model to run
Test the solution on malicious and benign samples of your choosing via:
python -m test -m data/DikeDataset-main/files/malware -b data/DikeDataset-main/files/benignYou can also use the system folder C:\Windows\System32\ as benign samples.
Sample collections may be in a folder, or in an archive of type zip, tar, tar.bz2, tar.gz or tgz.
It is not required to unzip and strongly recommended that you do not unzip the archive to test malicious samples.
Once you have a trained model, it can be tested by running
python test.py -m model_path.pklTo train a Random Forest model check defender/models/ml.py
python -m defender.models.ml && python test.py -m defender/ml.pklTo train a Deep Learning model check defender/models/malware_gpt.py
./scripts/run.sh && python test.py -m defender/ml.pklMinimum scores
- FPR 1%
- TPR 95%
Constraints
- 1GB of RAM
- Response time 5 seconds per sample
A valid submission for the defense track consists of the following
- a Docker image
- listens on port 8080
- accepts
POST /with headerContent-Type: application/octet-streamand the contents of a PE file in the body - returns
{"result": 0}for benign files and{"result": 1}for malicious files - for files up to 10**21 bytes (10 MiB), must respond in less than 5 seconds (a timeout results in a benign verdict)
The datasets used are listed in data.txt.
To apply the feature extractor on a folder of PE files and save them for training models use
python -m defender.dataset -s save_folder/save_name [--dike, --windows, --programs, --benign, --malware]
python -m defender.dataset -s save_folder/save_name [--dike, --windows, --programs, --benign, --malware]Different parameters allow creating a dataset from
--large_datasetfrom Practical Security Analytics dataset--dikefrom the DikeDataset--windowsfrom the own Windows files--programsfrom the Program Files and Drivers--benignto specify any number of folders considered benign--malwareto specify any number of folders considered malware
- https://practicalsecurityanalytics.com/pe-malware-machine-learning-dataset/
- DikeDataset: https://github.com/iosifache/DikeDataset
- Malware-Feed: https://github.com/MalwareSamples/Malware-Feed
- https://github.com/ytisf/theZoo
- https://github.com/fabrimagic72/malware-samples
- https://github.com/InQuest/malware-samples
- https://github.com/mstfknn/malware-sample-library
- https://github.com/wolfvan/some-samples
- https://github.com/RamadhanAmizudin/malware
- BODMAS: https://whyisyoung.github.io/BODMAS/
- https://github.com/elastic/ember
- https://github.com/Virus-Samples/Malware-Sample-Sources
- https://bazaar.abuse.ch/
- Malware Detection PE-Based Analysis Using Deep Learning Algorithm Dataset. Same files but DikeDataset has passed them through VirusTotal API. It contains malicious and benign PE files and having CC BY 4.0 license.