Hello everyone,
I’m currently in the process of learning about malware classification using CNNs, and I’ve been making good progress. As we all understand, datasets are a vital component of deep learning. However, I’ve encountered two issues in my journey. I’ve downloaded the “malimg” dataset from Kaggle, but there’s a recurring problem: many individuals upload this dataset, and it gets removed fairly quickly. I’m aware that the original paper associated with this specific dataset is titled “Malware Images: Visualization and Automatic Classification.” Unfortunately, all the links to the “original” dataset are outdated and no longer working. Consequently, I’m facing difficulties in obtaining the original dataset. Does anyone have any suggestions on how to access it or any information regarding why this dataset is no longer accessible to the public?
You might be wondering why I’m interested in the original dataset. Well, I conducted a manual review of the 25 classes within this dataset, and I observed something suspicious about the last class, “Yuner.A.” All the images in this class appear identical, which is quite intriguing. To investigate this further, I developed a small program (which takes some time to run as it checks for identical images). Surprisingly, apart from the file size, all 800 .png files in that class are indeed pixelwise identical.
Does anyone have an explanation for either of these two issues?