IT majors Microsoft and Intel have collaborated on a new research project to detect and classify malware. The project, which is termed as Stamina that stands for STAtic Malware-as-Image Network Analysis, uses deep learning technique that converts malware samples into grayscale images. The images are analysed for textural and structural patterns specific to malware samples. The project is a part of Microsoft’s malware detection programme using machine learning techniques.
Stamina consists of a few steps. At first, the input file is converted into a binary form of raw pixel data which is one-dimensional and transformed into a 2D photo. Algorithms analyses this image.The resulting image then gets converted to a smaller dimension and goes into a pre-trained deep neural network (DNN) which scans the image and tags it as clean or infected.
Stamina uses deep learning technique. In deep learning intelligent computer networks are capable of learning on their own from input data that is stored in an unstructured or unlabeled format, here, a random malware binary.
According to Microsoft, 2.2 million infected Portable Executable file hashes were fed as a base for the research, and 60 per cent of the known malware samples are used to train the original DNN algorithm. Stamina has an efficiency rate of 99.07 per cent in recognising and analysing malware samples, claims the team. Based on the results, Stamina could be very well one of those ML modules that may be implemented at Microsoft to find malware.
However, Microsoft said that while Stamina was accurate and fast when in smaller files, it faltered with larger ones.
“The results certainly encourage the use of deep transfer learning for the purpose of malware classification,” said Jugal Parikh and Marc Marino, the two Microsoft researchers who participated in the research on behalf of the Microsoft Threat Protection Intelligence Team.
“Microsoft now uses client-side machine learning model engines, cloud-side machine learning model engines, machine learning modules for capturing sequences of behaviours or capturing the content of the file itself”, according to Tanmay Ganacharya, Director for Security Research of Microsoft Threat Protection.
“Anybody can build a model, but the labelled data and the quantity of it and the quality of it helps train the machine learning models appropriately and hence defines how effective they are going to be,” Ganacharya reportedly said