An advanced profile hidden Markov model for malware detection

Abstract

The rapid growth of malicious software (malware) production in recent decades and the increasing number of threats posed by malware to network environments, such as the Internet and intelligent environments, emphasize the need for more research on the security of computer networks in information security and digital forensics. The method presented in this study identifies “species” of malware families, which are more sophisticated, obfuscated, and structurally diverse. We propose a hybrid technique combining aspects of signature detection with machine learning based methods to classify malware families. The method is carried out by utilizing Profile Hidden Markov Models (PHMMs) on the behavioral characteristics of malware species. This paper explains the process of modeling and training an advanced PHMM using sequences obtained from the extraction of each malware family’s paramount features, and the canonical sequences created in the process of Multiple Sequence Alignment (MSA) production. Due to the fact that not all parts of a file are malicious, the goal is to distinguish the malicious portions from the benign ones and place more emphasis on them in order to increase the likelihood of malware detection by having the least impact from the benign portions. Based on “consensus sequences”, the experimental results show that our proposed approach outperforms other HMM-based techniques even when limited training data is available. All supplementary materials including the code, datasets, and a complete list of results are available for public access on the Web. ${}^{1}$

Keywords

Malware detection metamorphic static analysis profile hidden Markov models

1. Introduction

Currently, we can observe a tremendous increase in the use of computers in everyday life and the Internet is playing a great role in people’s daily routines by providing important services. The dramatic advances in information technology and electronic commerce have provided new opportunities for business environments but has also brought about issues of security and confidentiality of information in cyberspace. Since Frederick Cohen used the phrase “computer virus” in 1983, a large number of new malicious codes have emerged every year.

Malicious software also known as malware, in general, refers to all unwanted computer programs such as computer viruses, Trojans, rootkits, worms, adware, spyware, and ransomware [1, 2, 3, 4, 5]. Malware detection is the first step in protecting computer systems. The general way to deal with malware is using anti-virus software, but these programs mostly use signature-based detection techniques and are not capable of identifying new malware and the zero-day threats. Today’s malware techniques are more insidious and better written than before. Some types of malware are more difficult to detect compared to the old malware generations; recent malware generations (i.e., polymorphic and metamorphic) use encryption and obfuscation techniques [1, 6]. They produce different strains of themselves while keeping the inherent functionality the same in order to create an entirely new variant, which is hardly similar to the original variant after reproduction.

Generally, malware detection techniques consist of two main steps. The first one is the malware analysis and feature extraction and the second is the identification phase. For the first step, the two common methods, static and dynamic analysis as well as new hybrid approaches are described in Section 2. The classification methods in the second step are categorized as signature-based detection, behavior-based detection, and statistical-based detection. They are discussed in Section 3.

Our work has been carried out in the field of cybersecurity, in which we focused on malware detection. Given the fact that one aspect of systems operation is time, its production is a sequence of data. Such sequences can be dealt with using process models and profiling approaches. In the case of process modeling, profiling and its related issues, hidden Markov models and in particular profile hidden Markov models are considered to be powerful tools. They have many use cases in computer science, including almost all issues in this field that are arranged by events or have a function based on time. Audio processing and speech recognition, handwriting and text recognition, detection of cyber-attacks and network penetration, biomarker modeling or protein structure prediction, detection and prevention of fraud or extraction of customer behavior patterns in bank transactions, and forecasting financial markets and stock prices are some cases that can be mentioned. HMM-based models have always been highlighted and played a significant role in the proper solutions to such issues. Therefore, modifying and upgrading the HMM-based models, which result in better performance and higher detection and classification accuracy, would have a significant impact on such applications and motif problems that were briefly mentioned above.

The specific sequence-based architecture of HMM decreases the complexity, making it plausible to reduce the overall run-time of the training phase. Furthermore, a plain structure implies that the associated model has a few parameters and thus can be trained on a small amounts of data. Such HMM architectures can lead to apt models if they proportion the structure of the application elegantly. Profile HMM (also known as linear HMM) is one example of these architectures, which is befitted as a motif representation for interesting positions in sequences. It is useful for two reasons. First, its structure is linear to match that of a position and second, it strives to return the process of evolution. The foundation of a profile HMM is a sequence of states, so-called “match states”, which depict the canonical sequence for the family under consideration. Each of the match states corresponds to one position in the canonical sequence. The series of states is similar to a position weight matrix (known as a profile) because each state has a frequency distribution across the whole symbols. Profile HMM also consists of two additional types of states (insert and delete states) in order to model the evolution process. Each delete state stands alongside its corresponding match state, which allows the match state to be skipped. Insert states with self-loops are placed between match states, allowing one or more bases to be inserted between two match states.

Contributions

We develop a statistical model originally introduced by Attaluri et al. [1] to strike obfuscated malware detection and propose a novel profiling algorithm to consider important features of data. The central objective of this paper is to model an advanced profile hidden Markov model (APHMM), which improves the performance of the proposed model. Moreover, we build a model incorporating statistical learning methods for malware behavior analysis using limited data and modify it to deal with binary features while maintaining the high detection rates of malicious activities.

The rest of this work is organized as follows. In Section 2, we review related work. Section 3 presents a brief history of malware detection techniques. In Section 4, the proposed approach and experimental results are described. We discuss our experiments and the utilized datasets in Section 5. The paper concludes with Section 6 where we discuss possible future work.

2. Related work

As mentioned earlier in Section 1, malware analysis techniques can be divided into two main categories of static and dynamic analysis methods. Static analysis is more popular because it analyzes the structure of malicious code and detects malware without running it in contrast to dynamic analysis, which usually requires emulation and malicious code analysis in runtime. Damodaran et al. identified three general approaches in malware analysis, namely, static analysis, dynamic analysis, and hybrid approaches [7]. Certain malware detection methods are based on static analysis discussed in [1, 3, 4, 5, 6, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18] and only rely on the features extracted from malware or benign files without executing them. On the other hand, dynamic analysis discussed in [5, 19, 20, 21, 22, 23, 24, 25, 26, 27], in which there is runtime monitoring, has been successfully used for malware detection. Recently, hybrid approaches in [7, 28, 29, 30] have been investigated, where static and dynamic features are used for precise identification. In the following sections, previous work is reviewed based on whether they rely on static, dynamic or hybrid static-dynamic analysis.

2.1 Static analysis

Static analysis of a malware is performed without the need for program execution. Some useful information we can extract using static analysis is the sequence of operational codes (known as opcodes), and control flow graphs. Khalilian et al. proposed an expert system for static detection of metamorphic malware. Their proposed method, named G3MD, applies graph mining to the opcode graphs of the metamorphic malware family to extract the frequent sub-graphs. Using these sub-graphs, a classifier is learned to distinguish between the benign and malicious files [31]. Christodorescu et al. presented a malware detection technique relying on static analysis and control flow graphs [4]. The approach focuses on recognizing obfuscation patterns in malicious codes and their results with good accuracy. Principal Component Analysis (PCA) [15] was used by Deshpande et al., and Support Vector Machines (SVM) [3] were utilized for malware classification by Singh et al. Fan et al. proposed a sequence mining algorithm to discover malicious sequential patterns along with All-Nearest-Neighbor (ANN) classifier which is constructed based on the discovered patterns for malware detection [32]. Annachhatre et al. hired a clustering algorithm based on features taken from static analysis for malware recognition [8]. Deshpande et al. used function call graph analysis for detecting malicious software [17] while Shanmugam et al. analyzed opcode-based similarity scores which rely on substitution cryptanalysis manners [16]. API call and operational code sequences were employed by Shankarapani et al. to assign whether a part of the code is similar to some specific malicious software [18]. Wagner et al. designed and developed, a knowledge-assisted visual malware analysis system (KAMAS) [33] for behavior-based malware analysis. They focused on a subset of function calls – the critical API calls that can be used for accessing private data and utilizing system resources [33]. In [34], authors developed an assembly code clone search system called ScalClone whose aim is to identify the code clones of the target malware.

2.2 Dynamic analysis

In dynamic analysis, the input sample program is executed. It is usually carried out in an isolated environment such as Virtual machines [35] or tools which are famous as sandboxes [36]. Some of the information that can be prepared by dynamic analysis is API calls, system calls, instruction traces, registry changes, and memory writes [7]. Kolbitsch et al. constructed fine-grained models which consider the malware’s behavior based on system calls [37]. These models’ results are depicted in the form of graphs, where the vertices signify system calls and the edges signify affiliation between the calls. Ahmed et al. presented a runtime monitoring tool that obtains statistical features based on spatiotemporal information in the logs of API calls [27]. Eskandari et al. extracted programs’ API calls and then combined them with control flow graphs to build an API-CFG model [23]. In an improved version [30], n-gram methods were used to extract the API calls. Qiao et al. analyzed malware based on frequency analysis of API call sequences [21]. Anderson et al. introduced a malware detection method that uses instruction trace executables logs. The traces are analyzed as graphs, where the nodes of the graphs indicate instructions, and statistics from instruction traces are used for computing transition probabilities [24].

2.3 Hybrid approach

Hybrid methods use both static and dynamic analysis aspects. In this section, we review some recent work of this type. Choi et al. introduced a framework for classifying malware using static analysis along with dynamic analysis. They extract malware features using a method called Malware DNA (Mal-DNA) [29]. The main focus of this method is behavior monitoring which occurs in the extraction of dynamic features. Eskandari et al. developed a tool called the HDM analyzer [30]. This tool applies both static and dynamic analysis during a learning phase while the test phase only relies on static analysis. The aim is to benefit from the apparently superior dynamic analysis in the training phase while the static analysis efficiency advantage is maintained at the scoring stage. For comparison, they show that the HDM analyzer has better overall precision and complexity than other static and dynamic analysis methods. The dynamic analysis was employed by Eskandari et al. which was based on the extracted API call sequences. Kabakus et al. introduced a novel hybrid Android malware analysis approach namely mad4a which uses the advantages of both static and dynamic analysis techniques and finds some unknown characteristics of Android malware through the used hybrid approaches [38].

3. Background

In this section, we present a brief overview on malware detection techniques and explain the general concepts used in this paper, paying special attention to profile hidden Markov models, which are the basis for our performance evaluation. We also review some notable previous works.

3.1 Malware detection

Various advanced recognition techniques have been considered in recent studies on malware detection. The common methods used in this field are: signature-based approaches such as pattern matching [4] and N-gram based [39] techniques; behavior-based (also called heuristic-based) approaches such as control flow graph [4], support vector machines [3], decision forest, K-means, and phylogenetic-tree; and statistical-based approaches such as hidden Markov models [11] and behavior profiling with hidden Markov models (also known as profile hidden Markov models) [1, 40]. We briefly review these methods before explaining PHMMs in detail.

3.1.1 Signature-based detection

Signature-based detection is extensively used in anti-virus programs [2, 3, 4]. A signature is a sequence of bytes, which is used to recognize malware belonging to a certain family. An anti-virus program which works based on the signature-based analysis must hold the signature repositories of the known malicious software. These databases are regularly updated by detecting and adding new malicious programs. This approach is efficient and effective for all known types of malware. A disadvantage of this approach is that it must have an up-to-date database and if a piece of malware is not present in the database, it will not be detected. In addition, simple obfuscation techniques are able to defeat and evade this method [1, 5, 7, 32, 41].

3.1.2 Behavior-based detection

Behavior-based detection focuses on transactions with the operating system while malware is being executed. The behavior of malware/benign files is analyzed in a training step. Based on patterns taken during the training phase, an input file is divided into malware or benign in the test step [33, 42]. In this method, the activities of a file are monitored to identify the functional patterns of the object and then to indicate whether the input file is malicious, or at least suspicious without scanning the file to determine its digital signature.

3.1.3 Statistical-based detection

Some techniques such as Markov process models used for malware detection are based on statistical analysis of features taken from the malware file. For example, PHMM was used by Attaluri et al. for identifying sophistical malware [1]. This technique has been used as a benchmark in other studies [1, 10, 11, 19, 20, 40, 43]. PHMMs have also been used in Bioinformatics to represent families of proteins [44]. Protein sequences belonging to one family are aligned to construct a multiple sequence alignment, which is used as the family’s representative profile hidden Markov model. These profiles can be further used to classify new proteins. Due to the inherent similarity of the production of metamorphic malware to the evolution and mutation process in the gene or protein sequences, PHMM is proposed as a potential statistical model for classifying malware families.

3.2 Markov chains

Assume that there are several states with transition probabilities for each move from one state to another. Markov chain rule states that each state is independent from all the other states. However, if we make each state dependent on its previous state, the first-order Markov chain will be established. In higher order Markov chains, more previous states are taken into account in the process of computing the probability of one state. In other words, in the $n^{\text{th}}$ -order Markov chain, the $n^{\text{th}}$ state is dependent on $n-1$ previous states, which means that the memory of a Markov chain is limited. As an example of use, a Markov chain for a DNA sequence is depicted in Fig. 1 [44]. The representation of the chemical code for a DNA sequence consists of four symbols (i.e., bases) which are A (adenosine), C (cytosine), G (guanine) and T (thymine). Figure 1 shows a first-order Markov model because of each state depending only on the one previous state. The transition probability of moving from a specific state to another is represented by an arrow. The transition probabilities are calculated after several DNA sequences have been observed. Then, all the probabilities make up a matrix, which is called transition probability matrix. Transition probabilities are calculated after observing several DNA sequences with the corresponding transition probability matrix providing a compact representation of these transition probabilities. This DNA Markov model is a first-order Markov model since each event depends only on the previous event.

Figure 1.

Markov chain for DNA.

3.3 Higher order markov chains

As described in Section 3.2, in higher order Markov chains each state depends on more than one previous state. The higher order Markov chains can also be represented as first-order ones by combining the states and the symbols. For example, a second-order Markov chain with a four-symbol alphabet such as {A, C, T, G} for representing a DNA sequence such as ATCCGA can be represented as a five-state first-order Markov model with states AT, TC, CC, CG, GA. This type of representation is called n-gram (bigram in this example). An n-gram is a commonly used concept in natural language processing, which we are able to use for detecting malware [39] as a data segmentation technique during pre-processing in order to perform feature extraction as has been carried out by Kang et al.

3.4 Hidden Markov models

A hidden Markov model (HMM) is a statistical approach for modeling systems which are assumed to contain Markov processes with hidden states and is based on the discrete hill climbing method [45]. HMMs have a variety applications in artificial intelligence area such as speech recognition, computational molecular biology, and malware detection [46]. An HMM can be made up of a series of observations which are affiliated to the “hidden” Markov process states (orange circles in Fig. 2) by a set of discrete probability distributions [7].

Figure 2.

Generic hidden Markov model example [7].

The practical advantage of HMMs is in solving each of the following three problems [45].

Problem 1: Evaluating

“Given an HMM $M$ and an observation sequence $O$ , determine the probability of observed sequence $O$ given $M$ , i.e., $p(O|M)$ . That is, we are able to score a given observation sequence against a given model, the better the score, the more closely the observation sequence matches the observations used to train the model” [7].

Problem 2: Decoding

“Given an HMM and an observation sequence $O$ , determine the optimal sequence of state transitions $X$ associated with $O$ . That is, we can uncover the “best” hidden state sequence. Here, “best” is in the sense of maximizing the expected number of correct states $X_{i}$ . This is in contrast to a dynamic program, which yields the $X_{i}$ corresponding to the highest scoring path” [7].

Problem 3: Training

“Given an observed sequence $O$ , adjust all the HMMs parameters such that the probability of observing $O$ by the model is maximized. That is, we are able to train a model to fit a given observation sequence $O$ ” [7].

In this paper, our focus is on the third problem; our main goal is to train a model based on observation sequences extracted from a given malware family. Then, in order to make decisions about unknown files, the solution for problem 1 is utilized for scoring the observation sequences which are extracted from malware files. The sequences extracted from benign files are also scored the same way. A complete discussion of the solutions for HMM’s three problems can be found in [45]; reference [46] is also a valuable one to find additional information on them. In addition, reference [41] contains some information about the application of HMMs in malware detection.

3.5 Profile hidden Markov models

Profile hidden Markov models can be used to classify malicious files into families or groups based on their behavior on the host system. PHMMs have been widely used in the field of bioinformatics to find the similar proteins and DNA sequences in a large database [43, 44]. As is presented in Fig. 3, a PHMM consists of three types of states: matches (rectangles), inserts (ovals) and deletes (circles). Using this particular model helps us overcome the obstacles caused by polymorphic and metamorphic malware. This model can lead to a high degree of classification accuracy compared to other advanced methods in this field. PHMM is a special case of standard hidden Markov model that is designed to cope with fundamental problems in bioinformatics.

Figure 3.

A Profile Hidden Markov Model (PHMM) with three match states.

3.6 Multiple sequence alignment

One key step in the PHMM construction procedure is the multiple sequence alignment. A similarity score can be computed for the MSA of DNA sequences which can be used to determine whether the gene sequences are descended from a common ancestor. In malware detection, in order to decide whether the unknown file of the input, which is a sequence of symbols, belongs to a malware family, the file is presented to the models built from MSAs produced from each malware family. Each malware family is modeled by a PHMM and the Viterbi algorithm for PHMMs can be used in this step to compute the similarity score of the sequence to each PHMM model.

3.6.1 Partial order alignment

Partial order alignment (POA) is a fast technique for multiple sequence alignment in bioinformatics. POA guarantees a desirable alignment for each new sequence, as compared to any sequence in the MSA. POA is also scalable and is capable of managing INDELs or branching in the alignment phase. Figure 4 shows an example of how POA constructs a directed acyclic graph for aligning two sequences. This algorithm makes it possible to build MSA for complex structures [47, 48, 49].

Figure 4.

Partial order alignment (POA) [50].

Partial order alignment is a graph-based algorithm that acts the same way as a dynamic programming method does. The only difference between the original dynamic programming algorithm (i.e., Needleman-Wunsch Algorithm; Fig. 5) and POA is that instead of the two sequences used in the former, a graph and a sequence are used as the input of the latter.

Figure 5.

The original dynamic programming algorithm [50].

For each alignment, the generated weighted DAG is updated and topologically ordered. Figure 6 depicts one step in updating the directed graph by POA. Finally, the consensus sequences made from this graph can be used to build an advanced PHMM.

Figure 6.

String to graph alignment [50].

4. Proposed approach

There are two main challenges in using machine learning techniques for malware detection problem. The first one is to develop the machine learning algorithm which is addressed at the training phase of our proposed method and the second challenge is to identify and select an appropriate feature set. The consensus sequences (or canonical sequences) are produced in the process of generating multiple sequence alignments. A consensus sequence is a sequence which is constructed from the weighted DAG generated by POA algorithm and is a representative of all the sequences in the MSA (Fig. 7). A PHMM may have more than one consensus sequence if the sequence family members are diverse.

Figure 7.

An example MSA along with the consensus sequence (in the red box).

As is shown in Fig. 7, the consensus sequence is added to the constructed MSA in order to obtain an Advanced Profile Hidden Markov Model (APHMM). Therefore, the features related to each family are emphasized leading to improvement in the performance of our statistical model. Note that the proposed model is superior to standard PHMMs in terms of improved MSA used in building the APHMM, and the model structure is unaltered. The second challenge has recently been the focus of many studies and the existing techniques have produced acceptable results. Our model selection phase is equivalent to scoring every opcode sequence against the constructed PHMM belonging to each malware family. Detection and recognition of malware with high accuracy and speed is one of the most important goals of researchers in the computer security field. To identify malicious code, our proposed approach is based on static behavior and statistical pattern recognition. In this method, we use PHMMs to identify the families of malicious software. By modifying the way of creating MSA, which is considered as learning step in this model, we are able to achieve desirable results. In addition to 100% accuracy, low false positive rate, simplicity, and low computational overhead are other outcomes of the proposed method.

4.1 Training

For training and scoring, we used opcode sequences, which are extracted from executable files by using static analysis. As noted in Section 2, opcode sequences have been used in many research studies on malware detection. We use IDA Pro to disassemble files and extract the static opcode sequences from the result. For example, suppose we disassemble a portable executable file and obtain the ASM file. A snapshot from an ASM file is shown in Fig. 8.

Figure 8.

Example disassembly [7].

The static opcode sequence corresponding to this disassembly is

Unigram: call, push, lea, push, push, call, add, test, jzBigram: call push, push lea, lea push, push push, push call, call add, add test, test jz

Figure 9.

Ten samples of NGVCK family residues.

The following sequences, which are the beginning of some residues shown in Fig. 9, were generated from ASM files belonging to NGVCK malware family. Each base (symbol) in the residue can indicate different types of N-grams. The residues along with the consensus sequences (indicated by red) constitute the MSA matrix as can be seen in Fig. 10.

Figure 10.

An MSA snapshot generated form NGVCK family.

We discard all operands such as labels and directives, and only retain the mnemonics (opcodes). For the top N frequent opcodes, N unique symbols were used and one symbol as a wild card for the rest. These unique symbols can be generated from single opcodes (unigram) or N general combinations of every N consecutive opcodes (N-gram). This is discussed further in Section 5.3. Then, sequences are used to train a PHMM, and the resulting model is used to score the files in the malware and benign families test sets. The scores from a given experiment are used to form a scatterplot.

5. Experiments

In this section, we first discuss the tools that can be used to extract features from binary code. Then, we give an overview of the malware dataset used in this work. Finally, we elaborate on the experiments conducted. In Section 5.3, we provide the results of our experiments. All supplementary materials including the code, datasets, and results are available for public access on the Web.2

²
All supplementary materials including codes, datasets and full results are presented in https://iasbs.ac.ir/∼ansari/malware/ aphmm.html.

5.1 Tools for static analysis

5.1.1 IDA Pro

This disassembler produces highly accurate assembly code from an executable file. It can also be used as a debugger. IDA is a powerful tool that supports scripting, function tracing, instruction tracing, instruction logging, and so forth. IDA Pro is used for static analysis, and, in particular, for the generation of “.asm” files from “.exe” files, from which opcodes and windows API calls can be extracted. Being a multi-processor and debugger, IDA Pro is available for various platforms such as Windows, Linux and Mac OS. Note that, there are Evaluation and Freeware versions of IDA freely available for non-commercial use with limited availability, but the Full version is not free of charge. We used the IDA Pro Disassembler version 6.8.1 for extracting opcode sequences from malware/benign family file sets.

5.1.2 PE explorer

Compared to other disassemblers, this tool has been designed to be more user-friendly. To achieve this aim, some of the components of other disassemblers have been removed which has resulted in an easy and rapid process. PE Explorer is the same as the other famous and proprietary tools for disassembling. In addition, it focuses on clarity, user-friendliness, and smoothness of navigation. PE Explorer is capable of opening, viewing and editing a variety of Windows executable file types (i.e. PE files). The evaluation version of PE Explorer is free of charge for 30 days with basic technical support.

5.1.3 Capstone

As said in Capstone webpage, “Capstone is a lightweight multi-platform, multi-architecture disassembly framework”. It comes with some exported methods helping to disassemble binary files. Some of its features are: Multi-architectures, providing details on the disassembler and some semantics of the disassembled instruction, and Native support for operating systems such as Windows, Mac, iOS, Android, and Linux. Capstone is free and available for well-known operating systems in package repositories. It is a Python/C library and using “pip install capstone” command, the Python module can easily be installed on Windows, Mac OS and Linux. The process of disassembling and debugging the portable executable files (for extracting features) and model generation and classification can be implemented as an integrated system using Capstone.

5.2 Datasets

Due to constant improvement and progress in development of new malware, a single benchmark cannot be used over a long period since using the models trained on these old datasets will not be desirable in malware detection field. Therefore, there is not one common dataset available as an assessment reference to all researchers. In this study, we used a collection of benign files from Cygwin related files. Additionally, we amassed a dataset containing four malware families used in several previous related work. The opcodes were extracted by the IDA Pro when not available. For each family, there are 50 examples of construction kits available on the VX Heaven website, with the exception of the NGVCK family of 200 samples, which is one of the most obscure malicious pieces of software available on malware detection because of their complex transformations. These datasets are also used in evaluations of various methods for malware detection that have been addressed in a number of previous works [1, 14, 31, 41, 51, 52, 53, 54, 55]. According to measurements of similarity carried out on the dataset by Rezaei et al. [14], after computing the similarity score using the method described in [41], samples in each family are compared in pairs. The least, the most and the average quantity of samples pair-wise comparison in each virus creation kit can be seen in Table 1.

Table 1
Min, max, and average kits similarities percentage

	NGVCK	G2	VCL32	MPCGEN	Cygwin
Min	0.01493	0.62845	0.34376	0.44964	0.13603
Max	0.21018	0.84864	0.92907	0.96568	0.93395
Average	0.10087	0.74491	0.60631	0.62704	0.34689

NGVCK generates viruses with similarities ranging from 1.5 to 21.0% with an average of about 10.0%. This is a far lower degree of similarity than any of the other three generators. For the non-NGVCK generators, the similarity between two variants of the same virus ranges from 34.4 to 96.6%, and the average scores of G2, VCL32, and MPCGEN are 74.5, 60.6, and 62.7%, respectively. On the other hand, normal files give an average similarity of 34.7%. From these results, we can see that the NGVCK viruses are substantially different from one another while the virus variants generated by the other generators are more similar to one another than normal files. We conclude that the non-NGVCK generators we tested are not nearly as effective as NGVCK at generating metamorphic viruses.

5.3 Results

To evaluate our models, we implemented a statistical-based malware detection technique based on profile hidden Markov models. In addition, we compared our results to n-gram analysis (with different $N$ parameter).

Attaluri et al. [1] proposed a method based on PHMMs. Since a different dataset was used by them, we implemented their method using Python and simulated the results on our dataset. The results of this method are provided in Fig. 11. As can be interpreted from the plot and also stated in the original paper, the method was not successful on classifying NGVCK family. In the evaluation phase, the Log Likelihood per Opcode (LLPO) is used to compute similarity score in order to detect a member of malware families (models). LLPO is computed by dividing the output of log likelihood function by the file’s length to create a length independent measure.

Figure 11.

The simulated results of Attaluri’s proposed method using standard PHMMs (without consensus sequences).

The strength of the proposed approach lies in adding consensus sequences to the MSA for building APHMMs. By adding the consensus sequence to the MSA, we build an advanced PHMM, which puts more emphasis on the frequent patterns existing in each family of malware. The effect of the number of consensus sequences is analyzed in the experiments reported in the various figures. Figures 12 and 13, present the result of classifying the NGVCK malware family with one and four consensus sequences, respectively. As mentioned before, for this family we had 200 sequences from which 150 sequences were used for training and 50 sequences were used for testing. The results for classifying other families are provided in supplementary material.3

https://iasbs.ac.ir/∼ansari/malware/aphmm.html.

For the rest of malware families only 50 sequences were available from which 40 were used for training and 10 sequences were used for testing. For a benign family, we had only 40 sequences (32 used in training and 8 for testing).

Figure 12.

The results of our proposed method for classifying NGVCK family by using APHMMs (one consensus sequence).

Figure 13.

The results of our proposed method for classifying NGVCK family by using APHMMs (four consensus sequences).

In feature extraction phase, we use a symbol for each frequent opcode. For the remaining opcodes, we use one symbol. In the former experiments, only top 36 unique opcodes are used (totally 37 opcodes). The results of unigram feature extraction method are presented in Figs 14 and 15 for 37 and 200 symbols, respectively. In Fig. 16, the results on bigram with 100 symbols are presented. By comparing Figs 14 and 15, we see that with unigram features using less symbols results in better distinction of the NGVCK class. If we analyze the frequency of all possible unigram symbols, we see that the top 14 opcodes cover 93% of all opcodes existing in all of the files with 98% of these opcodes covered by the top 50 opcodes [56]. We chose the top 36 opcodes as used in the research carried out by Attaluri et al. [1]. Having bigram symbols, the number of frequent ones will increase relatively. As we can see in Figs 14 and 15, having more symbols does not improve the results of unigram feature setting. The reason is the non-frequent symbols affects the scoring emission matrix and makes the score values unnecessarily small. For larger N values, the symbols are generated sequences from the combination of N successive opcodes, which are unique and informative (leading to better separation of families) even for very large numbers of symbols. Since PHMM is a position-specific model, using N-grams (N $>$ 1) has a positive effect on the model classification performance. By comparing Figs 15 and 16, we observe this positive effect on bigrams with 100 symbols.

Figure 14.

The results of our proposed method for classifying NGVCK family by using one consensus sequence, unigram and 37 symbols.

Figure 15.

The results of our proposed method for classifying NGVCK family by using one consensus sequence, unigram and 200 symbols.

Figure 16.

The results of our proposed method for classifying NGVCK family by using one consensus sequence, bigram and 100 symbols.

Table 2 shows a summary of all the results. We evaluate our results with three parameters, which directly impact on precision and recall. First is the number of consensus sequences (No. CSs) which plays an important role in the classification of malware families by emphasizing the common features. The second parameter (No. Syms) is the number of different symbols used in our MSA implementation and the third input parameter shows the number of consequence opcodes groups as an atomic part (unigram or bigram). The higher the order for this parameter, the better the classification. While higher orders of N-grams consist of more number of unique bases, it might make the classification better. Equations (1) and (2) show accuracy and FPR (False Positive Rate) respectively:

$\displaystyle\text{Accuracy}=\frac{\text{TP}+\text{TN}}{\text{TP}+\text{FN}+% \text{TN}+\text{FP}}$ (1) $\displaystyle\text{FPR}=\frac{\text{FP}}{\text{TN}+\text{FP}}$ (2)

where, TP and TN mean True Positive and True Negative, respectively. FP stands for False Positive, and FN means False Negative.

Table 2

The summary of our experimental results on NGVCK family where FPR means false positive rate, CSs stands for consensus sequence and Syms means symbols

Number of CSs/number of Syms/N-gram	Accuracy %	FPR
One/37/unigram	98	0.010
Two/37/unigram	98	0.005
Four/37/unigram	100	0.005
One/100/unigram	98	0.016
Two/100/bigram	100	0.000
One/200/unigram	96	0.000
Two/200/bigram	100	0.016

We presented a separate experiment for each of the malware datasets listed in Section 5.2 for each of the various combinations of conditions mentioned in Table 2. For each experiment, we evaluated the effect of three parameters, the number of consensus sequences, the order of n-grams, and the number of symbols (bases).

Table 3

Comparison of existing techniques with our method

Paper	Analysis type	Accuracy	FPR
Khalilian et al. (2018) [31] ${}^{*}$	Static	0.983	0.016
Fazlali et al. (2016) [52] ${}^{*}$	Static	0.994	0.01
Khodamoradi et al. (2015) [54] ${}^{*}$	Static	0.986	0.019
Canfora et al. (2014) [51] ${}^{*}$	Static	0.94	0.02
Rezaei et al. (2014) [14] ${}^{*}$	Static	0.94	0.00
Eskandari et al. (2012) [23]	Dynamic	0.98	0.02
Anderson et al. (2011) [24]	Dynamic	1.00	0.00
Eskandari et al. (2013) [30]	Hybrid	0.97	0.02
Islam et al. (2013) [28]	Hybrid	0.97	0.05
Our approach	Static	1.00	0.00

${}^{*}$ These studies applied the same datasets that we used to evaluate our method.

6. Conclusion and future work

In this paper, we designed and evaluated various malware detection techniques based on opcode sequences. We trained profile hidden Markov models and compared detection rates for models based on N-grams and consensus sequence approaches. Additionally, in order to obtain the perfect sensitivity score (i.e. one hundred percent), the false positive rate is 0.01 when using one consensus sequence in the model construction, and is reduced toward zero when four consensus sequences are considered. The evaluation results show that our classification technique obtains the perfect accuracy even when there are a few training datasets to build the models. We compare the performance of our advanced classification approach with some previous works. As one of our principal targets is high accuracy, we compare only with those methods which achieve the accuracy of 90% or higher on the set of executables. Table 3 summarizes this comparison.

Moreover, in the proposed approach presented in this study, we have leveraged reverse engineering and disassembly techniques in preprocessing and data preparation phases to extract the features needed to create our statistical model which increase the speed of the whole process. As a proposal for future research, another model could be considered to obtain the raw information from reading files in their binary form. By reading under question files byte-by-byte and in hexadecimal format instead of extracting information from opcodes, we may extract enough features to create the profile hidden Markov model. This process automatically minimizes the need for human intervention with the results and increases the processing speed especially in large-scale datasets analysis.

Footnotes

Acknowledgments

We would like to thank all the Artificial Intelligence Lab members at the Institute for Advanced Studies in Basic Sciences (IASBS) especially Dr. Zahra Narimani, Assistant Professor of Computer Science and Information Technology Department, IASBS, and Akbar Karimi, PhD student at the University of Parma, for their invaluable help in the process of developing this paper.

References

Attaluri

McGhee

and Stamp

, Profile hidden Markov models and metamorphic virus detection, Journal in Computer Virology 5(2) (2009), 151–169.

Aycock

, Computer viruses and malware, Vol. 22, 2006: Springer Science & Business Media.

Singh

, Support Vector Machines and Metamorphic Malware Detection, 2015.

Christodorescu

and Jha

, Static analysis of executables to detect malicious patterns, 2006, Wisconsin Univ-Madison Dept of Computer Sciences.

Galal

H.S.

Mahdy

Y.B.

and Atiea

M.A.

, Behavior-based features model for malware detection, Journal of Computer Virology and Hacking Techniques 12(2) (2016), 59–67.

Baysa

Low

R.M.

and Stamp

, Structural entropy and metamorphic malware, Journal of Computer Virology and Hacking Techniques 9(4) (2013), 179–192.

Damodaran

et al., A comparison of static, dynamic, and hybrid analysis for malware detection, Journal of Computer Virology and Hacking Techniques 13(1) (2017), 1–12.

Annachhatre

Austin

T.H.

and Stamp

, Hidden Markov models for malware classification, Journal of Computer Virology and Hacking Techniques 11(2) (2015), 59–73.

Wang

and Xu

, Malware variants detection based on opcode image recognition in small training set, in: Cloud Computing and Big Data Analysis (ICCCBDA), 2017 IEEE 2nd International Conference on, 2017.

10.

Rezaei

et al., Malware detection using opcodes statistical features, in: Telecommunications (IST), 2016 8th International Symposium on, 2016.

11.

Gharacheh

et al., Detection of metamorphic malware based on HMM: a hierarchical approach, International Journal of Intelligent Systems and Applications 8(4) (2016), 18.

12.

Jidigam

R.K.

Austin

T.H.

and Stamp

, Singular value decomposition and metamorphic detection, Journal of Computer Virology and Hacking Techniques 11(4) (2015), 203–216.

13.

Gharacheh

et al., Proposing an HMM-based approach to detect metamorphic malware, in: Fuzzy and Intelligent Systems (CFIS), 2015 4th Iranian Joint Congress on, 2015.

14.

Rezaei

et al., Metamorphic viruses detection by hidden Markov models, in: Telecommunications (IST), 2014 7th International Symposium on, 2014. IEEE.

15.

Deshpande

Park

and Stamp

, Eigenvalue analysis for metamorphic detection, Journal of Computer Virology and Hacking Techniques 10(1) (2014), 53–65.

16.

Shanmugam

Low

R.M.

and Stamp

, Simple substitution distance and metamorphic detection, Journal of Computer Virology and Hacking Techniques 9(3) (2013), 159–170.

17.

Deshpande

, Metamorphic Detection Using Function Call Graph Analysis, 2013.

18.

Shankarapani

M.K.

et al., Malware detection using assembly and API call sequences, Journal in Computer Virology 7(2) (2011), 107–119.

19.

Vemparala

et al., Malware Detection Using Dynamic Birthmarks, in: Proceedings of the 2016 ACM on International Workshop on Security and Privacy Analytics, 2016.

20.

Vemparala

, Malware Detection Using Dynamic Analysis, 2015.

21.

Qiao

et al., Analyzing malware by abstracting the frequent itemsets in API call sequences, in: Trust, Security and Privacy in Computing and Communications (TrustCom), 2013 12th IEEE International Conference on, 2013.

22.

Eskandari

Khorshidpur

and Hashemi

, To incorporate sequential dynamic features in malware detection engines, in: Intelligence and Security Informatics Conference (EISIC), 2012 European, 2012.

23.

Eskandari

and Hashemi

, A graph mining approach for detecting unknown malwares, Journal of Visual Languages & Computing 23(3) (2012), 154–162.

24.

Anderson

et al., Graph-based malware detection using dynamic analysis, Journal in Computer Virology 7(4) (2011), 247–258.

25.

Park

et al., Fast malware classification by automated behavioral graph matching, in: Proceedings of the Sixth Annual Workshop on Cyber Security and Information Intelligence Research, 2010.

26.

Dai

Guha

R.K.

and Lee

, Efficient virus detection using dynamic instruction sequences, JCP 4(5) (2009), 405–414.

27.

Ahmed

et al., Using spatio-temporal information in API calls with machine learning algorithms for malware detection, in: Proceedings of the 2nd ACM Workshop on Security and Artificial Intelligence, 2009.

28.

Islam

et al., Classification of malware based on integrated static and dynamic features, Journal of Network and Computer Applications 36(2) (2013), 646–656.

29.

Choi

Y.H.

et al., Toward extracting malware features for classification using static and dynamic analysis, in: Computing and Networking Technology (ICCNT), 2012 8th International Conference on, 2012.

30.

Eskandari

Khorshidpour

and Hashemi

, HDM-Analyser: A hybrid analysis approach based on data mining techniques for malware detection, 9(2) (2013), 77–93.

31.

Khalilian

et al., G3MD: mining frequent opcode sub-graphs for metamorphic malware detection of existing families, Expert Systems with Applications 112 (2018), 15–33.

32.

Fan

and Chen

, Malicious sequential pattern mining for automatic malware detection, Expert Systems with Applications 52 (2016), 16–25.

33.

Wagner

et al., A knowledge-assisted visual malware analysis system: design, validation, and reflection of KAMAS, Computers & Security 67 (2017), 1–15.

34.

Farhadi

M.R.

et al., Scalable code clone search for malware analysis, Digital Investigation 15 (2015), 46–60.

35.

Egele

et al., A survey on automated dynamic malware-analysis techniques and tools, ACM Computing Surveys (CSUR) 44(2) (2012), 6.

36.

Stiborek

Pevný

and Rehák

, Multiple instance learning for malware classification, Expert Systems with Applications 93 (2018), 346–357.

37.

Kolbitsch

et al., Effective and Efficient Malware Detection at the End Host, in: USENIX Security Symposium, 2009.

38.

Kabakus

A.T.

and Dogru

I.A.

, An in-depth analysis of Android malware using hybrid techniques, Digital Investigation 24 (2018), 25–33.

39.

Kang

et al., N-gram opcode analysis for android malware detection, arXiv preprint arXiv:1612.01445, 2016.

40.

Saradha

and Education

, Malware Analysis using Profile Hidden Markov Models and Intrusion Detection in a stream learning setting, Supercomputer Education and Research Centre, Indian Institute of Science, 2014.

41.

Wong

and Stamp

, Hunting for metamorphic engines, Journal in Computer Virology 2(3) (2006), 211–229.

42.

Rieck

et al., Learning and classification of malware behavior, in: International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment, 2008. Springer.

43.

Huang

and Stamp

, Masquerade detection using profile hidden Markov models, Computers & Security 30(8) (2011), 732–747.

44.

Durbin

et al., Probabilistic models of proteins and nucleic acids, Biol Seq Anal 14 (1998), 164–173.

45.

Stamp

, A revealing introduction to hidden Markov models, Department of Computer Science San Jose State University, 2004.

46.

Ghahramani

, An introduction to hidden Markov models and Bayesian networks, International Journal of Pattern Recognition and Artificial Intelligence 15(1) (2001), 9–42.

47.

Grasso

and Lee

, Combining partial order alignment and progressive multiple sequence alignment increases alignment speed and scalability to very large alignment problems, Bioinformatics 20(10) (2004), 1546–1556.

48.

Lee

, Generating consensus sequences from partial order multiple sequence alignment graphs, Bioinformatics 19(8) (2003), 999–1008.

49.

Lee

Grasso

and Sharlow

M.F.

, Multiple sequence alignment using partial order graphs, Bioinformatics 18(3) (2002), 452–464.

50.

Lab

, Understanding Partial Order Alignment for Multiple Sequence Alignment, 2017. Available from: http://simpsonlab.github.io/2015/05/01/understanding-poa/.

51.

Canfora

Iannaccone

A.N.

and Visaggio

C.A.

, Static analysis for the detection of metamorphic computer viruses using repeated-instructions counting heuristics, Journal of Computer Virology and Hacking Techniques 10(1) (2014), 11–27.

52.

Fazlali

et al., Metamorphic malware detection using opcode frequency rate and decision tree, International Journal of Information Security and Privacy (IJISP) 10(3) (2016), 67–86.

53.

Govindaraju

, Exhaustive statistical analysis for detection of metamorphic malware, 2010.

54.

Khodamoradi

et al., Heuristic metamorphic malware detection based on statistics of assembly instructions using classification algorithms, in: Computer Architecture and Digital Systems (CADS), 2015 18th CSI International Symposium on, 2015. IEEE.

55.

Lin

and Stamp

, Hunting for undetectable metamorphic viruses, Journal in Computer Virology 7(3) (2011), 201–214.

56.

Bilar

, Fingerprinting malicious code through statistical opcode analysis, in: ICGeS’07: Proceedings of the 3rd International Conference on Global E-Security, 2007.

An advanced profile hidden Markov model for malware detection

Abstract

Keywords

1. Introduction

Contributions

2. Related work

2.1 Static analysis

2.2 Dynamic analysis

2.3 Hybrid approach

3. Background

3.1 Malware detection

3.1.1 Signature-based detection

3.1.2 Behavior-based detection

3.1.3 Statistical-based detection

3.2 Markov chains

3.4 Hidden Markov models

Problem 1: Evaluating

Problem 2: Decoding

Problem 3: Training

3.6.1 Partial order alignment

2 All supplementary materials including codes, datasets and full results are presented in https://iasbs.ac.ir/∼ansari/malware/ aphmm.html.

5.1.1 IDA Pro

5.1.2 PE explorer

5.1.3 Capstone

5.2 Datasets

Table 1 Min, max, and average kits similarities percentage

Footnotes

Acknowledgments

References

²
All supplementary materials including codes, datasets and full results are presented in https://iasbs.ac.ir/∼ansari/malware/ aphmm.html.

Table 1
Min, max, and average kits similarities percentage