Article Preview
TopIntroduction
For the past three decades, malware has been posing a continuous threat to networks and systems. Malware can be defined as software or malicious code injected into a target system or network to make the system work abnormally (Christodorescu et al., 2005). Virus, Trojans, backdoors, worms, rootkits, spyware, adware etc. are several forms of malware. In general, any malware is commonly termed as a virus, that was first framed by Cohen (1987). Each malware is designed with a common goal of destroying or committing some illegitimate access to the system to retrieve some sensitive information from the system. The type of malware and the anti-malware or malware detection systems depends on the hardware/software platforms and the operating system. The main goal of attackers is to infect or morph malware to evade from the malware detectors.
At present most of the systems are making use of signature-based methods in identifying malicious code. This technique uses a database that contains expressions or sequences that are considered as malware. Malware is detected and an alert is triggered if the signature of the code/program screened matches with that of the database. The major drawback of this technique is that the sequences need to be updated day to day and when a malicious code whose sequence is not already in the database enters into the system which is not detected, that leads to a major threat to the system. It was proved in the recent works that metamorphism and polymorphism are employed for code obfuscation to successfully evade detection of viruses and other malware (Christodorescu et al., 2005).
Malware identification is of three types namely, static analysis, real-time analysis, or a mixture of both. Concerning malware detection methods in static analysis, they are categorized as signature based, heuristic based and behavioral based. Figure 1 shows a taxonomy of malware detection techniques. Amro and Ali (2016) have described that signature-based technique is the most commonly used technique since it produces a very less error rate.
Figure 1. A taxonomy of malware detection techniques
In this work, extracted features from the PE header of windows executable files are used. The PE header has four sections embedded within it (Liao, 2012). Figure 2 represents a snapshot of an executable file’s PE header when analyzed under a hex editor.
Figure 2. Snapshot of an executable file’s PE header in hex editor
The following features can be extracted from different headers in the PE header
Table 1 describes the features that can be identified with the help of the DOS header. The feature e_magic is a very basic feature that generally starts with the hex value 4D5A that means ‘MZ’ (Zatloukal & Znoj, 2017) at the beginning and indicates that the file is an MS-DOS executable file.
Table 1. Features extracted from DOS header
Feature | Description | Type |
e_magic | Magic number. | Numeric |
e_cblp | Bytes on the last page of a file | Numeric |
e_cp | Total pages a file contains. | Numeric |
e_cparhdr | Header size in paragraphs | Numeric |
E_maxalloc | Maximum number of extra paragraphs required. | Numeric |
E_sp | Initial sp value | Numeric |
E_lfanew | File address of new exe header | Numeric |
e_csum | Checksum value | Numeric |
e_minalloc | Minimum number of extra paragraphs required | Numeric |