Currently, log anomaly detection involves several distinct processing steps. In this section, we review some notable practices associated with each step.
Log Parsing
The raw log file has a semi-structured text format and cannot be directly used for machine learning and data mining purposes. To enable effective analysis, it is essential to preprocess the raw logs by log parsing to extract key information and eliminate redundant events and irrelevant elements from the raw logs. The traditional approach of log parsing is to process it through regular expressions which is time-consuming and challenging to implement in practice. Several effective methods for log parsing exist, such as LKE (Fu et al., 2009), LogSig (Mizutani, 2013), LogMine (Hamooni et al., 2016), and SHISO (Zhu et al., 2010). These methods use similarity-based clustering techniques to compute distances between logs and cluster them based on their similarity. Another category is frequency-based clustering, which includes approaches like LFA (Nagappan & Vouk, 2010), SLCT (Vaarandi, 2003), and LogCluster (Vaarandi & Pihelgas, 2015). These methods group log items into multiple clusters based on the frequency of their occurrence. The third category comprises heuristic methods that utilize specific data structures to parse logs into multiple templates. Representative techniques in this category include FT-Tree (Zhang et al., 2017), Drain (He et al., 2017), Spell (Du & Li, 2016), and Logstamp (Tao et al., 2022). These methods employ heuristic rules and data structures to identify common log patterns and generate templates for log parsing.
By employing these various log parsing techniques, the raw log data can be transformed into a structured format suitable for machine learning and data mining tasks, facilitating efficient analysis and knowledge extraction from log files.