Article Preview
TopIntroduction
File systems are important sources of confidential and private information. Various types of data (e.g., documents, audio, videos, and pictures) are stored in file systems. Data are often intentionally deleted or unintentionally lost, and therefore different methods of file recovery have been developed for various purposes, such as digital forensics and file rescue.
Depending on whether utilizing the file system metadata, existing file recovery approaches can be divided into two categories: Metadata-based file recovery (MFR) (Dewald & Seufert, 2017; Fairbanks, 2012; Jo et al., 2018; Kim et al., 2021; Lee et al., 2020; Lee & Shon, 2014) and carving-based file recovery (CFR) (Garfinkel, 2007; Garfinkel & McCarrin, 2015; Gladyshev & James, 2017; Golden & Vassil, 2005; Hand et al., 2012; Pal et al., 2003; Pal et al., 2008; Tang et al., 2016). MFR is fast and accurate because it can leverage file system metadata to interpret user data. However, MFR cannot work if metadata are missing or corrupted. Different from MFR, CFR does not rely on metadata. It leverages syntactic signatures (e.g., file header-footer pairs) (Tang et al., 2016), semantic structures (e.g., explicit control flow paths within a binary executable) (Hand et al., 2012), heuristic technologies (Garfinkel & McCarrin, 2015; Gladyshev & James, 2017; Pal et al., 2008), timestamps (Nordvik et al., 2020; Portera et al., 2021) or deep learning technologies (Heo et al., 2019; Mohammad & Alqahtani, 2019) to restore files. Unlike MFR, which can precisely recover a file under the “direct guidance” of metadata, CFR “indirectly infers” which data blocks belong to the file to be recovered. Therefore, CFR suffers from problems such as false positives and higher time overhead. In summary, both MFR and CFR have their advantages and disadvantages. They complement each other, and neither can take the place of the other.
Although researchers have conducted in-depth and extensive research, there are still issues to be addressed for MFR and CFR. A critical issue is that most existing approaches rely heavily on services from an operating system (OS) (Fairbanks, 2012; Garfinkel, 2007; Garfinkel & McCarrin, 2015; Golden & Vassil, 2005; Hand et al., 2012; Jo et al., 2018; Kim et al., 2021; Lee et al., 2020; Lee & Shon, 2014; Pal et al., 2003; Pal et al., 2008; Tang et al., 2016). However, in many cases an OS is not available. For example, when a hard disk fails, or a hard disk is sanitized based on American federal NIST 800-88 (Kissel et al., 2014), the hard disk can no longer be mounted to other machines to boot their OSs nor boot its own OS, which renders existing approaches (Fairbanks, 2012; Garfinkel, 2007; Garfinkel & McCarrin, 2015; Golden & Vassil, 2005; Hand et al., 2012; Jo et al., 2018; Kim et al., 2021; Lee et al., 2020; Lee & Shon, 2014; Pal et al., 2003; Pal et al., 2008; Tang et al., 2016;) unusable.