Article Preview
Top1. Introduction
A large number of modern applications and systems involve transaction processing. These transactions refer to events such as commercial transactions, banking, entry or updates of health records, etc. Each such event generates data, known as transactional data, which may be recorded for the generation of useful information in the future. Few examples of such data’s utility include the generation of product recommendations from a user’s past purchase history, inventory management given the sales records, fraud detection from users’ financial transactions and many more. However, publishing or publicly sharing any individual’s data may lead to serious privacy implications (Barbaro & Zeller, 2006; Narayanan & Shmatikov, 2008). This is especially true when this data is sensitive, for example, the data contains a user’s financial or health records. Further, data privacy is an important facet of data security and needs the utmost attention. This has led to a vast body of research studies in the domain of privacy-preserving publishing of data (Terrovitis et al., 2008; He & Naughton, 2009; Zhang et al., 2012; Kohlmayer et al., 2012). The methods to provide privacy-preserving data publishing include data encryption and data anonymization. Data anonymization is a popular way for privacy-preserving data publishing. Anonymization removes the fact that the particular records belong to the particular individual. The approach of anonymization is sufficient to prevent privacy attacks on published data. Therefore, the paper proposes the method based on anonymization to provide privacy-preserving publication of transactional data.
Some developed techniques, such as OLA (Zhang et al., 2012), Flash (Kohlmayer et al., 2012) are available to anonymize structured or relational data. It has been observed (Puri et al., 2019) that anonymization techniques for relational data do not apply to transactional data due to the lack of structure and sparseness in the latter. Hence, different models are required to define privacy- preserving publication of transactional data. A few models such as complete k-anonymity (He & Naughton, 2009), km-anonymity (Terrovitis et al., 2008), etc. have been developed in the past to define the privacy of transactional data. Complete k-anonymity assumes that every combination of attributes may be sensitive and should occur at least k times. Anonymization of data to achieve complete k-anonymity requires multiple additions or deletions of items from the dataset and thus, results in a high amount of information loss. Information loss is said to occur if the anonymized data is no longer useful for statistical analysis and mining purposes or does not provide similar information as original data. In comparison to complete k-anonymity, the km-anonymity model assumes every combination of attributes cannot be sensitive, therefore, it ensures every m-combination of items should occur at least k times. Since the km-anonymity model limits the anonymization to upto m items, hence, information loss is low compared to complete k-anonymity, and therefore, it is commonly used for transactional data to protect from identity disclosure attack (Terrovitis et al., 2008).