Article Preview
TopIntroduction
Maintaining privacy has always been crucial for an information system that collects large amounts of data pertaining to their customers. ‘Data mining’ or ‘knowledge discovery research’ undermines extracting potentially useful information from a raft of data. On the darker side, the user-friendliness of data mining results jeopardizes the privacy of the data. This can be countered by integrating privacy-preserving mechanisms in data mining tools. Due to the increasing need of distributed databases in business environment, need for Privacy Preserving Distributed Data Mining (PPDDM) becomes imperative. In distributed databases, the dataset may be horizontally or vertically partitioned. In horizontal partitioning of dataset, the parties have different number of objects each having same number of attributes; whereas in vertical partitioning of dataset, the parties have same number of objects but with partial set of attributes with them.
PPDDM applications can be characterized in two models viz. the corporate model and the World Wide Web model (Clifton, 2002). In corporate model, we assume that data is created and held by the participating parties whereas in the World Wide Web model individuals provide the data in electronic form themselves. In this paper, our focus is on investigating the privacy concerns associated with the corporate model, when it is necessary to share the data. The privacy policy and law prevents the parties from over pooling their data or revealing it to each other, due to the confidentiality of records. In such cases, classical data mining solutions cannot be used. Rather it is necessary to find a solution that enables the parties to collaboratively compute the desired data mining algorithms on the union of their databases, without ever pooling or revealing their data. The goal of our study is to propose an approach that enables multiple parties to collaboratively perform data mining in corporate model without jeopardizing the privacy of their data. We focus on detection of malicious behaviour by the parties so as to assure the trustworthiness of data. In particular, we focus on clustering application of data mining.
Among the two main approaches of Privacy Preserving Data Mining (PPDM) viz. the Randomization based and the Cryptography based approach, the latter provides higher level of privacy (Oliveira, 2003; Pederson, 2007). However, the cryptography-based approach is expensive in terms of computational and communication overheads and so the existing protocols proposed are not scalable with respect to dataset size and number of parties (Pederson, 2007). Therefore, the chief concern in designing such protocols must be on minimizing the overheads incurred in their design and implementation. In this paper, we address this issue. We focus on K-Means clustering algorithm of data mining and propose cryptography based approach for distributed K-Means clustering.