Article Preview
Top1. Introduction
A record may be viewed conceptually as consisting of a set of fields. When unique identifiers are unavailable or do not exist in records, determining records that represent the same real world entity is an important and challenging problem, which has many applications. For instance, it addresses data quality issues such as ``data accuracy, redundancy, consistency, currency and completeness” (Li, Zhang, & Bheemavaram, 2006). Ensuring data quality is becoming a critical issue that impacts organizational performance(Ballou, Wang, & Pazer, 1998; Ballou, 1999; Delone & Mclean, 1992; Redman, 1998) This problem is also referred to in the literature as record linkage problem (Fellegi & Sunter, 1969; Newcombe, 1988), data cleaning problem (Do & Rahm, 2002), object identification problem (Tejada, Knoblock, & Minton, 2001; Tejada, Knoblock, & Minton, 2002), or entity resolution problem (Benjelloun, Garcia-Molina, Su, & Widom, 2005). All these research efforts deal with the fundamental question of how to effectively identify record ``duplicates” when unique identifiers are unavailable or do not exist in records. The main idea is to rely on matching of other fields in records such as name, address, and so on. It is not uncommon for a record having over hundred fields in real data files. Therefore only a relatively small subset of fields is used to carry out the matching. The set of fields selected is application dependent and is often referred to as keys.