Article Preview
TopBackground
Entity-based data integration (EBDI) is the process of integrating and rationalizing the collective information associated with the same real-world entities. Each record related to a particular entity may only provide a small portion of information about that entity, but when combined with the information from other records a more comprehensive picture can emerge. Having multiple values for the same attribute can have both positive and negative implications. When the attribute values agree, it tends to increase the level of confidence that the values are accurate. On the other hand when there are conflicting values, it begs the question of which, if any, value is correct. The process of resolving these conflicts and deciding which values to keep or discard is sometimes called knowledgebase arbitration (Doerr, 2003; Liberatore, 1995; Revesz, 1993).
The formal description of EBDI extends the Algebraic model of entity resolution (ER) proposed by Talburt, Wang, Hess, and Kuo (2007) in which an ER process is defined in terms of an equivalence relation on a set of entity references (Talburt & Hashemi, 2008; Holland & Talburt, 2009; Talburt, 2011). The formal description of EBDI begins with the concept of an Integration Context. The integration context provides an explicit mechanism to describe both entity equivalence (the ER part) and attribute equivalence (the integration part) across a collection of information sources. Both entity and attribute equivalence must be considered when dealing with entity-based integration.
The evaluation of selection operators is best illustrated by example. Table 1 shows an integration context of three sources S1, S2, and S3 for which the entity equivalence relation X creates 10 integration entities. The columns labeled S1, S2, and S3 contain the values contributed by each of these sources for a particular integration attribute that can take on any one of values “A”, “B”, “C”, “D”, or null. Furthermore, the column labeled as True shows the correct value of this attribute for each of the 10 integration entities.
Table 1. Accuracy of sources and selection operators
| True | S1 | S2 | S3 | Naive | Best | Worst |
1 | B | A | - | - | A | A | A |
2 | C | - | C | A | C | C | A |
3 | A | A | A | D | A | A | D |
4 | B | B | B | C | B | B | C |
5 | D | C | D | D | C | D | C |
6 | C | C | B | - | C | C | B |
7 | D | - | - | D | D | D | D |
8 | B | - | D | B | D | B | D |
9 | A | A | - | B | A | A | B |
10 | B | B | A | C | B | B | A |
| 100% | 50% | 40% | 30% | 70% | 90% | 10% |