Choosing the indexing unit is yet a challenging problem in Arabic IR (Elayeb & Bounhas, 2015). When using surface words, precision reaches high levels, but we will report low recall rates because of the high derivation and agglutination of Arabic. Furthermore, this choice requires more resources in terms of storage space and processing time (Aljlayl & Frieder, 2002). Thus, other types of indexing units gave better performance to Arabic IR systems like stems and roots. To better understand this problem, we present an example in section 1. We summarize our contribution in section 2.
1.1. Motivation
Let consider the Arabic root “kassama-قسم” and some related words (cf. Table 1). In one side, root-based methods will relate many derived words to the same form and this will cause ambiguity and reduce precision. If we consider the example, “inkassama-انقسم” (was divided) and “akssama-اقسم” (swear) will be represented by the same index i.e. “kassama-قسم” (divide).
Table 1. Some Arabic words related to the root “kassama-قسم”
انقسم Inkassama | استقسم istakssama | اقتسم iktassama | اقسم Akssama | قاسم kaassama | انقسام inkissam | قسمة kisma | الانقسامات alinkissamat |
Was divided | Conjure | Share somebody in something. | Swear | Share something with somebody | Division | Division | The divisions |
Stem indexes reduce ambiguity (Ayed, 2014; Bounhas et al., 2015), but it will from the other side, reduce recall, since some studies (Al-Kabi et al., 2011) showed that in most cases, morphological variants of words have somehow similar semantic interpretations and are not completely dissimilar and different word forms may bear similar meaning. When we apply a light stemming to the same example, “alinkissamat-الانقسامات” (the divisions) and “inkissam-انقسام” (division) will have the same stem but their semantic relation with “kisma -قسمة” (division) will be ignored.
Indeed, light stemming methods offer less recall and more precision. However, lemmatization helps to achieve better recall rates, but reduces precision. Thus, combining these techniques seems promising as it realizes some compromise between precision and recall, in a way we ensure that “alinkissamat -الانقسامات” (the divisions) is equal to “inkissam-انقسام” (division), not so far from “inkassama-انقسم” (was divided), a little bit different from “iktassama-اقتسم” (share somebody in something), but not totally different from “kaassama-قاسم” (share something with somebody) and that “akssama-اقسم” (swear) and “istakssama-استقسم”(conjure) are somehow related (cf. Figure 1).
Figure 1. Hybrid index multi-level structure example for the root “kassama-قسم