Article Preview
TopIntroduction
Over the last decades, we have been experiencing the explosion of textual information such as social media, numerical books and digital encyclopedia, etc. Natural Language Processing (NLP) techniques have been designed to help user to analyze and extract insight from huge amount of textual data. Innovative Machine Learning (ML) approaches, such as neural networks and deep learning models, showed significant enhancements in many NLP applications (information retrieval, document clustering, etc.). Text categorization (TC) is a fundamental task in diverse text mining applications such as sentiment analysis (Kim, 2014), question classification (Alami, En-Nahnahi, Zidani, & Ouatik, 2019), information filtering and topic classification (El-Alami & El Alaoui, 2018). This process consists of assigning a predefined label or a category to a textual document. However, building a TC system remains a challenging task due to the following two reasons: (1) The high dimensionality of feature space which decreases the performance of the categorization system; (2) The existence of redundant and noisy features that misguide the TC results. To address these issues, various feature representation methods have been proposed. The most known representations are Bag-of-Words (BoW), pLSA, LDA, word embeddings and doc2vec. BoW (Wang & Manning, 2012) extracts patterns like unigrams, bigrams, n-grams as features by considering text as independent tokens. However, this method cannot capture semantics within texts and fails to reflect similarities among words. PLSA (Cai & Hofmann, 2003) and LDA (Hingmire, Chougule, Palshikar, & Chakraborti, 2013) are topic modeling methods which are, generally, applied to select more discriminative features but they suffer from inference problem. More efficient representations as word embeddings (Mikolov, Sutskever, Chen, Corrado, & Dean, 2013b) or document embeddings (Le & Mikolov, 2014) are defined as a set of language modeling techniques able to present words of the vocabulary or text as low dimensional vectors of real numbers through neural language models. They ignore the information embedded in lexical database. These representations have shown a good performance in Arabic text categorization. However, the information embedded in lexical database is ignored.
While several TC systems have been proposed for other languages (English, French, etc), Arabic TC still faces numerous difficulties in addition to the challenges discussed above. This could be explained by the complexity of Arabic language owing to the fact that it is inflectional and derivational.
In this paper, we explore deep neural models and retrofitting for Arabic text categorization to solve the aforementioned shortcomings such the luck of semantic, the high-dimensionality of representation space and the complexity of Arabic language. The retrofitting is defined as a graph-based learning technique utilizing lexical relational resources to train higher quality semantic vectors. It is employed for further enhancement. On the other hand, deep neural networks are able to achieve best results in many NLP tasks. Convolutional Neural Networks (CNNs) are able to achieve good performances in highlighting best features and empowering deeper models (Kim, 2014; Kalchbrenner, Grefenstette, & Blunsom, 2014). Long Short-Term Memory networks (LSTMs) are a kind of Recurrent Neural Networks (RNNs) which enable associations between cells to construct a directed graph along a sequence. They have demonstrated great capabilities in highlighting dynamic behavior of sequential data (Hochreiter & Schmidhuber, 1997). The main contributions of this work can be summarized as follows: