Article Preview
TopIntroduction
Over past two decades, with the tremendous advent, evolution, digitization and continuous growth in analysis of printed documents, Devanagari script and Hindi based processing systems have established their consistent and framed zone for information searching, extraction and retrieval from text and imaged documents. Although the text processing and accurate information retrieval (Puri & Kaushik, 2011; Puri & Kaushik, 2012) from Hindi printed scanned documents (Puri & Singh, 2018) have always been very complicated and challenging, yet it has achieved a great deal of success in accurate word and character recognition and also has got high level of researchers’ attention in recent days. In this article, a new automated Hindi Printed Document Classification System using Support Vector Machine and Fuzzy logic (HPDC-SF) is introduced, which proves to be an efficient advancement over currently available offline Hindi document processing systems. HPDC-SF is designed to classify scanned printed imaged documents into pre – defined mutually exclusive categories by using Support Vector Machine (SVM) at character level and Fuzzy matching at document level classification, respectively.
Many Hindi based processing systems have emerged in recent years through the combination of artificial intelligence (Padhy, 2005), pattern recognition, image processing (Gonzalez & Woods, 2008) and text mining (Han, Kamber, & Pei, 2012) concepts. These systems have contributed a lot towards the discrete and dynamic real time application areas of distributed environment. The automatic Hindi text processing system applications cover text syntax and semantics, editors, spell checkers, formatters, linguistics-based grammar and vocabulary, convertors, translators, transliteration, summarization, speech recognition with conversion, cross lingual and many other related fields. On the other side, many Hindi text imaged document methodologies have emerged in recent years (Puri & Singh, 2018; Sinha 2009), which have covered the areas of extraction and recognition of optical characters, words and lines in multi – script, multi – colored, multi – forms, multi – pattern, multi – oriented, multi – font and multi – sized documents. Therefore, it is found that there is a high need to design an advanced imaged document processing and classification system, which can work beyond Optical Character Recognition (OCR). Such systems need to build the words from extracted optical characters, to gather the image contents, and to classify the Hindi printed images optimally. Accuracy estimation of such systems is a major and highly critical aspect because only correct OCRing, word building, and effective classifier implementation can lead to accurate classification of Hindi printed images (Puri & Singh, 2018). The application areas of these automated document processing systems include categorization of Government legal files, security files, identification of property owners etc. In addition to this, they play a major role in separating the important text images from non-important ones. To estimate the measures and efficiency of HPDC-SF, various experiments have been performed on different types of Hindi printed images, which were collected from different Government sites, newsletters, novels, magazines, blogs, newspaper cuttings etc.