Search the World's Largest Database of Information Science & Technology Terms & Definitions
InfInfoScipedia LogoScipedia
A Free Service of IGI Global Publishing House
Below please find a list of definitions for the term that
you selected from multiple scholarly research resources.

What is Ngram (N-Gram)

Machine Learning for Societal Improvement, Modernization, and Progress
It is the general name given to consecutive arrays consisting of n elements. In the context of NLP and computational linguistics, the elements that make up n-grams can be selected as words, syllables, phonemes or letters in a spoken text or written text. The variable expressed with n represents the value for which the repetition is controlled, while the gram corresponds to the weight of this repeated value in the array.
Published in Chapter:
Deep Learning for Information Extraction From Digital Documents: An Innovative Approach to Automatic Parsing and Rich Text Extraction From PDF Files
Yavuz Kömeçoğlu (Kodiks Bilişim, Turkey), Serdar Akyol (Kodiks Bilisim, Turkey), Fethi Su (Kodiks Bilisim, Turkey), and Başak Buluz Kömeçoğlu (Gebze Technical University, Turkey)
DOI: 10.4018/978-1-6684-4045-2.ch009
Abstract
Print-oriented PDF documents are excellent at preserving the position of text and other objects but have difficulties in processing. Processable PDF documents will provide solutions to the unique needs of different sectors by paving the way for many innovations such as searching within documents, linking with different documents, or restructuring in a format that will increase the reading experience. In this chapter, a deep learning-based system design is presented that aims to export clean text content, separate all visual elements, and extract rich information from the content without losing the integrated structure of content types. While the F-RCNN model using the Detectron2 library was used to extract the layout, the cosine similarities between the wod2vec representations of the texts were used to identify the related clips, and the transformer language models were used to classify the clip type. The performance values on the 200-sample data set created by the researchers were determined as 1.87 WER and 2.11 CER in the headings and 0.22 WER and 0.21 CER in the paragraphs.
Full Text Chapter Download: US $37.50 Add to Cart
eContent Pro Discount Banner
InfoSci OnDemandECP Editorial ServicesAGOSR