Personalized Content Extraction and Text Classification Using Effective Web Scraping Techniques

Personalized Content Extraction and Text Classification Using Effective Web Scraping Techniques

Karthikeyan T., Karthik Sekaran, Ranjith D., Vinoth Kumar V., Balajee J M
Copyright: © 2019 |Pages: 12
DOI: 10.4018/IJWP.2019070103
(Individual Articles)
No Current Special Offers


Web scraping is a technique to extract information from various web documents automatically. It retrieves the related contents based on the query, aggregates and transforms the data from an unstructured format into a structured representation. Text classification becomes a vital phase to summarize the data and in categorizing the webpages adequately. In this article, using effective web scraping methodologies, the data is initially extracted from websites, then transformed into a structured form. Based on the keywords from the data, the documents are classified and labeled. A recursive feature elimination technique is applied to the data to select the best candidate feature subset. The final data-set trained with standard machine learning algorithms. The proposed model performs well on classifying the documents from the extracted data with a better accuracy rate.
Article Preview

2. Background

A personalized recommendation system is developed from the user’s search by (Liang et al., 2008) to provide customized content. Similarly, news categorization is made personally to the target users of different groups with scalable document classification techniques (Ioannis et al., 2006).

Collaborative tagging improves the keyword extraction process with better outcomes. Content-based tagging system represents the capabilities of search systems (Nirmala et al., 2010). The personalized blog recommendation system developed (Chiu et al., 2018) for mobile phone users. User history and browsing content are analyzed to provide targeted recommendations.

A personalized web-bot is created to assist the user based on their interest to view specific content and webpages (Jung et al., 2004). This system is developed by fusing collaborative filtering and hybrid content-based filtering techniques. Web browsing classification system on mobile interfaces is developed with six standard perspectives (Roudaki et al., 2015).

Genetic algorithm based document clustering method is proposed to mine the text from a large amount of biomedical information (Wahiba et al., 2016). A structured meta-data extraction method is deployed to fetch information from scientific studies (Tkaczyk et al., 2015) and available for researchers under open source license.

Complete Article List

Search this Journal:
Volume 15: 1 Issue (2024): Forthcoming, Available for Pre-Order
Volume 14: 2 Issues (2022): 1 Released, 1 Forthcoming
Volume 13: 2 Issues (2021)
Volume 12: 2 Issues (2020)
Volume 11: 2 Issues (2019)
Volume 10: 2 Issues (2018)
Volume 9: 2 Issues (2017)
Volume 8: 1 Issue (2016)
Volume 7: 2 Issues (2015)
Volume 6: 4 Issues (2014)
Volume 5: 4 Issues (2013)
Volume 4: 4 Issues (2012)
Volume 3: 4 Issues (2011)
Volume 2: 4 Issues (2010)
Volume 1: 4 Issues (2009)
View Complete Journal Contents Listing