Most of the researchers suggested that the similarity measures are scoring functions to determine relationship between a pair of webpages. The scores are usually between 0 and 1, the lower value indicates that two webpages are dissimilar and the higher value indicates that two webpages are identical by Smucker et al. (2007). Calado et al. (2006) proposed link-based techniques, in which hyperlinks are used for finding webpage similarity. Wan (2008) stated that similarity measures are central to many important applications such as searching, clustering, classification and recommendation.
Lin et al. (2006) stated that the similarity measures are broadly classified into two categories namely the text-based approach and the hyperlink-based approach. In the text-based approach, the similarity between webpages is evaluated by webpages’ contents. Peter Turney & Patrick Pantel (2010) suggested that the most widely used content based similarity measure is cosine TFIDF in Information Retrieval, which has several issues when applied to the web, since the web consists of billions of webpages. Scalability of web is a major issue, because it requires large amount of storage and long computation time for comparison of the full text. Next, accuracy of similarity measure is not exact since most of the webpages are not properly edited.