Document Similarity Measure for Classification and Clustering using TF-IDF
Author(s):
Neethu K P , MAR ATHANASIUS COLLEGE OF ENGINEERING, KOTHAMANGALAM, KERALA; Sony K T, MAR ATHANASIUS COLLEGE OF ENGINEERING, KOTHAMANGALAM, KERALA; Aby Abahai T, MAR ATHANASIUS COLLEGE OF ENGINEERING, KOTHAMANGALAM, KERALA
Keywords:
term frequency, inverse term frequency, tf-idf, SMTP, k-means clustering
Abstract:
Measuring the similarity between documents is an important operation in the text processing field. The feature with a larger spread offers more contribution to the similarity between documents. The feature value can be term frequency and relative term frequency that is a tf-idf combination. The Similarity Measure with tf-idf is extended to gauge the similarity between two sets of documents. Instead of counting difference between features our proposed system give weightage for feature. In this system absence and presence of a property has more important than similarity between documents features. The measure is applied in several text applications, including label classification and clustered with k-means like clustering. The results show that the performance obtained by the proposed measure is better than that achieved by other measures.
Other Details:
| Manuscript Id | : | IJSTEV3I2035
|
| Published in | : | Volume : 3, Issue : 2
|
| Publication Date | : | 01/09/2016
|
| Page(s) | : | 87-91
|
Download Article