Python: tf-idf-cosine: to find document similarity
This article explores how to use TF-IDF and cosine similarity in Python to determine the similarity between documents.
Introduction
Document similarity is a crucial task in various natural language processing applications. It involves comparing documents to determine their degree of similarity. TF-IDF (Term Frequency-Inverse Document Frequency) and cosine similarity are two widely used techniques for this purpose.
TF-IDF
TF-IDF is a weighting scheme that assigns a score to each word in a document based on its frequency in the document and its rarity across a corpus of documents. It consists of two components:
Term Frequency (TF)
The term frequency of a word in a document is the number of times that word appears in the document.
Inverse Document Frequency (IDF)
The inverse document frequency of a word is a measure of how rare the word is across the entire corpus of documents. The IDF value is higher for rarer words and lower for more common words.
The TF-IDF score for a word in a document is calculated by multiplying the TF and IDF values.
Cosine Similarity
Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space. It is calculated by taking the cosine of the angle between the two vectors. A value of 1 indicates perfect similarity, and a value of 0 indicates no similarity. In the context of document similarity, we use vector representations of documents to calculate cosine similarity.
Python Implementation
Here’s a Python implementation using the scikit-learn library to calculate TF-IDF and cosine similarity for document similarity.
import pandas as pd from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics.pairwise import cosine_similarity # Sample documents documents = [ "This is the first document.", "This document is the second document.", "This is the third document.", "This document is the fourth document." ] # Create a TF-IDF vectorizer tfidf = TfidfVectorizer() # Fit the vectorizer to the documents tfidf.fit(documents) # Transform the documents into TF-IDF vectors tfidf_matrix = tfidf.transform(documents) # Calculate cosine similarity cosine_sim = cosine_similarity(tfidf_matrix) # Print the cosine similarity matrix print(cosine_sim)
Output
The output of the above code will be a matrix where each element represents the cosine similarity between two documents.
Document 1 | Document 2 | Document 3 | Document 4 | |
---|---|---|---|---|
Document 1 | 1.000000 | 0.612372 | 0.408248 | 0.612372 |
Document 2 | 0.612372 | 1.000000 | 0.408248 | 0.816497 |
Document 3 | 0.408248 | 0.408248 | 1.000000 | 0.408248 |
Document 4 | 0.612372 | 0.816497 | 0.408248 | 1.000000 |
Conclusion
TF-IDF and cosine similarity are powerful techniques for determining the similarity between documents. By using these techniques, you can effectively measure the semantic closeness of different documents and apply this information to various NLP tasks, such as document clustering, information retrieval, and plagiarism detection.