Python: tf-idf-cosine: to find document similarity

Python: tf-idf-cosine: to find document similarity

This article explores how to use TF-IDF and cosine similarity in Python to determine the similarity between documents.

Introduction

Document similarity is a crucial task in various natural language processing applications. It involves comparing documents to determine their degree of similarity. TF-IDF (Term Frequency-Inverse Document Frequency) and cosine similarity are two widely used techniques for this purpose.

TF-IDF

TF-IDF is a weighting scheme that assigns a score to each word in a document based on its frequency in the document and its rarity across a corpus of documents. It consists of two components:

Term Frequency (TF)

The term frequency of a word in a document is the number of times that word appears in the document.

Inverse Document Frequency (IDF)

The inverse document frequency of a word is a measure of how rare the word is across the entire corpus of documents. The IDF value is higher for rarer words and lower for more common words.

The TF-IDF score for a word in a document is calculated by multiplying the TF and IDF values.

Cosine Similarity

Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space. It is calculated by taking the cosine of the angle between the two vectors. A value of 1 indicates perfect similarity, and a value of 0 indicates no similarity. In the context of document similarity, we use vector representations of documents to calculate cosine similarity.

Python Implementation

Here’s a Python implementation using the scikit-learn library to calculate TF-IDF and cosine similarity for document similarity.

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Sample documents
documents = [
    "This is the first document.",
    "This document is the second document.",
    "This is the third document.",
    "This document is the fourth document."
]

# Create a TF-IDF vectorizer
tfidf = TfidfVectorizer()

# Fit the vectorizer to the documents
tfidf.fit(documents)

# Transform the documents into TF-IDF vectors
tfidf_matrix = tfidf.transform(documents)

# Calculate cosine similarity
cosine_sim = cosine_similarity(tfidf_matrix)

# Print the cosine similarity matrix
print(cosine_sim)

Output

The output of the above code will be a matrix where each element represents the cosine similarity between two documents.

Document 1 Document 2 Document 3 Document 4
Document 1 1.000000 0.612372 0.408248 0.612372
Document 2 0.612372 1.000000 0.408248 0.816497
Document 3 0.408248 0.408248 1.000000 0.408248
Document 4 0.612372 0.816497 0.408248 1.000000

Conclusion

TF-IDF and cosine similarity are powerful techniques for determining the similarity between documents. By using these techniques, you can effectively measure the semantic closeness of different documents and apply this information to various NLP tasks, such as document clustering, information retrieval, and plagiarism detection.


Leave a Reply

Your email address will not be published. Required fields are marked *