Python: String Clustering with scikit-learn’s DBSCAN, using Levenshtein Distance as Metric

String Clustering with DBSCAN and Levenshtein Distance

Introduction

String clustering aims to group similar strings together. This is crucial in tasks like data cleaning, duplicate detection, and fuzzy search. One powerful approach uses DBSCAN (Density-Based Spatial Clustering of Applications with Noise) from scikit-learn, paired with the Levenshtein distance metric.

Understanding DBSCAN

DBSCAN works by identifying dense regions (clusters) in a dataset. It requires two parameters:

  • eps: Maximum distance between two points to be considered neighbors.
  • min_samples: Minimum number of points required to form a dense region.

Levenshtein Distance

Levenshtein distance measures the minimum number of edits (insertions, deletions, substitutions) required to transform one string into another. It captures the similarity between strings, enabling DBSCAN to cluster based on edit distance.

Implementation

Let’s implement string clustering using DBSCAN and Levenshtein distance in Python:

import numpy as np
from sklearn.cluster import DBSCAN
from nltk.metrics import edit_distance
strings = ['apple', 'appl', 'banana', 'aple', 'mango', 'bananna', 'pineapple']

1. Calculate Distance Matrix

We create a distance matrix where each element represents the Levenshtein distance between two strings:

distances = np.zeros((len(strings), len(strings)))
for i in range(len(strings)):
    for j in range(i, len(strings)):
        distances[i, j] = distances[j, i] = edit_distance(strings[i], strings[j])

2. Apply DBSCAN

We initialize DBSCAN with appropriate eps and min_samples values, and then fit it to the distance matrix:

dbscan = DBSCAN(eps=2, min_samples=2, metric='precomputed')
clusters = dbscan.fit_predict(distances)

3. Visualize Clusters

Finally, we print the cluster assignments for each string:

for i, cluster in enumerate(clusters):
    print(f'String: {strings[i]}, Cluster: {cluster}')

Output

String: apple, Cluster: 0
String: appl, Cluster: 0
String: banana, Cluster: 1
String: aple, Cluster: 0
String: mango, Cluster: -1
String: bananna, Cluster: 1
String: pineapple, Cluster: -1

Explanation

  • “apple”, “appl”, and “aple” belong to cluster 0, as they are similar based on Levenshtein distance.
  • “banana” and “bananna” form cluster 1.
  • “mango” and “pineapple” are outliers (noise) as they are not similar to any other string within the specified eps.

Conclusion

By using DBSCAN with the Levenshtein distance metric, we can effectively cluster similar strings. This technique is particularly useful in scenarios where traditional methods fail to account for minor variations and typos. Remember to adjust eps and min_samples based on your dataset and desired clustering behavior.


Leave a Reply

Your email address will not be published. Required fields are marked *