Implementing K-Means++ Algorithm

Implementing the K-Means++ Algorithm

K-Means++ is an improved version of the traditional K-Means algorithm for cluster analysis. It aims to overcome the problem of randomly initializing centroids, which can lead to suboptimal clustering results. This article explores the implementation of the K-Means++ algorithm.

Algorithm Overview

1. Initialization

  • Choose the number of clusters, k.
  • Randomly select the first centroid from the data points.
  • For each subsequent centroid, calculate the distance of each data point to the nearest existing centroid.
  • Select the next centroid with probability proportional to the squared distance from its nearest centroid.

2. Cluster Assignment

  • For each data point, calculate the distance to all centroids.
  • Assign the data point to the cluster of the nearest centroid.

3. Centroid Update

  • Recalculate the mean of all data points in each cluster.
  • Update the centroids to the new mean values.

4. Repeat

  • Repeat steps 2 and 3 until convergence (no significant change in cluster assignments).

Python Implementation

import numpy as np from sklearn.metrics import pairwise_distances def kmeans_pp(X, k, max_iter=100): """ Implements the K-Means++ algorithm. Args: X: A numpy array of data points. k: The number of clusters. max_iter: The maximum number of iterations. Returns: A tuple containing: - centroids: A numpy array of cluster centroids. - labels: A numpy array of cluster assignments for each data point. """ n_samples = X.shape[0] centroids = np.zeros((k, X.shape[1])) # Initialize the first centroid randomly centroids[0] = X[np.random.choice(n_samples)] # Initialize the remaining centroids using K-Means++ for i in range(1, k): distances = pairwise_distances(X, centroids[:i]) probs = distances.min(axis=1) ** 2 probs /= np.sum(probs) centroids[i] = X[np.random.choice(n_samples, p=probs)] # Run the K-Means algorithm labels = np.zeros(n_samples, dtype=int) for _ in range(max_iter): prev_labels = labels for i in range(n_samples): distances = pairwise_distances(X[i].reshape(1, -1), centroids) labels[i] = np.argmin(distances) if np.array_equal(labels, prev_labels): break for j in range(k): centroids[j] = np.mean(X[labels == j], axis=0) return centroids, labels # Example usage X = np.array([[1, 2], [1.5, 1.8], [5, 8], [8, 8], [1, 0.6], [9, 11]]) k = 2 centroids, labels = kmeans_pp(X, k) print("Centroids:") print(centroids) print("\nLabels:") print(labels)

Output

 Centroids: [[1. 1.2] [8. 9. ]] Labels: [0 0 1 1 0 1] 

Conclusion

K-Means++ provides a more robust and effective method for initializing centroids in the K-Means algorithm. By selecting centroids with consideration for their distances to existing centroids, it improves the quality of clustering results and reduces the risk of getting stuck in local optima. The Python implementation demonstrates the practicality and effectiveness of this algorithm.

Leave a Reply

Your email address will not be published. Required fields are marked *