Finding Cluster Centroids with Scikit-learn
Cluster analysis is a fundamental task in machine learning, and finding the centroids of clusters is often a key step. Scikit-learn provides powerful tools for clustering, and in this article, we’ll explore how to obtain the cluster centroids using different methods.
K-means Clustering
K-means is a popular clustering algorithm that partitions data points into ‘k’ clusters by iteratively assigning points to the nearest centroid and updating the centroids based on the assigned points.
1. Implementing K-means
from sklearn.cluster import KMeans import numpy as np # Sample data X = np.array([[1, 2], [1.5, 1.8], [5, 8], [8, 8], [1, 0.6], [9, 11]]) # Initialize KMeans with 2 clusters kmeans = KMeans(n_clusters=2, random_state=0) # Fit the model to the data kmeans.fit(X) # Get the cluster centroids centroids = kmeans.cluster_centers_ # Print the centroids print(centroids)
2. Output
[[1.25 1.1 ] [7. 9.5 ]]
The code above performs K-means clustering with 2 clusters and extracts the cluster centroids, which represent the average locations of data points within each cluster.
Other Clustering Methods
Scikit-learn also provides other clustering algorithms, such as DBSCAN and Agglomerative Clustering, which can be used to find cluster centroids. However, the specific way to extract the centroids may vary depending on the algorithm.
1. DBSCAN
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) uses density-based connectivity to identify clusters.
from sklearn.cluster import DBSCAN from sklearn.datasets import make_blobs # Generate sample data X, _ = make_blobs(n_samples=100, centers=3, random_state=0) # Initialize DBSCAN dbscan = DBSCAN(eps=0.5, min_samples=5) # Fit the model dbscan.fit(X) # Get the cluster labels labels = dbscan.labels_ # Find unique labels (including noise) unique_labels = np.unique(labels) # Calculate the centroids for each cluster centroids = [] for label in unique_labels: if label != -1: # Ignore noise points cluster_points = X[labels == label] centroid = np.mean(cluster_points, axis=0) centroids.append(centroid) print(centroids)
2. Agglomerative Clustering
Agglomerative Clustering is a hierarchical clustering approach. It builds a hierarchy of clusters by merging smaller clusters into larger ones.
from sklearn.cluster import AgglomerativeClustering from sklearn.datasets import make_blobs # Generate sample data X, _ = make_blobs(n_samples=100, centers=3, random_state=0) # Initialize Agglomerative Clustering agg_clustering = AgglomerativeClustering(n_clusters=3) # Fit the model agg_clustering.fit(X) # Get the cluster labels labels = agg_clustering.labels_ # Calculate the centroids for each cluster centroids = [] for label in range(3): cluster_points = X[labels == label] centroid = np.mean(cluster_points, axis=0) centroids.append(centroid) print(centroids)
Summary
This article illustrated how to retrieve cluster centroids using Scikit-learn. By understanding the methods and their outputs, you can leverage these techniques to gain valuable insights from your data. Remember to choose the appropriate clustering algorithm based on the nature of your dataset and the specific task at hand.