Visualizing Principal Component Analysis (PCA) in scikit-learn: Loadings and Biplots
Principal Component Analysis (PCA) is a powerful dimensionality reduction technique widely used in data analysis. After performing PCA, visualizing the loadings and creating biplots can provide valuable insights into the relationships between variables and principal components.
Understanding PCA Loadings
PCA loadings represent the contribution of each original feature to the corresponding principal component. They essentially indicate how much each variable “loads” onto the principal component. Loadings are often visualized as a bar chart or a heatmap to understand which variables are most influential in defining each principal component.
Creating a Loadings Plot in scikit-learn
Let’s illustrate how to plot loadings using scikit-learn. We’ll use the famous Iris dataset as an example:
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import pandas as pd
# Load the Iris dataset
iris = load_iris()
X = iris.data
features = iris.feature_names
# Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Apply PCA with 2 components
pca = PCA(n_components=2)
pca.fit(X_scaled)
# Get the loadings
loadings = pca.components_
# Create a dataframe for easier visualization
loadings_df = pd.DataFrame(loadings, columns=features)
loadings_df.index = ['PC1', 'PC2']
# Plot the loadings
plt.figure(figsize=(10, 6))
loadings_df.T.plot(kind='bar', rot=0)
plt.xlabel('Features')
plt.ylabel('Loadings')
plt.title('PCA Loadings')
plt.legend(title='Principal Components')
plt.show()
<Figure size 720x432 with 1 Axes>
In the above code, we first load and scale the Iris dataset. Then, we perform PCA with two components and obtain the loadings. Finally, we create a pandas DataFrame for the loadings and plot them using a bar chart. This visualization shows the contribution of each feature to PC1 and PC2.
Biplots: Visualizing Data and Loadings Together
A biplot combines the scores of the principal components with the loadings. This allows you to see both the data points and the relationships between variables in the reduced space.
Creating a Biplot in scikit-learn
To create a biplot, we’ll need to calculate the principal component scores and then use Matplotlib’s `plt.scatter` and `plt.quiver` functions:
import matplotlib.pyplot as plt
# Get the principal component scores
scores = pca.transform(X_scaled)
# Create the biplot
plt.figure(figsize=(10, 6))
plt.scatter(scores[:, 0], scores[:, 1], c=iris.target, cmap='viridis', alpha=0.7)
# Plot the loadings
for i, (x, y) in enumerate(loadings):
plt.arrow(0, 0, x, y, color='k', head_width=0.05, head_length=0.1, alpha=0.8)
plt.text(x, y, features[i], fontsize=10)
plt.xlabel('PC1 ({}%)'.format(round(pca.explained_variance_ratio_[0] * 100, 2)))
plt.ylabel('PC2 ({}%)'.format(round(pca.explained_variance_ratio_[1] * 100, 2)))
plt.title('Biplot of Iris Dataset')
plt.grid(True)
plt.show()
<Figure size 720x432 with 1 Axes>
In this code, we calculate the principal component scores, plot the data points as a scatter plot, and then draw arrows for the loadings. The length of the arrows represents the magnitude of the loadings, and their direction indicates the contribution of each variable to the corresponding principal component.
Benefits of Visualization
Plotting loadings and creating biplots offers several advantages:
- Data Interpretation: Loadings help understand which variables are most responsible for each principal component, providing insights into the underlying structure of the data.
- Variable Relationships: Biplots reveal the relationships between variables and how they contribute to the principal components. This can help identify potential correlations and dependencies.
- Data Exploration: Visualizing the data and loadings together facilitates a comprehensive analysis and exploration of the reduced dimensionality space.
Conclusion
Visualizing PCA loadings and creating biplots in scikit-learn allows you to gain deeper insights into the relationships between variables and principal components. By plotting these components, you can enhance your understanding of the underlying data structure and make informed decisions based on your findings.