Feature Encoding Techniques in Python: A Comprehensive Guide
In machine learning, categorical features often need to be transformed into numerical representations for models to effectively process them. Python provides several methods for this conversion, including:
pd.factorize
pd.get_dummies
sklearn.preprocessing.LabelEncoder
sklearn.preprocessing.OneHotEncoder
Let’s explore each of these techniques and their applications.
1. pd.factorize: Label Encoding
Functionality:
The pd.factorize
function converts categorical features into numerical labels. It assigns a unique integer to each distinct category, starting from 0.
Code Example:
import pandas as pd
data = {'City': ['New York', 'London', 'Paris', 'New York', 'London']}
df = pd.DataFrame(data)
# Factorizing the 'City' column
encoded_labels, uniques = pd.factorize(df['City'])
print(f'Encoded Labels: {encoded_labels}')
print(f'Unique Categories: {uniques}')
Output:
Encoded Labels: [0 1 2 0 1] Unique Categories: ['New York' 'London' 'Paris']
2. pd.get_dummies: One-Hot Encoding
Functionality:
pd.get_dummies
performs one-hot encoding, creating a new column for each distinct category and assigning a binary value (0 or 1) based on the presence or absence of the category.
Code Example:
import pandas as pd
data = {'City': ['New York', 'London', 'Paris', 'New York', 'London']}
df = pd.DataFrame(data)
# One-hot encoding the 'City' column
df_encoded = pd.get_dummies(df, columns=['City'], prefix='City')
print(df_encoded)
Output:
City_London City_New York City_Paris 0 0 1 0 1 1 0 0 2 0 0 1 3 0 1 0 4 1 0 0
3. sklearn.preprocessing.LabelEncoder: Label Encoding
Functionality:
Similar to pd.factorize
, LabelEncoder
transforms categorical features into numerical labels. It assigns a unique integer to each distinct category, starting from 0.
Code Example:
from sklearn.preprocessing import LabelEncoder
data = {'City': ['New York', 'London', 'Paris', 'New York', 'London']}
df = pd.DataFrame(data)
le = LabelEncoder()
encoded_labels = le.fit_transform(df['City'])
print(f'Encoded Labels: {encoded_labels}')
print(f'Class Mapping: {le.classes_}')
Output:
Encoded Labels: [2 1 0 2 1] Class Mapping: ['London' 'New York' 'Paris']
4. sklearn.preprocessing.OneHotEncoder: One-Hot Encoding
Functionality:
OneHotEncoder
also performs one-hot encoding, creating a sparse matrix with one column for each distinct category. Each row represents an instance, and a value of 1 in a column indicates the presence of that category for that instance.
Code Example:
from sklearn.preprocessing import OneHotEncoder
data = {'City': ['New York', 'London', 'Paris', 'New York', 'London']}
df = pd.DataFrame(data)
ohe = OneHotEncoder(sparse=False)
encoded_features = ohe.fit_transform(df[['City']])
print(f'Encoded Features:\n{encoded_features}')
print(f'Categories: {ohe.categories_}')
Output:
Encoded Features: [[0. 1. 0.] [1. 0. 0.] [0. 0. 1.] [0. 1. 0.] [1. 0. 0.]] Categories: [array(['London', 'New York', 'Paris'], dtype=object)]
Comparison Table
| Technique | Function | Output | Notes |
|—|—|—|—|
| pd.factorize | pd.factorize(column)
| Returns encoded labels and unique categories | Label encoding with numerical integers |
| pd.get_dummies | pd.get_dummies(df, columns=['column'])
| Creates new columns for each category with binary values | One-hot encoding |
| LabelEncoder | LabelEncoder().fit_transform(column)
| Returns encoded labels and a mapping of categories | Label encoding with numerical integers |
| OneHotEncoder | OneHotEncoder().fit_transform(df[['column']])
| Creates a sparse matrix with one column per category | One-hot encoding, suitable for large datasets |
Key Differences and Considerations
- Label Encoding (
pd.factorize
,LabelEncoder
): Useful for ordered categorical features, where the numerical representation reflects the inherent order. However, can introduce unintended relationships between categories. - One-Hot Encoding (
pd.get_dummies
,OneHotEncoder
): Preferred for unordered categorical features, as it avoids creating artificial relationships. Can lead to a higher number of features, especially with many categories. - Sparse Matrices (
OneHotEncoder
): Efficient for handling large datasets with many categories, as it stores only non-zero values. - Interpretability: Label encoding can be more interpretable, while one-hot encoding might be less intuitive.
Conclusion
Selecting the appropriate encoding technique depends on your data, model requirements, and interpretation needs. Experiment and evaluate different techniques to determine the best fit for your specific use case.