Understanding Feature Encoding in Python: pd.factorize, pd.get_dummies, LabelEncoder, and OneHotEncoder

Feature Encoding Techniques in Python: A Comprehensive Guide

In machine learning, categorical features often need to be transformed into numerical representations for models to effectively process them. Python provides several methods for this conversion, including:

  • pd.factorize
  • pd.get_dummies
  • sklearn.preprocessing.LabelEncoder
  • sklearn.preprocessing.OneHotEncoder

Let’s explore each of these techniques and their applications.

1. pd.factorize: Label Encoding

Functionality:

The pd.factorize function converts categorical features into numerical labels. It assigns a unique integer to each distinct category, starting from 0.

Code Example:


import pandas as pd

data = {'City': ['New York', 'London', 'Paris', 'New York', 'London']}
df = pd.DataFrame(data)

# Factorizing the 'City' column
encoded_labels, uniques = pd.factorize(df['City'])

print(f'Encoded Labels: {encoded_labels}')
print(f'Unique Categories: {uniques}')

Output:

Encoded Labels: [0 1 2 0 1]
Unique Categories: ['New York' 'London' 'Paris']

2. pd.get_dummies: One-Hot Encoding

Functionality:

pd.get_dummies performs one-hot encoding, creating a new column for each distinct category and assigning a binary value (0 or 1) based on the presence or absence of the category.

Code Example:


import pandas as pd

data = {'City': ['New York', 'London', 'Paris', 'New York', 'London']}
df = pd.DataFrame(data)

# One-hot encoding the 'City' column
df_encoded = pd.get_dummies(df, columns=['City'], prefix='City')
print(df_encoded)

Output:

   City_London  City_New York  City_Paris
0           0              1           0
1           1              0           0
2           0              0           1
3           0              1           0
4           1              0           0

3. sklearn.preprocessing.LabelEncoder: Label Encoding

Functionality:

Similar to pd.factorize, LabelEncoder transforms categorical features into numerical labels. It assigns a unique integer to each distinct category, starting from 0.

Code Example:


from sklearn.preprocessing import LabelEncoder

data = {'City': ['New York', 'London', 'Paris', 'New York', 'London']}
df = pd.DataFrame(data)

le = LabelEncoder()
encoded_labels = le.fit_transform(df['City'])

print(f'Encoded Labels: {encoded_labels}')
print(f'Class Mapping: {le.classes_}')

Output:

Encoded Labels: [2 1 0 2 1]
Class Mapping: ['London' 'New York' 'Paris']

4. sklearn.preprocessing.OneHotEncoder: One-Hot Encoding

Functionality:

OneHotEncoder also performs one-hot encoding, creating a sparse matrix with one column for each distinct category. Each row represents an instance, and a value of 1 in a column indicates the presence of that category for that instance.

Code Example:


from sklearn.preprocessing import OneHotEncoder

data = {'City': ['New York', 'London', 'Paris', 'New York', 'London']}
df = pd.DataFrame(data)

ohe = OneHotEncoder(sparse=False)
encoded_features = ohe.fit_transform(df[['City']])

print(f'Encoded Features:\n{encoded_features}')
print(f'Categories: {ohe.categories_}')

Output:

Encoded Features:
[[0. 1. 0.]
 [1. 0. 0.]
 [0. 0. 1.]
 [0. 1. 0.]
 [1. 0. 0.]]
Categories: [array(['London', 'New York', 'Paris'], dtype=object)]

Comparison Table

| Technique | Function | Output | Notes |
|—|—|—|—|
| pd.factorize | pd.factorize(column) | Returns encoded labels and unique categories | Label encoding with numerical integers |
| pd.get_dummies | pd.get_dummies(df, columns=['column']) | Creates new columns for each category with binary values | One-hot encoding |
| LabelEncoder | LabelEncoder().fit_transform(column) | Returns encoded labels and a mapping of categories | Label encoding with numerical integers |
| OneHotEncoder | OneHotEncoder().fit_transform(df[['column']]) | Creates a sparse matrix with one column per category | One-hot encoding, suitable for large datasets |

Key Differences and Considerations

  • Label Encoding (pd.factorize, LabelEncoder): Useful for ordered categorical features, where the numerical representation reflects the inherent order. However, can introduce unintended relationships between categories.
  • One-Hot Encoding (pd.get_dummies, OneHotEncoder): Preferred for unordered categorical features, as it avoids creating artificial relationships. Can lead to a higher number of features, especially with many categories.
  • Sparse Matrices (OneHotEncoder): Efficient for handling large datasets with many categories, as it stores only non-zero values.
  • Interpretability: Label encoding can be more interpretable, while one-hot encoding might be less intuitive.

Conclusion

Selecting the appropriate encoding technique depends on your data, model requirements, and interpretation needs. Experiment and evaluate different techniques to determine the best fit for your specific use case.

Leave a Reply

Your email address will not be published. Required fields are marked *