OneHotEncoder: categorical_features Deprecated
The categorical_features
parameter in the scikit-learn’s OneHotEncoder has been deprecated. This article explains the change and provides a clear solution for encoding specific columns.
Understanding the Change
In older versions of scikit-learn, OneHotEncoder
allowed specifying the indices of categorical features using the categorical_features
parameter. This approach is no longer recommended.
New Approach: handle_unknown and drop
The updated OneHotEncoder
now works with all features, regardless of their type. To encode only specific columns, you can use these parameters:
- handle_unknown: Controls how the encoder handles unseen categories during testing. Options include:
- ‘ignore’: Ignores unseen categories. (default)
- ‘error’: Raises an error for unseen categories.
- ‘use_encoded_value’: Encodes unseen categories using a dedicated value.
- drop: Determines whether to drop columns. Options include:
- ‘first’: Drops the first column for each categorical feature.
- ‘if_binary’: Drops a column if the categorical feature has only two categories.
Example: Encoding a Specific Column
Let’s illustrate the process with an example:
Column Name | Data Type |
---|---|
City | Categorical |
Age | Numeric |
Income | Numeric |
We want to encode the ‘City’ column only.
Code
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
# Sample data
data = {'City': ['New York', 'London', 'Paris', 'New York', 'London'],
'Age': [25, 30, 28, 32, 27],
'Income': [50000, 60000, 45000, 70000, 55000]}
df = pd.DataFrame(data)
# Create the encoder
encoder = OneHotEncoder(handle_unknown='ignore', drop='first')
# Fit the encoder on the 'City' column
encoder.fit(df[['City']])
# Transform the 'City' column
encoded_city = encoder.transform(df[['City']]).toarray()
# Create a new DataFrame with the encoded features
encoded_df = pd.DataFrame(encoded_city, columns=encoder.categories_[0][1:])
# Concatenate the encoded features with the original DataFrame
df = pd.concat([df, encoded_df], axis=1)
# Print the result
print(df)
Output
City Age Income London Paris
0 New York 25 50000 0 0
1 London 30 60000 1 0
2 Paris 28 45000 0 1
3 New York 32 70000 0 0
4 London 27 55000 1 0
In this example, the ‘City’ column has been successfully one-hot encoded, with the ‘New York’ column dropped as the reference category. The encoded features are now added to the original DataFrame.