OneHotEncoder categorical_features deprecated, how to transform specific column

OneHotEncoder: categorical_features Deprecated

The categorical_features parameter in the scikit-learn’s OneHotEncoder has been deprecated. This article explains the change and provides a clear solution for encoding specific columns.

Understanding the Change

In older versions of scikit-learn, OneHotEncoder allowed specifying the indices of categorical features using the categorical_features parameter. This approach is no longer recommended.

New Approach: handle_unknown and drop

The updated OneHotEncoder now works with all features, regardless of their type. To encode only specific columns, you can use these parameters:

  • handle_unknown: Controls how the encoder handles unseen categories during testing. Options include:
    • ‘ignore’: Ignores unseen categories. (default)
    • ‘error’: Raises an error for unseen categories.
    • ‘use_encoded_value’: Encodes unseen categories using a dedicated value.
  • drop: Determines whether to drop columns. Options include:
    • ‘first’: Drops the first column for each categorical feature.
    • ‘if_binary’: Drops a column if the categorical feature has only two categories.

Example: Encoding a Specific Column

Let’s illustrate the process with an example:

Column Name Data Type
City Categorical
Age Numeric
Income Numeric

We want to encode the ‘City’ column only.

Code


import pandas as pd
from sklearn.preprocessing import OneHotEncoder

# Sample data
data = {'City': ['New York', 'London', 'Paris', 'New York', 'London'],
        'Age': [25, 30, 28, 32, 27],
        'Income': [50000, 60000, 45000, 70000, 55000]}
df = pd.DataFrame(data)

# Create the encoder
encoder = OneHotEncoder(handle_unknown='ignore', drop='first')

# Fit the encoder on the 'City' column
encoder.fit(df[['City']])

# Transform the 'City' column
encoded_city = encoder.transform(df[['City']]).toarray()

# Create a new DataFrame with the encoded features
encoded_df = pd.DataFrame(encoded_city, columns=encoder.categories_[0][1:])

# Concatenate the encoded features with the original DataFrame
df = pd.concat([df, encoded_df], axis=1)

# Print the result
print(df)

Output


       City  Age  Income  London  Paris
0  New York   25  50000       0      0
1    London   30  60000       1      0
2     Paris   28  45000       0      1
3  New York   32  70000       0      0
4    London   27  55000       1      0

In this example, the ‘City’ column has been successfully one-hot encoded, with the ‘New York’ column dropped as the reference category. The encoded features are now added to the original DataFrame.


Leave a Reply

Your email address will not be published. Required fields are marked *