Linear Regression with String/Categorical Features
Linear regression is a powerful statistical technique used for predicting a continuous target variable based on one or more independent variables. Traditionally, linear regression assumes numerical features. However, real-world datasets often contain string or categorical features, presenting challenges for direct application of linear regression. This article explores how to handle string/categorical features within the context of linear regression.
Understanding the Challenge
Linear regression models rely on numerical relationships between variables. String/categorical features, representing qualitative data, cannot be directly interpreted by the model. Consider these examples:
- Customer City: “New York,” “London,” “Tokyo” are string values that don’t directly translate to numerical values.
- Product Category: “Electronics,” “Clothing,” “Food” are categories that lack inherent numerical meaning.
Strategies for Handling Categorical Features
Several techniques can be employed to incorporate categorical features into linear regression:
1. One-Hot Encoding
One-hot encoding is a widely used method for converting categorical features into numerical representations. It creates binary columns (0 or 1) for each unique category. For instance:
City | New York | London | Tokyo |
---|---|---|---|
New York | 1 | 0 | 0 |
London | 0 | 1 | 0 |
Tokyo | 0 | 0 | 1 |
This encoding allows linear regression to recognize the distinct categories as numerical inputs.
2. Dummy Encoding
Similar to one-hot encoding, dummy encoding converts categorical features into binary columns. However, it creates one fewer column than the number of categories. For example:
City | New York | London |
---|---|---|
New York | 1 | 0 |
London | 0 | 1 |
Tokyo | 0 | 0 |
This approach avoids creating redundant columns and helps prevent multicollinearity.
3. Label Encoding
Label encoding assigns a unique integer to each category. For example:
City | Encoded City |
---|---|
New York | 1 |
London | 2 |
Tokyo | 3 |
While simpler than one-hot encoding, label encoding can introduce ordinality, implying an order between categories that might not exist.
4. Ordinal Encoding
Ordinal encoding is suitable when categories have a natural order. For example, “Small,” “Medium,” “Large” can be encoded as 1, 2, 3 respectively.
5. Feature Hashing
For high-cardinality categorical features (many unique values), feature hashing offers a space-efficient approach. It maps categorical values to numerical indices using a hash function. This reduces the number of columns required.
Example: Implementing One-Hot Encoding
Here’s a Python example using Pandas and scikit-learn for one-hot encoding:
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import OneHotEncoder
# Sample dataset
data = {'City': ['New York', 'London', 'Tokyo', 'New York'],
'Price': [100, 150, 200, 120]}
df = pd.DataFrame(data)
# One-hot encode the 'City' feature
encoder = OneHotEncoder(sparse=False)
city_encoded = encoder.fit_transform(df[['City']])
# Create a new dataframe with encoded features
encoded_df = pd.DataFrame(city_encoded, columns=encoder.categories_[0])
df = pd.concat([df, encoded_df], axis=1)
# Drop the original 'City' column
df.drop('City', axis=1, inplace=True)
# Build and train the linear regression model
model = LinearRegression()
model.fit(df.drop('Price', axis=1), df['Price'])
# Make predictions
new_data = {'New York': 1, 'London': 0, 'Tokyo': 0}
prediction = model.predict(pd.DataFrame([new_data]))
print("Predicted Price:", prediction[0])
Considerations
- Dimensionality: One-hot encoding can significantly increase the number of features, potentially leading to the curse of dimensionality.
- Interpretability: While effective, one-hot encoding can reduce interpretability compared to using the original categorical feature names.
- Performance: High-cardinality features can make model training computationally expensive.
Conclusion
Handling string/categorical features in linear regression requires careful consideration and appropriate encoding techniques. One-hot encoding, dummy encoding, label encoding, and feature hashing are popular choices. Choosing the right approach depends on the specific dataset, model requirements, and trade-offs between dimensionality, interpretability, and performance.