Categorical Features Correlation

Categorical Features Correlation

Categorical features, representing discrete categories or groups, play a vital role in machine learning models. Understanding the relationships between these features is crucial for effective model building. While traditional correlation measures like Pearson’s correlation are designed for numerical data, specialized methods are needed to assess correlation between categorical features.

Methods for Categorical Feature Correlation

1. Chi-Square Test

The Chi-Square test is a statistical method used to determine if there’s a significant association between two categorical variables. It analyzes the observed frequencies of categories in a contingency table against expected frequencies under the assumption of independence. A high chi-square statistic indicates strong association.

Category 1 Category 2 Observed Frequency Expected Frequency
A B 10 5
A C 20 15
B B 15 20
B C 5 10

Example:

import pandas as pd
from scipy.stats import chi2_contingency

data = {'Category1': ['A', 'A', 'B', 'B'], 'Category2': ['B', 'C', 'B', 'C']}
df = pd.DataFrame(data)

contingency_table = pd.crosstab(df['Category1'], df['Category2'])

chi2, p, dof, expected = chi2_contingency(contingency_table)

print(f"Chi-Square Statistic: {chi2}")
print(f"P-Value: {p}")

Chi-Square Statistic: 8.0
P-Value: 0.0183156388887342

A p-value less than 0.05 suggests a significant association between the categories.

2. Cramer’s V

Cramer’s V is a measure of association between two categorical variables that ranges from 0 to 1, with higher values indicating stronger correlation. It is based on the chi-square statistic and accounts for the size of the contingency table.

Formula:

V = sqrt(chi-square / (n * min(r-1, c-1)))

Where:

  • n is the sample size
  • r is the number of rows in the contingency table
  • c is the number of columns in the contingency table

Example:

import pandas as pd
from scipy.stats import chi2_contingency

data = {'Category1': ['A', 'A', 'B', 'B'], 'Category2': ['B', 'C', 'B', 'C']}
df = pd.DataFrame(data)

contingency_table = pd.crosstab(df['Category1'], df['Category2'])

chi2, p, dof, expected = chi2_contingency(contingency_table)

n = len(df)
r = contingency_table.shape[0]
c = contingency_table.shape[1]

cramer_v = np.sqrt(chi2 / (n * min(r-1, c-1)))

print(f"Cramer's V: {cramer_v}")

Cramer's V: 0.8944271909999159

3. Mutual Information

Mutual Information (MI) measures the dependency between two variables. It quantifies how much knowing the value of one variable reduces the uncertainty about the other. Higher MI values indicate stronger correlation.

Formula:

MI(X, Y) = sum(P(x, y) * log2(P(x, y) / (P(x) * P(y))))

Where:

  • P(x, y) is the joint probability of X and Y
  • P(x) is the marginal probability of X
  • P(y) is the marginal probability of Y

Example:

import pandas as pd
from sklearn.metrics import mutual_info_score

data = {'Category1': ['A', 'A', 'B', 'B'], 'Category2': ['B', 'C', 'B', 'C']}
df = pd.DataFrame(data)

mi = mutual_info_score(df['Category1'], df['Category2'])

print(f"Mutual Information: {mi}")

Mutual Information: 0.5

Conclusion

Understanding correlation between categorical features is essential for building accurate machine learning models. Chi-Square test, Cramer’s V, and Mutual Information are effective techniques for assessing these relationships, providing insights into the dependencies and associations between categorical variables.


Leave a Reply

Your email address will not be published. Required fields are marked *