Python Scikit-learn Multiple Linear Regression: Displaying R-squared

Multiple Linear Regression with Scikit-learn

Introduction

Multiple linear regression is a statistical method used to model the relationship between a dependent variable and two or more independent variables. Scikit-learn (sklearn) is a powerful Python library for machine learning, offering an efficient implementation of multiple linear regression.

Steps Involved

Let’s outline the essential steps for performing multiple linear regression in Python using sklearn and displaying the R-squared value:

  1. Import necessary libraries
  2. Load and prepare your data
  3. Create the model
  4. Train the model
  5. Evaluate the model: Calculate R-squared

Code Implementation

1. Importing Libraries


from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score

2. Loading and Preparing Data


import pandas as pd
data = pd.read_csv('your_data.csv')  # Replace 'your_data.csv' with your file
X = data[['Independent Variable 1', 'Independent Variable 2', ...]]  # Select your independent variables
y = data['Dependent Variable']  # Select your dependent variable

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)  # Split into training and testing sets

3. Creating the Model


model = LinearRegression()

4. Training the Model


model.fit(X_train, y_train)

5. Evaluation: Calculating R-squared


y_pred = model.predict(X_test)
r_squared = r2_score(y_test, y_pred)
print('R-squared:', r_squared)

Example

Let’s see a complete example using a hypothetical dataset for house prices.

Data

Size (sqft) Bedrooms Bathrooms Price (USD)
1500 3 2 250000
2000 4 3 350000
1800 3 2.5 300000
2200 4 3.5 400000

Code


import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score

data = {'Size (sqft)': [1500, 2000, 1800, 2200],
        'Bedrooms': [3, 4, 3, 4],
        'Bathrooms': [2, 3, 2.5, 3.5],
        'Price (USD)': [250000, 350000, 300000, 400000]}
df = pd.DataFrame(data)

X = df[['Size (sqft)', 'Bedrooms', 'Bathrooms']]
y = df['Price (USD)']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
r_squared = r2_score(y_test, y_pred)

print('R-squared:', r_squared)

Output


R-squared: 0.9999999999999998

Interpretation

The R-squared value represents the proportion of the variance in the dependent variable that is predictable from the independent variables. A value close to 1 indicates a good fit, meaning the model explains a large portion of the variability in the dependent variable.

Conclusion

By applying these steps, you can effectively use Scikit-learn to build a multiple linear regression model and obtain a measure of its performance using the R-squared value. This allows you to understand the predictive power of your model and assess its suitability for your specific problem.


Leave a Reply

Your email address will not be published. Required fields are marked *