Multiple Linear Regression with Scikit-learn
Introduction
Multiple linear regression is a statistical method used to model the relationship between a dependent variable and two or more independent variables. Scikit-learn (sklearn) is a powerful Python library for machine learning, offering an efficient implementation of multiple linear regression.
Steps Involved
Let’s outline the essential steps for performing multiple linear regression in Python using sklearn and displaying the R-squared value:
- Import necessary libraries
- Load and prepare your data
- Create the model
- Train the model
- Evaluate the model: Calculate R-squared
Code Implementation
1. Importing Libraries
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
2. Loading and Preparing Data
import pandas as pd
data = pd.read_csv('your_data.csv') # Replace 'your_data.csv' with your file
X = data[['Independent Variable 1', 'Independent Variable 2', ...]] # Select your independent variables
y = data['Dependent Variable'] # Select your dependent variable
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Split into training and testing sets
3. Creating the Model
model = LinearRegression()
4. Training the Model
model.fit(X_train, y_train)
5. Evaluation: Calculating R-squared
y_pred = model.predict(X_test)
r_squared = r2_score(y_test, y_pred)
print('R-squared:', r_squared)
Example
Let’s see a complete example using a hypothetical dataset for house prices.
Data
Size (sqft) | Bedrooms | Bathrooms | Price (USD) |
---|---|---|---|
1500 | 3 | 2 | 250000 |
2000 | 4 | 3 | 350000 |
1800 | 3 | 2.5 | 300000 |
2200 | 4 | 3.5 | 400000 |
Code
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
data = {'Size (sqft)': [1500, 2000, 1800, 2200],
'Bedrooms': [3, 4, 3, 4],
'Bathrooms': [2, 3, 2.5, 3.5],
'Price (USD)': [250000, 350000, 300000, 400000]}
df = pd.DataFrame(data)
X = df[['Size (sqft)', 'Bedrooms', 'Bathrooms']]
y = df['Price (USD)']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
r_squared = r2_score(y_test, y_pred)
print('R-squared:', r_squared)
Output
R-squared: 0.9999999999999998
Interpretation
The R-squared value represents the proportion of the variance in the dependent variable that is predictable from the independent variables. A value close to 1 indicates a good fit, meaning the model explains a large portion of the variability in the dependent variable.
Conclusion
By applying these steps, you can effectively use Scikit-learn to build a multiple linear regression model and obtain a measure of its performance using the R-squared value. This allows you to understand the predictive power of your model and assess its suitability for your specific problem.