Introduction to Isolation Forest
Isolation Forest is an unsupervised machine learning algorithm used for anomaly detection. It identifies outliers by isolating them in a forest of decision trees. The algorithm is based on the principle that outliers are easier to isolate than normal data points.
How Isolation Forest Works
Isolation Forest operates by following these steps:
1. Building the Forest
- Randomly select a subset of features from the dataset.
- Randomly select a split value within the range of the chosen feature.
- Split the data based on the selected feature and value, creating two branches.
- Repeat steps 1-3 recursively until all data points are isolated in their own branches.
2. Isolating Outliers
The algorithm calculates the path length to isolate a data point. Outliers are typically easier to isolate and thus have shorter path lengths. The average path length is used to calculate an anomaly score.
Implementing Isolation Forest
1. Importing Libraries
import pandas as pd from sklearn.ensemble import IsolationForest
2. Loading Data
data = pd.read_csv('your_data.csv')
3. Creating Isolation Forest Model
model = IsolationForest(contamination=0.05)
Here, ‘contamination’ parameter specifies the expected proportion of outliers in the dataset.
4. Fitting the Model
model.fit(data)
5. Predicting Outliers
predictions = model.predict(data)
The ‘predictions’ variable will contain a list of -1 for outliers and 1 for inliers.
6. Viewing Results
print(predictions)
Example
Dataset
Let’s consider a simple dataset with some outliers:
Feature 1 | Feature 2 |
---|---|
1 | 1 |
2 | 2 |
3 | 3 |
4 | 4 |
5 | 5 |
10 | 100 |
20 | 200 |
Code
import pandas as pd from sklearn.ensemble import IsolationForest data = pd.DataFrame({'Feature 1': [1, 2, 3, 4, 5, 10, 20], 'Feature 2': [1, 2, 3, 4, 5, 100, 200]}) model = IsolationForest(contamination=0.1) model.fit(data) predictions = model.predict(data) print(predictions)
Output
[-1 1 1 1 1 -1 -1]
The output shows that the last two data points (10,100) and (20,200) are identified as outliers.
Advantages of Isolation Forest
- Effective in handling high-dimensional data.
- Relatively fast training and prediction times.
- Robust to outliers in the data itself.
Applications of Isolation Forest
- Fraud detection
- Network intrusion detection
- Anomaly detection in sensor data
- Medical diagnosis