Isolation Forest
Isolation Forest (iForest) is a popular machine learning algorithm specifically designed for outlier/anomaly detection. It's efficient, scales well to high-dimensional data, and works on the principle that anomalies are easier to isolate than normal points.
How Isolation Forest Works (Outlier Detection)
· It constructs binary trees (isolation trees) by randomly selecting a feature and then randomly selecting a split value between the max and min of that feature.
· The idea: Anomalies get isolated quicker (fewer splits), so they have shorter path lengths in the tree.
· Aggregating across many trees gives an anomaly score between 0 and 1:
-
Closer to 1 → Likely anomaly.
-
Closer to 0.5 → Normal data.
Outlier Correction Approach
· Isolation Forest (iForest) is designed for detecting outliers, but it does not modify the data itself.
· To correct the detected outliers:
-
First, compute the median or mean of the values classified as normal (anomaly == 1) for each feature.
-
Then, replace the values identified as outliers (anomaly == -1) with the corresponding computed median or mean.
Isolation Forest Key Parameters
Param Name | Description | Default Value | Possible Values |
---|---|---|---|
CONTAMINATION | Estimated proportion of outliers in the dataset | 0.05 | Float between 0.0 and 0.5, or 'auto' |
N_ESTIMATORS | Number of trees (base estimators) in the forest | 100 | Any positive integer |
MAX-SAMPLES | Number of samples to draw for training each tree | 'auto' | Integer (1 to n_samples) or float (0.0–1.0] |
MAX-FEATURES | Number of features to draw for each tree | 1.0 | Float (0.0–1.0], or integer (1 to n_features) |
N-JOBS | Number of parallel jobs for computation | -1 | -1 (all CPUs), 1, 2, ..., or None |