Isolation Forest

Isolation Forest (iForest) is a popular machine learning algorithm specifically designed for outlier/anomaly detection. It's efficient, scales well to high-dimensional data, and works on the principle that anomalies are easier to isolate than normal points.

How Isolation Forest Works (Outlier Detection)

· It constructs binary trees (isolation trees) by randomly selecting a feature and then randomly selecting a split value between the max and min of that feature.

· The idea: Anomalies get isolated quicker (fewer splits), so they have shorter path lengths in the tree.

· Aggregating across many trees gives an anomaly score between 0 and 1:

  • Closer to 1 → Likely anomaly.

  • Closer to 0.5 → Normal data.

Outlier Correction Approach

· Isolation Forest (iForest) is designed for detecting outliers, but it does not modify the data itself.

· To correct the detected outliers:

  • First, compute the median or mean of the values classified as normal (anomaly == 1) for each feature.

  • Then, replace the values identified as outliers (anomaly == -1) with the corresponding computed median or mean.

Isolation Forest Key Parameters

Param Name Description Default Value Possible Values
CONTAMINATION Estimated proportion of outliers in the dataset 0.05 Float between 0.0 and 0.5, or 'auto'
N_ESTIMATORS Number of trees (base estimators) in the forest 100 Any positive integer
MAX-SAMPLES Number of samples to draw for training each tree 'auto' Integer (1 to n_samples) or float (0.0–1.0]
MAX-FEATURES Number of features to draw for each tree 1.0 Float (0.0–1.0], or integer (1 to n_features)
N-JOBS Number of parallel jobs for computation -1 -1 (all CPUs), 1, 2, ..., or None