Detect outliers (anomalies in training data) by using the `iforest`

function.

Load the sample data set `NYCHousing2015`

.

The data set includes 10 variables with information on the sales of properties in New York City in 2015. Print a summary of the data set

Variables:
BOROUGH: 91446x1 double
Values:
Min 1
Median 3
Max 5
NEIGHBORHOOD: 91446x1 cell array of character vectors
BUILDINGCLASSCATEGORY: 91446x1 cell array of character vectors
RESIDENTIALUNITS: 91446x1 double
Values:
Min 0
Median 1
Max 8759
COMMERCIALUNITS: 91446x1 double
Values:
Min 0
Median 0
Max 612
LANDSQUAREFEET: 91446x1 double
Values:
Min 0
Median 1700
Max 2.9306e+07
GROSSSQUAREFEET: 91446x1 double
Values:
Min 0
Median 1056
Max 8.9422e+06
YEARBUILT: 91446x1 double
Values:
Min 0
Median 1939
Max 2016
SALEPRICE: 91446x1 double
Values:
Min 0
Median 3.3333e+05
Max 4.1111e+09
SALEDATE: 91446x1 datetime
Values:
Min 01-Jan-2015
Median 09-Jul-2015
Max 31-Dec-2015

The `SALEDATE`

column is a `datetime`

array, which is not supported by `iforest`

. Create columns for the month and day numbers of the `datetime`

values, and delete the `SALEDATE`

column.

The columns `BOROUGH`

, `NEIGHBORHOOD`

, and `BUILDINGCLASSCATEGORY`

contain categorical predictors. Display the number of categories for the categorical predictors.

For a categorical variable with more than 64 categories, the `iforest`

function uses an approximate splitting method that can reduce the accuracy of the isolation forest model. Remove the `NEIGHBORHOOD`

column, which contains a categorical variable with 254 categories.

Train an isolation forest model for `NYCHousing2015`

. Specify the fraction of anomalies in the training observations as 0.1, and specify the first variable (`BOROUGH`

) as a categorical predictor. The first variable is a numeric array, so `iforest`

assumes it is a continuous variable unless you specify the variable as a categorical variable.

`Mdl`

is an `IsolationForest`

object. `iforest`

also returns the anomaly indicators (`tf`

) and anomaly scores (`scores`

) for the training data `NYCHousing2015`

.

Plot a histogram of the score values. Create a vertical line at the score threshold corresponding to the specified fraction.

If you want to identify anomalies with a different contamination fraction (for example, 0.01), you can retrain an isolation forest model.

rng("default") % For reproducibility
[newMdl,newtf,scores] = iforest(NYCHousing2015, ...
ContaminationFraction=0.01,CategoricalPredictors=1);

If you want to identify anomalies with a different score threshold value (for example, 0.65), you can pass the `IsolationForest`

object, the training data, and a new threshold value to the `isanomaly`

function.

[newtf,scores] = isanomaly(Mdl,NYCHousing2015,ScoreThreshold=0.65);

Note that changing the contamination fraction or score threshold does not change the anomaly scores. Therefore, if you do not want to compute the anomaly scores again by using `iforest`

or `isanomaly`

, you can obtain a new anomaly identifier with the existing score values.

Change the fraction of anomalies in the training data to 0.01.

Find a new score threshold by using the `quantile`

function.

newScoreThreshold = 0.6597

Obtain a new anomaly identifier.