Anomaly detection#

  • In contrast to statistical outliers, anomaly detection is more often concerned with local deviations than deviations from a common distribution.

  • We will cover two density based methods and one tree based method.

  • Density based methods are highly dependent on scaling.

    • The amount and type of scaling is problem dependent and can be considered part of the tuning.

DBSCAN#

  • Density-Based Spatial Clustering of Applications with Noise.

  • No assumptions about distributions.

  • Definitions:

    • ‘Core point’ if >= MinPts within radius \(\epsilon\).

    • ‘Border point’ if < MinPts within radius \(\epsilon\), but within radius of a ‘core point’.

    • Else a ‘Noise points’.

  • Clustering:

    • Cluster ‘core points’ that lay within each others radii.

    • Assign ‘border points’ to their respective ‘core point’ clusters.

  • Our main interest is in detecting ‘noise points’, i.e., outliers.

    • Also, small clusters may be indicative of series of outliers.

DBSCAN in 2D#

DBSCAN in 1D#

  • To avoid a pure vertical clustering in the charts, we need to use the observation/time dimension actively.

  • The horizontal spacing between points in the chart will be an extra parameter to tune.

  • More than one variable can be included in DBSCAN, but these must be matched by some form of scaling, e.g., standardisation.

  • DBSCAN does not care about drift in mean values, only local density (pro and con).

# Random normal data
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(1)
N = 1000
data = np.random.normal(0, 1, N)
plt.figure(figsize=(10,3))
plt.plot(data, 'o')
plt.ylim(-4.5, 4.5)
plt.xlim(0, N)
plt.axhline(0, color='black', linestyle='--')
plt.ylabel('Values')
plt.xlabel('Observations')
plt.show()
../../_images/3b34f1eed97ae8394a3c1845ce8c220bccc5602582fd974bb4762caf4e30ec69.png
# Import DBSCAN from sklearn
from sklearn.cluster import DBSCAN

# Reshape the data to a column vector together with an index column
step_size = 0.02
data2D = np.array([data, np.linspace(0, N*step_size, N)]).T

# Initialize and fit the DBscan model
db = DBSCAN(eps=0.5, min_samples=3, metric='euclidean')
db.fit(data2D)

# Obtain the predicted labels and calculate number of clusters
pred_labels = db.labels_
# -1 is an outlier, >=0 is a cluster

# Count number of samples in each cluster
counts = np.bincount(pred_labels+1)
counts
array([ 28, 957,   4,   3,   5,   3])
# Show the cluster labels
from matplotlib.lines import Line2D
plt.figure(figsize=(10,3))
plt.plot(data2D[:,1], data2D[:,0], 'o')
plt.ylim(-4.5, 4.5)
plt.xlim(0, max(data2D[:,1]))
plt.axhline(0, color='black', linestyle='--')
plt.ylabel('Values')
plt.xlabel('Scaled observations')
# Plot special samples in red and orange
for i in range(len(data)):
    if pred_labels[i] == -1:
        plt.plot(data2D[i,1], data2D[i,0], 'o', color='red')
    if pred_labels[i] > 0:
        plt.plot(data2D[i,1], data2D[i,0], 'o', color='orange')
legend_elements = [Line2D([0], [0], marker='o', color='red', label='outlier', linestyle='None'),
                   Line2D([0], [0], marker='o', color='orange', label='cluster', linestyle='None')]
plt.legend(handles=legend_elements, bbox_to_anchor=(1.05, 1), loc=2)
plt.grid()
plt.show()
../../_images/fe50617f5c529ffba397d8544132528efa2881e5bdc667a6df60a157ac443e93.png
# Convert the above plot to Plotly format
import plotly.graph_objects as go
fig = go.Figure()
fig.add_trace(go.Scatter(x=data2D[:,1], y=data2D[:,0], mode='markers', marker=dict(color='blue'), name='normal'))
fig.add_trace(go.Scatter(x=data2D[:,1][pred_labels==-1], y=data2D[pred_labels==-1,0], mode='markers', marker=dict(color='red'), name='outlier'))
fig.add_trace(go.Scatter(x=data2D[:,1][pred_labels>0], y=data2D[pred_labels>0,0], mode='markers', marker=dict(color='orange'), name='cluster'))
fig.update_layout(title='DBSCAN Anomaly Detection', xaxis_title='Scaled observations', yaxis_title='Values')
fig.show()