# Enhanced Anomaly Detection Algorithm for Effective Alerts

Written on

## Motivation

ThousandEyes employs agents to gather and analyze a wide array of metrics. By tracking changes in these metrics, users can gain insights that help refine alert settings based on particular trends and occurrences. This principle is fundamental to our Alert Rule framework, enabling users to establish various criteria for notifications related to tests and agents. Alert rules can evaluate specific values detected in a test (e.g., a particular error code in a response header), set a fixed threshold for time-series data (e.g., when response times exceed a designated limit), or utilize Dynamic Baselining for time-series analysis. Dynamic Baselining is particularly advantageous for identifying deviations from a typical range of values for parameters like latency, as this range may fluctuate over time and vary based on specific configurations, making static thresholds ineffective.

An instance of this is the dynamic Standard Deviation (SD) approach, which generates alerts when response times exceed the average by a multiple of the standard deviation from the metrics' distribution. In the illustration below, ThousandEyes alerts users for any new entry in the time series that is greater than or equal to twice the standard deviation above the mean within specific time windows.

Although this outlier detection method works well with nearly normal data distributions, it does not ensure the same level of effectiveness for other distributions. In this article, we propose a novel statistical method using Tukey Fences (referred to here as a percentile-based anomaly detector) to enhance dynamic baselining. Such improvements could significantly reduce both false positives and negatives, thereby minimizing erroneous and overlooked alerts.

## Comparing Statistical Anomaly Detectors

### Statistical Anomaly Detectors

In this document, we compare three methods: the Standard Deviation Method, the Percentile Based Anomaly Detector, and the Tukey Fences approach. These methods represent different variants of Dynamic Baselining; the Standard Deviation Method serves as our legacy approach, while we benchmark it against two more recent alternatives.

**Standard Deviation (SD) Method**

The SD method establishes a threshold for outliers, considering all points above this threshold as anomalous. For a given set of observations X related to a specific metric and time window, we define this threshold as follows:

In essence, we assess points that exceed a specified number of standard deviations added to the mean of the observations. This method has a single tuning parameter (num_sd), which dictates the algorithm's sensitivity; a lower value results in a higher number of detected anomalies, while a higher value does the opposite.

**Percentile Based Anomaly Detector**

The SD Method assumes that the timing metric (e.g., response time) aggregated over a time window follows a Gaussian distribution. If the distribution is indeed Gaussian, it is feasible to calibrate confidence levels to the appropriate standard deviation bounds and identify anomalies. This feature is crucial for our use case, as it provides users with control over the desired sensitivity and confidence levels. However, if the Gaussian assumption is violated (e.g., due to a heavy-tailed distribution), calibration errors may occur. Consequently, users may struggle to manage the number of detected anomalies based on their target confidence or sensitivity levels. Ideally, a high confidence and low sensitivity level should yield only a few extreme values, while a low confidence and high sensitivity level should result in a greater number of anomalies.

Under the Gaussian assumption, setting num_sd = 1 corresponds to a confidence level of 68%, meaning approximately 68% of samples would fall within that limit, leaving 32% (100–68) as anomalies. Increasing num_sd to 2 raises the confidence level to 95%, resulting in 5% of samples being flagged as anomalies. The following illustration demonstrates the varying threshold curves for num_sd = 1, 2, and 3, where samples crossing these limits are identified as anomalies:

The red dots in the image signify anomalies flagged by the num_sd = 3 threshold, which theoretically corresponds to a 99.7% confidence level. If the distribution were Gaussian, one would expect far fewer anomalies (red dots). However, as the actual distribution is not Gaussian, a considerable number of anomalies are detected, even at the highest confidence level. The prevalent high number of false positives flagged by the SD algorithm stems from miscalibration due to the non-Gaussian nature of the underlying distribution. This calibration error leads to a lack of control over the number of anomalies based on the desired confidence and sensitivity levels.

To address this issue, we adopted a strategy whereby instead of calculating the mean and standard deviation, the anomaly detector computes rolling percentiles (quantiles) over the same time window for a specified sensitivity level; the resulting envelope from varying confidence levels can then be employed to identify anomalies. This technique is known as the percentile (quantile) anomaly detector.

In Figure 3, the confidence levels for the percentile detector align with the 1, 2, and 3 standard deviation boundaries of the SD detector. The red dots in both plots indicate the highest confidence level. The figure illustrates that the highest percentile substantially reduces false positives identified by the SD approach.

Below, we present a calibration plot:

In Figure 4, the "ideal" category represents the target confidence level. For example, at a low sensitivity level, the confidence level is 68%, leading to 32% anomalies; similarly, a 95% confidence level should ideally result in 5% anomalies. The percentile (quantile) algorithm demonstrates a closer match to the target confidence levels compared to the STD (rolling standard deviation) approach. The total deviation from the ideal for the STD detector is 15%, while the percentile detector achieves a reduction to 6%, resulting in a 60% improvement in calibration.

In addition to calibration (or the ability to control the number of anomalies) according to customer preferences, the percentile detector identifies anomalies of superior quality. The following plot illustrates the distinction between anomalous and non-anomalous samples for both STD and percentile anomaly detectors:

Figure 5 reveals that the percentile algorithm more effectively separates anomalous samples from non-anomalous ones, as evidenced by the reduced overlap in the histogram. The effect size, which measures the separation between the two categories, improves from 0.73 for the STD method to 0.85 for the percentile method, indicating a 16% enhancement in distinguishing anomalies from normal samples.

The percentile algorithm has been further refined and optimized to create a high-performance online solution using the Tukey Fences method.

**Tukey Fences Method**

The Tukey Fences algorithm similarly establishes a threshold for outliers, categorizing all points above this threshold as anomalous. For a given set of observations X related to a specific metric and time window, we define this threshold as follows:

Here, *P_low* and *P_high* represent the lower and upper percentiles, respectively, while *k* is a sensitivity parameter akin to the num_sd in the SD method. Essentially, we examine the observations over the time window to identify a low percentile value and a high one; we then multiply the distance between these values by *k* and add the result to the low percentile value to establish the outlier limit. Many of our clients’ metrics exhibit positively-skewed distributions, suggesting that low percentile values will be close together, whereas high percentile values can escalate quickly. This insight indicates that the algorithm will be more influenced by the choice of *k*-parameter and *P_high* than by *P_low*.

## Comparing SD and Tukey Anomalies Scorings

We developed a custom scoring function tailored to our specific use case. The primary objective of this score is to enhance the quality of detected anomalies rather than simply reduce their quantity, while also addressing the challenges posed by an unsupervised environment.

Our analysis focused on three major clients and the three most commonly used metrics for alert definitions. To accurately compare the quality of anomalies detected by the two methods, we ensured that the number of generated anomalies was equivalent. This involved identifying the tuples of low percentile/high percentile/k-parameter that produced the same quantity of anomalies as *num_sd = 3*.

This analysis informed our selection of Tukey Parameters. Our bespoke scoring method indicated a 15% improvement in anomaly quality at the medium sensitivity level and up to 33% at the low sensitivity level.

## Choice of k-Parameters for Low and High Sensitivities

ThousandEyes aims to offer customizable anomaly detectors to better align with our clients’ unique requirements. To facilitate this, we provide sensitivity levels that significantly influence the number of detected anomalies. In addition to the proposed medium sensitivity level, we introduce two additional levels: low and high sensitivity, which would yield 50% fewer or 50% more anomalies, respectively. The sensitivity will only adjust the *k*-parameter in the Tukey method.

## Real-Time Integration with Flink Alerting Pipeline

The newly implemented Tukey algorithm operates within a real-time application built on Flink. This job processes percentile calculations every five minutes and stores the results in a cache. The Alerter Engine retrieves percentile values from the cache in real time. Customers can adjust sensitivity settings for anomaly detection while configuring alert rules. Depending on the sensitivity levels, the Engine detects anomalies by varying the *k*-parameter. This approach enhances threshold determination, provides intuitive guidelines, and allows for configurable sensitivity parameters. With these advancements, we anticipate substantial improvements in our ability to swiftly and accurately identify and resolve network issues.

## Conclusion

Our recent analysis has unveiled an innovative combination of Tukey Parameter values that achieves a notable 15% increase in accuracy at the medium sensitivity level and up to 33% at the low sensitivity level. What truly distinguishes this development is our novel scoring method, which not only elevates overall quality but also dramatically reduces false positives. And this is just the beginning! This breakthrough is part of our ongoing mission to transform the alerting system at ThousandEyes. Stay tuned for more exciting updates!

*If you're interested in tackling complex challenges, explore our Engineering careers.*