Anomaly Detection: Behind the Scenes

Anomaly detection discovers any inconsistencies or irregularities in the data and plays a key role in maintaining up-to-date and good-quality data.

This lets users:

Address and solve any issues with the data in a timely manner. Users are notified of potential problems and variations in the data or the data pipeline, which can then be handled early on.
Monitor the quality of the data set to ensure it accurately reflects the actual state of data. This also improves the accuracy of reports and analytics generated on the data for a specific period of time.

The anomaly detection microservice is configurable; details of the configurable properties can be found in configuration-reference:anomaly-detection-configuration.adoc.

Anomaly detection runs on historical profiling versions of the given entity (catalog items and their attributes) using one of two algorithms: Isolation forest (time-independent) or Time-dependent.

The Anomaly Detector microservice receives anomaly detection requests from the Metadata Management Module (MMM) and sends back the results once the detection is complete.

Isolation forest (time-independent)

The time-independent model uses the Isolation Forest Algorithm to single out anomalous data points. This is done by measuring how much each data point diverges from other points in the same data set.

Data points are individually evaluated based on a number of categories, each corresponding to a different type of statistical data. When all categories are considered, the more a data point departs from the rest of the points in the set, the more anomalous it is.

The algorithm then compiles this multidimensional input to identify anomalies on the level of categories and looks for explainability. Specifically, it attempts to detect the single most anomalous feature, for example, that could be the minimum feature in the numeric statistics category.

Based on these results, users can further investigate in which ways the identified category causes the anomaly and take steps to resolve the issue.

Time-independent anomaly detection is active after profiling has been run at least 6 times. It becomes more accurate the more data points there are available.

Time-independent anomaly detection feature categories

Count features

These features take into account how many instances of a certain type of values are found in an attribute. As such, they are applicable to attributes of all data types.

Feature name Description

Feature name	Description
`null count`	The number of `NULL` values in a catalog item attribute.
`row size`	The number of rows in a catalog item attribute.
`distinct count`	The total number of non-unique and unique values. Non-unique values are counted in this category once while all their duplicates, whether there is only one or several, fall into the `duplicate count` category. This does not include `NULL` values.
`duplicate count`	The number of values that have duplicates. This does not include `NULL` values.
`unique count`	The number of values that appear only once in an attribute. This does not include `NULL` values.

null count

The number of NULL values in a catalog item attribute.

row size

The number of rows in a catalog item attribute.

distinct count

The total number of non-unique and unique values.

Non-unique values are counted in this category once while all their duplicates, whether there is only one or several, fall into the duplicate count category.

This does not include NULL values.

duplicate count

The number of values that have duplicates. This does not include NULL values.

unique count

The number of values that appear only once in an attribute. This does not include NULL values.

Main string features

The features from this category are used only for string data type.

Feature name Description

Feature name	Description
`minimal string length`	The number of characters in the shortest string value.
`maximal string length`	The number of characters in the longest string value.
`mean of string lengths`	The average number of characters across all attribute values.

minimal string length

The number of characters in the shortest string value.

maximal string length

The number of characters in the longest string value.

mean of string lengths

The average number of characters across all attribute values.

Main numeric features

The features from this category are used only for numeric data types.

Feature name Description

Feature name	Description
`minimum`	The lowest value in an attribute.
`maximum`	The highest value in an attribute.
`mean`	The average value of an attribute.
`standard deviation`	The value of the standard deviation calculated on an attribute.
`numeric sum`	The total of all values in an attribute.

minimum

The lowest value in an attribute.

maximum

The highest value in an attribute.

mean

The average value of an attribute.

standard deviation

The value of the standard deviation calculated on an attribute.

numeric sum

The total of all values in an attribute.

Frequency head

The values that occur most frequently in a catalog item attribute and their counts.

Frequency tail

The least frequent values in a catalog item attribute and their counts.

Mask head

The most frequent masks in a catalog item attribute and their counts.

Mask tail

The least frequent masks in a catalog item attribute and their counts.

Fingerprints

Applicable only when the Isolation Forest algorithm is used and only on string values. Corresponds to the fingerprint vector.

For more information, see Term Suggestions.

Quantiles

Applicable only when the Isolation Forest algorithm is used and only on numeric values. This category shows the frequency distribution of data points and identifies the most anomalous quantile ranges.

Each percentile is a value greater than or equal to the corresponding percentage of all values in a column. In other words, percentiles divide all the values from a column in such a way that 10% of those values fall between each two neighboring percentiles. The lowest percentage of values is therefore located in the 0th percentile-minimum, the highest percentage in the 100th percentile-maximum, and the median percentage in the 50th percentile-median.

0th percentile-minimum
10th percentile
20th percentile
30th percentile
40th percentile
50th percentile-median
60th percentile
70th percentile
80th percentile
90th percentile
100th percentile-maximum

Time-dependent

The time series model works based on periodicity (seasonality), which describes how often a pattern is repeated in the data in regular, fixed intervals. This can be based on metadata, that is, profiling results, or on the data itself.

In the context of ONE, some common periodicity values indicate the following:

7: The data pattern is repeated every seven days (or every seven profile versions) and suggests that we are dealing with daily data or one profiling a day.

For example, if the data is profiled on Mondays, we can expect similar values in 7 days on each following Monday and so on.
12: This suggests that the data is profiled or the data pattern repeats on a monthly basis.
24: There is a repeating pattern in the data every 24 data points or profiles. In other words, it indicates we have hourly data as the data is profiled every hour.

The model breaks the data down into three components:

Trend: The general direction in which the data is developing over time, which can be, for instance, growing or decreasing. This pattern can be linear or nonlinear and it does not repeat.
Season: A general pattern in the data that repeats and changes over time in accordance with the period specified by periodicity.

For example, daily seasonality can mean that a business makes the most sales on Mondays and Thursdays, with very low sales on Fridays and no sales on Saturdays, and that the same pattern is consistently valid for all analyzed weeks.
Residuals (residual errors): These account for any noise or errors in the data that cannot be explained either by trend or by season. This part of variation is typically random and therefore unpredictable.

A minimum of six profiles are required for time-dependent anomaly detection. When periodicity is provided by the user, the time-dependent model requires the number of data points to be more than two times greater than the periodicity value.

For example, if the periodicity is seven, there needs to be at least 15 profile versions. Otherwise, an error is returned.

Timestamps

If the periodicity is not known, the system can derive it for you based on timestamps which are passed from MMM to AI.

The model also recalculates according to any irregular timestamps, by handling them in one of two ways:

If there is a profiling which is outside of the schedule, it is detected and won’t be used in the anomaly detection algorithm.
If there is a profiling which is missing, it is detected and the values are imputed for the purpose of anomaly detection.

When the system needs to derive the periodicity, a greater number of data points can sometimes be required to account for any noise in the time series (for example, duplicated or missing points). If there are not enough data points, an error is returned.

How user feedback is processed in anomaly detection

When an anomaly is detected in the data, users are notified of the issue and prompted to confirm or reject the detected result.

The anomaly detection model is constantly improved based on the confirmed anomalies, which are handled as follows:

If the time-independent model (Isolation Forest algorithm) is applied, confirmed anomalous points are no longer taken into account.
If the time-dependent model (time series analysis) is applied, confirmed anomalous points are replaced with values in the expected range.

In both cases, this leads to a more sensitive and precise model.

The data point for which the anomaly was confirmed cannot be used to identify other anomalies as they are not related to a specific value. Therefore, excluding the data point from future detection runs increases the chances of accurately discovering other anomalous data points.

In addition, data points that have been marked as anomalous keep their status permanently since the user feedback is treated as the ground truth.

Edge cases

This section examines how anomaly detection behaves when dealing with simpler edge cases:

Empty profiles. Since profiling is performed on tables with empty columns, the only useful statistic is row count, which is either equal to zero or, if the column contains only NULL values, matches the null count statistic.

In that case, we run anomaly detection on the same category (row count) for all the available profiles, both empty and non-empty, and on all the statistical categories for the most recent, non-empty sequence of profiles.

For example, in case there are 10 profiles, out of which the first four are empty, the anomaly detection runs on the row count category for all 10 profiles and on all the categories for the remaining six profiles.
Profiles with different data types. When profiling results are obtained from data consisting of different data types, anomaly detection is applied on the most recent sequence of profiles of the same data type.

For example, if we have 10 profiles, with the first four based on string data type and the last six on numeric data, anomaly detection is run on all the statistics for the last six profiles and only on the row count category for all the 10 profiles.
Profiles with inconsistent statistics. In certain cases, profiles run on the same data set can contain different statistical information; for example, some profiles might be missing a number of categories.

In that case, the statistical categories included in the most recent profiling version are taken as a reference point and the earlier profiles are analyzed to determine which of them contain the same statistics. Anomaly detection is then applied on the profiles that share the same categories of statistical data.

Hard-to-detect anomalies and false positives

A number of custom rules have been added to the models to detect previously hard-to-detect anomalies more effectively and to reduce the overall number of false positives. The models have been tweaked to better detect unexpected nulls, unexpected negative or positive values, unexpected zero values, and changes from established trends.

Regarding false positives, values identical to those previously dismissed as not anomalous are no longer flagged, and values aren’t flagged based on frequency if no significant key or changes are detected.

The low-sensitivity options are now even less sensitive, so altering the anomaly detection sensitivity can help if false positives persist.

Sizing guidelines

Due to the nature of implemented AI algorithms, the resource consumption of Anomaly Detection depends on the amount of data that is processed. The more data there is, the more resources are required. At startup, the Anomaly Detector microservice requires 140 MB memory (RAM).

The Anomaly Detector microservice has the possibility of leveraging multiple threads or processes to improve the performance speed. These options are provided through several machine learning libraries, namely OpenMP and BLAS, that can each be individually configured.

The number of threads can be set to one of the following values:

null: No value is provided. In this case, machine learning libraries use their default settings.
0: All physical CPU cores are used, without hyper-threads.
n: The n number of CPU cores is used. Hyper-threading is applied as needed.
-n: In this case, the number of CPU cores used is reduced by the value of -n. For example, if your machine has six CPUs, setting this option to -2 means that four CPU cores are employed.

For some of the libraries used, it is possible to specify the preferred number of parallel calculations using the property ataccama.one.aicore.parallelism.jobs. In addition, the libraries use parallelism in low-level operations as well, which can be controlled through the property ataccama.one.aicore.parallelism.omp.

In cases where more specific tuning is necessary, the following options are available for lower-level parallel processing:

ataccama.one.aicore.parallelism.blas: Has a higher priority than OpenMP but can be ignored by some libraries depending on the compilation options of the OpenBLAS dependency.
ataccama.one.aicore.parallelism.threads: Has a higher priority than OpenMP and BLAS but comes with a higher overhead than these two since it uses the dynamic API.

We strongly caution against adjusting the values of ataccama.one.aicore.parallelism.blas and ataccama.one.aicore.parallelism.threads properties without prior consultation with our team as the use case is specific and changes might lead to unexpected results.

The same applies to other parallelism properties. Performance tests clearly show an added benefit of using parallel processing, however, as it consumes more resources, the option is disabled by default. Therefore, enabling this option should only be done after carefully considering users' needs.

As the Anomaly Detector microservice retrieves and processes profiling results for each attribute in a given catalog item, its memory usage is mainly determined by the size of the entire profiling history. While extensive performance testing is in progress, the current results show that, with a limited number of profiling versions per attribute (under 100), the memory consumption of the microservice should not exceed 170 MB.

Was this page useful?