Monitoring Metrics
The following article provides a list of available module-specific monitoring metrics for the Ataccama ONE Platform.
DPM and DPE
DPM
For Data Processing Module (DPM), the endpoint publishes the metrics as follows:
Metric | Metric type | Description | ||
---|---|---|---|---|
|
Gauge |
The number of jobs in each job status.
Possible values of the
|
||
|
Gauge |
Measures the age of the job with the highest priority level in the DPM job queue. This is measured from the time of creation up to the time of retrieval. |
||
|
Gauge |
The number of currently active DPEs. |
||
|
Gauge |
The number of connected DPEs. |
||
|
Gauge |
The number of disconnected DPEs. |
||
|
Gauge |
The number of currently inactive DPEs. |
||
|
Gauge |
The maximum duration of retrieving DPM events, in seconds. |
||
|
Gauge |
The total time spent while retrieving DPM events, in seconds. |
||
|
Gauge |
The number of attempts at retrieving DPM events. |
||
|
Gauge |
The number of jobs for which events are currently in memory waiting to be processed. |
||
|
Gauge |
Measures how long it takes to check the status of DPEs, expressed in seconds. |
||
|
Gauge |
The number of jobs waiting to be canceled. |
||
|
Timer |
The duration of creating jobs. |
||
|
Timer |
The duration of updating jobs. |
||
|
Timer |
The duration of add operations on the submit job queue. |
||
|
Timer |
The duration of add operations on the cancel job queue. |
||
|
Timer |
The duration of canceling jobs as admin. |
||
|
Timer |
The duration of submitting jobs. |
||
|
Timer |
The duration of canceling jobs. |
||
|
Timer |
The duration of resubmitting jobs. |
||
|
Gauge |
The size of incoming |
||
|
Timer |
The duration of uploading files. |
||
|
Timer |
The duration of downloading provider files. |
||
|
Timer |
The duration of job preprocessing.
|
||
|
Timer |
The duration of job status changes.
The status transition is determined using tags |
||
|
Timer |
The duration of job preprocessing. The metric uses |
||
|
Timer |
The duration of job postprocessing. The metric uses |
||
|
Threadpool |
The name of the thread pool in the tag |
||
|
Gauge |
Health check metrics grouped by dependency and depender (in this case, DPM). Must be explicitly enabled through the property |
||
|
Gauge |
The maximum duration of a DPE status check, expressed in seconds. |
||
|
Gauge |
The minimum duration of a DPE status check, expressed in seconds. |
||
|
Gauge |
The average duration of a DPE status check, expressed in seconds. |
||
|
Gauge |
The median duration of a DPE status check, expressed in seconds. |
DPE
For Data Processing Engine (DPE), the following metrics are available:
Metric | Metric type | Description |
---|---|---|
|
Timer |
Average total processing time of pushdown processing over the previous five runs. |
|
Timer |
Average time spent on deploying Java functions to Snowflake over the previous five runs. |
|
Timer |
Average time spent on processing column statistics requirement and translating them to Snowpark Dataframe over the previous five runs. |
|
Timer |
Average time spent on processing domain detection rules and translating them to Snowpark Dataframe over the previous five runs. |
|
Timer |
Average time spent on uploading lookups to Snowflake temporary tables over the previous five runs. |
|
Timer |
Average time spent on processing and executing a query that gathers frequency statistics over the previous five runs. |
|
Timer |
Average time spent on executing Snowpark Dataframes over the previous five runs. |
|
Timer |
Average time spent on processing returned results over the previous five runs. |
|
Gauge |
Displays how many Snowflake Pushdown jobs were processed during the specified time window.
Configurable via property |
|
Timer |
Average time spent on processing fingerprint requirements and translating them to Snowpark Dataframes over the previous five runs. |
|
Gauge |
The number of jobs that are currently running in DPE. |
|
Timer |
Measures how long it takes to collect and compute constraints. Static constraints are calculated only once during the first collection. |
|
Timer |
Measures how long it takes to compute constraints based on digests of the cached data source connections. |
|
Gauge |
Health check metrics grouped by dependency and depender (in this case, DPE). Must be explicitly enabled through the property Health statuses are mapped as follows:
|
MMM
For Metadata Management Module (MMM), the following metrics are available:
Metric | Metric type | Description |
---|---|---|
|
Gauge |
The number of metadata transactions. |
|
Gauge |
The total time spent on transactions. |
|
Gauge |
The maximum duration of a transaction. |
|
Gauge |
The number of currently active transactions. |
|
Gauge |
The total number of times the listener was invoked. |
|
Gauge |
The total time spent in the listener. |
|
Gauge |
The maximum time spent in the listener. |
|
Gauge |
The total number of events fired. |
|
Gauge |
The number of events fired. |
|
Gauge |
The maximum duration of the handler. |
|
Gauge |
The number of actively running threads in the pool for processing DPx results. |
|
Gauge |
The number of tasks waiting to be processed in the DPx queue. |
|
Gauge |
The current size of the pool for processing DPx results. |
|
Counter |
The total number of events received from DPM and DPE, grouped by |
|
Histogram |
The size of DPM results per job |
|
Histogram |
The number of seconds that MMM requires to execute the job success result handler, provided by job |
|
Timer |
The total duration of jobs grouped by exit. |
|
Timer |
The number of seconds jobs spent in each status, grouped by |
|
Timer |
The total duration of metadata operations, grouped by |
|
Timer |
The total time spent in the flow event handler. |
|
Timer |
The duration of profiling, grouped by |
|
Counter |
The number of catalog item attributes that were profiled. |
|
Counter |
The total number of entities affected during a metadata import grouped by |
|
Gauge |
The number of stored external events (not delivered or not acknowledged). |
|
Counter |
The total number of all external events. |
|
Gauge |
The number of external event subscribers that are currently online. |
|
Gauge |
The total number of external event subscribers. |
|
Timer |
The duration of outgoing gRPC calls, grouped by |
|
Counter |
The number of received gRPC streaming messages on the client side, grouped by |
|
Counter |
The number of sent gRPC streaming messages on the client side, grouped by |
|
Timer |
The duration of incoming gRPC calls, grouped by |
|
Counter |
The number of received gRPC streaming messages on the server side, grouped by |
|
Counter |
The number of sent gRPC streaming messages on the server side, grouped by |
|
Gauge |
The number of tasks waiting to be processed in the gRPC server executor queue. |
|
Gauge |
The number of active threads in the gRPC server executor. |
|
Gauge |
The current number of threads in the gRPC server executor. |
|
Timer |
The duration of catalog item search queries. |
|
Counter |
The number of all compacted events in the catalog search outbox table (used for optimizing performance). |
|
Gauge |
The current number of failed events that will not be reprocessed. |
|
Gauge |
The current number of failed catalog search sync events that will be reprocessed. |
|
Gauge |
The current number of catalog search sync events waiting to be reprocessed. |
|
Counter |
The total number of events added to the catalog search outbox queue. |
|
Gauge |
The number of actively running threads in a pool, grouped by the thread pool |
|
Gauge |
The number of tasks waiting to be processed in a pool, grouped by the thread pool |
|
Gauge |
The current pool size, grouped by the thread pool |
|
Gauge |
The number of users that are simultaneously working with the application (application node). |
|
Timer |
The duration of GraphQL operations.
Operations can be distinguished by the tag |
|
Gauge |
Health check metrics grouped by dependency and depender (in this case, Other health checks are explicitly enabled using the following properties:
Health statuses are mapped as follows:
|
|
Gauge |
The current number of threads in the scheduler pool, grouped by the scheduler |
|
Gauge |
The number of active threads in the scheduler pool, grouped by the scheduler |
|
Gauge |
The number of tasks waiting to be processed in the scheduler queue, grouped by the scheduler |
|
Gauge |
An approximate total number of tasks that have been scheduled for execution at any point, grouped by the scheduler |
|
Gauge |
An approximate total number of tasks that have been completed, grouped by the scheduler |
|
Counter |
The numbers of entities involved in DQ evaluation, grouped by |
|
Counter |
The number of lookup files ( |
Catalog Search plugin
The following suffixes related to the Catalog Search plugin in MMM are particularly important as they can indicate a critical error in the functioning of the plugin or its communication with Elasticsearch. Therefore, we recommend proactively monitoring these metrics with alerts configured for the listed thresholds. |
Metric | Threshold | Recommended action |
---|---|---|
|
5000 |
The suggested threshold corresponds to the expected peak number of pending events during a metadata import. This indicates that the application is lagging in processing of existing events. In case the increase in the queue was caused by a bulk import, the issue typically resolves on its own. If the issue is reoccurring, the processing capabilities of the application can be scaled up as needed. |
|
101 |
The suggested threshold corresponds to the size of the event batch that should be processed. Given that the metric indicates that all events finish with an error, the MMM log should be checked for any technical or connectivity problems between Elasticsearch and the Catalog Search plugin. In case events finish in the failed state instead (because they could not be reprocessed), the application administrator should run the recovery process. |
|
1 |
Check the MMM log for any technical or connectivity problems between Elasticsearch and the Catalog Search plugin. In most cases, the application administrator should run the recovery process. |
Transaction Data plugin
Metric | Labels | Description |
---|---|---|
|
Time taken (seconds) to fetch transaction data and map them to GraphQL response in data history plugin. |
|
|
|
Time taken to execute anomaly detection on transaction data. |
|
|
Time taken to persist anomaly information of transaction data. |
|
|
Time taken to fetch transaction data related to anomaly detection job request. |
|
Time taken to create and submit transaction data executor job. |
|
|
Time taken to import results from results of transaction data executor job. |
|
|
Time taken to serialize transaction data executor job context. |
|
|
|
Possible values:
|
Anomaly Detection
Component | Metric | Metric type | Description | Labels |
---|---|---|---|---|
Anomaly Detector |
|
Counter |
The number of stream requests for anomaly detection issued from MMM. |
|
Anomaly Detector |
|
Counter |
The number of unary requests for anomaly detection issued from MMM. |
|
Anomaly Detector |
|
Histogram |
A histogram representing the duration of anomaly detection requests over all provided categories. |
|
Anomaly Detector |
|
Summary |
The number of seconds needed to complete anomaly detection processing for the chosen model and the given category. |
|
Anomaly Detector |
|
Summary |
The number of data points (for example, profiling versions) that were fetched from MMM. |
|
Anomaly Detector |
|
Summary |
The number of confirmed anomalies (feedback) that users provided. |
|
Anomaly Detector |
|
Summary |
The number of data points that the model identified as anomalous. |
|
Anomaly Detector |
|
Summary |
The number of features (metrics) sent for the anomaly detection. |
|
Anomaly Detector |
|
Summary |
The number of seconds needed to fit the Isolation Forest model. |
|
Anomaly Detector |
|
Summary |
The number of seconds needed to run the Isolation Forest to obtain the explainability of anomalies. |
|
Anomaly Detector |
|
Summary |
The number of seconds needed to run the frequency cutoff method in the Isolation forest model. |
|
gRPC Server |
|
Counter |
The total number of gRPC requests with authentication failures. |
|
gRPC Server |
|
Counter |
The total number of gRPC commands received. |
|
gRPC Server |
|
Summary |
The processing time of a gRPC request, expressed in seconds. |
|
gRPC Server |
|
Gauge |
The number of active RPCs, either queued or currently processed. |
|
Microservice |
|
Info |
The microservice details. |
|
WSGI Server |
|
Counter |
The total number of HTTP requests with authentication failures. |
|
WSGI Server |
|
Counter |
The total number of HTTP request status codes. |
|
Term Suggestions
Feedback
Component | Metric | Metric type | Description | Labels |
---|---|---|---|---|
Feedback |
|
Counter |
The total number of positive or negative feedbacks received from users. |
|
Feedback |
|
Histogram |
The current distance thresholds. |
|
Database |
|
Summary |
The number of seconds a database query takes to complete. |
|
gRPC Server |
|
Counter |
The total number of gRPC requests with authentication failures. |
|
gRPC Server |
|
Counter |
The total number of gRPC commands received. |
|
gRPC Server |
|
Summary |
The processing time of a gRPC request, expressed in seconds. |
|
gRPC Server |
|
Gauge |
The number of active RPCs, either queued or currently processed. |
|
Microservice |
|
Info |
The microservice details. |
|
WSGI Server |
|
Counter |
The total number of HTTP requests with authentication failures. |
|
WSGI Server |
|
Counter |
The total number of HTTP request status codes. |
|
Neighbors
Component | Metric | Metric type | Description | Labels | ||
---|---|---|---|---|---|---|
Neighbors |
|
Gauge |
The number of attributes available to the Term Suggestions microservices.
|
|
||
Neighbors |
|
Gauge |
The number of attributes currently stored in the memory. |
|
||
Neighbors |
|
Gauge |
The maximum number of attributes that can be stored in the memory. |
|
||
Neighbors |
|
Histogram |
Distances to k-th nearest neighbors. |
|
||
Database |
|
Summary |
The number of seconds a database query takes to complete. |
|
||
gRPC Server |
|
Counter |
The total number of gRPC requests with authentication failures. |
|
||
gRPC Server |
|
Counter |
The total number of gRPC commands received. |
|
||
gRPC Server |
|
Summary |
The processing time of a gRPC request, expressed in seconds. |
|
||
gRPC Server |
|
Gauge |
The number of active RPCs, either queued or currently processed. |
|
||
Microservice |
|
Info |
The microservice details. |
|
||
WSGI Server |
|
Counter |
The total number of HTTP requests with authentication failures. |
|
||
WSGI Server |
|
Counter |
The total number of HTTP request status codes. |
|
Recommender
Component | Metric | Metric type | Description | Labels |
---|---|---|---|---|
Recommender |
|
Counter |
The number of attributes for which suggestions were computed. |
|
Recommender |
|
Counter |
The number of suggestions created. |
|
Recommender |
|
Gauge |
The number of known terms. |
|
Recommender |
|
Gauge |
The number of disabled terms. |
|
Recommender |
|
Counter |
The number of times all suggestions were rendered outdated. |
|
Recommender |
|
Counter |
The number of times all suggestions were brought up to date. |
|
Recommender |
|
Gauge |
The number of attributes that have up-to-date suggestions. |
|
Recommender |
|
Gauge |
The number of attributes that have up-to-date suggestions and for which the ground truth is known. |
|
Recommender |
|
Gauge |
The confusion matrix computed between suggestions and assigned terms. |
|
Database |
|
Summary |
The number of seconds a database query takes to complete. |
|
gRPC Client |
|
Summary |
The number of seconds a gRPC query takes to complete. |
|
gRPC Server |
|
Counter |
The total number of gRPC requests with authentication failures. |
|
gRPC Server |
|
Counter |
The total number of gRPC commands received. |
|
gRPC Server |
|
Summary |
The processing time of a gRPC request, expressed in seconds. |
|
gRPC Server |
|
Gauge |
The number of active RPCs, either queued or currently processed. |
|
Microservice |
|
Info |
The microservice details. |
|
WSGI Server |
|
Counter |
The total number of HTTP requests with authentication failures. |
|
WSGI Server |
|
Counter |
The total number of HTTP request status codes. |
|
JVM
Metric | Metric type | Description |
---|---|---|
|
Summary |
The duration of garbage collector pauses. |
|
Gauge |
The maximum amount of memory that is available for memory management, expressed in bytes. |
|
Gauge |
The number of idle connections in the thread pool. |
PostgreSQL
Metric | Metric type | Description |
---|---|---|
|
Counter |
The number of deadlocks detected in the database. |
|
Gauge |
The maximum duration of an active transaction. |
|
Gauge |
Used to set the maximum number of concurrent connections. |
|
Gauge |
The number of active connections. |
Was this page useful?