Monitoring Metrics
The following article provides a list of available module-specific monitoring metrics for the Ataccama ONE Platform.
DPM and DPE
DPM
For Data Processing Module (DPM), the endpoint publishes the metrics as follows:
Metric | Metric type | Description | ||
---|---|---|---|---|
|
Gauge |
The number of jobs in each job status.
Possible values of the
|
||
|
Gauge |
Measures the age of the job with the highest priority level in the DPM job queue. This is measured from the time of creation up to the time of retrieval. |
||
|
Gauge |
The number of currently active DPEs. |
||
|
Gauge |
The number of connected DPEs. |
||
|
Gauge |
The number of disconnected DPEs. |
||
|
Gauge |
The number of currently inactive DPEs. |
||
|
Gauge |
The maximum duration of retrieving DPM events, in seconds. |
||
|
Gauge |
The total time spent while retrieving DPM events, in seconds. |
||
|
Gauge |
The number of attempts at retrieving DPM events. |
||
|
Gauge |
The number of jobs for which events are currently in memory waiting to be processed. |
||
|
Gauge |
Measures how long it takes to check the status of DPEs, expressed in seconds. |
||
|
Gauge |
The number of jobs waiting to be canceled. |
||
|
Timer |
The duration of creating jobs. |
||
|
Timer |
The duration of updating jobs. |
||
|
Timer |
The duration of add operations on the submit job queue. |
||
|
Timer |
The duration of add operations on the cancel job queue. |
||
|
Timer |
The duration of canceling jobs as admin. |
||
|
Timer |
The duration of submitting jobs. |
||
|
Timer |
The duration of canceling jobs. |
||
|
Timer |
The duration of resubmitting jobs. |
||
|
Gauge |
The size of incoming |
||
|
Timer |
The duration of uploading files. |
||
|
Timer |
The duration of downloading provider files. |
||
|
Timer |
The duration of job preprocessing.
|
||
|
Timer |
The duration of job status changes.
The status transition is determined using tags |
||
|
Timer |
The duration of job preprocessing. The metric uses |
||
|
Timer |
The duration of job postprocessing. The metric uses |
||
|
Threadpool |
The name of the thread pool in the tag |
||
|
Gauge |
Health check metrics grouped by dependency and depender (in this case, DPM). Must be explicitly enabled through the property |
||
|
Gauge |
The maximum duration of a DPE status check, expressed in seconds. |
||
|
Gauge |
The minimum duration of a DPE status check, expressed in seconds. |
||
|
Gauge |
The average duration of a DPE status check, expressed in seconds. |
||
|
Gauge |
The median duration of a DPE status check, expressed in seconds. |
DPE
For Data Processing Engine (DPE), the following metrics are available:
Metric | Metric type | Description |
---|---|---|
|
Timer |
Average total processing time of pushdown processing over the previous five runs. |
|
Timer |
Average time spent on deploying Java functions to Snowflake over the previous five runs. |
|
Timer |
Average time spent on processing column statistics requirement and translating them to Snowpark Dataframe over the previous five runs. |
|
Timer |
Average time spent on processing domain detection rules and translating them to Snowpark Dataframe over the previous five runs. |
|
Timer |
Average time spent on uploading lookups to Snowflake temporary tables over the previous five runs. |
|
Timer |
Average time spent on processing and executing a query that gathers frequency statistics over the previous five runs. |
|
Timer |
Average time spent on executing Snowpark Dataframes over the previous five runs. |
|
Timer |
Average time spent on processing returned results over the previous five runs. |
|
Gauge |
Displays how many Snowflake Pushdown jobs were processed during the specified time window.
Configurable via property |
|
Timer |
Average time spent on processing fingerprint requirements and translating them to Snowpark Dataframes over the previous five runs. |
|
Gauge |
The number of jobs that are currently running in DPE. |
|
Timer |
Measures how long it takes to collect and compute constraints. Static constraints are calculated only once during the first collection. |
|
Timer |
Measures how long it takes to compute constraints based on digests of the cached data source connections. |
|
Gauge |
Health check metrics grouped by dependency and depender (in this case, DPE). Must be explicitly enabled through the property Health statuses are mapped as follows:
|
MMM
For Metadata Management Module (MMM), the following metrics are available:
Metric | Metric type | Description |
---|---|---|
|
Gauge |
The number of metadata transactions. |
|
Gauge |
The total time spent on transactions. |
|
Gauge |
The maximum duration of a transaction. |
|
Gauge |
The number of currently active transactions. |
|
Gauge |
The total number of times the listener was invoked. |
|
Gauge |
The total time spent in the listener. |
|
Gauge |
The maximum time spent in the listener. |
|
Gauge |
The total number of events fired. |
|
Gauge |
The number of events fired. |
|
Gauge |
The maximum duration of the handler. |
|
Gauge |
The number of actively running threads in the pool for processing DPx results. |
|
Gauge |
The number of tasks waiting to be processed in the DPx queue. |
|
Gauge |
The current size of the pool for processing DPx results. |
|
Counter |
The total number of events received from DPM and DPE, grouped by |
|
Histogram |
The size of DPM results per job |
|
Histogram |
The number of seconds that MMM requires to execute the job success result handler, provided by job |
|
Timer |
The total duration of jobs grouped by exit. |
|
Timer |
The number of seconds jobs spent in each status, grouped by |
|
Timer |
The total duration of metadata operations, grouped by |
|
Timer |
The total time spent in the flow event handler. |
|
Timer |
The duration of profiling, grouped by |
|
Counter |
The number of catalog item attributes that were profiled. |
|
Counter |
The total number of entities affected during a metadata import grouped by |
|
Gauge |
The number of stored external events (not delivered or not acknowledged). |
|
Counter |
The total number of all external events. |
|
Gauge |
The number of external event subscribers that are currently online. |
|
Gauge |
The total number of external event subscribers. |
|
Timer |
The duration of outgoing gRPC calls, grouped by |
|
Counter |
The number of received gRPC streaming messages on the client side, grouped by |
|
Counter |
The number of sent gRPC streaming messages on the client side, grouped by |
|
Timer |
The duration of incoming gRPC calls, grouped by |
|
Counter |
The number of received gRPC streaming messages on the server side, grouped by |
|
Counter |
The number of sent gRPC streaming messages on the server side, grouped by |
|
Gauge |
The number of tasks waiting to be processed in the gRPC server executor queue. |
|
Gauge |
The number of active threads in the gRPC server executor. |
|
Gauge |
The current number of threads in the gRPC server executor. |
|
Timer |
The duration of catalog item search queries. |
|
Counter |
The number of all compacted events in the catalog search outbox table (used for optimizing performance). |
|
Gauge |
The current number of failed events that will not be reprocessed. |
|
Gauge |
The current number of failed catalog search sync events that will be reprocessed. |
|
Gauge |
The current number of catalog search sync events waiting to be reprocessed. |
|
Counter |
The total number of events added to the catalog search outbox queue. |
|
Gauge |
The number of actively running threads in a pool, grouped by the thread pool |
|
Gauge |
The number of tasks waiting to be processed in a pool, grouped by the thread pool |
|
Gauge |
The current pool size, grouped by the thread pool |
|
Gauge |
The number of users that are simultaneously working with the application (application node). |
|
Timer |
The duration of GraphQL operations.
Operations can be distinguished by the tag |
|
Gauge |
Health check metrics grouped by dependency and depender (in this case, Other health checks are explicitly enabled using the following properties:
Health statuses are mapped as follows:
|
|
Gauge |
The current number of threads in the scheduler pool, grouped by the scheduler |
|
Gauge |
The number of active threads in the scheduler pool, grouped by the scheduler |
|
Gauge |
The number of tasks waiting to be processed in the scheduler queue, grouped by the scheduler |
|
Gauge |
An approximate total number of tasks that have been scheduled for execution at any point, grouped by the scheduler |
|
Gauge |
An approximate total number of tasks that have been completed, grouped by the scheduler |
|
Counter |
The numbers of entities involved in DQ evaluation, grouped by |
|
Counter |
The number of lookup files ( |
Catalog Search plugin
The following suffixes related to the Catalog Search plugin in MMM are particularly important as they can indicate a critical error in the functioning of the plugin or its communication with Elasticsearch. Therefore, we recommend proactively monitoring these metrics with alerts configured for the listed thresholds. |
Metric | Threshold | Recommended action |
---|---|---|
|
5000 |
The suggested threshold corresponds to the expected peak number of pending events during a metadata import. This indicates that the application is lagging in processing of existing events. In case the increase in the queue was caused by a bulk import, the issue typically resolves on its own. If the issue is reoccurring, the processing capabilities of the application can be scaled up as needed. |
|
101 |
The suggested threshold corresponds to the size of the event batch that should be processed. Given that the metric indicates that all events finish with an error, the MMM log should be checked for any technical or connectivity problems between Elasticsearch and the Catalog Search plugin. In case events finish in the failed state instead (because they could not be reprocessed), the application administrator should run the recovery process. |
|
1 |
Check the MMM log for any technical or connectivity problems between Elasticsearch and the Catalog Search plugin. In most cases, the application administrator should run the recovery process. |
Transaction Data plugin
Metric | Labels | Description |
---|---|---|
|
Time taken (seconds) to fetch transaction data and map them to GraphQL response in data history plugin. |
|
|
|
Time taken to execute anomaly detection on transaction data. |
|
|
Time taken to persist anomaly information of transaction data. |
|
|
Time taken to fetch transaction data related to anomaly detection job request. |
|
Time taken to create and submit transaction data executor job. |
|
|
Time taken to import results from results of transaction data executor job. |
|
|
Time taken to serialize transaction data executor job context. |
|
|
|
Possible values:
|
Anomaly Detection
Component | Metric | Metric type | Description | Labels |
---|---|---|---|---|
Anomaly Detector |
|
Counter |
The number of stream requests for anomaly detection issued from MMM. |
|
Anomaly Detector |
|
Counter |
The number of unary requests for anomaly detection issued from MMM. |
|
Anomaly Detector |
|
Histogram |
A histogram representing the duration of anomaly detection requests over all provided categories. |
|
Anomaly Detector |
|
Summary |
The number of seconds needed to complete anomaly detection processing for the chosen model and the given category. |
|
Anomaly Detector |
|
Summary |
The number of data points (for example, profiling versions) that were fetched from MMM. |
|
Anomaly Detector |
|
Summary |
The number of confirmed anomalies (feedback) that users provided. |
|
Anomaly Detector |
|
Summary |
The number of data points that the model identified as anomalous. |
|
Anomaly Detector |
|
Summary |
The number of features (metrics) sent for the anomaly detection. |
|
Anomaly Detector |
|
Summary |
The number of seconds needed to fit the Isolation Forest model. |
|
Anomaly Detector |
|
Summary |
The number of seconds needed to run the Isolation Forest to obtain the explainability of anomalies. |
|
Anomaly Detector |
|
Summary |
The number of seconds needed to run the frequency cutoff method in the Isolation forest model. |
|
gRPC Server |
|
Counter |
The total number of gRPC requests with authentication failures. |
|
gRPC Server |
|
Counter |
The total number of gRPC commands received. |
|
gRPC Server |
|
Summary |
The processing time of a gRPC request, expressed in seconds. |
|
gRPC Server |
|
Gauge |
The number of active RPCs, either queued or currently processed. |
|
Microservice |
|
Info |
The microservice details. |
|
WSGI Server |
|
Counter |
The total number of HTTP requests with authentication failures. |
|
WSGI Server |
|
Counter |
The total number of HTTP request status codes. |
|
Term Suggestions
Feedback
Component | Metric | Metric type | Description | Labels |
---|---|---|---|---|
Feedback |
|
Counter |
The total number of positive or negative feedbacks received from users. |
|
Feedback |
|
Histogram |
The current distance thresholds. |
|
Database |
|
Summary |
The number of seconds a database query takes to complete. |
|
gRPC Server |
|
Counter |
The total number of gRPC requests with authentication failures. |
|
gRPC Server |
|
Counter |
The total number of gRPC commands received. |
|
gRPC Server |
|
Summary |
The processing time of a gRPC request, expressed in seconds. |
|
gRPC Server |
|
Gauge |
The number of active RPCs, either queued or currently processed. |
|
Microservice |
|
Info |
The microservice details. |
|
WSGI Server |
|
Counter |
The total number of HTTP requests with authentication failures. |
|
WSGI Server |
|
Counter |
The total number of HTTP request status codes. |
|
Neighbors
Component | Metric | Metric type | Description | Labels | ||
---|---|---|---|---|---|---|
Neighbors |
|
Gauge |
The number of attributes available to the Term Suggestions microservices.
|
|
||
Neighbors |
|
Gauge |
The number of attributes currently stored in the memory. |
|
||
Neighbors |
|
Gauge |
The maximum number of attributes that can be stored in the memory. |
|
||
Neighbors |
|
Histogram |
Distances to k-th nearest neighbors. |
|
||
Database |
|
Summary |
The number of seconds a database query takes to complete. |
|
||
gRPC Server |
|
Counter |
The total number of gRPC requests with authentication failures. |
|
||
gRPC Server |
|
Counter |
The total number of gRPC commands received. |
|
||
gRPC Server |
|
Summary |
The processing time of a gRPC request, expressed in seconds. |
|
||
gRPC Server |
|
Gauge |
The number of active RPCs, either queued or currently processed. |
|
||
Microservice |
|
Info |
The microservice details. |
|
||
WSGI Server |
|
Counter |
The total number of HTTP requests with authentication failures. |
|
||
WSGI Server |
|
Counter |
The total number of HTTP request status codes. |
|
Recommender
Component | Metric | Metric type | Description | Labels |
---|---|---|---|---|
Recommender |
|
Counter |
The number of attributes for which suggestions were computed. |
|
Recommender |
|
Counter |
The number of suggestions created. |
|
Recommender |
|
Gauge |
The number of known terms. |
|
Recommender |
|
Gauge |
The number of disabled terms. |
|
Recommender |
|
Counter |
The number of times all suggestions were rendered outdated. |
|
Recommender |
|
Counter |
The number of times all suggestions were brought up to date. |
|
Recommender |
|
Gauge |
The number of attributes that have up-to-date suggestions. |
|
Recommender |
|
Gauge |
The number of attributes that have up-to-date suggestions and for which the ground truth is known. |
|
Recommender |
|
Gauge |
The confusion matrix computed between suggestions and assigned terms. |
|
Database |
|
Summary |
The number of seconds a database query takes to complete. |
|
gRPC Client |
|
Summary |
The number of seconds a gRPC query takes to complete. |
|
gRPC Server |
|
Counter |
The total number of gRPC requests with authentication failures. |
|
gRPC Server |
|
Counter |
The total number of gRPC commands received. |
|
gRPC Server |
|
Summary |
The processing time of a gRPC request, expressed in seconds. |
|
gRPC Server |
|
Gauge |
The number of active RPCs, either queued or currently processed. |
|
Microservice |
|
Info |
The microservice details. |
|
WSGI Server |
|
Counter |
The total number of HTTP requests with authentication failures. |
|
WSGI Server |
|
Counter |
The total number of HTTP request status codes. |
|
AI Matching
Matching Manager
Component | Metric | Type | Description | Labels |
---|---|---|---|---|
Background thread |
|
Counter |
The total number of failures during the infinite processing loop. |
|
Background thread |
|
Summary |
The processing time of a single work unit computed in the infinite processing loop, expressed in seconds. |
|
Background thread |
|
Counter |
The total number of failures raised during the whole background thread run method. |
|
Background thread |
|
Summary |
The processing time of a whole background thread run method, expressed in seconds. |
|
Database |
|
Summary |
The number of seconds a database query takes to complete. |
|
gRPC Client |
|
Summary |
The number of seconds a gRPC query takes to complete. |
|
gRPC Server |
|
Counter |
The total number of gRPC commands received. |
|
gRPC Server |
|
Counter |
The total number of gRPC requests with failures. |
|
gRPC Server |
|
Summary |
The processing time of a gRPC request, expressed in seconds. |
|
gRPC Server |
|
Gauge |
The number of active RPCs, either queued or currently processed. |
|
gRPC Server |
|
Counter |
Deprecated, use |
|
HTTP Server |
|
Counter |
The total number of HTTP requests with failures. |
|
HTTP Server |
|
Counter |
The total number of HTTP request status codes. |
|
HTTP Server |
|
Summary |
The processing time of a HTTP request, expressed in seconds. |
|
Job |
|
Histogram |
The total execution time of a job, expressed in seconds. |
|
Job |
|
Counter |
The number of executed jobs. |
|
Microservice |
|
Info |
The microservice details. |
|
Model |
|
Gauge |
The total number of labeled pairs. |
|
Model |
|
Gauge |
The model quality represented as a floating point value between 0 and 1. |
|
Matching Worker
Component | Metric | Type | Description | Labels |
---|---|---|---|---|
Background thread |
|
Counter |
The total number of failures during the infinite processing loop. |
|
Background thread |
|
Summary |
The processing time of a single work unit computed in the infinite processing loop, expressed in seconds. |
|
Background thread |
|
Counter |
The total number of failures raised during the whole background thread run method. |
|
Background thread |
|
Summary |
The processing time of a whole background thread run method, expressed in seconds. |
|
Database |
|
Summary |
The number of seconds a database query takes to complete. |
|
gRPC Client |
|
Summary |
The number of seconds a gRPC query takes to complete. |
|
gRPC Server |
|
Counter |
The total number of gRPC commands received. |
|
gRPC Server |
|
Counter |
The total number of gRPC requests with failures. |
|
gRPC Server |
|
Summary |
The processing time of a gRPC request, expressed in seconds. |
|
gRPC Server |
|
Gauge |
The number of active RPCs, either queued or currently processed. |
|
gRPC Server |
|
Counter |
Deprecated, use |
|
HTTP Server |
|
Counter |
The total number of HTTP requests with failures. |
|
HTTP Server |
|
Counter |
The total number of HTTP request status codes. |
|
HTTP Server |
|
Summary |
The processing time of a HTTP request, expressed in seconds. |
|
Job |
|
Histogram |
The total execution time of a job, expressed in seconds. |
|
Job |
|
Counter |
The number of executed jobs. |
|
Microservice |
|
Info |
The microservice details. |
|
JVM
Metric | Metric type | Description |
---|---|---|
|
Summary |
The duration of garbage collector pauses. |
|
Gauge |
The maximum amount of memory that is available for memory management, expressed in bytes. |
|
Gauge |
The number of idle connections in the thread pool. |
PostgreSQL
Metric | Metric type | Description |
---|---|---|
|
Counter |
The number of deadlocks detected in the database. |
|
Gauge |
The maximum duration of an active transaction. |
|
Gauge |
Used to set the maximum number of concurrent connections. |
|
Gauge |
The number of active connections. |
MDM
Metric | Metric type | Description |
---|---|---|
|
Gauge |
The duration of incoming gRPC calls, grouped by |
|
Summary |
The duration of incoming gRPC calls, grouped by |
|
Gauge |
Number of active clients in the gRPC pool. |
|
Gauge |
The duration of outgoing gRPC calls, grouped by |
|
Counter |
The number of outgoing gRPC calls, grouped by |
|
Gauge |
The duration of outgoing gRPC calls, grouped by |
|
Counter |
The number of received gRPC streaming messages on the server side, grouped by |
|
Counter |
Total number of exceptions on gRPC request processing. |
|
Counter |
The number of sent gRPC streaming messages on the client side, grouped by |
|
Counter |
Total number of exceptions on HTTP request processing. |
|
Summary |
The duration of incoming HTTP calls, grouped by |
|
Counter |
Total number of incoming http calls, grouped by |
|
Gauge |
Total duration of incoming HTTP calls, grouped by |
|
Gauge |
Maximum time spent in incoming HTTP call processing, grouped by |
|
Counter |
Total number of calls to the MDM webapp layer, grouped by |
|
Gauge |
Total duration of calls to the MDM webapp layer, grouped by |
|
Gauge |
Maximum time spent in calls to the MDM webapp layer, grouped by |
|
Gauge |
Maximum duration of alive Tomcat session. |
|
Gauge |
Current number of active Tomcat sessions. |
|
Gauge |
An estimate of the total capacity of the buffers in this pool. |
|
Gauge |
Size of long-lived heap memory pool after reclamation. |
|
Summary |
Time spent in GC pause. |
|
Gauge |
Maximum time spent in GC pause. |
|
Gauge |
The amount of memory used. |
|
Counter |
Incremented for an increase in the size of the (young) heap memory pool after one GC until the next. |
|
Gauge |
The uptime of the Java Virtual Machine (JVM). |
|
Gauge |
The recent CPU usage for the JVM process. |
|
Gauge |
The number of processors available to the JVM. |
RDM
For ONE RDM, the endpoint publishes the metrics as follows:
Metric | Metric type | Description | Labels | ||
---|---|---|---|---|---|
|
Gauge |
The number of SQL statements in progress. |
|
||
|
Gauge |
The number of methods in progress. |
|
||
|
Gauge |
The number of required locks. |
|
||
|
Gauge |
The number of requested but not acquired locks. |
|
||
|
Gauge |
The indicator of whether RDM Web App is business ready, where |
|
||
|
Timer |
How long it took to load all RDM users with their roles from Keycloak. |
|
||
|
Timer with buckets |
The query execution duration. Bucket |
|
||
|
Counter |
The number of failed query executions.
Complementary to |
|
||
|
Timer |
The duration of SOAP calls to the RDM server for different actions defined in RDM configuration.
The |
|
||
|
Counter |
The number of failed SOAP calls to the RDM server.
Complementary to |
|
||
|
Gauge |
The state of internal RDM locks for entities and system objects.
Possible state values are The
|
|
||
|
Counter |
The number of times acquiring lock failed on timeout.
Complementary to |
|
||
|
Timer |
The duration of administration, security, and data service method calls.
The |
|
||
|
Counter |
The number of failed method calls.
Complementary to |
|
||
|
Timer with buckets |
The duration of finished RDM background operations.
Bucket |
|
||
|
Counter |
The number of failed RDM background operations. |
|
||
|
Gauge |
The number of entity records. Collected once per day (configurable). The
|
|
||
|
Gauge |
The number of RDM background operations in progress. |
|
Was this page useful?