System Users Service Metrics¶
This document details the telemetry metrics exposed by the UpdateSystemUsersInMemoryCache hosted service. These metrics provide insights into the performance and reliability of the system user data synchronization into the local in-memory cache.
The primary meter for these metrics is CRMFacade.SystemUsers.
Service Goal
This service is responsible for periodically fetching system user data from DataVerse and caching it in memory. This reduces latency for operations that need to look up user information.
Metrics¶
| Metric Name | Type | Description |
|---|---|---|
systemusers_sync_total |
Counter | Total number of system user sync operations attempted. Incremented on success or when skipped. |
systemusers_sync_errors_total |
Counter | Total number of failed system user sync operations. |
systemusers_sync_duration_seconds |
Histogram | The duration, in seconds, of each sync operation. |
systemusers_last_sync_timestamp |
Observable Gauge | The Unix epoch timestamp of the last successful sync. |
systemusers_total_count |
Observable Gauge | The total number of system users currently held in the cache. |
systemusers_cache_hits_total |
Observable Gauge | The total number of times the system user cache was successfully accessed. |
Dimensions¶
These dimensions (tags) can be used to filter and group metric data.
| Metric Name | Dimension Name | Possible Values | Description |
|---|---|---|---|
systemusers_sync_total |
operation |
success, skipped |
The outcome of the sync operation. |
reason |
signal_present, lock_not_acquired |
The reason an operation was skipped (only present if operation=skipped). |
|
systemusers_sync_errors_total |
operation |
error |
Indicates a failed operation. |
error_type |
Exception Name (e.g., TimeoutException) |
The type of exception that caused the failure. | |
systemusers_sync_duration_seconds |
operation |
success, error |
The outcome of the operation whose duration was measured. |
Example KQL Queries¶
Here are some example queries you can use in Azure Application Insights to monitor the service.
customMetrics
| where name == "systemusers_sync_total"
| where timestamp > ago(1d)
| extend operation = tostring(customDimensions.operation)
| extend reason = tostring(customDimensions.reason)
| summarize Count = sum(value) by operation, reason
| order by Count desc
customMetrics
| where name == "systemusers_sync_duration_seconds"
| where timestamp > ago(1d)
| summarize AvgDuration_sec = avg(value)
This query can be used to create an alert if the last successful sync is older than a specified threshold (e.g., 2 hours).
customMetrics
| where name == "systemusers_last_sync_timestamp"
| summarize LastSync = max(value)
| extend AgeInSeconds = datetime_diff('second', now(), unixtime_seconds_todatetime(LastSync))
| where AgeInSeconds > 7200 // 2 hours
| project readable_time_utc = unixtime_seconds_todatetime(LastSync), AgeInSeconds
Metric Flow Diagram¶
The following diagram illustrates the flow of the UpdateSystemUsersInMemoryCache job and when each metric is recorded.
graph TD
A[Start Job] --> B{Try Acquire Lock};
B -- Lock Acquired --> C{Signal Present?};
B -- Lock Not Acquired --> D["Record: sync_total<br>(skipped, lock_not_acquired)"];
C -- Yes --> E["Record: sync_total<br>(skipped, signal_present)"];
C -- No --> F[Fetch from DataVerse];
F -- Success --> G[Update Cache];
F -- Failure --> H["Record: sync_errors_total<br>Record: sync_duration_seconds(error)"];
G --> I["Record: sync_total(success)<br>Record: sync_duration_seconds(success)<br>Update: last_sync_timestamp<br>Update: total_count"];
J[API Request] --> K{Get Users from Cache};
K -- Cache Hit --> L["Update: cache_hits_total"];