# HashiCorp Nomad by HTTP ## Overview This template is designed to monitor HashiCorp Nomad by Zabbix. It works without any external scripts. Currently the template supports Nomad servers and clients discovery. ## Requirements Zabbix version: 7.0 and higher. ## Tested versions This template has been tested on: - HashiCorp Nomad version 1.5.6/1.6.0 ## Configuration > Zabbix should be configured according to the instructions in the [Templates out of the box](https://www.zabbix.com/documentation/7.0/manual/config/templates_out_of_the_box) section. ## Setup 1. Create a synthetic Nomad host. It should be one of the Nomad cluster members, load-balancing service (if cluster is used) or a single node in a selected Nomad region. 2. Define the `{$NOMAD.ENDPOINT.API.URL}` macro value with correct web protocol, host and port. 3. Prepare an ACL token with `node:read`, `namespace:read-job`, `agent:read` and `management` permissions applied. Define the `{$NOMAD.TOKEN}` macro value. > Refer to the vendor documentation about [`Nomad native ACL`](https://developer.hashicorp.com/nomad/tutorials/access-control/access-control-policies) or [`Nomad Vault-generated tokens`](https://developer.hashicorp.com/nomad/tutorials/access-control/vault-nomad-secrets) if you have the HashiCorp Vault integration configured. **Additional information**: * Synthetic Nomad host will be used just as an endpoint for servers and clients discovery (general cluster information), it will not be monitored as a Nomad server or client, so that to prevent duplicate entities. * If you're not using ACL - skip 3rd setup step. * The Nomad servers/clients discovery is limited by region. If you're using multi-region cluster- create one synthetic host per region. * The Nomad server/client templates are ready for separate usage. Feel free to use if you prefer manual host creation. **Useful links** * [HashiCorp Nomad multi-region federation](https://developer.hashicorp.com/nomad/tutorials/manage-clusters/federation) * [HashiCorp Nomad agent API reference](https://developer.hashicorp.com/nomad/api-docs/agent) * [HashiCorp Nomad raft operator API reference](https://developer.hashicorp.com/nomad/api-docs/operator/raft) * [HashiCorp Nomad nodes API reference](https://developer.hashicorp.com/nomad/api-docs/nodes) ### Macros used |Name|Description|Default| |----|-----------|-------| |{$NOMAD.ENDPOINT.API.URL}|
API endpoint URL for one of the Nomad cluster members.
|`http://localhost:4646`| |{$NOMAD.TOKEN}|Nomad authentication token.
|`Response timeout for an API.
|`15s`| |{$NOMAD.HTTP.PROXY}|Sets the HTTP proxy for script and HTTP agent items. If this parameter is empty, then no proxy is used.
|| |{$NOMAD.API.RESPONSE.SUCCESS}|HTTP API successful response code. Availability triggers threshold. Change, if needed.
|`200`| |{$NOMAD.SERVER.NAME.MATCHES}|The filter to include HashiCorp Nomad servers by name.
|`.*`| |{$NOMAD.SERVER.NAME.NOT_MATCHES}|The filter to exclude HashiCorp Nomad servers by name.
|`CHANGE_IF_NEEDED`| |{$NOMAD.SERVER.DC.MATCHES}|The filter to include HashiCorp Nomad servers by datacenter belonging.
|`.*`| |{$NOMAD.SERVER.DC.NOT_MATCHES}|The filter to exclude HashiCorp Nomad servers by datacenter belonging.
|`CHANGE_IF_NEEDED`| |{$NOMAD.CLIENT.NAME.MATCHES}|The filter to include HashiCorp Nomad clients by name.
|`.*`| |{$NOMAD.CLIENT.NAME.NOT_MATCHES}|The filter to exclude HashiCorp Nomad clients by name.
|`CHANGE_IF_NEEDED`| |{$NOMAD.CLIENT.DC.MATCHES}|The filter to include HashiCorp Nomad clients by datacenter belonging.
|`.*`| |{$NOMAD.CLIENT.DC.NOT_MATCHES}|The filter to exclude HashiCorp Nomad clients by datacenter belonging.
|`CHANGE_IF_NEEDED`| |{$NOMAD.CLIENT.SCHEDULE.ELIGIBILITY.MATCHES}|The filter to include HashiCorp Nomad clients by scheduling eligibility.
|`.*`| |{$NOMAD.CLIENT.SCHEDULE.ELIGIBILITY.NOT_MATCHES}|The filter to exclude HashiCorp Nomad clients by scheduling eligibility.
|`CHANGE_IF_NEEDED`| ### Items |Name|Description|Type|Key and additional info| |----|-----------|----|-----------------------| |HashiCorp Nomad: Nomad clients get|Nomad clients data in raw format.
|HTTP agent|nomad.client.nodes.get**Preprocessing**
Check for not supported value
⛔️Custom on fail: Set value to: `{"header":{"HTTP/1.1 408 Request timeout":""}}`
Client nodes API response message.
|Dependent item|nomad.client.nodes.api.response**Preprocessing**
JavaScript: `The text is too long. Please see the template.`
Discard unchanged with heartbeat: `1h`
Nomad servers data in raw format.
|Script|nomad.server.nodes.get| |HashiCorp Nomad: Server-related APIs response|Server-related (`operator/raft/configuration`, `agent/members`) APIs error response message.
|Dependent item|nomad.server.api.response**Preprocessing**
JSON Path: `$.error`
⛔️Custom on fail: Set value to: `HTTP/1.1 200 OK`
Discard unchanged with heartbeat: `1h`
Current cluster region.
|Dependent item|nomad.region**Preprocessing**
JSON Path: `$..region.first()`
Nomad servers count.
|Dependent item|nomad.servers.count**Preprocessing**
JSON Path: `$[?(@.Name)].length()`
Nomad clients count.
|Dependent item|nomad.clients.count**Preprocessing**
JSON Path: `$.body[?(@.Name)].length()`
Client nodes API connection has failed.
Ensure that Nomad API URL and the necessary permissions have been defined correctly, check the service state and network connectivity between Nomad and Zabbix.
Server-related API connection has failed.
Ensure that Nomad API URL and the necessary permissions have been defined correctly, check the service state and network connectivity between Nomad and Zabbix.
Client nodes discovery.
|Dependent item|nomad.clients.discovery**Preprocessing**
JSON Path: `$.body`
⛔️Custom on fail: Discard value
Discard unchanged with heartbeat: `1h`
Server nodes discovery.
|Dependent item|nomad.servers.discovery**Preprocessing**
Check for error in JSON: `$.error`
⛔️Custom on fail: Discard value
Discard unchanged with heartbeat: `1h`
Nomad client API scheme.
|`http`| |{$NOMAD.CLIENT.API.PORT}|Nomad client API port.
|`4646`| |{$NOMAD.TOKEN}|Nomad authentication token.
|`Response timeout for an API.
|`15s`| |{$NOMAD.HTTP.PROXY}|Sets the HTTP proxy for HTTP agent item. If this parameter is empty, then no proxy is used.
|| |{$NOMAD.API.RESPONSE.SUCCESS}|HTTP API successful response code. Availability triggers threshold. Change, if needed.
|`200`| |{$NOMAD.CLIENT.RPC.PORT}|Nomad RPC service port.
|`4647`| |{$NOMAD.CLIENT.SERF.PORT}|Nomad serf service port.
|`4648`| |{$NOMAD.CLIENT.OPEN.FDS.MAX.WARN}|Maximum percentage of used file descriptors.
|`90`| |{$NOMAD.DISK.NAME.MATCHES}|The filter to include HashiCorp Nomad client disks by name.
|`.*`| |{$NOMAD.DISK.NAME.NOT_MATCHES}|The filter to exclude HashiCorp Nomad client disks by name.
|`CHANGE_IF_NEEDED`| |{$NOMAD.JOB.NAME.MATCHES}|The filter to include HashiCorp Nomad client jobs by name.
|`.*`| |{$NOMAD.JOB.NAME.NOT_MATCHES}|The filter to exclude HashiCorp Nomad client jobs by name.
|`CHANGE_IF_NEEDED`| |{$NOMAD.JOB.NAMESPACE.MATCHES}|The filter to include HashiCorp Nomad client jobs by namespace.
|`.*`| |{$NOMAD.JOB.NAMESPACE.NOT_MATCHES}|The filter to exclude HashiCorp Nomad client jobs by namespace.
|`CHANGE_IF_NEEDED`| |{$NOMAD.JOB.TYPE.MATCHES}|The filter to include HashiCorp Nomad client jobs by type.
|`.*`| |{$NOMAD.JOB.TYPE.NOT_MATCHES}|The filter to exclude HashiCorp Nomad client jobs by type.
|`CHANGE_IF_NEEDED`| |{$NOMAD.JOB.TASK.GROUP.MATCHES}|The filter to include HashiCorp Nomad client jobs by task group belonging.
|`.*`| |{$NOMAD.JOB.TASK.GROUP.NOT_MATCHES}|The filter to exclude HashiCorp Nomad client jobs by task group belonging.
|`CHANGE_IF_NEEDED`| |{$NOMAD.DRIVER.NAME.MATCHES}|The filter to include HashiCorp Nomad client drivers by name.
|`.*`| |{$NOMAD.DRIVER.NAME.NOT_MATCHES}|The filter to exclude HashiCorp Nomad client drivers by name.
|`CHANGE_IF_NEEDED`| |{$NOMAD.DRIVER.DETECT.MATCHES}|The filter to include HashiCorp Nomad client drivers by detection state. Possible filtering values: `true`, `false`.
|`.*`| |{$NOMAD.DRIVER.DETECT.NOT_MATCHES}|The filter to exclude HashiCorp Nomad client drivers by detection state. Possible filtering values: `true`, `false`.
|`CHANGE_IF_NEEDED`| |{$NOMAD.CPU.UTIL.MIN}|CPU utilization threshold. Measured as a percentage.
|`90`| |{$NOMAD.RAM.AVAIL.MIN}|CPU utilization threshold. Measured as a percentage.
|`5`| |{$NOMAD.INODES.FREE.MIN.WARN}|Warning threshold of the filesystem metadata utilization. Measured as a percentage.
|`20`| |{$NOMAD.INODES.FREE.MIN.CRIT}|Critical threshold of the filesystem metadata utilization. Measured as a percentage.
|`10`| ### Items |Name|Description|Type|Key and additional info| |----|-----------|----|-----------------------| |HashiCorp Nomad Client: Telemetry get|Telemetry data in raw format.
|HTTP agent|nomad.client.data.get**Preprocessing**
Check for not supported value
⛔️Custom on fail: Set value to: `{"header":{"HTTP/1.1 408 Request timeout":""}}`
Nomad client metrics in raw format.
|Dependent item|nomad.client.metrics.get**Preprocessing**
JSON Path: `$.body`
⛔️Custom on fail: Discard value
Monitoring API response message.
|Dependent item|nomad.client.data.api.response**Preprocessing**
JavaScript: `The text is too long. Please see the template.`
Discard unchanged with heartbeat: `1h`
Current [rpc] service state.
|Simple check|net.tcp.service[tcp,,{$NOMAD.CLIENT.RPC.PORT}]**Preprocessing**
Discard unchanged with heartbeat: `1h`
Current [serf] service state.
|Simple check|net.tcp.service[tcp,,{$NOMAD.CLIENT.SERF.PORT}]**Preprocessing**
Discard unchanged with heartbeat: `1h`
Total amount of CPU shares the scheduler has allocated to tasks.
|Dependent item|nomad.client.allocated.cpu**Preprocessing**
Prometheus pattern: `VALUE(nomad_client_allocated_cpu)`
⛔️Custom on fail: Discard value
Total amount of CPU shares free for the scheduler to allocate to tasks.
|Dependent item|nomad.client.unallocated.cpu**Preprocessing**
Prometheus pattern: `VALUE(nomad_client_unallocated_cpu)`
⛔️Custom on fail: Discard value
Total amount of memory the scheduler has allocated to tasks.
|Dependent item|nomad.client.allocated.memory**Preprocessing**
Prometheus pattern: `VALUE(nomad_client_allocated_memory)`
⛔️Custom on fail: Discard value
Custom multiplier: `1.0E+6`
Total amount of memory free for the scheduler to allocate to tasks.
|Dependent item|nomad.client.unallocated.memory**Preprocessing**
Prometheus pattern: `VALUE(nomad_client_unallocated_memory)`
⛔️Custom on fail: Discard value
Custom multiplier: `1.0E+6`
Total amount of disk space the scheduler has allocated to tasks.
|Dependent item|nomad.client.allocated.disk**Preprocessing**
Prometheus pattern: `VALUE(nomad_client_allocated_disk)`
⛔️Custom on fail: Discard value
Custom multiplier: `1.0E+6`
Total amount of disk space free for the scheduler to allocate to tasks.
|Dependent item|nomad.client.unallocated.disk**Preprocessing**
Prometheus pattern: `VALUE(nomad_client_unallocated_disk)`
⛔️Custom on fail: Discard value
Custom multiplier: `1.0E+6`
Number of allocations waiting for previous versions.
|Dependent item|nomad.client.allocations.blocked**Preprocessing**
Prometheus pattern: `VALUE(nomad_client_allocations_blocked)`
⛔️Custom on fail: Set value to: `0`
Number of allocations migrating data from previous versions.
|Dependent item|nomad.client.allocations.migrating**Preprocessing**
Prometheus pattern: `VALUE(nomad_client_allocations_migrating)`
⛔️Custom on fail: Set value to: `0`
Number of allocations pending (received by the client but not yet running).
|Dependent item|nomad.client.allocations.pending**Preprocessing**
Prometheus pattern: `VALUE(nomad_client_allocations_pending)`
⛔️Custom on fail: Set value to: `0`
Number of allocations starting.
|Dependent item|nomad.client.allocations.start**Preprocessing**
Prometheus pattern: `VALUE(nomad_client_allocations_start)`
⛔️Custom on fail: Set value to: `0`
Number of allocations running.
|Dependent item|nomad.client.allocations.running**Preprocessing**
Prometheus pattern: `VALUE(nomad_client_allocations_running)`
⛔️Custom on fail: Set value to: `0`
Number of allocations terminal.
|Dependent item|nomad.client.allocations.terminal**Preprocessing**
Prometheus pattern: `VALUE(nomad_client_allocations_terminal)`
⛔️Custom on fail: Set value to: `0`
Number of allocations failed.
|Dependent item|nomad.client.allocations.failed**Preprocessing**
Prometheus pattern: `SUM(nomad_client_allocs_failed)`
⛔️Custom on fail: Set value to: `0`
Discard unchanged with heartbeat: `1h`
Number of allocations completed.
|Dependent item|nomad.client.allocations.complete**Preprocessing**
Prometheus pattern: `SUM(nomad_client_allocs_complete)`
⛔️Custom on fail: Set value to: `0`
Discard unchanged with heartbeat: `1h`
Number of allocations restarted.
|Dependent item|nomad.client.allocations.restart**Preprocessing**
Prometheus pattern: `SUM(nomad_client_allocs_restart)`
⛔️Custom on fail: Set value to: `0`
Discard unchanged with heartbeat: `1h`
Number of allocations OOM killed.
|Dependent item|nomad.client.allocations.oom_killed**Preprocessing**
Prometheus pattern: `VALUE(nomad_client_allocs_oom_killed)`
⛔️Custom on fail: Set value to: `0`
Discard unchanged with heartbeat: `1h`
CPU utilization in idle state.
|Dependent item|nomad.client.cpu.idle**Preprocessing**
Prometheus pattern: `AVG(nomad_client_host_cpu_idle)`
⛔️Custom on fail: Discard value
CPU utilization in system space.
|Dependent item|nomad.client.cpu.system**Preprocessing**
Prometheus pattern: `AVG(nomad_client_host_cpu_system)`
⛔️Custom on fail: Discard value
Total CPU utilization.
|Dependent item|nomad.client.cpu.total**Preprocessing**
Prometheus pattern: `AVG(nomad_client_host_cpu_total)`
⛔️Custom on fail: Discard value
CPU utilization in user space.
|Dependent item|nomad.client.cpu.user**Preprocessing**
Prometheus pattern: `AVG(nomad_client_host_cpu_user)`
⛔️Custom on fail: Discard value
Total amount of memory available to processes which includes free and cached memory.
|Dependent item|nomad.client.memory.available**Preprocessing**
Prometheus pattern: `VALUE(nomad_client_host_memory_available)`
⛔️Custom on fail: Discard value
Amount of memory which is free.
|Dependent item|nomad.client.memory.free**Preprocessing**
Prometheus pattern: `VALUE(nomad_client_host_memory_free)`
Total amount of physical memory on the node.
|Dependent item|nomad.client.memory.total**Preprocessing**
Prometheus pattern: `VALUE(nomad_client_host_memory_total)`
Amount of memory used by processes.
|Dependent item|nomad.client.memory.used**Preprocessing**
Prometheus pattern: `VALUE(nomad_client_host_memory_used)`
Uptime of the host running the Nomad client.
|Dependent item|nomad.client.uptime**Preprocessing**
Prometheus pattern: `VALUE(nomad_client_uptime)`
Node info data in raw format.
|HTTP agent|nomad.client.node.info.get**Preprocessing**
Check for not supported value
⛔️Custom on fail: Set value to: `{"header":{"HTTP/1.1 408 Request timeout":""}}`
Nomad client version.
|Dependent item|nomad.client.version**Preprocessing**
JSON Path: `$.body..Version.first()`
Nodes API response message.
|Dependent item|nomad.client.node.info.api.response**Preprocessing**
JavaScript: `The text is too long. Please see the template.`
Discard unchanged with heartbeat: `1h`
Allocated jobs data in raw format.
|HTTP agent|nomad.client.job.allocs.get**Preprocessing**
Check for not supported value
⛔️Custom on fail: Set value to: `{"header":{"HTTP/1.1 408 Request timeout":""}}`
Allocations API response message.
|Dependent item|nomad.client.job.allocs.api.response**Preprocessing**
JavaScript: `The text is too long. Please see the template.`
Discard unchanged with heartbeat: `1h`
Monitoring API connection has failed.
Ensure that Nomad API URL and the necessary permissions have been defined correctly, check the service state and network connectivity between Nomad and Zabbix.
Cannot establish the connection to [rpc] service port {$NOMAD.CLIENT.RPC.PORT}.
Check the Nomad state and network connectivity between Nomad and Zabbix.
Cannot establish the connection to [serf] service port {$NOMAD.CLIENT.SERF.PORT}.
Check the Nomad state and network connectivity between Nomad and Zabbix.
OOM killed allocations found.
|`last(/HashiCorp Nomad Client by HTTP/nomad.client.allocations.oom_killed) > 0`|Warning|**Manual close**: Yes| |HashiCorp Nomad Client: High CPU utilization|CPU utilization is too high. The system might be slow to respond.
|`min(/HashiCorp Nomad Client by HTTP/nomad.client.cpu.total, 10m) >= {$NOMAD.CPU.UTIL.MIN}`|Average|| |HashiCorp Nomad Client: High memory utilization|RAM utilization is too high. The system might be slow to respond.
|`(min(/HashiCorp Nomad Client by HTTP/nomad.client.memory.available, 10m) / last(/HashiCorp Nomad Client by HTTP/nomad.client.memory.total))*100 <= {$NOMAD.RAM.AVAIL.MIN}`|Average|| |HashiCorp Nomad Client: The host has been restarted|The host uptime is less than 10 minutes.
|`last(/HashiCorp Nomad Client by HTTP/nomad.client.uptime) < 10m`|Warning|**Manual close**: Yes| |HashiCorp Nomad Client: Nomad client version has changed|Nomad client version has changed.
|`change(/HashiCorp Nomad Client by HTTP/nomad.client.version)<>0`|Info|**Manual close**: Yes| |HashiCorp Nomad Client: Nodes API connection has failed|Nodes API connection has failed.
Ensure that Nomad API URL and the necessary permissions have been defined correctly, check the service state and network connectivity between Nomad and Zabbix.
Allocations API connection has failed.
Ensure that Nomad API URL and the necessary permissions have been defined correctly, check the service state and network connectivity between Nomad and Zabbix.
Client drivers discovery.
|Dependent item|nomad.client.drivers.discovery**Preprocessing**
JavaScript: `The text is too long. Please see the template.`
Discard unchanged with heartbeat: `1h`
Driver [{#DRIVER.NAME}] state.
|Dependent item|nomad.client.driver.state["{#DRIVER.NAME}"]**Preprocessing**
JSON Path: `$.body..Drivers.{#DRIVER.NAME}.Healthy.first()`
Discard unchanged with heartbeat: `1h`
Driver [{#DRIVER.NAME}] detection state.
|Dependent item|nomad.client.driver.detected["{#DRIVER.NAME}"]**Preprocessing**
JSON Path: `$.body..Drivers.{#DRIVER.NAME}.Detected.first()`
The [{#DRIVER.NAME}] driver detected, but its state is unhealthy.
|`last(/HashiCorp Nomad Client by HTTP/nomad.client.driver.state["{#DRIVER.NAME}"]) = 0 and last(/HashiCorp Nomad Client by HTTP/nomad.client.driver.detected["{#DRIVER.NAME}"]) = 1`|Warning|**Manual close**: Yes| |HashiCorp Nomad Client: Driver [{#DRIVER.NAME}] detection state has changed|The [{#DRIVER.NAME}] driver detection state has changed.
|`change(/HashiCorp Nomad Client by HTTP/nomad.client.driver.detected["{#DRIVER.NAME}"]) <> 0`|Info|**Manual close**: Yes| ### LLD rule Physical disks discovery |Name|Description|Type|Key and additional info| |----|-----------|----|-----------------------| |Physical disks discovery|Physical disks discovery.
|Dependent item|nomad.client.disk.discovery**Preprocessing**
Prometheus to JSON: `nomad_client_host_disk_available{disk=~".*"}`
Amount of space which is available on ["{#DEV.NAME}"] disk.
|Dependent item|nomad.client.disk.available["{#DEV.NAME}"]**Preprocessing**
Prometheus pattern: `VALUE(nomad_client_host_disk_available{disk="{#DEV.NAME}"})`
Disk space consumed by the inodes on ["{#DEV.NAME}"] disk.
|Dependent item|nomad.client.disk.inodes_percent["{#DEV.NAME}"]**Preprocessing**
Prometheus pattern: `The text is too long. Please see the template.`
Total size of the ["{#DEV.NAME}"] device.
|Dependent item|nomad.client.disk.size["{#DEV.NAME}"]**Preprocessing**
Prometheus pattern: `VALUE(nomad_client_host_disk_size{disk="{#DEV.NAME}"})`
Percentage of disk ["{#DEV.NAME}"] space used.
|Dependent item|nomad.client.disk.used_percent["{#DEV.NAME}"]**Preprocessing**
Prometheus pattern: `The text is too long. Please see the template.`
Amount of disk ["{#DEV.NAME}"] space which has been used.
|Dependent item|nomad.client.disk.used["{#DEV.NAME}"]**Preprocessing**
Prometheus pattern: `VALUE(nomad_client_host_disk_used{disk="{#DEV.NAME}"})`
It may become impossible to write to a disk if there are no index nodes left.
The following error messages may be returned as symptoms, even though the free space:
- No space left on device;
- Disk is full.
It may become impossible to write to a disk if there are no index nodes left.
The following error messages may be returned as symptoms, even though the free space:
- No space left on device;
- Disk is full.
High disk [{#DEV.NAME}] utilization.
|`min(/HashiCorp Nomad Client by HTTP/nomad.client.disk.used_percent["{#DEV.NAME}"],5m) >= {$NOMAD.DISK.UTIL.MIN.WARN:"{#DEV.NAME}"}`|Warning|**Manual close**: YesHigh disk [{#DEV.NAME}] utilization.
|`min(/HashiCorp Nomad Client by HTTP/nomad.client.disk.used_percent["{#DEV.NAME}"],5m) >= {$NOMAD.DISK.UTIL.MIN.CRIT:"{#DEV.NAME}"}`|Average|**Manual close**: Yes| ### LLD rule Allocated jobs discovery |Name|Description|Type|Key and additional info| |----|-----------|----|-----------------------| |Allocated jobs discovery|Allocated jobs discovery.
|Dependent item|nomad.client.alloc.discovery**Preprocessing**
JavaScript: `The text is too long. Please see the template.`
Discard unchanged with heartbeat: `1h`
Total CPU resources allocated by the ["{#JOB.NAME}"] job across all cores.
|Dependent item|nomad.client.allocs.cpu.allocated["{#JOB.NAME}","{#JOB.TASK.GROUP}","{#JOB.NAMESPACE}"]**Preprocessing**
Prometheus pattern: `The text is too long. Please see the template.`
Total CPU resources consumed by the ["{#JOB.NAME}"] job in system space.
|Dependent item|nomad.client.allocs.cpu.system["{#JOB.NAME}","{#JOB.TASK.GROUP}","{#JOB.NAMESPACE}"]**Preprocessing**
Prometheus pattern: `The text is too long. Please see the template.`
Total CPU resources consumed by the ["{#JOB.NAME}"] job in user space.
|Dependent item|nomad.client.allocs.cpu.user["{#JOB.NAME}","{#JOB.TASK.GROUP}","{#JOB.NAMESPACE}"]**Preprocessing**
Prometheus pattern: `The text is too long. Please see the template.`
Total CPU resources consumed by the ["{#JOB.NAME}"] job across all cores.
|Dependent item|nomad.client.allocs.cpu.total_percent["{#JOB.NAME}","{#JOB.TASK.GROUP}","{#JOB.NAMESPACE}"]**Preprocessing**
Prometheus pattern: `The text is too long. Please see the template.`
Total number of CPU periods that the ["{#JOB.NAME}"] job was throttled.
|Dependent item|nomad.client.allocs.cpu.throttled_periods["{#JOB.NAME}","{#JOB.TASK.GROUP}","{#JOB.NAMESPACE}"]**Preprocessing**
Prometheus pattern: `The text is too long. Please see the template.`
Custom multiplier: `1e-09`
Total time that the ["{#JOB.NAME}"] job was throttled.
|Dependent item|nomad.client.allocs.cpu.throttled_time["{#JOB.NAME}","{#JOB.TASK.GROUP}","{#JOB.NAMESPACE}"]**Preprocessing**
Prometheus pattern: `The text is too long. Please see the template.`
⛔️Custom on fail: Discard value
Custom multiplier: `1e-09`
CPU ticks consumed by the process for the ["{#JOB.NAME}"] job in the last collection interval.
|Dependent item|nomad.client.allocs.cpu.total_ticks["{#JOB.NAME}","{#JOB.TASK.GROUP}","{#JOB.NAMESPACE}"]**Preprocessing**
Prometheus pattern: `The text is too long. Please see the template.`
Amount of memory allocated by the ["{#JOB.NAME}"] job.
|Dependent item|nomad.client.allocs.memory.allocated["{#JOB.NAME}","{#JOB.TASK.GROUP}","{#JOB.NAMESPACE}"]**Preprocessing**
Prometheus pattern: `The text is too long. Please see the template.`
Amount of memory cached by the ["{#JOB.NAME}"] job.
|Dependent item|nomad.client.allocs.memory.cache["{#JOB.NAME}","{#JOB.TASK.GROUP}","{#JOB.NAMESPACE}"]**Preprocessing**
Prometheus pattern: `The text is too long. Please see the template.`
Total amount of memory used by the ["{#JOB.NAME}"] job.
|Dependent item|nomad.client.allocs.memory.usage["{#JOB.NAME}","{#JOB.TASK.GROUP}","{#JOB.NAMESPACE}"]**Preprocessing**
Prometheus pattern: `The text is too long. Please see the template.`
Amount of memory swapped by the ["{#JOB.NAME}"] job.
|Dependent item|nomad.client.allocs.memory.swap["{#JOB.NAME}","{#JOB.TASK.GROUP}","{#JOB.NAMESPACE}"]**Preprocessing**
Prometheus pattern: `The text is too long. Please see the template.`
Nomad SERVER API scheme.
|`http`| |{$NOMAD.SERVER.API.PORT}|Nomad SERVER API port.
|`4646`| |{$NOMAD.TOKEN}|Nomad authentication token.
|`Response timeout for an API.
|`15s`| |{$NOMAD.HTTP.PROXY}|Sets the HTTP proxy for HTTP agent item. If this parameter is empty, then no proxy is used.
|| |{$NOMAD.API.RESPONSE.SUCCESS}|HTTP API successful response code. Availability triggers threshold. Change, if needed.
|`200`| |{$NOMAD.SERVER.RPC.PORT}|Nomad RPC service port.
|`4647`| |{$NOMAD.SERVER.SERF.PORT}|Nomad serf service port.
|`4648`| |{$NOMAD.REDUNDANCY.MIN}|Amount of redundant servers to keep the cluster safe.
Default value - '1' for the 3-nodes cluster.
Change if needed.
|`1`| |{$NOMAD.OPEN.FDS.MAX}|Maximum percentage of used file descriptors.
|`90`| |{$NOMAD.SERVER.LEADER.LATENCY}|Leader last contact latency threshold.
|`0.3s`| ### Items |Name|Description|Type|Key and additional info| |----|-----------|----|-----------------------| |HashiCorp Nomad Server: Telemetry get|Telemetry data in raw format.
|HTTP agent|nomad.server.data.get**Preprocessing**
Check for not supported value
⛔️Custom on fail: Set value to: `{"header":{"HTTP/1.1 408 Request timeout":""}}`
Nomad server metrics in raw format.
|Dependent item|nomad.server.metrics.get**Preprocessing**
JSON Path: `$.body`
⛔️Custom on fail: Discard value
Monitoring API response message.
|Dependent item|nomad.server.data.api.response**Preprocessing**
JavaScript: `The text is too long. Please see the template.`
Discard unchanged with heartbeat: `1h`
Internal stats data in raw format.
|HTTP agent|nomad.server.stats.get**Preprocessing**
Check for not supported value
⛔️Custom on fail: Set value to: `{"header":{"HTTP/1.1 408 Request timeout":""}}`
Internal stats API response message.
|Dependent item|nomad.server.stats.api.response**Preprocessing**
JavaScript: `The text is too long. Please see the template.`
Discard unchanged with heartbeat: `1h`
Nomad server version.
|Dependent item|nomad.server.version**Preprocessing**
JSON Path: `$.body.config.Version.Version`
Nomad raft version.
|Dependent item|nomad.raft.version**Preprocessing**
JSON Path: `$.body.stats.raft.protocol_version`
⛔️Custom on fail: Discard value
Current cluster raft peers amount.
|Dependent item|nomad.server.raft.peers**Preprocessing**
JSON Path: `$.body.stats.raft.num_peers`
⛔️Custom on fail: Discard value
Current role in the cluster.
|Dependent item|nomad.server.raft.cluster_role**Preprocessing**
JSON Path: `$.body.stats.raft.state`
⛔️Custom on fail: Discard value
JavaScript: `The text is too long. Please see the template.`
Total user and system CPU time spent in seconds.
|Dependent item|nomad.server.cpu.time**Preprocessing**
Prometheus pattern: `VALUE(process_cpu_seconds_total)`
⛔️Custom on fail: Discard value
Memory utilization in bytes.
|Dependent item|nomad.server.runtime.alloc_bytes**Preprocessing**
Prometheus pattern: `VALUE(nomad_runtime_alloc_bytes)`
⛔️Custom on fail: Discard value
Virtual memory size in bytes.
|Dependent item|nomad.server.virtual_memory_bytes**Preprocessing**
Prometheus pattern: `VALUE(process_virtual_memory_bytes)`
⛔️Custom on fail: Discard value
Resident memory size in bytes.
|Dependent item|nomad.server.resident_memory_bytes**Preprocessing**
Prometheus pattern: `VALUE(process_resident_memory_bytes)`
⛔️Custom on fail: Discard value
Number of objects on the heap.
General memory pressure indicator.
|Dependent item|nomad.server.runtime.heap_objects**Preprocessing**
Prometheus pattern: `VALUE(nomad_runtime_heap_objects)`
⛔️Custom on fail: Discard value
Number of open file descriptors.
|Dependent item|nomad.server.process_open_fds**Preprocessing**
Prometheus pattern: `VALUE(process_open_fds)`
⛔️Custom on fail: Discard value
Maximum number of open file descriptors.
|Dependent item|nomad.server.process_max_fds**Preprocessing**
Prometheus pattern: `VALUE(process_max_fds)`
⛔️Custom on fail: Discard value
Number of goroutines and general load pressure indicator.
|Dependent item|nomad.server.runtime.num_goroutines**Preprocessing**
Prometheus pattern: `VALUE(nomad_runtime_num_goroutines)`
⛔️Custom on fail: Discard value
Evaluations that are pending until an existing evaluation for the same job completes.
|Dependent item|nomad.server.broker.total_pending**Preprocessing**
Prometheus pattern: `VALUE(nomad_nomad_broker_total_pending)`
⛔️Custom on fail: Discard value
Number of evaluations ready to be processed.
|Dependent item|nomad.server.broker.total_ready**Preprocessing**
Prometheus pattern: `VALUE(nomad_nomad_broker_total_ready)`
⛔️Custom on fail: Discard value
Evaluations dispatched for processing but incomplete.
|Dependent item|nomad.server.broker.total_unacked**Preprocessing**
Prometheus pattern: `VALUE(nomad_nomad_broker_total_unacked)`
⛔️Custom on fail: Discard value
Amount of CPU shares requested by blocked evals.
|Dependent item|nomad.server.blocked_evals.cpu**Preprocessing**
Prometheus pattern: `VALUE(nomad_nomad_blocked_evals_cpu)`
⛔️Custom on fail: Discard value
Amount of memory requested by blocked evals.
|Dependent item|nomad.server.blocked_evals.memory**Preprocessing**
Prometheus pattern: `VALUE(nomad_nomad_blocked_evals_memory)`
⛔️Custom on fail: Discard value
Amount of CPU shares requested by blocked evals of a job.
|Dependent item|nomad.server.blocked_evals.job.cpu**Preprocessing**
Prometheus pattern: `VALUE(nomad_nomad_blocked_evals_job_cpu)`
⛔️Custom on fail: Discard value
Amount of memory requested by blocked evals of a job.
|Dependent item|nomad.server.blocked_evals.job.memory**Preprocessing**
Prometheus pattern: `VALUE(nomad_nomad_blocked_evals_job_memory)`
⛔️Custom on fail: Discard value
Count of evals in the blocked state for any reason (cluster resource exhaustion or quota limits).
|Dependent item|nomad.server.blocked_evals.total_blocked**Preprocessing**
Prometheus pattern: `VALUE(nomad_nomad_blocked_evals_total_blocked)`
⛔️Custom on fail: Discard value
Count of evals that have escaped computed node classes.
This indicates a scheduler optimization was skipped and is not usually a source of concern.
|Dependent item|nomad.server.blocked_evals.total_escaped**Preprocessing**
Prometheus pattern: `VALUE(nomad_nomad_blocked_evals_total_escaped)`
⛔️Custom on fail: Discard value
Count of evals waiting to be enqueued.
|Dependent item|nomad.server.broker.total_waiting**Preprocessing**
Prometheus pattern: `VALUE(nomad_nomad_broker_total_waiting)`
⛔️Custom on fail: Discard value
Count of blocked evals due to quota limits (the resources for these jobs are not counted in other blocked_evals metrics, except for total_blocked).
|Dependent item|nomad.server.blocked_evals.total_quota_limit**Preprocessing**
Prometheus pattern: `VALUE(nomad_nomad_blocked_evals_total_quota_limit)`
⛔️Custom on fail: Discard value
Average time elapsed with evaluations waiting to be enqueued.
|Dependent item|nomad.server.broker.eval_waiting**Preprocessing**
Prometheus pattern: `AVG(nomad_nomad_eval_ack_sum)`
⛔️Custom on fail: Discard value
Custom multiplier: `1e-09`
Time elapsed for Eval.Ack RPC call.
|Dependent item|nomad.server.eval.ack**Preprocessing**
Prometheus pattern: `VALUE(nomad_nomad_eval_ack_sum)`
⛔️Custom on fail: Discard value
Custom multiplier: `1e-09`
Time elapsed for Job.Summary RPC call.
|Dependent item|nomad.server.job_summary.get_job_summary**Preprocessing**
Prometheus pattern: `VALUE(nomad_nomad_job_summary_get_job_summary_sum)`
⛔️Custom on fail: Discard value
Custom multiplier: `1e-09`
Number of active heartbeat timers.
Each timer represents a Nomad client connection.
|Dependent item|nomad.server.heartbeat.active**Preprocessing**
Prometheus pattern: `VALUE(nomad_nomad_heartbeat_active)`
⛔️Custom on fail: Discard value
Number of RPC requests being handled.
|Dependent item|nomad.server.rpc.request**Preprocessing**
Prometheus pattern: `VALUE(nomad_nomad_rpc_request)`
⛔️Custom on fail: Discard value
Number of RPC requests being handled that result in an error.
|Dependent item|nomad.server.rpc.request_error**Preprocessing**
Prometheus pattern: `VALUE(nomad_nomad_rpc_request)`
⛔️Custom on fail: Discard value
Number of RPC queries.
|Dependent item|nomad.server.rpc.query**Preprocessing**
Prometheus pattern: `VALUE(nomad_nomad_rpc_query)`
⛔️Custom on fail: Discard value
Time elapsed for Job.Allocations RPC call.
|Dependent item|nomad.server.job.allocations**Preprocessing**
Prometheus pattern: `VALUE(nomad_nomad_job_allocations_sum)`
⛔️Custom on fail: Discard value
Custom multiplier: `1e-09`
Time elapsed for Job.Evaluations RPC call.
|Dependent item|nomad.server.job.evaluations**Preprocessing**
Prometheus pattern: `VALUE(nomad_nomad_job_evaluations_sum)`
⛔️Custom on fail: Discard value
Custom multiplier: `1e-09`
Time elapsed for Job.GetJob RPC call.
|Dependent item|nomad.server.job.get_job**Preprocessing**
Prometheus pattern: `VALUE(nomad_nomad_job_get_job_sum)`
⛔️Custom on fail: Discard value
Custom multiplier: `1e-09`
Time elapsed to apply a plan.
|Dependent item|nomad.server.plan.apply**Preprocessing**
Prometheus pattern: `VALUE(nomad_nomad_plan_apply_sum)`
⛔️Custom on fail: Discard value
Custom multiplier: `1e-09`
Time elapsed to evaluate a plan.
|Dependent item|nomad.server.plan.evaluate**Preprocessing**
Prometheus pattern: `VALUE(nomad_nomad_plan_evaluate_sum)`
⛔️Custom on fail: Discard value
Custom multiplier: `1e-09`
Time elapsed for Plan.Submit RPC call.
|Dependent item|nomad.server.plan.submit**Preprocessing**
Prometheus pattern: `VALUE(nomad_nomad_plan_submit_sum)`
⛔️Custom on fail: Discard value
Custom multiplier: `1e-09`
Time elapsed that planner waits for the raft index of the plan to be processed.
|Dependent item|nomad.server.plan.wait_for_index**Preprocessing**
Prometheus pattern: `VALUE(nomad_nomad_plan_wait_for_index_sum)`
⛔️Custom on fail: Discard value
Custom multiplier: `1e-09`
Time elapsed for Node.List RPC call.
|Dependent item|nomad.server.client.list**Preprocessing**
Prometheus pattern: `VALUE(nomad_nomad_client_list_sum)`
⛔️Custom on fail: Discard value
Custom multiplier: `1e-09`
Time elapsed for Node.UpdateAlloc RPC call.
|Dependent item|nomad.server.client.update_alloc**Preprocessing**
Prometheus pattern: `VALUE(nomad_nomad_client_update_alloc_sum)`
⛔️Custom on fail: Discard value
Custom multiplier: `1e-09`
Time elapsed for Node.UpdateStatus RPC call.
|Dependent item|nomad.server.client.update_status**Preprocessing**
Prometheus pattern: `VALUE(nomad_nomad_client_update_status_sum)`
⛔️Custom on fail: Discard value
Custom multiplier: `1e-09`
Time elapsed for Node.GetClientAllocs RPC call.
|Dependent item|nomad.server.client.get_client_allocs**Preprocessing**
Prometheus pattern: `VALUE(nomad_nomad_client_get_client_allocs_sum)`
⛔️Custom on fail: Discard value
Custom multiplier: `1e-09`
Time elapsed for Eval.Dequeue RPC call.
|Dependent item|nomad.server.client.dequeue**Preprocessing**
Prometheus pattern: `VALUE(nomad_nomad_eval_dequeue_sum)`
⛔️Custom on fail: Discard value
Custom multiplier: `1e-09`
Time since last successful Vault token renewal.
|Dependent item|nomad.server.vault.token_last_renewal**Preprocessing**
Prometheus pattern: `VALUE(nomad_nomad_vault_token_last_renewal)`
⛔️Custom on fail: Discard value
Custom multiplier: `0.001`
Time until next Vault token renewal attempt.
|Dependent item|nomad.server.vault.token_next_renewal**Preprocessing**
Prometheus pattern: `VALUE(nomad_nomad_vault_token_next_renewal)`
⛔️Custom on fail: Discard value
Custom multiplier: `0.001`
Time to live for Vault token.
|Dependent item|nomad.server.vault.token_ttl**Preprocessing**
Prometheus pattern: `VALUE(nomad_nomad_vault_token_ttl)`
⛔️Custom on fail: Discard value
Custom multiplier: `0.001`
Count of revoked tokens.
|Dependent item|nomad.server.vault.distributed_tokens_revoked**Preprocessing**
Prometheus pattern: `VALUE(nomad_nomad_vault_distributed_tokens_revoking)`
⛔️Custom on fail: Discard value
Number of dead jobs.
|Dependent item|nomad.server.job_status.dead**Preprocessing**
Prometheus pattern: `VALUE(nomad_nomad_job_status_dead)`
⛔️Custom on fail: Set value to: `0`
Number of pending jobs.
|Dependent item|nomad.server.job_status.pending**Preprocessing**
Prometheus pattern: `VALUE(nomad_nomad_job_status_pending)`
⛔️Custom on fail: Set value to: `0`
Number of running jobs.
|Dependent item|nomad.server.job_status.running**Preprocessing**
Prometheus pattern: `VALUE(nomad_nomad_job_status_running)`
⛔️Custom on fail: Set value to: `0`
Number of complete allocations for a job.
|Dependent item|nomad.server.job_summary.complete**Preprocessing**
Prometheus pattern: `SUM(nomad_nomad_job_summary_complete)`
⛔️Custom on fail: Set value to: `0`
Number of failed allocations for a job.
|Dependent item|nomad.server.job_summary.failed**Preprocessing**
Prometheus pattern: `SUM(nomad_nomad_job_summary_failed)`
⛔️Custom on fail: Set value to: `0`
Number of lost allocations for a job.
|Dependent item|nomad.server.job_summary.lost**Preprocessing**
Prometheus pattern: `SUM(nomad_nomad_job_summary_lost)`
⛔️Custom on fail: Set value to: `0`
Number of unknown allocations for a job.
|Dependent item|nomad.server.job_summary.unknown**Preprocessing**
Prometheus pattern: `SUM(nomad_nomad_job_summary_unknown)`
⛔️Custom on fail: Set value to: `0`
Number of queued allocations for a job.
|Dependent item|nomad.server.job_summary.queued**Preprocessing**
Prometheus pattern: `SUM(nomad_nomad_job_summary_queued)`
⛔️Custom on fail: Set value to: `0`
Number of running allocations for a job.
|Dependent item|nomad.server.job_summary.running**Preprocessing**
Prometheus pattern: `SUM(nomad_nomad_job_summary_running)`
⛔️Custom on fail: Set value to: `0`
Number of starting allocations for a job.
|Dependent item|nomad.server.job_summary.starting**Preprocessing**
Prometheus pattern: `SUM(nomad_nomad_job_summary_starting)`
⛔️Custom on fail: Set value to: `0`
Time elapsed to broadcast gossip messages.
|Dependent item|nomad.server.memberlist.gossip**Preprocessing**
Prometheus pattern: `VALUE(nomad_memberlist_gossip_sum)`
⛔️Custom on fail: Discard value
Custom multiplier: `1e-09`
Time elapsed to establish a raft barrier during leader transition.
|Dependent item|nomad.server.leader.barrier**Preprocessing**
Prometheus pattern: `VALUE(nomad_nomad_leader_barrier_sum)`
⛔️Custom on fail: Discard value
Custom multiplier: `1e-09`
Time elapsed to reconcile a serf peer with state store.
|Dependent item|nomad.server.leader.reconcile_member**Preprocessing**
Prometheus pattern: `VALUE(nomad_nomad_leader_reconcileMember_sum)`
⛔️Custom on fail: Discard value
Custom multiplier: `1e-09`
Time elapsed to reconcile all serf peers with state store.
|Dependent item|nomad.server.leader.reconcile**Preprocessing**
Prometheus pattern: `VALUE(nomad_nomad_leader_reconcile_sum)`
⛔️Custom on fail: Discard value
Custom multiplier: `1e-09`
Time since last contact to leader.
General indicator of Raft latency.
|Dependent item|nomad.server.raft.leader.lastContact**Preprocessing**
Prometheus pattern: `VALUE(nomad_raft_leader_lastContact{quantile="0.99"})`
⛔️Custom on fail: Discard value
Replace: `NaN -> 0`
Custom multiplier: `0.001`
Count of evals in the plan queue.
|Dependent item|nomad.server.plan.queue_depth**Preprocessing**
Prometheus pattern: `VALUE(nomad_nomad_plan_queue_depth)`
⛔️Custom on fail: Discard value
Time elapsed for worker to create an eval.
|Dependent item|nomad.server.worker.create_eval**Preprocessing**
Prometheus pattern: `VALUE(nomad_nomad_worker_dequeue_eval_sum)`
⛔️Custom on fail: Discard value
Custom multiplier: `1e-09`
Time elapsed for worker to dequeue an eval.
|Dependent item|nomad.server.worker.dequeue_eval**Preprocessing**
Prometheus pattern: `VALUE(nomad_nomad_worker_dequeue_eval_sum)`
⛔️Custom on fail: Discard value
Custom multiplier: `1e-09`
Time elapsed for worker to invoke the scheduler.
|Dependent item|nomad.server.worker.invoke_scheduler_service**Preprocessing**
Prometheus pattern: `VALUE(nomad_nomad_worker_invoke_scheduler_service_sum)`
⛔️Custom on fail: Discard value
Custom multiplier: `1e-09`
Time elapsed for worker to send acknowledgement.
|Dependent item|nomad.server.worker.send_ack**Preprocessing**
Prometheus pattern: `VALUE(nomad_nomad_worker_send_ack_sum)`
⛔️Custom on fail: Discard value
Custom multiplier: `1e-09`
Time elapsed for worker to submit plan.
|Dependent item|nomad.server.worker.submit_plan**Preprocessing**
Prometheus pattern: `VALUE(nomad_nomad_worker_submit_plan_sum)`
⛔️Custom on fail: Discard value
Custom multiplier: `1e-09`
Time elapsed for worker to submit updated eval.
|Dependent item|nomad.server.worker.update_eval**Preprocessing**
Prometheus pattern: `VALUE(nomad_nomad_worker_update_eval_sum)`
⛔️Custom on fail: Discard value
Custom multiplier: `1e-09`
Time elapsed that worker waits for the raft index of the eval to be processed.
|Dependent item|nomad.server.worker.wait_for_index**Preprocessing**
Prometheus pattern: `VALUE(nomad_nomad_worker_wait_for_index_sum)`
⛔️Custom on fail: Discard value
Custom multiplier: `1e-09`
Count of blocking raft API calls.
|Dependent item|nomad.server.raft.barrier**Preprocessing**
Prometheus pattern: `VALUE(nomad_raft_barrier)`
⛔️Custom on fail: Discard value
Count of logs enqueued.
|Dependent item|nomad.server.raft.commit_num_logs**Preprocessing**
Prometheus pattern: `VALUE(nomad_raft_commitNumLogs)`
⛔️Custom on fail: Discard value
Number of Raft transactions.
|Dependent item|nomad.server.raft.apply**Preprocessing**
Prometheus pattern: `VALUE(nomad_raft_apply)`
⛔️Custom on fail: Set value to: `0`
Time elapsed to commit writes.
|Dependent item|nomad.server.raft.commit_time**Preprocessing**
Prometheus pattern: `VALUE(nomad_nomad_worker_dequeue_eval_sum)`
⛔️Custom on fail: Discard value
Custom multiplier: `1e-09`
Raft transaction commit time.
|Dependent item|nomad.server.raft.replication.appendEntries**Preprocessing**
Prometheus pattern: `AVG(nomad_raft_replication_appendEntries_rpc)`
⛔️Custom on fail: Discard value
Custom multiplier: `0.001`
Time elapsed to apply write to FSM.
|Dependent item|nomad.server.raft.fsm.apply**Preprocessing**
Prometheus pattern: `VALUE(nomad_raft_fsm_apply_sum)`
⛔️Custom on fail: Discard value
Custom multiplier: `1e-09`
Time elapsed to enqueue write to FSM.
|Dependent item|nomad.server.raft.fsm.enqueue**Preprocessing**
Prometheus pattern: `VALUE(nomad_raft_fsm_enqueue_sum)`
⛔️Custom on fail: Discard value
Custom multiplier: `1e-09`
Time elapsed to apply Autopilot raft entry.
|Dependent item|nomad.server.raft.fsm.autopilot**Preprocessing**
Prometheus pattern: `VALUE(nomad_nomad_fsm_autopilot_sum)`
⛔️Custom on fail: Set value to: `0`
Custom multiplier: `1e-09`
Time elapsed to apply RegisterNode raft entry.
|Dependent item|nomad.server.raft.fsm.register_node**Preprocessing**
Prometheus pattern: `VALUE(nomad_nomad_fsm_register_node_sum)`
⛔️Custom on fail: Discard value
Custom multiplier: `1e-09`
Current index applied to FSM.
|Dependent item|nomad.server.raft.applied_index**Preprocessing**
Prometheus pattern: `VALUE(nomad_raft_appliedIndex)`
⛔️Custom on fail: Discard value
Most recent index seen.
|Dependent item|nomad.server.raft.last_index**Preprocessing**
Prometheus pattern: `VALUE(nomad_raft_lastIndex)`
⛔️Custom on fail: Discard value
Time elapsed to write log, mark in flight, and start replication.
|Dependent item|nomad.server.raft.leader.dispatch_log**Preprocessing**
Prometheus pattern: `VALUE(nomad_raft_leader_dispatchLog_sum)`
⛔️Custom on fail: Discard value
Custom multiplier: `1e-09`
Count of logs dispatched.
|Dependent item|nomad.server.raft.leader.dispatch_num_logs**Preprocessing**
Prometheus pattern: `VALUE(nomad_raft_leader_dispatchNumLogs)`
⛔️Custom on fail: Set value to: `0`
Count of failing to heartbeat and starting election.
|Dependent item|nomad.server.raft.transition.heartbeat_timeout**Preprocessing**
Prometheus pattern: `VALUE(nomad_raft_transition_heartbeat_timeout)`
⛔️Custom on fail: Set value to: `0`
Discard unchanged with heartbeat: `1h`
Count of objects freed from heap by go runtime GC.
|Dependent item|nomad.server.runtime.free_count**Preprocessing**
Prometheus pattern: `VALUE(nomad_runtime_free_count)`
⛔️Custom on fail: Discard value
Go runtime GC pause times.
|Dependent item|nomad.server.runtime.gc_pause_ns**Preprocessing**
Prometheus pattern: `VALUE(nomad_runtime_gc_pause_ns_sum)`
⛔️Custom on fail: Discard value
Custom multiplier: `1e-09`
Go runtime GC metadata size in bytes.
|Dependent item|nomad.server.runtime.sys_bytes**Preprocessing**
Prometheus pattern: `VALUE(nomad_runtime_sys_bytes)`
⛔️Custom on fail: Discard value
Count of go runtime GC runs.
|Dependent item|nomad.server.runtime.total_gc_runs**Preprocessing**
Prometheus pattern: `VALUE(nomad_runtime_total_gc_runs)`
⛔️Custom on fail: Discard value
Count of memberlist events received.
|Dependent item|nomad.server.serf.queue.event**Preprocessing**
Prometheus pattern: `VALUE(nomad_serf_queue_Event_sum)`
⛔️Custom on fail: Discard value
Count of memberlist changes.
|Dependent item|nomad.server.serf.queue.intent**Preprocessing**
Prometheus pattern: `VALUE(nomad_serf_queue_Intent_sum)`
⛔️Custom on fail: Discard value
Count of memberlist queries.
|Dependent item|nomad.server.serf.queue.queries**Preprocessing**
Prometheus pattern: `VALUE(nomad_serf_queue_Query_sum)`
⛔️Custom on fail: Discard value
Current snapshot index.
|Dependent item|nomad.server.state.snapshot.index**Preprocessing**
Prometheus pattern: `VALUE(nomad_state_snapshotIndex)`
⛔️Custom on fail: Discard value
Count of service evals ready to be scheduled.
|Dependent item|nomad.server.broker.service_ready**Preprocessing**
Prometheus pattern: `VALUE(nomad_nomad_broker_service_ready)`
⛔️Custom on fail: Discard value
Count of unacknowledged service evals.
|Dependent item|nomad.server.broker.service_unacked**Preprocessing**
Prometheus pattern: `VALUE(nomad_nomad_broker_service_unacked)`
⛔️Custom on fail: Discard value
Count of service evals ready to be scheduled.
|Dependent item|nomad.server.broker.system_ready**Preprocessing**
Prometheus pattern: `VALUE(nomad_nomad_broker_system_ready)`
⛔️Custom on fail: Discard value
Count of unacknowledged system evals.
|Dependent item|nomad.server.broker.system_unacked**Preprocessing**
Prometheus pattern: `VALUE(nomad_nomad_broker_system_unacked)`
⛔️Custom on fail: Discard value
Number of BoltDB free pages.
|Dependent item|nomad.server.raft.boltdb.num_free_pages**Preprocessing**
Prometheus pattern: `VALUE(nomad_raft_boltdb_numFreePages)`
⛔️Custom on fail: Discard value
Number of BoltDB pending pages.
|Dependent item|nomad.server.raft.boltdb.num_pending_pages**Preprocessing**
Prometheus pattern: `VALUE(nomad_raft_boltdb_numPendingPages)`
⛔️Custom on fail: Discard value
Number of free page bytes.
|Dependent item|nomad.server.raft.boltdb.free_page_bytes**Preprocessing**
Prometheus pattern: `VALUE(nomad_raft_boltdb_freePageBytes)`
⛔️Custom on fail: Discard value
Number of freelist bytes.
|Dependent item|nomad.server.raft.boltdb.freelist_bytes**Preprocessing**
Prometheus pattern: `VALUE(nomad_raft_boltdb_freelistBytes)`
⛔️Custom on fail: Discard value
Count of total read transactions.
|Dependent item|nomad.server.raft.boltdb.total_read_txn**Preprocessing**
Prometheus pattern: `VALUE(nomad_raft_boltdb_totalReadTxn)`
⛔️Custom on fail: Discard value
Number of current open read transactions.
|Dependent item|nomad.server.raft.boltdb.open_read_txn**Preprocessing**
Prometheus pattern: `VALUE(nomad_raft_boltdb_openReadTxn)`
⛔️Custom on fail: Discard value
Number of pages in use.
|Dependent item|nomad.server.raft.boltdb.txstats.page_count**Preprocessing**
Prometheus pattern: `VALUE(nomad_raft_boltdb_txstats_pageCount)`
⛔️Custom on fail: Discard value
Number of page allocations.
|Dependent item|nomad.server.raft.boltdb.txstats.page_alloc**Preprocessing**
Prometheus pattern: `VALUE(nomad_raft_boltdb_txstats_pageAlloc)`
⛔️Custom on fail: Discard value
Count of total database cursors.
|Dependent item|nomad.server.raft.boltdb.txstats.cursor_count**Preprocessing**
Prometheus pattern: `VALUE(nomad_raft_boltdb_txstats_cursorCount)`
⛔️Custom on fail: Discard value
Count of total database nodes.
|Dependent item|nomad.server.raft.boltdb.txstats.node_count**Preprocessing**
Prometheus pattern: `VALUE(nomad_raft_boltdb_txstats_nodeCount)`
⛔️Custom on fail: Discard value
Count of total database node dereferences.
|Dependent item|nomad.server.raft.boltdb.txstats.node_deref**Preprocessing**
Prometheus pattern: `VALUE(nomad_raft_boltdb_txstats_nodeDeref)`
⛔️Custom on fail: Discard value
Count of total rebalance operations.
|Dependent item|nomad.server.raft.boltdb.txstats.rebalance**Preprocessing**
Prometheus pattern: `VALUE(nomad_raft_boltdb_txstats_rebalance)`
⛔️Custom on fail: Discard value
Count of total split operations.
|Dependent item|nomad.server.raft.boltdb.txstats.split**Preprocessing**
Prometheus pattern: `VALUE(nomad_raft_boltdb_txstats_split)`
⛔️Custom on fail: Discard value
Count of total spill operations.
|Dependent item|nomad.server.raft.boltdb.txstats.spill**Preprocessing**
Prometheus pattern: `VALUE(nomad_raft_boltdb_txstats_spill)`
⛔️Custom on fail: Discard value
Count of total write operations.
|Dependent item|nomad.server.raft.boltdb.txstats.write**Preprocessing**
Prometheus pattern: `VALUE(nomad_raft_boltdb_txstats_write)`
⛔️Custom on fail: Discard value
Sample of rebalance operation times.
|Dependent item|nomad.server.raft.boltdb.txstats.rebalance_time**Preprocessing**
Prometheus pattern: `VALUE(nomad_raft_boltdb_txstats_rebalanceTime_sum)`
⛔️Custom on fail: Discard value
Custom multiplier: `1e-09`
Sample of spill operation times.
|Dependent item|nomad.server.raft.boltdb.txstats.spill_time**Preprocessing**
Prometheus pattern: `VALUE(nomad_raft_boltdb_txstats_spillTime_sum)`
⛔️Custom on fail: Discard value
Custom multiplier: `1e-09`
Sample of write operation times.
|Dependent item|nomad.server.raft.boltdb.txstats.write_time**Preprocessing**
Prometheus pattern: `VALUE(nomad_raft_boltdb_txstats_writeTime_sum)`
⛔️Custom on fail: Discard value
Custom multiplier: `1e-09`
Current [rpc] service state.
|Simple check|net.tcp.service[tcp,,{$NOMAD.SERVER.RPC.PORT}]**Preprocessing**
Discard unchanged with heartbeat: `1h`
Current [serf] service state.
|Simple check|net.tcp.service[tcp,,{$NOMAD.SERVER.SERF.PORT}]**Preprocessing**
Discard unchanged with heartbeat: `1h`
Time elapsed for Namespace.ListNamespaces.
|Dependent item|nomad.server.namespace.list_namespace**Preprocessing**
Prometheus pattern: `VALUE(nomad_nomad_namespace_list_namespace_sum)`
⛔️Custom on fail: Discard value
Custom multiplier: `1e-09`
Current autopilot state.
|Dependent item|nomad.server.autopilot.state**Preprocessing**
Prometheus pattern: `VALUE(nomad_nomad_autopilot_healthy)`
⛔️Custom on fail: Discard value
The number of redundant healthy servers that can fail without causing an outage.
|Dependent item|nomad.server.autopilot.failure_tolerance**Preprocessing**
Prometheus pattern: `VALUE(nomad_nomad_autopilot_failure_tolerance)`
⛔️Custom on fail: Discard value
Time elapsed to apply AllocClientUpdate raft entry.
|Dependent item|nomad.server.alloc_client_update**Preprocessing**
Prometheus pattern: `VALUE(nomad_nomad_fsm_alloc_client_update_sum)`
⛔️Custom on fail: Discard value
Custom multiplier: `1e-09`
Time elapsed to apply ApplyPlanResults raft entry.
|Dependent item|nomad.server.fsm.apply_plan_results**Preprocessing**
Prometheus pattern: `VALUE(nomad_nomad_fsm_apply_plan_results_sum)`
⛔️Custom on fail: Discard value
Custom multiplier: `1e-09`
Time elapsed to apply UpdateEval raft entry.
|Dependent item|nomad.server.fsm.update_eval**Preprocessing**
Prometheus pattern: `VALUE(nomad_nomad_fsm_update_eval_sum)`
⛔️Custom on fail: Discard value
Custom multiplier: `1e-09`
Time elapsed to apply RegisterJob raft entry.
|Dependent item|nomad.server.fsm.register_job**Preprocessing**
Prometheus pattern: `VALUE(nomad_nomad_fsm_register_job_sum)`
⛔️Custom on fail: Discard value
Custom multiplier: `1e-09`
Count of attempts to reschedule an allocation.
|Dependent item|nomad.server.scheduler.allocs.rescheduled.attempted**Preprocessing**
Prometheus pattern: `SUM(nomad_scheduler_allocs_reschedule_attempted)`
⛔️Custom on fail: Set value to: `0`
Monitoring API connection has failed.
Ensure that Nomad API URL and the necessary permissions have been defined correctly, check the service state and network connectivity between Nomad and Zabbix.
Internal stats API connection has failed.
Ensure that Nomad API URL and the necessary permissions have been defined correctly, check the service state and network connectivity between Nomad and Zabbix.
Nomad server version has changed.
|`change(/HashiCorp Nomad Server by HTTP/nomad.server.version)<>0`|Info|**Manual close**: Yes| |HashiCorp Nomad Server: Cluster role has changed|Cluster role has changed.
|`change(/HashiCorp Nomad Server by HTTP/nomad.server.raft.cluster_role) <> 0`|Info|**Manual close**: Yes| |HashiCorp Nomad Server: Current number of open files is too high|Heavy file descriptor usage (i.e., near the process file descriptor limit) indicates a potential file descriptor exhaustion issue.
|`min(/HashiCorp Nomad Server by HTTP/nomad.server.process_open_fds,5m)/last(/HashiCorp Nomad Server by HTTP/nomad.server.process_max_fds)*100>{$NOMAD.OPEN.FDS.MAX}`|Warning|| |HashiCorp Nomad Server: Dead jobs found|Jobs with the `Dead` state discovered.
Check the {$NOMAD.SERVER.API.SCHEME}://{HOST.IP}:{$NOMAD.SERVER.API.PORT}/v1/jobs URL for the details.
The nomad.raft.leader.lastContact metric is a general indicator of Raft latency which can be used to observe how Raft timing is performing and guide infrastructure provisioning.
If this number trends upwards, look at CPU, disk IOPs, and network latency. nomad.raft.leader.lastContact should not get too close to the leader lease timeout of 500ms.
Cannot establish the connection to [rpc] service port {$NOMAD.SERVER.RPC.PORT}.
Check the Nomad state and network connectivity between Nomad and Zabbix.
Cannot establish the connection to [serf] service port {$NOMAD.SERVER.SERF.PORT}.
Check the Nomad state and network connectivity between Nomad and Zabbix.
The autopilot is in unhealthy state. The successful failover probability is extremely low.
|`last(/HashiCorp Nomad Server by HTTP/nomad.server.autopilot.state) = 0 and nodata(/HashiCorp Nomad Server by HTTP/nomad.server.autopilot.state,5m) = 0`|Average|**Manual close**: Yes| |HashiCorp Nomad Server: Autopilot redundancy is low|The autopilot redundancy is low.
Cluster crash risk is high due to one more server failure.