84 KiB
HashiCorp Nomad by HTTP
Overview
This template is designed to monitor HashiCorp Nomad by Zabbix. It works without any external scripts. Currently the template supports Nomad servers and clients discovery.
Requirements
Zabbix version: 7.0 and higher.
Tested versions
This template has been tested on:
- HashiCorp Nomad version 1.5.6/1.6.0
Configuration
Zabbix should be configured according to the instructions in the Templates out of the box section.
Setup
- Create a synthetic Nomad host. It should be one of the Nomad cluster members, load-balancing service (if cluster is used) or a single node in a selected Nomad region.
- Define the
{$NOMAD.ENDPOINT.API.URL}
macro value with correct web protocol, host and port. - Prepare an ACL token with
node:read
,namespace:read-job
,agent:read
andmanagement
permissions applied. Define the{$NOMAD.TOKEN}
macro value.
Refer to the vendor documentation about
Nomad native ACL
orNomad Vault-generated tokens
if you have the HashiCorp Vault integration configured.
Additional information:
- Synthetic Nomad host will be used just as an endpoint for servers and clients discovery (general cluster information), it will not be monitored as a Nomad server or client, so that to prevent duplicate entities.
- If you're not using ACL - skip 3rd setup step.
- The Nomad servers/clients discovery is limited by region. If you're using multi-region cluster- create one synthetic host per region.
- The Nomad server/client templates are ready for separate usage. Feel free to use if you prefer manual host creation.
Useful links
- HashiCorp Nomad multi-region federation
- HashiCorp Nomad agent API reference
- HashiCorp Nomad raft operator API reference
- HashiCorp Nomad nodes API reference
Macros used
Name | Description | Default |
---|---|---|
{$NOMAD.ENDPOINT.API.URL} | API endpoint URL for one of the Nomad cluster members. |
http://localhost:4646 |
{$NOMAD.TOKEN} | Nomad authentication token. |
<PUT YOUR AUTH TOKEN> |
{$NOMAD.DATA.TIMEOUT} | Response timeout for an API. |
15s |
{$NOMAD.HTTP.PROXY} | Sets the HTTP proxy for script and HTTP agent items. If this parameter is empty, then no proxy is used. |
|
{$NOMAD.API.RESPONSE.SUCCESS} | HTTP API successful response code. Availability triggers threshold. Change, if needed. |
200 |
{$NOMAD.SERVER.NAME.MATCHES} | The filter to include HashiCorp Nomad servers by name. |
.* |
{$NOMAD.SERVER.NAME.NOT_MATCHES} | The filter to exclude HashiCorp Nomad servers by name. |
CHANGE_IF_NEEDED |
{$NOMAD.SERVER.DC.MATCHES} | The filter to include HashiCorp Nomad servers by datacenter belonging. |
.* |
{$NOMAD.SERVER.DC.NOT_MATCHES} | The filter to exclude HashiCorp Nomad servers by datacenter belonging. |
CHANGE_IF_NEEDED |
{$NOMAD.CLIENT.NAME.MATCHES} | The filter to include HashiCorp Nomad clients by name. |
.* |
{$NOMAD.CLIENT.NAME.NOT_MATCHES} | The filter to exclude HashiCorp Nomad clients by name. |
CHANGE_IF_NEEDED |
{$NOMAD.CLIENT.DC.MATCHES} | The filter to include HashiCorp Nomad clients by datacenter belonging. |
.* |
{$NOMAD.CLIENT.DC.NOT_MATCHES} | The filter to exclude HashiCorp Nomad clients by datacenter belonging. |
CHANGE_IF_NEEDED |
{$NOMAD.CLIENT.SCHEDULE.ELIGIBILITY.MATCHES} | The filter to include HashiCorp Nomad clients by scheduling eligibility. |
.* |
{$NOMAD.CLIENT.SCHEDULE.ELIGIBILITY.NOT_MATCHES} | The filter to exclude HashiCorp Nomad clients by scheduling eligibility. |
CHANGE_IF_NEEDED |
Items
Name | Description | Type | Key and additional info |
---|---|---|---|
HashiCorp Nomad: Nomad clients get | Nomad clients data in raw format. |
HTTP agent | nomad.client.nodes.get Preprocessing
|
HashiCorp Nomad: Client nodes API response | Client nodes API response message. |
Dependent item | nomad.client.nodes.api.response Preprocessing
|
HashiCorp Nomad: Nomad servers get | Nomad servers data in raw format. |
Script | nomad.server.nodes.get |
HashiCorp Nomad: Server-related APIs response | Server-related ( |
Dependent item | nomad.server.api.response Preprocessing
|
HashiCorp Nomad: Region | Current cluster region. |
Dependent item | nomad.region Preprocessing
|
HashiCorp Nomad: Nomad servers count | Nomad servers count. |
Dependent item | nomad.servers.count Preprocessing
|
HashiCorp Nomad: Nomad clients count | Nomad clients count. |
Dependent item | nomad.clients.count Preprocessing
|
Triggers
Name | Description | Expression | Severity | Dependencies and additional info |
---|---|---|---|---|
HashiCorp Nomad: Client nodes API connection has failed | Client nodes API connection has failed. |
find(/HashiCorp Nomad by HTTP/nomad.client.nodes.api.response,,"like","{$NOMAD.API.RESPONSE.SUCCESS}")=0 |
Average | Manual close: Yes |
HashiCorp Nomad: Server-related API connection has failed | Server-related API connection has failed. |
find(/HashiCorp Nomad by HTTP/nomad.server.api.response,,"like","{$NOMAD.API.RESPONSE.SUCCESS}")=0 |
Average | Manual close: Yes |
LLD rule Clients discovery
Name | Description | Type | Key and additional info |
---|---|---|---|
Clients discovery | Client nodes discovery. |
Dependent item | nomad.clients.discovery Preprocessing
|
LLD rule Servers discovery
Name | Description | Type | Key and additional info |
---|---|---|---|
Servers discovery | Server nodes discovery. |
Dependent item | nomad.servers.discovery Preprocessing
|
HashiCorp Nomad Client by HTTP
Overview
This template is designed to monitor HashiCorp Nomad clients by Zabbix. It works without any external scripts.
Requirements
Zabbix version: 7.0 and higher.
Tested versions
This template has been tested on:
- HashiCorp Nomad version 1.5.6/1.6.0
Configuration
Zabbix should be configured according to the instructions in the Templates out of the box section.
Setup
- Enable telemetry in HashiCorp Nomad agent configuration file. Set the Prometheus metrics format.
Refer to the
vendor documentation
.
- Prepare an ACL token with
node:read
,namespace:read-job
permissions applied. Define the{$NOMAD.TOKEN}
macro value.
Refer to the vendor documentation about
Nomad native ACL
orNomad Vault-generated tokens
if you're using integration with HashiCorp Vault.
- Set the values for the
{$NOMAD.CLIENT.API.SCHEME}
and{$NOMAD.CLIENT.API.PORT}
macros to define the common Nomad API web schema and connection port.
Additional information:
-
You have to prepare an additional ACL token only if you wish to monitor Nomad clients as separate entities. If you're using clients discovery - token will be inherited from the master host linked to the HashiCorp Nomad by HTTP template.
-
If you're not using ACL - skip 2nd setup step.
-
The Nomad clients use the default web schema -
HTTP
and default API port -4646
. If you're using clients discovery and you need to re-define macros for the particular host created from prototype, use the context macros like {{$NOMAD.CLIENT.API.SCHEME:NECESSARY.IP
}} or/and {{$NOMAD.CLIENT.API.PORT:NECESSARY.IP
}} on master host or template level. -
Some metrics may not be collected depending on your HashiCorp Nomad agent version and configuration.
Useful links:
- HashiCorp Nomad metrics list
- HashiCorp Nomad telemetry configuration reference
- HashiCorp Nomad metrics API reference
- HashiCorp Nomad nodes API reference
- HashiCorp Nomad allocations API reference
- Zabbix user macros with context
Macros used
Name | Description | Default |
---|---|---|
{$NOMAD.CLIENT.API.SCHEME} | Nomad client API scheme. |
http |
{$NOMAD.CLIENT.API.PORT} | Nomad client API port. |
4646 |
{$NOMAD.TOKEN} | Nomad authentication token. |
<PUT YOUR AUTH TOKEN> |
{$NOMAD.DATA.TIMEOUT} | Response timeout for an API. |
15s |
{$NOMAD.HTTP.PROXY} | Sets the HTTP proxy for HTTP agent item. If this parameter is empty, then no proxy is used. |
|
{$NOMAD.API.RESPONSE.SUCCESS} | HTTP API successful response code. Availability triggers threshold. Change, if needed. |
200 |
{$NOMAD.CLIENT.RPC.PORT} | Nomad RPC service port. |
4647 |
{$NOMAD.CLIENT.SERF.PORT} | Nomad serf service port. |
4648 |
{$NOMAD.CLIENT.OPEN.FDS.MAX.WARN} | Maximum percentage of used file descriptors. |
90 |
{$NOMAD.DISK.NAME.MATCHES} | The filter to include HashiCorp Nomad client disks by name. |
.* |
{$NOMAD.DISK.NAME.NOT_MATCHES} | The filter to exclude HashiCorp Nomad client disks by name. |
CHANGE_IF_NEEDED |
{$NOMAD.JOB.NAME.MATCHES} | The filter to include HashiCorp Nomad client jobs by name. |
.* |
{$NOMAD.JOB.NAME.NOT_MATCHES} | The filter to exclude HashiCorp Nomad client jobs by name. |
CHANGE_IF_NEEDED |
{$NOMAD.JOB.NAMESPACE.MATCHES} | The filter to include HashiCorp Nomad client jobs by namespace. |
.* |
{$NOMAD.JOB.NAMESPACE.NOT_MATCHES} | The filter to exclude HashiCorp Nomad client jobs by namespace. |
CHANGE_IF_NEEDED |
{$NOMAD.JOB.TYPE.MATCHES} | The filter to include HashiCorp Nomad client jobs by type. |
.* |
{$NOMAD.JOB.TYPE.NOT_MATCHES} | The filter to exclude HashiCorp Nomad client jobs by type. |
CHANGE_IF_NEEDED |
{$NOMAD.JOB.TASK.GROUP.MATCHES} | The filter to include HashiCorp Nomad client jobs by task group belonging. |
.* |
{$NOMAD.JOB.TASK.GROUP.NOT_MATCHES} | The filter to exclude HashiCorp Nomad client jobs by task group belonging. |
CHANGE_IF_NEEDED |
{$NOMAD.DRIVER.NAME.MATCHES} | The filter to include HashiCorp Nomad client drivers by name. |
.* |
{$NOMAD.DRIVER.NAME.NOT_MATCHES} | The filter to exclude HashiCorp Nomad client drivers by name. |
CHANGE_IF_NEEDED |
{$NOMAD.DRIVER.DETECT.MATCHES} | The filter to include HashiCorp Nomad client drivers by detection state. Possible filtering values: |
.* |
{$NOMAD.DRIVER.DETECT.NOT_MATCHES} | The filter to exclude HashiCorp Nomad client drivers by detection state. Possible filtering values: |
CHANGE_IF_NEEDED |
{$NOMAD.CPU.UTIL.MIN} | CPU utilization threshold. Measured as a percentage. |
90 |
{$NOMAD.RAM.AVAIL.MIN} | CPU utilization threshold. Measured as a percentage. |
5 |
{$NOMAD.INODES.FREE.MIN.WARN} | Warning threshold of the filesystem metadata utilization. Measured as a percentage. |
20 |
{$NOMAD.INODES.FREE.MIN.CRIT} | Critical threshold of the filesystem metadata utilization. Measured as a percentage. |
10 |
Items
Name | Description | Type | Key and additional info |
---|---|---|---|
HashiCorp Nomad Client: Telemetry get | Telemetry data in raw format. |
HTTP agent | nomad.client.data.get Preprocessing
|
HashiCorp Nomad Client: Metrics | Nomad client metrics in raw format. |
Dependent item | nomad.client.metrics.get Preprocessing
|
HashiCorp Nomad Client: Monitoring API response | Monitoring API response message. |
Dependent item | nomad.client.data.api.response Preprocessing
|
HashiCorp Nomad Client: Service [rpc] state | Current [rpc] service state. |
Simple check | net.tcp.service[tcp,,{$NOMAD.CLIENT.RPC.PORT}] Preprocessing
|
HashiCorp Nomad Client: Service [serf] state | Current [serf] service state. |
Simple check | net.tcp.service[tcp,,{$NOMAD.CLIENT.SERF.PORT}] Preprocessing
|
HashiCorp Nomad Client: CPU allocated | Total amount of CPU shares the scheduler has allocated to tasks. |
Dependent item | nomad.client.allocated.cpu Preprocessing
|
HashiCorp Nomad Client: CPU unallocated | Total amount of CPU shares free for the scheduler to allocate to tasks. |
Dependent item | nomad.client.unallocated.cpu Preprocessing
|
HashiCorp Nomad Client: Memory allocated | Total amount of memory the scheduler has allocated to tasks. |
Dependent item | nomad.client.allocated.memory Preprocessing
|
HashiCorp Nomad Client: Memory unallocated | Total amount of memory free for the scheduler to allocate to tasks. |
Dependent item | nomad.client.unallocated.memory Preprocessing
|
HashiCorp Nomad Client: Disk allocated | Total amount of disk space the scheduler has allocated to tasks. |
Dependent item | nomad.client.allocated.disk Preprocessing
|
HashiCorp Nomad Client: Disk unallocated | Total amount of disk space free for the scheduler to allocate to tasks. |
Dependent item | nomad.client.unallocated.disk Preprocessing
|
HashiCorp Nomad Client: Allocations blocked | Number of allocations waiting for previous versions. |
Dependent item | nomad.client.allocations.blocked Preprocessing
|
HashiCorp Nomad Client: Allocations migrating | Number of allocations migrating data from previous versions. |
Dependent item | nomad.client.allocations.migrating Preprocessing
|
HashiCorp Nomad Client: Allocations pending | Number of allocations pending (received by the client but not yet running). |
Dependent item | nomad.client.allocations.pending Preprocessing
|
HashiCorp Nomad Client: Allocations starting | Number of allocations starting. |
Dependent item | nomad.client.allocations.start Preprocessing
|
HashiCorp Nomad Client: Allocations running | Number of allocations running. |
Dependent item | nomad.client.allocations.running Preprocessing
|
HashiCorp Nomad Client: Allocations terminal | Number of allocations terminal. |
Dependent item | nomad.client.allocations.terminal Preprocessing
|
HashiCorp Nomad Client: Allocations failed, rate | Number of allocations failed. |
Dependent item | nomad.client.allocations.failed Preprocessing
|
HashiCorp Nomad Client: Allocations completed, rate | Number of allocations completed. |
Dependent item | nomad.client.allocations.complete Preprocessing
|
HashiCorp Nomad Client: Allocations restarted, rate | Number of allocations restarted. |
Dependent item | nomad.client.allocations.restart Preprocessing
|
HashiCorp Nomad Client: Allocations OOM killed | Number of allocations OOM killed. |
Dependent item | nomad.client.allocations.oom_killed Preprocessing
|
HashiCorp Nomad Client: CPU idle utilization | CPU utilization in idle state. |
Dependent item | nomad.client.cpu.idle Preprocessing
|
HashiCorp Nomad Client: CPU system utilization | CPU utilization in system space. |
Dependent item | nomad.client.cpu.system Preprocessing
|
HashiCorp Nomad Client: CPU total utilization | Total CPU utilization. |
Dependent item | nomad.client.cpu.total Preprocessing
|
HashiCorp Nomad Client: CPU user utilization | CPU utilization in user space. |
Dependent item | nomad.client.cpu.user Preprocessing
|
HashiCorp Nomad Client: Memory available | Total amount of memory available to processes which includes free and cached memory. |
Dependent item | nomad.client.memory.available Preprocessing
|
HashiCorp Nomad Client: Memory free | Amount of memory which is free. |
Dependent item | nomad.client.memory.free Preprocessing
|
HashiCorp Nomad Client: Memory size | Total amount of physical memory on the node. |
Dependent item | nomad.client.memory.total Preprocessing
|
HashiCorp Nomad Client: Memory used | Amount of memory used by processes. |
Dependent item | nomad.client.memory.used Preprocessing
|
HashiCorp Nomad Client: Uptime | Uptime of the host running the Nomad client. |
Dependent item | nomad.client.uptime Preprocessing
|
HashiCorp Nomad Client: Node info get | Node info data in raw format. |
HTTP agent | nomad.client.node.info.get Preprocessing
|
HashiCorp Nomad Client: Nomad client version | Nomad client version. |
Dependent item | nomad.client.version Preprocessing
|
HashiCorp Nomad Client: Nodes API response | Nodes API response message. |
Dependent item | nomad.client.node.info.api.response Preprocessing
|
HashiCorp Nomad Client: Allocated jobs get | Allocated jobs data in raw format. |
HTTP agent | nomad.client.job.allocs.get Preprocessing
|
HashiCorp Nomad Client: Allocations API response | Allocations API response message. |
Dependent item | nomad.client.job.allocs.api.response Preprocessing
|
Triggers
Name | Description | Expression | Severity | Dependencies and additional info |
---|---|---|---|---|
HashiCorp Nomad Client: Monitoring API connection has failed | Monitoring API connection has failed. |
find(/HashiCorp Nomad Client by HTTP/nomad.client.data.api.response,,"like","{$NOMAD.API.RESPONSE.SUCCESS}")=0 |
Average | Manual close: Yes |
HashiCorp Nomad Client: Service [rpc] is down | Cannot establish the connection to [rpc] service port {$NOMAD.CLIENT.RPC.PORT}. |
last(/HashiCorp Nomad Client by HTTP/net.tcp.service[tcp,,{$NOMAD.CLIENT.RPC.PORT}]) = 0 |
Average | Manual close: Yes |
HashiCorp Nomad Client: Service [serf] is down | Cannot establish the connection to [serf] service port {$NOMAD.CLIENT.SERF.PORT}. |
last(/HashiCorp Nomad Client by HTTP/net.tcp.service[tcp,,{$NOMAD.CLIENT.SERF.PORT}]) = 0 |
Average | Manual close: Yes |
HashiCorp Nomad Client: OOM killed allocations found | OOM killed allocations found. |
last(/HashiCorp Nomad Client by HTTP/nomad.client.allocations.oom_killed) > 0 |
Warning | Manual close: Yes |
HashiCorp Nomad Client: High CPU utilization | CPU utilization is too high. The system might be slow to respond. |
min(/HashiCorp Nomad Client by HTTP/nomad.client.cpu.total, 10m) >= {$NOMAD.CPU.UTIL.MIN} |
Average | |
HashiCorp Nomad Client: High memory utilization | RAM utilization is too high. The system might be slow to respond. |
(min(/HashiCorp Nomad Client by HTTP/nomad.client.memory.available, 10m) / last(/HashiCorp Nomad Client by HTTP/nomad.client.memory.total))*100 <= {$NOMAD.RAM.AVAIL.MIN} |
Average | |
HashiCorp Nomad Client: The host has been restarted | The host uptime is less than 10 minutes. |
last(/HashiCorp Nomad Client by HTTP/nomad.client.uptime) < 10m |
Warning | Manual close: Yes |
HashiCorp Nomad Client: Nomad client version has changed | Nomad client version has changed. |
change(/HashiCorp Nomad Client by HTTP/nomad.client.version)<>0 |
Info | Manual close: Yes |
HashiCorp Nomad Client: Nodes API connection has failed | Nodes API connection has failed. |
find(/HashiCorp Nomad Client by HTTP/nomad.client.node.info.api.response,,"like","{$NOMAD.API.RESPONSE.SUCCESS}")=0 |
Average | Manual close: Yes Depends on:
|
HashiCorp Nomad Client: Allocations API connection has failed | Allocations API connection has failed. |
find(/HashiCorp Nomad Client by HTTP/nomad.client.job.allocs.api.response,,"like","{$NOMAD.API.RESPONSE.SUCCESS}")=0 |
Average | Manual close: Yes Depends on:
|
LLD rule Drivers discovery
Name | Description | Type | Key and additional info |
---|---|---|---|
Drivers discovery | Client drivers discovery. |
Dependent item | nomad.client.drivers.discovery Preprocessing
|
Item prototypes for Drivers discovery
Name | Description | Type | Key and additional info |
---|---|---|---|
HashiCorp Nomad Client: Driver [{#DRIVER.NAME}] state | Driver [{#DRIVER.NAME}] state. |
Dependent item | nomad.client.driver.state["{#DRIVER.NAME}"] Preprocessing
|
HashiCorp Nomad Client: Driver [{#DRIVER.NAME}] detection state | Driver [{#DRIVER.NAME}] detection state. |
Dependent item | nomad.client.driver.detected["{#DRIVER.NAME}"] Preprocessing
|
Trigger prototypes for Drivers discovery
Name | Description | Expression | Severity | Dependencies and additional info |
---|---|---|---|---|
HashiCorp Nomad Client: Driver [{#DRIVER.NAME}] is in unhealthy state | The [{#DRIVER.NAME}] driver detected, but its state is unhealthy. |
last(/HashiCorp Nomad Client by HTTP/nomad.client.driver.state["{#DRIVER.NAME}"]) = 0 and last(/HashiCorp Nomad Client by HTTP/nomad.client.driver.detected["{#DRIVER.NAME}"]) = 1 |
Warning | Manual close: Yes |
HashiCorp Nomad Client: Driver [{#DRIVER.NAME}] detection state has changed | The [{#DRIVER.NAME}] driver detection state has changed. |
change(/HashiCorp Nomad Client by HTTP/nomad.client.driver.detected["{#DRIVER.NAME}"]) <> 0 |
Info | Manual close: Yes |
LLD rule Physical disks discovery
Name | Description | Type | Key and additional info |
---|---|---|---|
Physical disks discovery | Physical disks discovery. |
Dependent item | nomad.client.disk.discovery Preprocessing
|
Item prototypes for Physical disks discovery
Name | Description | Type | Key and additional info |
---|---|---|---|
HashiCorp Nomad Client: Disk ["{#DEV.NAME}"] space available | Amount of space which is available on ["{#DEV.NAME}"] disk. |
Dependent item | nomad.client.disk.available["{#DEV.NAME}"] Preprocessing
|
HashiCorp Nomad Client: Disk ["{#DEV.NAME}"] inodes utilization | Disk space consumed by the inodes on ["{#DEV.NAME}"] disk. |
Dependent item | nomad.client.disk.inodes_percent["{#DEV.NAME}"] Preprocessing
|
HashiCorp Nomad Client: Disk ["{#DEV.NAME}"] size | Total size of the ["{#DEV.NAME}"] device. |
Dependent item | nomad.client.disk.size["{#DEV.NAME}"] Preprocessing
|
HashiCorp Nomad Client: Disk ["{#DEV.NAME}"] space utilization | Percentage of disk ["{#DEV.NAME}"] space used. |
Dependent item | nomad.client.disk.used_percent["{#DEV.NAME}"] Preprocessing
|
HashiCorp Nomad Client: Disk ["{#DEV.NAME}"] space used | Amount of disk ["{#DEV.NAME}"] space which has been used. |
Dependent item | nomad.client.disk.used["{#DEV.NAME}"] Preprocessing
|
Trigger prototypes for Physical disks discovery
Name | Description | Expression | Severity | Dependencies and additional info |
---|---|---|---|---|
HashiCorp Nomad Client: Running out of free inodes on [{#DEV.NAME}] device | It may become impossible to write to a disk if there are no index nodes left. |
min(/HashiCorp Nomad Client by HTTP/nomad.client.disk.inodes_percent["{#DEV.NAME}"],5m) >= {$NOMAD.INODES.FREE.MIN.WARN:"{#DEV.NAME}"} |
Warning | Manual close: Yes Depends on:
|
HashiCorp Nomad Client: Running out of free inodes on [{#DEV.NAME}] device | It may become impossible to write to a disk if there are no index nodes left. |
min(/HashiCorp Nomad Client by HTTP/nomad.client.disk.inodes_percent["{#DEV.NAME}"],5m) >= {$NOMAD.INODES.FREE.MIN.CRIT:"{#DEV.NAME}"} |
Average | Manual close: Yes |
HashiCorp Nomad Client: High disk [{#DEV.NAME}] utilization | High disk [{#DEV.NAME}] utilization. |
min(/HashiCorp Nomad Client by HTTP/nomad.client.disk.used_percent["{#DEV.NAME}"],5m) >= {$NOMAD.DISK.UTIL.MIN.WARN:"{#DEV.NAME}"} |
Warning | Manual close: Yes Depends on:
|
HashiCorp Nomad Client: High disk [{#DEV.NAME}] utilization | High disk [{#DEV.NAME}] utilization. |
min(/HashiCorp Nomad Client by HTTP/nomad.client.disk.used_percent["{#DEV.NAME}"],5m) >= {$NOMAD.DISK.UTIL.MIN.CRIT:"{#DEV.NAME}"} |
Average | Manual close: Yes |
LLD rule Allocated jobs discovery
Name | Description | Type | Key and additional info |
---|---|---|---|
Allocated jobs discovery | Allocated jobs discovery. |
Dependent item | nomad.client.alloc.discovery Preprocessing
|
Item prototypes for Allocated jobs discovery
Name | Description | Type | Key and additional info |
---|---|---|---|
HashiCorp Nomad Client: Job ["{#JOB.NAME}"] CPU allocated | Total CPU resources allocated by the ["{#JOB.NAME}"] job across all cores. |
Dependent item | nomad.client.allocs.cpu.allocated["{#JOB.NAME}","{#JOB.TASK.GROUP}","{#JOB.NAMESPACE}"] Preprocessing
|
HashiCorp Nomad Client: Job ["{#JOB.NAME}"] CPU system utilization | Total CPU resources consumed by the ["{#JOB.NAME}"] job in system space. |
Dependent item | nomad.client.allocs.cpu.system["{#JOB.NAME}","{#JOB.TASK.GROUP}","{#JOB.NAMESPACE}"] Preprocessing
|
HashiCorp Nomad Client: Job ["{#JOB.NAME}"] CPU user utilization | Total CPU resources consumed by the ["{#JOB.NAME}"] job in user space. |
Dependent item | nomad.client.allocs.cpu.user["{#JOB.NAME}","{#JOB.TASK.GROUP}","{#JOB.NAMESPACE}"] Preprocessing
|
HashiCorp Nomad Client: Job ["{#JOB.NAME}"] CPU total utilization | Total CPU resources consumed by the ["{#JOB.NAME}"] job across all cores. |
Dependent item | nomad.client.allocs.cpu.total_percent["{#JOB.NAME}","{#JOB.TASK.GROUP}","{#JOB.NAMESPACE}"] Preprocessing
|
HashiCorp Nomad Client: Job ["{#JOB.NAME}"] CPU throttled periods time | Total number of CPU periods that the ["{#JOB.NAME}"] job was throttled. |
Dependent item | nomad.client.allocs.cpu.throttled_periods["{#JOB.NAME}","{#JOB.TASK.GROUP}","{#JOB.NAMESPACE}"] Preprocessing
|
HashiCorp Nomad Client: Job ["{#JOB.NAME}"] CPU throttled time | Total time that the ["{#JOB.NAME}"] job was throttled. |
Dependent item | nomad.client.allocs.cpu.throttled_time["{#JOB.NAME}","{#JOB.TASK.GROUP}","{#JOB.NAMESPACE}"] Preprocessing
|
HashiCorp Nomad Client: Job ["{#JOB.NAME}"] CPU ticks | CPU ticks consumed by the process for the ["{#JOB.NAME}"] job in the last collection interval. |
Dependent item | nomad.client.allocs.cpu.total_ticks["{#JOB.NAME}","{#JOB.TASK.GROUP}","{#JOB.NAMESPACE}"] Preprocessing
|
HashiCorp Nomad Client: Job ["{#JOB.NAME}"] Memory allocated | Amount of memory allocated by the ["{#JOB.NAME}"] job. |
Dependent item | nomad.client.allocs.memory.allocated["{#JOB.NAME}","{#JOB.TASK.GROUP}","{#JOB.NAMESPACE}"] Preprocessing
|
HashiCorp Nomad Client: Job ["{#JOB.NAME}"] Memory cached | Amount of memory cached by the ["{#JOB.NAME}"] job. |
Dependent item | nomad.client.allocs.memory.cache["{#JOB.NAME}","{#JOB.TASK.GROUP}","{#JOB.NAMESPACE}"] Preprocessing
|
HashiCorp Nomad Client: Job ["{#JOB.NAME}"] Memory used | Total amount of memory used by the ["{#JOB.NAME}"] job. |
Dependent item | nomad.client.allocs.memory.usage["{#JOB.NAME}","{#JOB.TASK.GROUP}","{#JOB.NAMESPACE}"] Preprocessing
|
HashiCorp Nomad Client: Job ["{#JOB.NAME}"] Memory swapped | Amount of memory swapped by the ["{#JOB.NAME}"] job. |
Dependent item | nomad.client.allocs.memory.swap["{#JOB.NAME}","{#JOB.TASK.GROUP}","{#JOB.NAMESPACE}"] Preprocessing
|
HashiCorp Nomad Server by HTTP
Overview
This template is designed to monitor HashiCorp Nomad servers by Zabbix. It works without any external scripts.
Requirements
Zabbix version: 7.0 and higher.
Tested versions
This template has been tested on:
- HashiCorp Nomad version 1.5.6/1.6.0
Configuration
Zabbix should be configured according to the instructions in the Templates out of the box section.
Setup
- Enable telemetry in HashiCorp Nomad agent configuration file. Set the Prometheus metrics format.
Refer to the
vendor documentation
.
- Set the values for the
{$NOMAD.SERVER.API.SCHEME}
and{$NOMAD.SERVER.API.PORT}
macros to define the common Nomad API web schema and connection port.
Additional information:
- The Nomad servers use the default web schema -
HTTP
and default API port -4646
. If you're using servers discovery and you need to re-define macros for the particular host created from prototype, use the context macros like {{$NOMAD.SERVER.API.SCHEME:NECESSARY.IP
}} or/and {{$NOMAD.SERVER.API.PORT:NECESSARY.IP
}} on master host or template level. - Some metrics may not be collected depending on your HashiCorp Nomad agent version, configuration and cluster role.
- Don't forget to define the
{$NOMAD.REDUNDANCY.MIN}
macro value, based on your cluster nodes amount to configure the failure tolerance triggers correctly.
Useful links:
- HashiCorp Nomad metrics list
- HashiCorp Nomad telemetry configuration reference
- HashiCorp Nomad metrics API reference
- HashiCorp Nomad agent API reference
- HashiCorp Nomad cluster failure tolerance reference
- Zabbix user macros with context
Macros used
Name | Description | Default |
---|---|---|
{$NOMAD.SERVER.API.SCHEME} | Nomad SERVER API scheme. |
http |
{$NOMAD.SERVER.API.PORT} | Nomad SERVER API port. |
4646 |
{$NOMAD.TOKEN} | Nomad authentication token. |
<PUT YOUR AUTH TOKEN> |
{$NOMAD.DATA.TIMEOUT} | Response timeout for an API. |
15s |
{$NOMAD.HTTP.PROXY} | Sets the HTTP proxy for HTTP agent item. If this parameter is empty, then no proxy is used. |
|
{$NOMAD.API.RESPONSE.SUCCESS} | HTTP API successful response code. Availability triggers threshold. Change, if needed. |
200 |
{$NOMAD.SERVER.RPC.PORT} | Nomad RPC service port. |
4647 |
{$NOMAD.SERVER.SERF.PORT} | Nomad serf service port. |
4648 |
{$NOMAD.REDUNDANCY.MIN} | Amount of redundant servers to keep the cluster safe. Default value - '1' for the 3-nodes cluster. Change if needed. |
1 |
{$NOMAD.OPEN.FDS.MAX} | Maximum percentage of used file descriptors. |
90 |
{$NOMAD.SERVER.LEADER.LATENCY} | Leader last contact latency threshold. |
0.3s |
Items
Name | Description | Type | Key and additional info |
---|---|---|---|
HashiCorp Nomad Server: Telemetry get | Telemetry data in raw format. |
HTTP agent | nomad.server.data.get Preprocessing
|
HashiCorp Nomad Server: Metrics | Nomad server metrics in raw format. |
Dependent item | nomad.server.metrics.get Preprocessing
|
HashiCorp Nomad Server: Monitoring API response | Monitoring API response message. |
Dependent item | nomad.server.data.api.response Preprocessing
|
HashiCorp Nomad Server: Internal stats get | Internal stats data in raw format. |
HTTP agent | nomad.server.stats.get Preprocessing
|
HashiCorp Nomad Server: Internal stats API response | Internal stats API response message. |
Dependent item | nomad.server.stats.api.response Preprocessing
|
HashiCorp Nomad Server: Nomad server version | Nomad server version. |
Dependent item | nomad.server.version Preprocessing
|
HashiCorp Nomad Server: Nomad raft version | Nomad raft version. |
Dependent item | nomad.raft.version Preprocessing
|
HashiCorp Nomad Server: Raft peers | Current cluster raft peers amount. |
Dependent item | nomad.server.raft.peers Preprocessing
|
HashiCorp Nomad Server: Cluster role | Current role in the cluster. |
Dependent item | nomad.server.raft.cluster_role Preprocessing
|
HashiCorp Nomad Server: CPU time, rate | Total user and system CPU time spent in seconds. |
Dependent item | nomad.server.cpu.time Preprocessing
|
HashiCorp Nomad Server: Memory used | Memory utilization in bytes. |
Dependent item | nomad.server.runtime.alloc_bytes Preprocessing
|
HashiCorp Nomad Server: Virtual memory size | Virtual memory size in bytes. |
Dependent item | nomad.server.virtual_memory_bytes Preprocessing
|
HashiCorp Nomad Server: Resident memory size | Resident memory size in bytes. |
Dependent item | nomad.server.resident_memory_bytes Preprocessing
|
HashiCorp Nomad Server: Heap objects | Number of objects on the heap. General memory pressure indicator. |
Dependent item | nomad.server.runtime.heap_objects Preprocessing
|
HashiCorp Nomad Server: Open file descriptors | Number of open file descriptors. |
Dependent item | nomad.server.process_open_fds Preprocessing
|
HashiCorp Nomad Server: Open file descriptors, max | Maximum number of open file descriptors. |
Dependent item | nomad.server.process_max_fds Preprocessing
|
HashiCorp Nomad Server: Goroutines | Number of goroutines and general load pressure indicator. |
Dependent item | nomad.server.runtime.num_goroutines Preprocessing
|
HashiCorp Nomad Server: Evaluations pending | Evaluations that are pending until an existing evaluation for the same job completes. |
Dependent item | nomad.server.broker.total_pending Preprocessing
|
HashiCorp Nomad Server: Evaluations ready | Number of evaluations ready to be processed. |
Dependent item | nomad.server.broker.total_ready Preprocessing
|
HashiCorp Nomad Server: Evaluations unacked | Evaluations dispatched for processing but incomplete. |
Dependent item | nomad.server.broker.total_unacked Preprocessing
|
HashiCorp Nomad Server: CPU shares for blocked evaluations | Amount of CPU shares requested by blocked evals. |
Dependent item | nomad.server.blocked_evals.cpu Preprocessing
|
HashiCorp Nomad Server: Memory shares by blocked evaluations | Amount of memory requested by blocked evals. |
Dependent item | nomad.server.blocked_evals.memory Preprocessing
|
HashiCorp Nomad Server: CPU shares for blocked job evaluations | Amount of CPU shares requested by blocked evals of a job. |
Dependent item | nomad.server.blocked_evals.job.cpu Preprocessing
|
HashiCorp Nomad Server: Memory shares for blocked job evaluations | Amount of memory requested by blocked evals of a job. |
Dependent item | nomad.server.blocked_evals.job.memory Preprocessing
|
HashiCorp Nomad Server: Evaluations blocked | Count of evals in the blocked state for any reason (cluster resource exhaustion or quota limits). |
Dependent item | nomad.server.blocked_evals.total_blocked Preprocessing
|
HashiCorp Nomad Server: Evaluations escaped | Count of evals that have escaped computed node classes. This indicates a scheduler optimization was skipped and is not usually a source of concern. |
Dependent item | nomad.server.blocked_evals.total_escaped Preprocessing
|
HashiCorp Nomad Server: Evaluations waiting | Count of evals waiting to be enqueued. |
Dependent item | nomad.server.broker.total_waiting Preprocessing
|
HashiCorp Nomad Server: Evaluations blocked due to quota limit | Count of blocked evals due to quota limits (the resources for these jobs are not counted in other blocked_evals metrics, except for total_blocked). |
Dependent item | nomad.server.blocked_evals.total_quota_limit Preprocessing
|
HashiCorp Nomad Server: Evaluations enqueue time | Average time elapsed with evaluations waiting to be enqueued. |
Dependent item | nomad.server.broker.eval_waiting Preprocessing
|
HashiCorp Nomad Server: RPC evaluation acknowledgement time | Time elapsed for Eval.Ack RPC call. |
Dependent item | nomad.server.eval.ack Preprocessing
|
HashiCorp Nomad Server: RPC job summary time | Time elapsed for Job.Summary RPC call. |
Dependent item | nomad.server.job_summary.get_job_summary Preprocessing
|
HashiCorp Nomad Server: Heartbeats active | Number of active heartbeat timers. Each timer represents a Nomad client connection. |
Dependent item | nomad.server.heartbeat.active Preprocessing
|
HashiCorp Nomad Server: RPC requests, rate | Number of RPC requests being handled. |
Dependent item | nomad.server.rpc.request Preprocessing
|
HashiCorp Nomad Server: RPC error requests, rate | Number of RPC requests being handled that result in an error. |
Dependent item | nomad.server.rpc.request_error Preprocessing
|
HashiCorp Nomad Server: RPC queries, rate | Number of RPC queries. |
Dependent item | nomad.server.rpc.query Preprocessing
|
HashiCorp Nomad Server: RPC job allocations time | Time elapsed for Job.Allocations RPC call. |
Dependent item | nomad.server.job.allocations Preprocessing
|
HashiCorp Nomad Server: RPC job evaluations time | Time elapsed for Job.Evaluations RPC call. |
Dependent item | nomad.server.job.evaluations Preprocessing
|
HashiCorp Nomad Server: RPC get job time | Time elapsed for Job.GetJob RPC call. |
Dependent item | nomad.server.job.get_job Preprocessing
|
HashiCorp Nomad Server: Plan apply time | Time elapsed to apply a plan. |
Dependent item | nomad.server.plan.apply Preprocessing
|
HashiCorp Nomad Server: Plan evaluate time | Time elapsed to evaluate a plan. |
Dependent item | nomad.server.plan.evaluate Preprocessing
|
HashiCorp Nomad Server: RPC plan submit time | Time elapsed for Plan.Submit RPC call. |
Dependent item | nomad.server.plan.submit Preprocessing
|
HashiCorp Nomad Server: Plan raft index processing time | Time elapsed that planner waits for the raft index of the plan to be processed. |
Dependent item | nomad.server.plan.wait_for_index Preprocessing
|
HashiCorp Nomad Server: RPC list time | Time elapsed for Node.List RPC call. |
Dependent item | nomad.server.client.list Preprocessing
|
HashiCorp Nomad Server: RPC update allocations time | Time elapsed for Node.UpdateAlloc RPC call. |
Dependent item | nomad.server.client.update_alloc Preprocessing
|
HashiCorp Nomad Server: RPC update status time | Time elapsed for Node.UpdateStatus RPC call. |
Dependent item | nomad.server.client.update_status Preprocessing
|
HashiCorp Nomad Server: RPC get client allocs time | Time elapsed for Node.GetClientAllocs RPC call. |
Dependent item | nomad.server.client.get_client_allocs Preprocessing
|
HashiCorp Nomad Server: RPC eval dequeue time | Time elapsed for Eval.Dequeue RPC call. |
Dependent item | nomad.server.client.dequeue Preprocessing
|
HashiCorp Nomad Server: Vault token last renewal | Time since last successful Vault token renewal. |
Dependent item | nomad.server.vault.token_last_renewal Preprocessing
|
HashiCorp Nomad Server: Vault token next renewal | Time until next Vault token renewal attempt. |
Dependent item | nomad.server.vault.token_next_renewal Preprocessing
|
HashiCorp Nomad Server: Vault token TTL | Time to live for Vault token. |
Dependent item | nomad.server.vault.token_ttl Preprocessing
|
HashiCorp Nomad Server: Vault tokens revoked | Count of revoked tokens. |
Dependent item | nomad.server.vault.distributed_tokens_revoked Preprocessing
|
HashiCorp Nomad Server: Jobs dead | Number of dead jobs. |
Dependent item | nomad.server.job_status.dead Preprocessing
|
HashiCorp Nomad Server: Jobs pending | Number of pending jobs. |
Dependent item | nomad.server.job_status.pending Preprocessing
|
HashiCorp Nomad Server: Jobs running | Number of running jobs. |
Dependent item | nomad.server.job_status.running Preprocessing
|
HashiCorp Nomad Server: Job allocations completed | Number of complete allocations for a job. |
Dependent item | nomad.server.job_summary.complete Preprocessing
|
HashiCorp Nomad Server: Job allocations failed | Number of failed allocations for a job. |
Dependent item | nomad.server.job_summary.failed Preprocessing
|
HashiCorp Nomad Server: Job allocations lost | Number of lost allocations for a job. |
Dependent item | nomad.server.job_summary.lost Preprocessing
|
HashiCorp Nomad Server: Job allocations unknown | Number of unknown allocations for a job. |
Dependent item | nomad.server.job_summary.unknown Preprocessing
|
HashiCorp Nomad Server: Job allocations queued | Number of queued allocations for a job. |
Dependent item | nomad.server.job_summary.queued Preprocessing
|
HashiCorp Nomad Server: Job allocations running | Number of running allocations for a job. |
Dependent item | nomad.server.job_summary.running Preprocessing
|
HashiCorp Nomad Server: Job allocations starting | Number of starting allocations for a job. |
Dependent item | nomad.server.job_summary.starting Preprocessing
|
HashiCorp Nomad Server: Gossip time | Time elapsed to broadcast gossip messages. |
Dependent item | nomad.server.memberlist.gossip Preprocessing
|
HashiCorp Nomad Server: Leader barrier time | Time elapsed to establish a raft barrier during leader transition. |
Dependent item | nomad.server.leader.barrier Preprocessing
|
HashiCorp Nomad Server: Reconcile peer time | Time elapsed to reconcile a serf peer with state store. |
Dependent item | nomad.server.leader.reconcile_member Preprocessing
|
HashiCorp Nomad Server: Total reconcile time | Time elapsed to reconcile all serf peers with state store. |
Dependent item | nomad.server.leader.reconcile Preprocessing
|
HashiCorp Nomad Server: Leader last contact | Time since last contact to leader. General indicator of Raft latency. |
Dependent item | nomad.server.raft.leader.lastContact Preprocessing
|
HashiCorp Nomad Server: Plan queue | Count of evals in the plan queue. |
Dependent item | nomad.server.plan.queue_depth Preprocessing
|
HashiCorp Nomad Server: Worker evaluation create time | Time elapsed for worker to create an eval. |
Dependent item | nomad.server.worker.create_eval Preprocessing
|
HashiCorp Nomad Server: Worker evaluation dequeue time | Time elapsed for worker to dequeue an eval. |
Dependent item | nomad.server.worker.dequeue_eval Preprocessing
|
HashiCorp Nomad Server: Worker invoke scheduler time | Time elapsed for worker to invoke the scheduler. |
Dependent item | nomad.server.worker.invoke_scheduler_service Preprocessing
|
HashiCorp Nomad Server: Worker acknowledgement send time | Time elapsed for worker to send acknowledgement. |
Dependent item | nomad.server.worker.send_ack Preprocessing
|
HashiCorp Nomad Server: Worker submit plan time | Time elapsed for worker to submit plan. |
Dependent item | nomad.server.worker.submit_plan Preprocessing
|
HashiCorp Nomad Server: Worker update evaluation time | Time elapsed for worker to submit updated eval. |
Dependent item | nomad.server.worker.update_eval Preprocessing
|
HashiCorp Nomad Server: Worker log replication time | Time elapsed that worker waits for the raft index of the eval to be processed. |
Dependent item | nomad.server.worker.wait_for_index Preprocessing
|
HashiCorp Nomad Server: Raft calls blocked, rate | Count of blocking raft API calls. |
Dependent item | nomad.server.raft.barrier Preprocessing
|
HashiCorp Nomad Server: Raft commit logs enqueued | Count of logs enqueued. |
Dependent item | nomad.server.raft.commit_num_logs Preprocessing
|
HashiCorp Nomad Server: Raft transactions, rate | Number of Raft transactions. |
Dependent item | nomad.server.raft.apply Preprocessing
|
HashiCorp Nomad Server: Raft commit time | Time elapsed to commit writes. |
Dependent item | nomad.server.raft.commit_time Preprocessing
|
HashiCorp Nomad Server: Raft transaction commit time | Raft transaction commit time. |
Dependent item | nomad.server.raft.replication.appendEntries Preprocessing
|
HashiCorp Nomad Server: FSM apply time | Time elapsed to apply write to FSM. |
Dependent item | nomad.server.raft.fsm.apply Preprocessing
|
HashiCorp Nomad Server: FSM enqueue time | Time elapsed to enqueue write to FSM. |
Dependent item | nomad.server.raft.fsm.enqueue Preprocessing
|
HashiCorp Nomad Server: FSM autopilot time | Time elapsed to apply Autopilot raft entry. |
Dependent item | nomad.server.raft.fsm.autopilot Preprocessing
|
HashiCorp Nomad Server: FSM register node time | Time elapsed to apply RegisterNode raft entry. |
Dependent item | nomad.server.raft.fsm.register_node Preprocessing
|
HashiCorp Nomad Server: FSM index | Current index applied to FSM. |
Dependent item | nomad.server.raft.applied_index Preprocessing
|
HashiCorp Nomad Server: Raft last index | Most recent index seen. |
Dependent item | nomad.server.raft.last_index Preprocessing
|
HashiCorp Nomad Server: Dispatch log time | Time elapsed to write log, mark in flight, and start replication. |
Dependent item | nomad.server.raft.leader.dispatch_log Preprocessing
|
HashiCorp Nomad Server: Logs dispatched | Count of logs dispatched. |
Dependent item | nomad.server.raft.leader.dispatch_num_logs Preprocessing
|
HashiCorp Nomad Server: Heartbeat fails | Count of failing to heartbeat and starting election. |
Dependent item | nomad.server.raft.transition.heartbeat_timeout Preprocessing
|
HashiCorp Nomad Server: Objects freed, rate | Count of objects freed from heap by go runtime GC. |
Dependent item | nomad.server.runtime.free_count Preprocessing
|
HashiCorp Nomad Server: GC pause time | Go runtime GC pause times. |
Dependent item | nomad.server.runtime.gc_pause_ns Preprocessing
|
HashiCorp Nomad Server: GC metadata size | Go runtime GC metadata size in bytes. |
Dependent item | nomad.server.runtime.sys_bytes Preprocessing
|
HashiCorp Nomad Server: GC runs | Count of go runtime GC runs. |
Dependent item | nomad.server.runtime.total_gc_runs Preprocessing
|
HashiCorp Nomad Server: Memberlist events | Count of memberlist events received. |
Dependent item | nomad.server.serf.queue.event Preprocessing
|
HashiCorp Nomad Server: Memberlist changes | Count of memberlist changes. |
Dependent item | nomad.server.serf.queue.intent Preprocessing
|
HashiCorp Nomad Server: Memberlist queries | Count of memberlist queries. |
Dependent item | nomad.server.serf.queue.queries Preprocessing
|
HashiCorp Nomad Server: Snapshot index | Current snapshot index. |
Dependent item | nomad.server.state.snapshot.index Preprocessing
|
HashiCorp Nomad Server: Services ready to schedule | Count of service evals ready to be scheduled. |
Dependent item | nomad.server.broker.service_ready Preprocessing
|
HashiCorp Nomad Server: Services unacknowledged | Count of unacknowledged service evals. |
Dependent item | nomad.server.broker.service_unacked Preprocessing
|
HashiCorp Nomad Server: System evaluations ready to schedule | Count of service evals ready to be scheduled. |
Dependent item | nomad.server.broker.system_ready Preprocessing
|
HashiCorp Nomad Server: System evaluations unacknowledged | Count of unacknowledged system evals. |
Dependent item | nomad.server.broker.system_unacked Preprocessing
|
HashiCorp Nomad Server: BoltDB free pages | Number of BoltDB free pages. |
Dependent item | nomad.server.raft.boltdb.num_free_pages Preprocessing
|
HashiCorp Nomad Server: BoltDB pending pages | Number of BoltDB pending pages. |
Dependent item | nomad.server.raft.boltdb.num_pending_pages Preprocessing
|
HashiCorp Nomad Server: BoltDB free page bytes | Number of free page bytes. |
Dependent item | nomad.server.raft.boltdb.free_page_bytes Preprocessing
|
HashiCorp Nomad Server: BoltDB freelist bytes | Number of freelist bytes. |
Dependent item | nomad.server.raft.boltdb.freelist_bytes Preprocessing
|
HashiCorp Nomad Server: BoltDB read transactions, rate | Count of total read transactions. |
Dependent item | nomad.server.raft.boltdb.total_read_txn Preprocessing
|
HashiCorp Nomad Server: BoltDB open read transactions | Number of current open read transactions. |
Dependent item | nomad.server.raft.boltdb.open_read_txn Preprocessing
|
HashiCorp Nomad Server: BoltDB pages in use | Number of pages in use. |
Dependent item | nomad.server.raft.boltdb.txstats.page_count Preprocessing
|
HashiCorp Nomad Server: BoltDB page allocations, rate | Number of page allocations. |
Dependent item | nomad.server.raft.boltdb.txstats.page_alloc Preprocessing
|
HashiCorp Nomad Server: BoltDB cursors | Count of total database cursors. |
Dependent item | nomad.server.raft.boltdb.txstats.cursor_count Preprocessing
|
HashiCorp Nomad Server: BoltDB nodes, rate | Count of total database nodes. |
Dependent item | nomad.server.raft.boltdb.txstats.node_count Preprocessing
|
HashiCorp Nomad Server: BoltDB node dereferences, rate | Count of total database node dereferences. |
Dependent item | nomad.server.raft.boltdb.txstats.node_deref Preprocessing
|
HashiCorp Nomad Server: BoltDB rebalance operations, rate | Count of total rebalance operations. |
Dependent item | nomad.server.raft.boltdb.txstats.rebalance Preprocessing
|
HashiCorp Nomad Server: BoltDB split operations, rate | Count of total split operations. |
Dependent item | nomad.server.raft.boltdb.txstats.split Preprocessing
|
HashiCorp Nomad Server: BoltDB spill operations, rate | Count of total spill operations. |
Dependent item | nomad.server.raft.boltdb.txstats.spill Preprocessing
|
HashiCorp Nomad Server: BoltDB write operations, rate | Count of total write operations. |
Dependent item | nomad.server.raft.boltdb.txstats.write Preprocessing
|
HashiCorp Nomad Server: BoltDB rebalance time | Sample of rebalance operation times. |
Dependent item | nomad.server.raft.boltdb.txstats.rebalance_time Preprocessing
|
HashiCorp Nomad Server: BoltDB spill time | Sample of spill operation times. |
Dependent item | nomad.server.raft.boltdb.txstats.spill_time Preprocessing
|
HashiCorp Nomad Server: BoltDB write time | Sample of write operation times. |
Dependent item | nomad.server.raft.boltdb.txstats.write_time Preprocessing
|
HashiCorp Nomad Server: Service [rpc] state | Current [rpc] service state. |
Simple check | net.tcp.service[tcp,,{$NOMAD.SERVER.RPC.PORT}] Preprocessing
|
HashiCorp Nomad Server: Service [serf] state | Current [serf] service state. |
Simple check | net.tcp.service[tcp,,{$NOMAD.SERVER.SERF.PORT}] Preprocessing
|
HashiCorp Nomad Server: Namespace list time | Time elapsed for Namespace.ListNamespaces. |
Dependent item | nomad.server.namespace.list_namespace Preprocessing
|
HashiCorp Nomad Server: Autopilot state | Current autopilot state. |
Dependent item | nomad.server.autopilot.state Preprocessing
|
HashiCorp Nomad Server: Autopilot failure tolerance | The number of redundant healthy servers that can fail without causing an outage. |
Dependent item | nomad.server.autopilot.failure_tolerance Preprocessing
|
HashiCorp Nomad Server: FSM allocation client update time | Time elapsed to apply AllocClientUpdate raft entry. |
Dependent item | nomad.server.alloc_client_update Preprocessing
|
HashiCorp Nomad Server: FSM apply plan results time | Time elapsed to apply ApplyPlanResults raft entry. |
Dependent item | nomad.server.fsm.apply_plan_results Preprocessing
|
HashiCorp Nomad Server: FSM update evaluation time | Time elapsed to apply UpdateEval raft entry. |
Dependent item | nomad.server.fsm.update_eval Preprocessing
|
HashiCorp Nomad Server: FSM job registration time | Time elapsed to apply RegisterJob raft entry. |
Dependent item | nomad.server.fsm.register_job Preprocessing
|
HashiCorp Nomad Server: Allocation reschedule attempts | Count of attempts to reschedule an allocation. |
Dependent item | nomad.server.scheduler.allocs.rescheduled.attempted Preprocessing
|
Triggers
Name | Description | Expression | Severity | Dependencies and additional info |
---|---|---|---|---|
HashiCorp Nomad Server: Monitoring API connection has failed | Monitoring API connection has failed. |
find(/HashiCorp Nomad Server by HTTP/nomad.server.data.api.response,,"like","{$NOMAD.API.RESPONSE.SUCCESS}")=0 |
Average | Manual close: Yes |
HashiCorp Nomad Server: Internal stats API connection has failed | Internal stats API connection has failed. |
find(/HashiCorp Nomad Server by HTTP/nomad.server.stats.api.response,,"like","{$NOMAD.API.RESPONSE.SUCCESS}")=0 |
Average | Manual close: Yes Depends on:
|
HashiCorp Nomad Server: Nomad server version has changed | Nomad server version has changed. |
change(/HashiCorp Nomad Server by HTTP/nomad.server.version)<>0 |
Info | Manual close: Yes |
HashiCorp Nomad Server: Cluster role has changed | Cluster role has changed. |
change(/HashiCorp Nomad Server by HTTP/nomad.server.raft.cluster_role) <> 0 |
Info | Manual close: Yes |
HashiCorp Nomad Server: Current number of open files is too high | Heavy file descriptor usage (i.e., near the process file descriptor limit) indicates a potential file descriptor exhaustion issue. |
min(/HashiCorp Nomad Server by HTTP/nomad.server.process_open_fds,5m)/last(/HashiCorp Nomad Server by HTTP/nomad.server.process_max_fds)*100>{$NOMAD.OPEN.FDS.MAX} |
Warning | |
HashiCorp Nomad Server: Dead jobs found | Jobs with the |
last(/HashiCorp Nomad Server by HTTP/nomad.server.job_status.dead) > 0 and nodata(/HashiCorp Nomad Server by HTTP/nomad.server.job_status.dead,5m) = 0 |
Warning | Manual close: Yes |
HashiCorp Nomad Server: Leader last contact timeout exceeded | The nomad.raft.leader.lastContact metric is a general indicator of Raft latency which can be used to observe how Raft timing is performing and guide infrastructure provisioning. |
min(/HashiCorp Nomad Server by HTTP/nomad.server.raft.leader.lastContact,5m) >= {$NOMAD.SERVER.LEADER.LATENCY} and nodata(/HashiCorp Nomad Server by HTTP/nomad.server.raft.leader.lastContact,5m) = 0 |
Warning | |
HashiCorp Nomad Server: Service [rpc] is down | Cannot establish the connection to [rpc] service port {$NOMAD.SERVER.RPC.PORT}. |
last(/HashiCorp Nomad Server by HTTP/net.tcp.service[tcp,,{$NOMAD.SERVER.RPC.PORT}]) = 0 |
Average | Manual close: Yes |
HashiCorp Nomad Server: Service [serf] is down | Cannot establish the connection to [serf] service port {$NOMAD.SERVER.SERF.PORT}. |
last(/HashiCorp Nomad Server by HTTP/net.tcp.service[tcp,,{$NOMAD.SERVER.SERF.PORT}]) = 0 |
Average | Manual close: Yes |
HashiCorp Nomad Server: Autopilot is unhealthy | The autopilot is in unhealthy state. The successful failover probability is extremely low. |
last(/HashiCorp Nomad Server by HTTP/nomad.server.autopilot.state) = 0 and nodata(/HashiCorp Nomad Server by HTTP/nomad.server.autopilot.state,5m) = 0 |
Average | Manual close: Yes |
HashiCorp Nomad Server: Autopilot redundancy is low | The autopilot redundancy is low. |
last(/HashiCorp Nomad Server by HTTP/nomad.server.autopilot.failure_tolerance) < {$NOMAD.REDUNDANCY.MIN} and nodata(/HashiCorp Nomad Server by HTTP/nomad.server.autopilot.failure_tolerance,5m) = 0 |
Warning | Manual close: Yes |
Feedback
Please report any issues with the template at https://support.zabbix.com
You can also provide feedback, discuss the template, or ask for help at ZABBIX forums