You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
yzl 93958d0fb0
zabbix6.0
1 year ago
..
README.md zabbix6.0 1 year ago
template_app_nomad_http.yaml zabbix6.0 1 year ago

README.md

HashiCorp Nomad by HTTP

Overview

This template is designed to monitor HashiCorp Nomad by Zabbix. It works without any external scripts. Currently the template supports Nomad servers and clients discovery.

Requirements

Zabbix version: 7.0 and higher.

Tested versions

This template has been tested on:

  • HashiCorp Nomad version 1.5.6/1.6.0

Configuration

Zabbix should be configured according to the instructions in the Templates out of the box section.

Setup

  1. Create a synthetic Nomad host. It should be one of the Nomad cluster members, load-balancing service (if cluster is used) or a single node in a selected Nomad region.
  2. Define the {$NOMAD.ENDPOINT.API.URL} macro value with correct web protocol, host and port.
  3. Prepare an ACL token with node:read, namespace:read-job, agent:read and management permissions applied. Define the {$NOMAD.TOKEN} macro value.

Refer to the vendor documentation about Nomad native ACL or Nomad Vault-generated tokens if you have the HashiCorp Vault integration configured.

Additional information:

  • Synthetic Nomad host will be used just as an endpoint for servers and clients discovery (general cluster information), it will not be monitored as a Nomad server or client, so that to prevent duplicate entities.
  • If you're not using ACL - skip 3rd setup step.
  • The Nomad servers/clients discovery is limited by region. If you're using multi-region cluster- create one synthetic host per region.
  • The Nomad server/client templates are ready for separate usage. Feel free to use if you prefer manual host creation.

Useful links

Macros used

Name Description Default
{$NOMAD.ENDPOINT.API.URL}

API endpoint URL for one of the Nomad cluster members.

http://localhost:4646
{$NOMAD.TOKEN}

Nomad authentication token.

<PUT YOUR AUTH TOKEN>
{$NOMAD.DATA.TIMEOUT}

Response timeout for an API.

15s
{$NOMAD.HTTP.PROXY}

Sets the HTTP proxy for script and HTTP agent items. If this parameter is empty, then no proxy is used.

{$NOMAD.API.RESPONSE.SUCCESS}

HTTP API successful response code. Availability triggers threshold. Change, if needed.

200
{$NOMAD.SERVER.NAME.MATCHES}

The filter to include HashiCorp Nomad servers by name.

.*
{$NOMAD.SERVER.NAME.NOT_MATCHES}

The filter to exclude HashiCorp Nomad servers by name.

CHANGE_IF_NEEDED
{$NOMAD.SERVER.DC.MATCHES}

The filter to include HashiCorp Nomad servers by datacenter belonging.

.*
{$NOMAD.SERVER.DC.NOT_MATCHES}

The filter to exclude HashiCorp Nomad servers by datacenter belonging.

CHANGE_IF_NEEDED
{$NOMAD.CLIENT.NAME.MATCHES}

The filter to include HashiCorp Nomad clients by name.

.*
{$NOMAD.CLIENT.NAME.NOT_MATCHES}

The filter to exclude HashiCorp Nomad clients by name.

CHANGE_IF_NEEDED
{$NOMAD.CLIENT.DC.MATCHES}

The filter to include HashiCorp Nomad clients by datacenter belonging.

.*
{$NOMAD.CLIENT.DC.NOT_MATCHES}

The filter to exclude HashiCorp Nomad clients by datacenter belonging.

CHANGE_IF_NEEDED
{$NOMAD.CLIENT.SCHEDULE.ELIGIBILITY.MATCHES}

The filter to include HashiCorp Nomad clients by scheduling eligibility.

.*
{$NOMAD.CLIENT.SCHEDULE.ELIGIBILITY.NOT_MATCHES}

The filter to exclude HashiCorp Nomad clients by scheduling eligibility.

CHANGE_IF_NEEDED

Items

Name Description Type Key and additional info
HashiCorp Nomad: Nomad clients get

Nomad clients data in raw format.

HTTP agent nomad.client.nodes.get

Preprocessing

  • Check for not supported value

    Custom on fail: Set value to: {"header":{"HTTP/1.1 408 Request timeout":""}}

HashiCorp Nomad: Client nodes API response

Client nodes API response message.

Dependent item nomad.client.nodes.api.response

Preprocessing

  • JavaScript: The text is too long. Please see the template.

  • Discard unchanged with heartbeat: 1h

HashiCorp Nomad: Nomad servers get

Nomad servers data in raw format.

Script nomad.server.nodes.get
HashiCorp Nomad: Server-related APIs response

Server-related (operator/raft/configuration, agent/members) APIs error response message.

Dependent item nomad.server.api.response

Preprocessing

  • JSON Path: $.error

    Custom on fail: Set value to: HTTP/1.1 200 OK

  • Discard unchanged with heartbeat: 1h

HashiCorp Nomad: Region

Current cluster region.

Dependent item nomad.region

Preprocessing

  • JSON Path: $..region.first()

HashiCorp Nomad: Nomad servers count

Nomad servers count.

Dependent item nomad.servers.count

Preprocessing

  • JSON Path: $[?(@.Name)].length()

HashiCorp Nomad: Nomad clients count

Nomad clients count.

Dependent item nomad.clients.count

Preprocessing

  • JSON Path: $.body[?(@.Name)].length()

Triggers

Name Description Expression Severity Dependencies and additional info
HashiCorp Nomad: Client nodes API connection has failed

Client nodes API connection has failed.
Ensure that Nomad API URL and the necessary permissions have been defined correctly, check the service state and network connectivity between Nomad and Zabbix.

find(/HashiCorp Nomad by HTTP/nomad.client.nodes.api.response,,"like","{$NOMAD.API.RESPONSE.SUCCESS}")=0 Average Manual close: Yes
HashiCorp Nomad: Server-related API connection has failed

Server-related API connection has failed.
Ensure that Nomad API URL and the necessary permissions have been defined correctly, check the service state and network connectivity between Nomad and Zabbix.

find(/HashiCorp Nomad by HTTP/nomad.server.api.response,,"like","{$NOMAD.API.RESPONSE.SUCCESS}")=0 Average Manual close: Yes

LLD rule Clients discovery

Name Description Type Key and additional info
Clients discovery

Client nodes discovery.

Dependent item nomad.clients.discovery

Preprocessing

  • JSON Path: $.body

    Custom on fail: Discard value

  • Discard unchanged with heartbeat: 1h

LLD rule Servers discovery

Name Description Type Key and additional info
Servers discovery

Server nodes discovery.

Dependent item nomad.servers.discovery

Preprocessing

  • Check for error in JSON: $.error

    Custom on fail: Discard value

  • Discard unchanged with heartbeat: 1h

HashiCorp Nomad Client by HTTP

Overview

This template is designed to monitor HashiCorp Nomad clients by Zabbix. It works without any external scripts.

Requirements

Zabbix version: 7.0 and higher.

Tested versions

This template has been tested on:

  • HashiCorp Nomad version 1.5.6/1.6.0

Configuration

Zabbix should be configured according to the instructions in the Templates out of the box section.

Setup

  1. Enable telemetry in HashiCorp Nomad agent configuration file. Set the Prometheus metrics format.

Refer to the vendor documentation.

  1. Prepare an ACL token with node:read, namespace:read-job permissions applied. Define the {$NOMAD.TOKEN} macro value.

Refer to the vendor documentation about Nomad native ACL or Nomad Vault-generated tokens if you're using integration with HashiCorp Vault.

  1. Set the values for the {$NOMAD.CLIENT.API.SCHEME} and {$NOMAD.CLIENT.API.PORT} macros to define the common Nomad API web schema and connection port.

Additional information:

  • You have to prepare an additional ACL token only if you wish to monitor Nomad clients as separate entities. If you're using clients discovery - token will be inherited from the master host linked to the HashiCorp Nomad by HTTP template.

  • If you're not using ACL - skip 2nd setup step.

  • The Nomad clients use the default web schema - HTTP and default API port - 4646. If you're using clients discovery and you need to re-define macros for the particular host created from prototype, use the context macros like {{$NOMAD.CLIENT.API.SCHEME:NECESSARY.IP}} or/and {{$NOMAD.CLIENT.API.PORT:NECESSARY.IP}} on master host or template level.

  • Some metrics may not be collected depending on your HashiCorp Nomad agent version and configuration.

Useful links:

Macros used

Name Description Default
{$NOMAD.CLIENT.API.SCHEME}

Nomad client API scheme.

http
{$NOMAD.CLIENT.API.PORT}

Nomad client API port.

4646
{$NOMAD.TOKEN}

Nomad authentication token.

<PUT YOUR AUTH TOKEN>
{$NOMAD.DATA.TIMEOUT}

Response timeout for an API.

15s
{$NOMAD.HTTP.PROXY}

Sets the HTTP proxy for HTTP agent item. If this parameter is empty, then no proxy is used.

{$NOMAD.API.RESPONSE.SUCCESS}

HTTP API successful response code. Availability triggers threshold. Change, if needed.

200
{$NOMAD.CLIENT.RPC.PORT}

Nomad RPC service port.

4647
{$NOMAD.CLIENT.SERF.PORT}

Nomad serf service port.

4648
{$NOMAD.CLIENT.OPEN.FDS.MAX.WARN}

Maximum percentage of used file descriptors.

90
{$NOMAD.DISK.NAME.MATCHES}

The filter to include HashiCorp Nomad client disks by name.

.*
{$NOMAD.DISK.NAME.NOT_MATCHES}

The filter to exclude HashiCorp Nomad client disks by name.

CHANGE_IF_NEEDED
{$NOMAD.JOB.NAME.MATCHES}

The filter to include HashiCorp Nomad client jobs by name.

.*
{$NOMAD.JOB.NAME.NOT_MATCHES}

The filter to exclude HashiCorp Nomad client jobs by name.

CHANGE_IF_NEEDED
{$NOMAD.JOB.NAMESPACE.MATCHES}

The filter to include HashiCorp Nomad client jobs by namespace.

.*
{$NOMAD.JOB.NAMESPACE.NOT_MATCHES}

The filter to exclude HashiCorp Nomad client jobs by namespace.

CHANGE_IF_NEEDED
{$NOMAD.JOB.TYPE.MATCHES}

The filter to include HashiCorp Nomad client jobs by type.

.*
{$NOMAD.JOB.TYPE.NOT_MATCHES}

The filter to exclude HashiCorp Nomad client jobs by type.

CHANGE_IF_NEEDED
{$NOMAD.JOB.TASK.GROUP.MATCHES}

The filter to include HashiCorp Nomad client jobs by task group belonging.

.*
{$NOMAD.JOB.TASK.GROUP.NOT_MATCHES}

The filter to exclude HashiCorp Nomad client jobs by task group belonging.

CHANGE_IF_NEEDED
{$NOMAD.DRIVER.NAME.MATCHES}

The filter to include HashiCorp Nomad client drivers by name.

.*
{$NOMAD.DRIVER.NAME.NOT_MATCHES}

The filter to exclude HashiCorp Nomad client drivers by name.

CHANGE_IF_NEEDED
{$NOMAD.DRIVER.DETECT.MATCHES}

The filter to include HashiCorp Nomad client drivers by detection state. Possible filtering values: true, false.

.*
{$NOMAD.DRIVER.DETECT.NOT_MATCHES}

The filter to exclude HashiCorp Nomad client drivers by detection state. Possible filtering values: true, false.

CHANGE_IF_NEEDED
{$NOMAD.CPU.UTIL.MIN}

CPU utilization threshold. Measured as a percentage.

90
{$NOMAD.RAM.AVAIL.MIN}

CPU utilization threshold. Measured as a percentage.

5
{$NOMAD.INODES.FREE.MIN.WARN}

Warning threshold of the filesystem metadata utilization. Measured as a percentage.

20
{$NOMAD.INODES.FREE.MIN.CRIT}

Critical threshold of the filesystem metadata utilization. Measured as a percentage.

10

Items

Name Description Type Key and additional info
HashiCorp Nomad Client: Telemetry get

Telemetry data in raw format.

HTTP agent nomad.client.data.get

Preprocessing

  • Check for not supported value

    Custom on fail: Set value to: {"header":{"HTTP/1.1 408 Request timeout":""}}

HashiCorp Nomad Client: Metrics

Nomad client metrics in raw format.

Dependent item nomad.client.metrics.get

Preprocessing

  • JSON Path: $.body

    Custom on fail: Discard value

HashiCorp Nomad Client: Monitoring API response

Monitoring API response message.

Dependent item nomad.client.data.api.response

Preprocessing

  • JavaScript: The text is too long. Please see the template.

  • Discard unchanged with heartbeat: 1h

HashiCorp Nomad Client: Service [rpc] state

Current [rpc] service state.

Simple check net.tcp.service[tcp,,{$NOMAD.CLIENT.RPC.PORT}]

Preprocessing

  • Discard unchanged with heartbeat: 1h

HashiCorp Nomad Client: Service [serf] state

Current [serf] service state.

Simple check net.tcp.service[tcp,,{$NOMAD.CLIENT.SERF.PORT}]

Preprocessing

  • Discard unchanged with heartbeat: 1h

HashiCorp Nomad Client: CPU allocated

Total amount of CPU shares the scheduler has allocated to tasks.

Dependent item nomad.client.allocated.cpu

Preprocessing

  • Prometheus pattern: VALUE(nomad_client_allocated_cpu)

    Custom on fail: Discard value

HashiCorp Nomad Client: CPU unallocated

Total amount of CPU shares free for the scheduler to allocate to tasks.

Dependent item nomad.client.unallocated.cpu

Preprocessing

  • Prometheus pattern: VALUE(nomad_client_unallocated_cpu)

    Custom on fail: Discard value

HashiCorp Nomad Client: Memory allocated

Total amount of memory the scheduler has allocated to tasks.

Dependent item nomad.client.allocated.memory

Preprocessing

  • Prometheus pattern: VALUE(nomad_client_allocated_memory)

    Custom on fail: Discard value

  • Custom multiplier: 1.0E+6

HashiCorp Nomad Client: Memory unallocated

Total amount of memory free for the scheduler to allocate to tasks.

Dependent item nomad.client.unallocated.memory

Preprocessing

  • Prometheus pattern: VALUE(nomad_client_unallocated_memory)

    Custom on fail: Discard value

  • Custom multiplier: 1.0E+6

HashiCorp Nomad Client: Disk allocated

Total amount of disk space the scheduler has allocated to tasks.

Dependent item nomad.client.allocated.disk

Preprocessing

  • Prometheus pattern: VALUE(nomad_client_allocated_disk)

    Custom on fail: Discard value

  • Custom multiplier: 1.0E+6

HashiCorp Nomad Client: Disk unallocated

Total amount of disk space free for the scheduler to allocate to tasks.

Dependent item nomad.client.unallocated.disk

Preprocessing

  • Prometheus pattern: VALUE(nomad_client_unallocated_disk)

    Custom on fail: Discard value

  • Custom multiplier: 1.0E+6

HashiCorp Nomad Client: Allocations blocked

Number of allocations waiting for previous versions.

Dependent item nomad.client.allocations.blocked

Preprocessing

  • Prometheus pattern: VALUE(nomad_client_allocations_blocked)

    Custom on fail: Set value to: 0

HashiCorp Nomad Client: Allocations migrating

Number of allocations migrating data from previous versions.

Dependent item nomad.client.allocations.migrating

Preprocessing

  • Prometheus pattern: VALUE(nomad_client_allocations_migrating)

    Custom on fail: Set value to: 0

HashiCorp Nomad Client: Allocations pending

Number of allocations pending (received by the client but not yet running).

Dependent item nomad.client.allocations.pending

Preprocessing

  • Prometheus pattern: VALUE(nomad_client_allocations_pending)

    Custom on fail: Set value to: 0

HashiCorp Nomad Client: Allocations starting

Number of allocations starting.

Dependent item nomad.client.allocations.start

Preprocessing

  • Prometheus pattern: VALUE(nomad_client_allocations_start)

    Custom on fail: Set value to: 0

HashiCorp Nomad Client: Allocations running

Number of allocations running.

Dependent item nomad.client.allocations.running

Preprocessing

  • Prometheus pattern: VALUE(nomad_client_allocations_running)

    Custom on fail: Set value to: 0

HashiCorp Nomad Client: Allocations terminal

Number of allocations terminal.

Dependent item nomad.client.allocations.terminal

Preprocessing

  • Prometheus pattern: VALUE(nomad_client_allocations_terminal)

    Custom on fail: Set value to: 0

HashiCorp Nomad Client: Allocations failed, rate

Number of allocations failed.

Dependent item nomad.client.allocations.failed

Preprocessing

  • Prometheus pattern: SUM(nomad_client_allocs_failed)

    Custom on fail: Set value to: 0

  • Change per second
  • Discard unchanged with heartbeat: 1h

HashiCorp Nomad Client: Allocations completed, rate

Number of allocations completed.

Dependent item nomad.client.allocations.complete

Preprocessing

  • Prometheus pattern: SUM(nomad_client_allocs_complete)

    Custom on fail: Set value to: 0

  • Change per second
  • Discard unchanged with heartbeat: 1h

HashiCorp Nomad Client: Allocations restarted, rate

Number of allocations restarted.

Dependent item nomad.client.allocations.restart

Preprocessing

  • Prometheus pattern: SUM(nomad_client_allocs_restart)

    Custom on fail: Set value to: 0

  • Change per second
  • Discard unchanged with heartbeat: 1h

HashiCorp Nomad Client: Allocations OOM killed

Number of allocations OOM killed.

Dependent item nomad.client.allocations.oom_killed

Preprocessing

  • Prometheus pattern: VALUE(nomad_client_allocs_oom_killed)

    Custom on fail: Set value to: 0

  • Discard unchanged with heartbeat: 1h

HashiCorp Nomad Client: CPU idle utilization

CPU utilization in idle state.

Dependent item nomad.client.cpu.idle

Preprocessing

  • Prometheus pattern: AVG(nomad_client_host_cpu_idle)

    Custom on fail: Discard value

HashiCorp Nomad Client: CPU system utilization

CPU utilization in system space.

Dependent item nomad.client.cpu.system

Preprocessing

  • Prometheus pattern: AVG(nomad_client_host_cpu_system)

    Custom on fail: Discard value

HashiCorp Nomad Client: CPU total utilization

Total CPU utilization.

Dependent item nomad.client.cpu.total

Preprocessing

  • Prometheus pattern: AVG(nomad_client_host_cpu_total)

    Custom on fail: Discard value

HashiCorp Nomad Client: CPU user utilization

CPU utilization in user space.

Dependent item nomad.client.cpu.user

Preprocessing

  • Prometheus pattern: AVG(nomad_client_host_cpu_user)

    Custom on fail: Discard value

HashiCorp Nomad Client: Memory available

Total amount of memory available to processes which includes free and cached memory.

Dependent item nomad.client.memory.available

Preprocessing

  • Prometheus pattern: VALUE(nomad_client_host_memory_available)

    Custom on fail: Discard value

HashiCorp Nomad Client: Memory free

Amount of memory which is free.

Dependent item nomad.client.memory.free

Preprocessing

  • Prometheus pattern: VALUE(nomad_client_host_memory_free)

HashiCorp Nomad Client: Memory size

Total amount of physical memory on the node.

Dependent item nomad.client.memory.total

Preprocessing

  • Prometheus pattern: VALUE(nomad_client_host_memory_total)

HashiCorp Nomad Client: Memory used

Amount of memory used by processes.

Dependent item nomad.client.memory.used

Preprocessing

  • Prometheus pattern: VALUE(nomad_client_host_memory_used)

HashiCorp Nomad Client: Uptime

Uptime of the host running the Nomad client.

Dependent item nomad.client.uptime

Preprocessing

  • Prometheus pattern: VALUE(nomad_client_uptime)

HashiCorp Nomad Client: Node info get

Node info data in raw format.

HTTP agent nomad.client.node.info.get

Preprocessing

  • Check for not supported value

    Custom on fail: Set value to: {"header":{"HTTP/1.1 408 Request timeout":""}}

HashiCorp Nomad Client: Nomad client version

Nomad client version.

Dependent item nomad.client.version

Preprocessing

  • JSON Path: $.body..Version.first()

HashiCorp Nomad Client: Nodes API response

Nodes API response message.

Dependent item nomad.client.node.info.api.response

Preprocessing

  • JavaScript: The text is too long. Please see the template.

  • Discard unchanged with heartbeat: 1h

HashiCorp Nomad Client: Allocated jobs get

Allocated jobs data in raw format.

HTTP agent nomad.client.job.allocs.get

Preprocessing

  • Check for not supported value

    Custom on fail: Set value to: {"header":{"HTTP/1.1 408 Request timeout":""}}

HashiCorp Nomad Client: Allocations API response

Allocations API response message.

Dependent item nomad.client.job.allocs.api.response

Preprocessing

  • JavaScript: The text is too long. Please see the template.

  • Discard unchanged with heartbeat: 1h

Triggers

Name Description Expression Severity Dependencies and additional info
HashiCorp Nomad Client: Monitoring API connection has failed

Monitoring API connection has failed.
Ensure that Nomad API URL and the necessary permissions have been defined correctly, check the service state and network connectivity between Nomad and Zabbix.

find(/HashiCorp Nomad Client by HTTP/nomad.client.data.api.response,,"like","{$NOMAD.API.RESPONSE.SUCCESS}")=0 Average Manual close: Yes
HashiCorp Nomad Client: Service [rpc] is down

Cannot establish the connection to [rpc] service port {$NOMAD.CLIENT.RPC.PORT}.
Check the Nomad state and network connectivity between Nomad and Zabbix.

last(/HashiCorp Nomad Client by HTTP/net.tcp.service[tcp,,{$NOMAD.CLIENT.RPC.PORT}]) = 0 Average Manual close: Yes
HashiCorp Nomad Client: Service [serf] is down

Cannot establish the connection to [serf] service port {$NOMAD.CLIENT.SERF.PORT}.
Check the Nomad state and network connectivity between Nomad and Zabbix.

last(/HashiCorp Nomad Client by HTTP/net.tcp.service[tcp,,{$NOMAD.CLIENT.SERF.PORT}]) = 0 Average Manual close: Yes
HashiCorp Nomad Client: OOM killed allocations found

OOM killed allocations found.

last(/HashiCorp Nomad Client by HTTP/nomad.client.allocations.oom_killed) > 0 Warning Manual close: Yes
HashiCorp Nomad Client: High CPU utilization

CPU utilization is too high. The system might be slow to respond.

min(/HashiCorp Nomad Client by HTTP/nomad.client.cpu.total, 10m) >= {$NOMAD.CPU.UTIL.MIN} Average
HashiCorp Nomad Client: High memory utilization

RAM utilization is too high. The system might be slow to respond.

(min(/HashiCorp Nomad Client by HTTP/nomad.client.memory.available, 10m) / last(/HashiCorp Nomad Client by HTTP/nomad.client.memory.total))*100 <= {$NOMAD.RAM.AVAIL.MIN} Average
HashiCorp Nomad Client: The host has been restarted

The host uptime is less than 10 minutes.

last(/HashiCorp Nomad Client by HTTP/nomad.client.uptime) < 10m Warning Manual close: Yes
HashiCorp Nomad Client: Nomad client version has changed

Nomad client version has changed.

change(/HashiCorp Nomad Client by HTTP/nomad.client.version)<>0 Info Manual close: Yes
HashiCorp Nomad Client: Nodes API connection has failed

Nodes API connection has failed.
Ensure that Nomad API URL and the necessary permissions have been defined correctly, check the service state and network connectivity between Nomad and Zabbix.

find(/HashiCorp Nomad Client by HTTP/nomad.client.node.info.api.response,,"like","{$NOMAD.API.RESPONSE.SUCCESS}")=0 Average Manual close: Yes
Depends on:
  • HashiCorp Nomad Client: Monitoring API connection has failed
HashiCorp Nomad Client: Allocations API connection has failed

Allocations API connection has failed.
Ensure that Nomad API URL and the necessary permissions have been defined correctly, check the service state and network connectivity between Nomad and Zabbix.

find(/HashiCorp Nomad Client by HTTP/nomad.client.job.allocs.api.response,,"like","{$NOMAD.API.RESPONSE.SUCCESS}")=0 Average Manual close: Yes
Depends on:
  • HashiCorp Nomad Client: Monitoring API connection has failed

LLD rule Drivers discovery

Name Description Type Key and additional info
Drivers discovery

Client drivers discovery.

Dependent item nomad.client.drivers.discovery

Preprocessing

  • JavaScript: The text is too long. Please see the template.

  • Discard unchanged with heartbeat: 1h

Item prototypes for Drivers discovery

Name Description Type Key and additional info
HashiCorp Nomad Client: Driver [{#DRIVER.NAME}] state

Driver [{#DRIVER.NAME}] state.

Dependent item nomad.client.driver.state["{#DRIVER.NAME}"]

Preprocessing

  • JSON Path: $.body..Drivers.{#DRIVER.NAME}.Healthy.first()

  • Boolean to decimal
  • Discard unchanged with heartbeat: 1h

HashiCorp Nomad Client: Driver [{#DRIVER.NAME}] detection state

Driver [{#DRIVER.NAME}] detection state.

Dependent item nomad.client.driver.detected["{#DRIVER.NAME}"]

Preprocessing

  • JSON Path: $.body..Drivers.{#DRIVER.NAME}.Detected.first()

  • Boolean to decimal

Trigger prototypes for Drivers discovery

Name Description Expression Severity Dependencies and additional info
HashiCorp Nomad Client: Driver [{#DRIVER.NAME}] is in unhealthy state

The [{#DRIVER.NAME}] driver detected, but its state is unhealthy.

last(/HashiCorp Nomad Client by HTTP/nomad.client.driver.state["{#DRIVER.NAME}"]) = 0 and last(/HashiCorp Nomad Client by HTTP/nomad.client.driver.detected["{#DRIVER.NAME}"]) = 1 Warning Manual close: Yes
HashiCorp Nomad Client: Driver [{#DRIVER.NAME}] detection state has changed

The [{#DRIVER.NAME}] driver detection state has changed.

change(/HashiCorp Nomad Client by HTTP/nomad.client.driver.detected["{#DRIVER.NAME}"]) <> 0 Info Manual close: Yes

LLD rule Physical disks discovery

Name Description Type Key and additional info
Physical disks discovery

Physical disks discovery.

Dependent item nomad.client.disk.discovery

Preprocessing

  • Prometheus to JSON: nomad_client_host_disk_available{disk=~".*"}

Item prototypes for Physical disks discovery

Name Description Type Key and additional info
HashiCorp Nomad Client: Disk ["{#DEV.NAME}"] space available

Amount of space which is available on ["{#DEV.NAME}"] disk.

Dependent item nomad.client.disk.available["{#DEV.NAME}"]

Preprocessing

  • Prometheus pattern: VALUE(nomad_client_host_disk_available{disk="{#DEV.NAME}"})

HashiCorp Nomad Client: Disk ["{#DEV.NAME}"] inodes utilization

Disk space consumed by the inodes on ["{#DEV.NAME}"] disk.

Dependent item nomad.client.disk.inodes_percent["{#DEV.NAME}"]

Preprocessing

  • Prometheus pattern: The text is too long. Please see the template.

HashiCorp Nomad Client: Disk ["{#DEV.NAME}"] size

Total size of the ["{#DEV.NAME}"] device.

Dependent item nomad.client.disk.size["{#DEV.NAME}"]

Preprocessing

  • Prometheus pattern: VALUE(nomad_client_host_disk_size{disk="{#DEV.NAME}"})

HashiCorp Nomad Client: Disk ["{#DEV.NAME}"] space utilization

Percentage of disk ["{#DEV.NAME}"] space used.

Dependent item nomad.client.disk.used_percent["{#DEV.NAME}"]

Preprocessing

  • Prometheus pattern: The text is too long. Please see the template.

HashiCorp Nomad Client: Disk ["{#DEV.NAME}"] space used

Amount of disk ["{#DEV.NAME}"] space which has been used.

Dependent item nomad.client.disk.used["{#DEV.NAME}"]

Preprocessing

  • Prometheus pattern: VALUE(nomad_client_host_disk_used{disk="{#DEV.NAME}"})

Trigger prototypes for Physical disks discovery

Name Description Expression Severity Dependencies and additional info
HashiCorp Nomad Client: Running out of free inodes on [{#DEV.NAME}] device

It may become impossible to write to a disk if there are no index nodes left.
The following error messages may be returned as symptoms, even though the free space:
- No space left on device;
- Disk is full.

min(/HashiCorp Nomad Client by HTTP/nomad.client.disk.inodes_percent["{#DEV.NAME}"],5m) >= {$NOMAD.INODES.FREE.MIN.WARN:"{#DEV.NAME}"} Warning Manual close: Yes
Depends on:
  • HashiCorp Nomad Client: Running out of free inodes on [{#DEV.NAME}] device
HashiCorp Nomad Client: Running out of free inodes on [{#DEV.NAME}] device

It may become impossible to write to a disk if there are no index nodes left.
The following error messages may be returned as symptoms, even though the free space:
- No space left on device;
- Disk is full.

min(/HashiCorp Nomad Client by HTTP/nomad.client.disk.inodes_percent["{#DEV.NAME}"],5m) >= {$NOMAD.INODES.FREE.MIN.CRIT:"{#DEV.NAME}"} Average Manual close: Yes
HashiCorp Nomad Client: High disk [{#DEV.NAME}] utilization

High disk [{#DEV.NAME}] utilization.

min(/HashiCorp Nomad Client by HTTP/nomad.client.disk.used_percent["{#DEV.NAME}"],5m) >= {$NOMAD.DISK.UTIL.MIN.WARN:"{#DEV.NAME}"} Warning Manual close: Yes
Depends on:
  • HashiCorp Nomad Client: Running out of free inodes on [{#DEV.NAME}] device
HashiCorp Nomad Client: High disk [{#DEV.NAME}] utilization

High disk [{#DEV.NAME}] utilization.

min(/HashiCorp Nomad Client by HTTP/nomad.client.disk.used_percent["{#DEV.NAME}"],5m) >= {$NOMAD.DISK.UTIL.MIN.CRIT:"{#DEV.NAME}"} Average Manual close: Yes

LLD rule Allocated jobs discovery

Name Description Type Key and additional info
Allocated jobs discovery

Allocated jobs discovery.

Dependent item nomad.client.alloc.discovery

Preprocessing

  • JavaScript: The text is too long. Please see the template.

  • Discard unchanged with heartbeat: 1h

Item prototypes for Allocated jobs discovery

Name Description Type Key and additional info
HashiCorp Nomad Client: Job ["{#JOB.NAME}"] CPU allocated

Total CPU resources allocated by the ["{#JOB.NAME}"] job across all cores.

Dependent item nomad.client.allocs.cpu.allocated["{#JOB.NAME}","{#JOB.TASK.GROUP}","{#JOB.NAMESPACE}"]

Preprocessing

  • Prometheus pattern: The text is too long. Please see the template.

HashiCorp Nomad Client: Job ["{#JOB.NAME}"] CPU system utilization

Total CPU resources consumed by the ["{#JOB.NAME}"] job in system space.

Dependent item nomad.client.allocs.cpu.system["{#JOB.NAME}","{#JOB.TASK.GROUP}","{#JOB.NAMESPACE}"]

Preprocessing

  • Prometheus pattern: The text is too long. Please see the template.

HashiCorp Nomad Client: Job ["{#JOB.NAME}"] CPU user utilization

Total CPU resources consumed by the ["{#JOB.NAME}"] job in user space.

Dependent item nomad.client.allocs.cpu.user["{#JOB.NAME}","{#JOB.TASK.GROUP}","{#JOB.NAMESPACE}"]

Preprocessing

  • Prometheus pattern: The text is too long. Please see the template.

HashiCorp Nomad Client: Job ["{#JOB.NAME}"] CPU total utilization

Total CPU resources consumed by the ["{#JOB.NAME}"] job across all cores.

Dependent item nomad.client.allocs.cpu.total_percent["{#JOB.NAME}","{#JOB.TASK.GROUP}","{#JOB.NAMESPACE}"]

Preprocessing

  • Prometheus pattern: The text is too long. Please see the template.

HashiCorp Nomad Client: Job ["{#JOB.NAME}"] CPU throttled periods time

Total number of CPU periods that the ["{#JOB.NAME}"] job was throttled.

Dependent item nomad.client.allocs.cpu.throttled_periods["{#JOB.NAME}","{#JOB.TASK.GROUP}","{#JOB.NAMESPACE}"]

Preprocessing

  • Prometheus pattern: The text is too long. Please see the template.

  • Custom multiplier: 1e-09

HashiCorp Nomad Client: Job ["{#JOB.NAME}"] CPU throttled time

Total time that the ["{#JOB.NAME}"] job was throttled.

Dependent item nomad.client.allocs.cpu.throttled_time["{#JOB.NAME}","{#JOB.TASK.GROUP}","{#JOB.NAMESPACE}"]

Preprocessing

  • Prometheus pattern: The text is too long. Please see the template.

    Custom on fail: Discard value

  • Custom multiplier: 1e-09

HashiCorp Nomad Client: Job ["{#JOB.NAME}"] CPU ticks

CPU ticks consumed by the process for the ["{#JOB.NAME}"] job in the last collection interval.

Dependent item nomad.client.allocs.cpu.total_ticks["{#JOB.NAME}","{#JOB.TASK.GROUP}","{#JOB.NAMESPACE}"]

Preprocessing

  • Prometheus pattern: The text is too long. Please see the template.

HashiCorp Nomad Client: Job ["{#JOB.NAME}"] Memory allocated

Amount of memory allocated by the ["{#JOB.NAME}"] job.

Dependent item nomad.client.allocs.memory.allocated["{#JOB.NAME}","{#JOB.TASK.GROUP}","{#JOB.NAMESPACE}"]

Preprocessing

  • Prometheus pattern: The text is too long. Please see the template.

HashiCorp Nomad Client: Job ["{#JOB.NAME}"] Memory cached

Amount of memory cached by the ["{#JOB.NAME}"] job.

Dependent item nomad.client.allocs.memory.cache["{#JOB.NAME}","{#JOB.TASK.GROUP}","{#JOB.NAMESPACE}"]

Preprocessing

  • Prometheus pattern: The text is too long. Please see the template.

HashiCorp Nomad Client: Job ["{#JOB.NAME}"] Memory used

Total amount of memory used by the ["{#JOB.NAME}"] job.

Dependent item nomad.client.allocs.memory.usage["{#JOB.NAME}","{#JOB.TASK.GROUP}","{#JOB.NAMESPACE}"]

Preprocessing

  • Prometheus pattern: The text is too long. Please see the template.

HashiCorp Nomad Client: Job ["{#JOB.NAME}"] Memory swapped

Amount of memory swapped by the ["{#JOB.NAME}"] job.

Dependent item nomad.client.allocs.memory.swap["{#JOB.NAME}","{#JOB.TASK.GROUP}","{#JOB.NAMESPACE}"]

Preprocessing

  • Prometheus pattern: The text is too long. Please see the template.

HashiCorp Nomad Server by HTTP

Overview

This template is designed to monitor HashiCorp Nomad servers by Zabbix. It works without any external scripts.

Requirements

Zabbix version: 7.0 and higher.

Tested versions

This template has been tested on:

  • HashiCorp Nomad version 1.5.6/1.6.0

Configuration

Zabbix should be configured according to the instructions in the Templates out of the box section.

Setup

  1. Enable telemetry in HashiCorp Nomad agent configuration file. Set the Prometheus metrics format.

Refer to the vendor documentation.

  1. Set the values for the {$NOMAD.SERVER.API.SCHEME} and {$NOMAD.SERVER.API.PORT} macros to define the common Nomad API web schema and connection port.

Additional information:

  • The Nomad servers use the default web schema - HTTP and default API port - 4646. If you're using servers discovery and you need to re-define macros for the particular host created from prototype, use the context macros like {{$NOMAD.SERVER.API.SCHEME:NECESSARY.IP}} or/and {{$NOMAD.SERVER.API.PORT:NECESSARY.IP}} on master host or template level.
  • Some metrics may not be collected depending on your HashiCorp Nomad agent version, configuration and cluster role.
  • Don't forget to define the {$NOMAD.REDUNDANCY.MIN} macro value, based on your cluster nodes amount to configure the failure tolerance triggers correctly.

Useful links:

Macros used

Name Description Default
{$NOMAD.SERVER.API.SCHEME}

Nomad SERVER API scheme.

http
{$NOMAD.SERVER.API.PORT}

Nomad SERVER API port.

4646
{$NOMAD.TOKEN}

Nomad authentication token.

<PUT YOUR AUTH TOKEN>
{$NOMAD.DATA.TIMEOUT}

Response timeout for an API.

15s
{$NOMAD.HTTP.PROXY}

Sets the HTTP proxy for HTTP agent item. If this parameter is empty, then no proxy is used.

{$NOMAD.API.RESPONSE.SUCCESS}

HTTP API successful response code. Availability triggers threshold. Change, if needed.

200
{$NOMAD.SERVER.RPC.PORT}

Nomad RPC service port.

4647
{$NOMAD.SERVER.SERF.PORT}

Nomad serf service port.

4648
{$NOMAD.REDUNDANCY.MIN}

Amount of redundant servers to keep the cluster safe.

Default value - '1' for the 3-nodes cluster.

Change if needed.

1
{$NOMAD.OPEN.FDS.MAX}

Maximum percentage of used file descriptors.

90
{$NOMAD.SERVER.LEADER.LATENCY}

Leader last contact latency threshold.

0.3s

Items

Name Description Type Key and additional info
HashiCorp Nomad Server: Telemetry get

Telemetry data in raw format.

HTTP agent nomad.server.data.get

Preprocessing

  • Check for not supported value

    Custom on fail: Set value to: {"header":{"HTTP/1.1 408 Request timeout":""}}

HashiCorp Nomad Server: Metrics

Nomad server metrics in raw format.

Dependent item nomad.server.metrics.get

Preprocessing

  • JSON Path: $.body

    Custom on fail: Discard value

HashiCorp Nomad Server: Monitoring API response

Monitoring API response message.

Dependent item nomad.server.data.api.response

Preprocessing

  • JavaScript: The text is too long. Please see the template.

  • Discard unchanged with heartbeat: 1h

HashiCorp Nomad Server: Internal stats get

Internal stats data in raw format.

HTTP agent nomad.server.stats.get

Preprocessing

  • Check for not supported value

    Custom on fail: Set value to: {"header":{"HTTP/1.1 408 Request timeout":""}}

HashiCorp Nomad Server: Internal stats API response

Internal stats API response message.

Dependent item nomad.server.stats.api.response

Preprocessing

  • JavaScript: The text is too long. Please see the template.

  • Discard unchanged with heartbeat: 1h

HashiCorp Nomad Server: Nomad server version

Nomad server version.

Dependent item nomad.server.version

Preprocessing

  • JSON Path: $.body.config.Version.Version

HashiCorp Nomad Server: Nomad raft version

Nomad raft version.

Dependent item nomad.raft.version

Preprocessing

  • JSON Path: $.body.stats.raft.protocol_version

    Custom on fail: Discard value

HashiCorp Nomad Server: Raft peers

Current cluster raft peers amount.

Dependent item nomad.server.raft.peers

Preprocessing

  • JSON Path: $.body.stats.raft.num_peers

    Custom on fail: Discard value

HashiCorp Nomad Server: Cluster role

Current role in the cluster.

Dependent item nomad.server.raft.cluster_role

Preprocessing

  • JSON Path: $.body.stats.raft.state

    Custom on fail: Discard value

  • JavaScript: The text is too long. Please see the template.

HashiCorp Nomad Server: CPU time, rate

Total user and system CPU time spent in seconds.

Dependent item nomad.server.cpu.time

Preprocessing

  • Prometheus pattern: VALUE(process_cpu_seconds_total)

    Custom on fail: Discard value

  • Change per second
HashiCorp Nomad Server: Memory used

Memory utilization in bytes.

Dependent item nomad.server.runtime.alloc_bytes

Preprocessing

  • Prometheus pattern: VALUE(nomad_runtime_alloc_bytes)

    Custom on fail: Discard value

HashiCorp Nomad Server: Virtual memory size

Virtual memory size in bytes.

Dependent item nomad.server.virtual_memory_bytes

Preprocessing

  • Prometheus pattern: VALUE(process_virtual_memory_bytes)

    Custom on fail: Discard value

HashiCorp Nomad Server: Resident memory size

Resident memory size in bytes.

Dependent item nomad.server.resident_memory_bytes

Preprocessing

  • Prometheus pattern: VALUE(process_resident_memory_bytes)

    Custom on fail: Discard value

HashiCorp Nomad Server: Heap objects

Number of objects on the heap.

General memory pressure indicator.

Dependent item nomad.server.runtime.heap_objects

Preprocessing

  • Prometheus pattern: VALUE(nomad_runtime_heap_objects)

    Custom on fail: Discard value

HashiCorp Nomad Server: Open file descriptors

Number of open file descriptors.

Dependent item nomad.server.process_open_fds

Preprocessing

  • Prometheus pattern: VALUE(process_open_fds)

    Custom on fail: Discard value

HashiCorp Nomad Server: Open file descriptors, max

Maximum number of open file descriptors.

Dependent item nomad.server.process_max_fds

Preprocessing

  • Prometheus pattern: VALUE(process_max_fds)

    Custom on fail: Discard value

HashiCorp Nomad Server: Goroutines

Number of goroutines and general load pressure indicator.

Dependent item nomad.server.runtime.num_goroutines

Preprocessing

  • Prometheus pattern: VALUE(nomad_runtime_num_goroutines)

    Custom on fail: Discard value

HashiCorp Nomad Server: Evaluations pending

Evaluations that are pending until an existing evaluation for the same job completes.

Dependent item nomad.server.broker.total_pending

Preprocessing

  • Prometheus pattern: VALUE(nomad_nomad_broker_total_pending)

    Custom on fail: Discard value

HashiCorp Nomad Server: Evaluations ready

Number of evaluations ready to be processed.

Dependent item nomad.server.broker.total_ready

Preprocessing

  • Prometheus pattern: VALUE(nomad_nomad_broker_total_ready)

    Custom on fail: Discard value

HashiCorp Nomad Server: Evaluations unacked

Evaluations dispatched for processing but incomplete.

Dependent item nomad.server.broker.total_unacked

Preprocessing

  • Prometheus pattern: VALUE(nomad_nomad_broker_total_unacked)

    Custom on fail: Discard value

HashiCorp Nomad Server: CPU shares for blocked evaluations

Amount of CPU shares requested by blocked evals.

Dependent item nomad.server.blocked_evals.cpu

Preprocessing

  • Prometheus pattern: VALUE(nomad_nomad_blocked_evals_cpu)

    Custom on fail: Discard value

HashiCorp Nomad Server: Memory shares by blocked evaluations

Amount of memory requested by blocked evals.

Dependent item nomad.server.blocked_evals.memory

Preprocessing

  • Prometheus pattern: VALUE(nomad_nomad_blocked_evals_memory)

    Custom on fail: Discard value

HashiCorp Nomad Server: CPU shares for blocked job evaluations

Amount of CPU shares requested by blocked evals of a job.

Dependent item nomad.server.blocked_evals.job.cpu

Preprocessing

  • Prometheus pattern: VALUE(nomad_nomad_blocked_evals_job_cpu)

    Custom on fail: Discard value

HashiCorp Nomad Server: Memory shares for blocked job evaluations

Amount of memory requested by blocked evals of a job.

Dependent item nomad.server.blocked_evals.job.memory

Preprocessing

  • Prometheus pattern: VALUE(nomad_nomad_blocked_evals_job_memory)

    Custom on fail: Discard value

HashiCorp Nomad Server: Evaluations blocked

Count of evals in the blocked state for any reason (cluster resource exhaustion or quota limits).

Dependent item nomad.server.blocked_evals.total_blocked

Preprocessing

  • Prometheus pattern: VALUE(nomad_nomad_blocked_evals_total_blocked)

    Custom on fail: Discard value

HashiCorp Nomad Server: Evaluations escaped

Count of evals that have escaped computed node classes.

This indicates a scheduler optimization was skipped and is not usually a source of concern.

Dependent item nomad.server.blocked_evals.total_escaped

Preprocessing

  • Prometheus pattern: VALUE(nomad_nomad_blocked_evals_total_escaped)

    Custom on fail: Discard value

HashiCorp Nomad Server: Evaluations waiting

Count of evals waiting to be enqueued.

Dependent item nomad.server.broker.total_waiting

Preprocessing

  • Prometheus pattern: VALUE(nomad_nomad_broker_total_waiting)

    Custom on fail: Discard value

HashiCorp Nomad Server: Evaluations blocked due to quota limit

Count of blocked evals due to quota limits (the resources for these jobs are not counted in other blocked_evals metrics, except for total_blocked).

Dependent item nomad.server.blocked_evals.total_quota_limit

Preprocessing

  • Prometheus pattern: VALUE(nomad_nomad_blocked_evals_total_quota_limit)

    Custom on fail: Discard value

HashiCorp Nomad Server: Evaluations enqueue time

Average time elapsed with evaluations waiting to be enqueued.

Dependent item nomad.server.broker.eval_waiting

Preprocessing

  • Prometheus pattern: AVG(nomad_nomad_eval_ack_sum)

    Custom on fail: Discard value

  • Custom multiplier: 1e-09

HashiCorp Nomad Server: RPC evaluation acknowledgement time

Time elapsed for Eval.Ack RPC call.

Dependent item nomad.server.eval.ack

Preprocessing

  • Prometheus pattern: VALUE(nomad_nomad_eval_ack_sum)

    Custom on fail: Discard value

  • Custom multiplier: 1e-09

HashiCorp Nomad Server: RPC job summary time

Time elapsed for Job.Summary RPC call.

Dependent item nomad.server.job_summary.get_job_summary

Preprocessing

  • Prometheus pattern: VALUE(nomad_nomad_job_summary_get_job_summary_sum)

    Custom on fail: Discard value

  • Custom multiplier: 1e-09

HashiCorp Nomad Server: Heartbeats active

Number of active heartbeat timers.

Each timer represents a Nomad client connection.

Dependent item nomad.server.heartbeat.active

Preprocessing

  • Prometheus pattern: VALUE(nomad_nomad_heartbeat_active)

    Custom on fail: Discard value

HashiCorp Nomad Server: RPC requests, rate

Number of RPC requests being handled.

Dependent item nomad.server.rpc.request

Preprocessing

  • Prometheus pattern: VALUE(nomad_nomad_rpc_request)

    Custom on fail: Discard value

  • Change per second
HashiCorp Nomad Server: RPC error requests, rate

Number of RPC requests being handled that result in an error.

Dependent item nomad.server.rpc.request_error

Preprocessing

  • Prometheus pattern: VALUE(nomad_nomad_rpc_request)

    Custom on fail: Discard value

  • Change per second
HashiCorp Nomad Server: RPC queries, rate

Number of RPC queries.

Dependent item nomad.server.rpc.query

Preprocessing

  • Prometheus pattern: VALUE(nomad_nomad_rpc_query)

    Custom on fail: Discard value

  • Change per second
HashiCorp Nomad Server: RPC job allocations time

Time elapsed for Job.Allocations RPC call.

Dependent item nomad.server.job.allocations

Preprocessing

  • Prometheus pattern: VALUE(nomad_nomad_job_allocations_sum)

    Custom on fail: Discard value

  • Custom multiplier: 1e-09

HashiCorp Nomad Server: RPC job evaluations time

Time elapsed for Job.Evaluations RPC call.

Dependent item nomad.server.job.evaluations

Preprocessing

  • Prometheus pattern: VALUE(nomad_nomad_job_evaluations_sum)

    Custom on fail: Discard value

  • Custom multiplier: 1e-09

HashiCorp Nomad Server: RPC get job time

Time elapsed for Job.GetJob RPC call.

Dependent item nomad.server.job.get_job

Preprocessing

  • Prometheus pattern: VALUE(nomad_nomad_job_get_job_sum)

    Custom on fail: Discard value

  • Custom multiplier: 1e-09

HashiCorp Nomad Server: Plan apply time

Time elapsed to apply a plan.

Dependent item nomad.server.plan.apply

Preprocessing

  • Prometheus pattern: VALUE(nomad_nomad_plan_apply_sum)

    Custom on fail: Discard value

  • Custom multiplier: 1e-09

HashiCorp Nomad Server: Plan evaluate time

Time elapsed to evaluate a plan.

Dependent item nomad.server.plan.evaluate

Preprocessing

  • Prometheus pattern: VALUE(nomad_nomad_plan_evaluate_sum)

    Custom on fail: Discard value

  • Custom multiplier: 1e-09

HashiCorp Nomad Server: RPC plan submit time

Time elapsed for Plan.Submit RPC call.

Dependent item nomad.server.plan.submit

Preprocessing

  • Prometheus pattern: VALUE(nomad_nomad_plan_submit_sum)

    Custom on fail: Discard value

  • Custom multiplier: 1e-09

HashiCorp Nomad Server: Plan raft index processing time

Time elapsed that planner waits for the raft index of the plan to be processed.

Dependent item nomad.server.plan.wait_for_index

Preprocessing

  • Prometheus pattern: VALUE(nomad_nomad_plan_wait_for_index_sum)

    Custom on fail: Discard value

  • Custom multiplier: 1e-09

HashiCorp Nomad Server: RPC list time

Time elapsed for Node.List RPC call.

Dependent item nomad.server.client.list

Preprocessing

  • Prometheus pattern: VALUE(nomad_nomad_client_list_sum)

    Custom on fail: Discard value

  • Custom multiplier: 1e-09

HashiCorp Nomad Server: RPC update allocations time

Time elapsed for Node.UpdateAlloc RPC call.

Dependent item nomad.server.client.update_alloc

Preprocessing

  • Prometheus pattern: VALUE(nomad_nomad_client_update_alloc_sum)

    Custom on fail: Discard value

  • Custom multiplier: 1e-09

HashiCorp Nomad Server: RPC update status time

Time elapsed for Node.UpdateStatus RPC call.

Dependent item nomad.server.client.update_status

Preprocessing

  • Prometheus pattern: VALUE(nomad_nomad_client_update_status_sum)

    Custom on fail: Discard value

  • Custom multiplier: 1e-09

HashiCorp Nomad Server: RPC get client allocs time

Time elapsed for Node.GetClientAllocs RPC call.

Dependent item nomad.server.client.get_client_allocs

Preprocessing

  • Prometheus pattern: VALUE(nomad_nomad_client_get_client_allocs_sum)

    Custom on fail: Discard value

  • Custom multiplier: 1e-09

HashiCorp Nomad Server: RPC eval dequeue time

Time elapsed for Eval.Dequeue RPC call.

Dependent item nomad.server.client.dequeue

Preprocessing

  • Prometheus pattern: VALUE(nomad_nomad_eval_dequeue_sum)

    Custom on fail: Discard value

  • Custom multiplier: 1e-09

HashiCorp Nomad Server: Vault token last renewal

Time since last successful Vault token renewal.

Dependent item nomad.server.vault.token_last_renewal

Preprocessing

  • Prometheus pattern: VALUE(nomad_nomad_vault_token_last_renewal)

    Custom on fail: Discard value

  • Custom multiplier: 0.001

HashiCorp Nomad Server: Vault token next renewal

Time until next Vault token renewal attempt.

Dependent item nomad.server.vault.token_next_renewal

Preprocessing

  • Prometheus pattern: VALUE(nomad_nomad_vault_token_next_renewal)

    Custom on fail: Discard value

  • Custom multiplier: 0.001

HashiCorp Nomad Server: Vault token TTL

Time to live for Vault token.

Dependent item nomad.server.vault.token_ttl

Preprocessing

  • Prometheus pattern: VALUE(nomad_nomad_vault_token_ttl)

    Custom on fail: Discard value

  • Custom multiplier: 0.001

HashiCorp Nomad Server: Vault tokens revoked

Count of revoked tokens.

Dependent item nomad.server.vault.distributed_tokens_revoked

Preprocessing

  • Prometheus pattern: VALUE(nomad_nomad_vault_distributed_tokens_revoking)

    Custom on fail: Discard value

HashiCorp Nomad Server: Jobs dead

Number of dead jobs.

Dependent item nomad.server.job_status.dead

Preprocessing

  • Prometheus pattern: VALUE(nomad_nomad_job_status_dead)

    Custom on fail: Set value to: 0

HashiCorp Nomad Server: Jobs pending

Number of pending jobs.

Dependent item nomad.server.job_status.pending

Preprocessing

  • Prometheus pattern: VALUE(nomad_nomad_job_status_pending)

    Custom on fail: Set value to: 0

HashiCorp Nomad Server: Jobs running

Number of running jobs.

Dependent item nomad.server.job_status.running

Preprocessing

  • Prometheus pattern: VALUE(nomad_nomad_job_status_running)

    Custom on fail: Set value to: 0

HashiCorp Nomad Server: Job allocations completed

Number of complete allocations for a job.

Dependent item nomad.server.job_summary.complete

Preprocessing

  • Prometheus pattern: SUM(nomad_nomad_job_summary_complete)

    Custom on fail: Set value to: 0

HashiCorp Nomad Server: Job allocations failed

Number of failed allocations for a job.

Dependent item nomad.server.job_summary.failed

Preprocessing

  • Prometheus pattern: SUM(nomad_nomad_job_summary_failed)

    Custom on fail: Set value to: 0

HashiCorp Nomad Server: Job allocations lost

Number of lost allocations for a job.

Dependent item nomad.server.job_summary.lost

Preprocessing

  • Prometheus pattern: SUM(nomad_nomad_job_summary_lost)

    Custom on fail: Set value to: 0

HashiCorp Nomad Server: Job allocations unknown

Number of unknown allocations for a job.

Dependent item nomad.server.job_summary.unknown

Preprocessing

  • Prometheus pattern: SUM(nomad_nomad_job_summary_unknown)

    Custom on fail: Set value to: 0

HashiCorp Nomad Server: Job allocations queued

Number of queued allocations for a job.

Dependent item nomad.server.job_summary.queued

Preprocessing

  • Prometheus pattern: SUM(nomad_nomad_job_summary_queued)

    Custom on fail: Set value to: 0

HashiCorp Nomad Server: Job allocations running

Number of running allocations for a job.

Dependent item nomad.server.job_summary.running

Preprocessing

  • Prometheus pattern: SUM(nomad_nomad_job_summary_running)

    Custom on fail: Set value to: 0

HashiCorp Nomad Server: Job allocations starting

Number of starting allocations for a job.

Dependent item nomad.server.job_summary.starting

Preprocessing

  • Prometheus pattern: SUM(nomad_nomad_job_summary_starting)

    Custom on fail: Set value to: 0

HashiCorp Nomad Server: Gossip time

Time elapsed to broadcast gossip messages.

Dependent item nomad.server.memberlist.gossip

Preprocessing

  • Prometheus pattern: VALUE(nomad_memberlist_gossip_sum)

    Custom on fail: Discard value

  • Custom multiplier: 1e-09

HashiCorp Nomad Server: Leader barrier time

Time elapsed to establish a raft barrier during leader transition.

Dependent item nomad.server.leader.barrier

Preprocessing

  • Prometheus pattern: VALUE(nomad_nomad_leader_barrier_sum)

    Custom on fail: Discard value

  • Custom multiplier: 1e-09

HashiCorp Nomad Server: Reconcile peer time

Time elapsed to reconcile a serf peer with state store.

Dependent item nomad.server.leader.reconcile_member

Preprocessing

  • Prometheus pattern: VALUE(nomad_nomad_leader_reconcileMember_sum)

    Custom on fail: Discard value

  • Custom multiplier: 1e-09

HashiCorp Nomad Server: Total reconcile time

Time elapsed to reconcile all serf peers with state store.

Dependent item nomad.server.leader.reconcile

Preprocessing

  • Prometheus pattern: VALUE(nomad_nomad_leader_reconcile_sum)

    Custom on fail: Discard value

  • Custom multiplier: 1e-09

HashiCorp Nomad Server: Leader last contact

Time since last contact to leader.

General indicator of Raft latency.

Dependent item nomad.server.raft.leader.lastContact

Preprocessing

  • Prometheus pattern: VALUE(nomad_raft_leader_lastContact{quantile="0.99"})

    Custom on fail: Discard value

  • Replace: NaN -> 0

  • Custom multiplier: 0.001

HashiCorp Nomad Server: Plan queue

Count of evals in the plan queue.

Dependent item nomad.server.plan.queue_depth

Preprocessing

  • Prometheus pattern: VALUE(nomad_nomad_plan_queue_depth)

    Custom on fail: Discard value

HashiCorp Nomad Server: Worker evaluation create time

Time elapsed for worker to create an eval.

Dependent item nomad.server.worker.create_eval

Preprocessing

  • Prometheus pattern: VALUE(nomad_nomad_worker_dequeue_eval_sum)

    Custom on fail: Discard value

  • Custom multiplier: 1e-09

HashiCorp Nomad Server: Worker evaluation dequeue time

Time elapsed for worker to dequeue an eval.

Dependent item nomad.server.worker.dequeue_eval

Preprocessing

  • Prometheus pattern: VALUE(nomad_nomad_worker_dequeue_eval_sum)

    Custom on fail: Discard value

  • Custom multiplier: 1e-09

HashiCorp Nomad Server: Worker invoke scheduler time

Time elapsed for worker to invoke the scheduler.

Dependent item nomad.server.worker.invoke_scheduler_service

Preprocessing

  • Prometheus pattern: VALUE(nomad_nomad_worker_invoke_scheduler_service_sum)

    Custom on fail: Discard value

  • Custom multiplier: 1e-09

HashiCorp Nomad Server: Worker acknowledgement send time

Time elapsed for worker to send acknowledgement.

Dependent item nomad.server.worker.send_ack

Preprocessing

  • Prometheus pattern: VALUE(nomad_nomad_worker_send_ack_sum)

    Custom on fail: Discard value

  • Custom multiplier: 1e-09

HashiCorp Nomad Server: Worker submit plan time

Time elapsed for worker to submit plan.

Dependent item nomad.server.worker.submit_plan

Preprocessing

  • Prometheus pattern: VALUE(nomad_nomad_worker_submit_plan_sum)

    Custom on fail: Discard value

  • Custom multiplier: 1e-09

HashiCorp Nomad Server: Worker update evaluation time

Time elapsed for worker to submit updated eval.

Dependent item nomad.server.worker.update_eval

Preprocessing

  • Prometheus pattern: VALUE(nomad_nomad_worker_update_eval_sum)

    Custom on fail: Discard value

  • Custom multiplier: 1e-09

HashiCorp Nomad Server: Worker log replication time

Time elapsed that worker waits for the raft index of the eval to be processed.

Dependent item nomad.server.worker.wait_for_index

Preprocessing

  • Prometheus pattern: VALUE(nomad_nomad_worker_wait_for_index_sum)

    Custom on fail: Discard value

  • Custom multiplier: 1e-09

HashiCorp Nomad Server: Raft calls blocked, rate

Count of blocking raft API calls.

Dependent item nomad.server.raft.barrier

Preprocessing

  • Prometheus pattern: VALUE(nomad_raft_barrier)

    Custom on fail: Discard value

  • Change per second
HashiCorp Nomad Server: Raft commit logs enqueued

Count of logs enqueued.

Dependent item nomad.server.raft.commit_num_logs

Preprocessing

  • Prometheus pattern: VALUE(nomad_raft_commitNumLogs)

    Custom on fail: Discard value

HashiCorp Nomad Server: Raft transactions, rate

Number of Raft transactions.

Dependent item nomad.server.raft.apply

Preprocessing

  • Prometheus pattern: VALUE(nomad_raft_apply)

    Custom on fail: Set value to: 0

  • Change per second
HashiCorp Nomad Server: Raft commit time

Time elapsed to commit writes.

Dependent item nomad.server.raft.commit_time

Preprocessing

  • Prometheus pattern: VALUE(nomad_nomad_worker_dequeue_eval_sum)

    Custom on fail: Discard value

  • Custom multiplier: 1e-09

HashiCorp Nomad Server: Raft transaction commit time

Raft transaction commit time.

Dependent item nomad.server.raft.replication.appendEntries

Preprocessing

  • Prometheus pattern: AVG(nomad_raft_replication_appendEntries_rpc)

    Custom on fail: Discard value

  • Custom multiplier: 0.001

HashiCorp Nomad Server: FSM apply time

Time elapsed to apply write to FSM.

Dependent item nomad.server.raft.fsm.apply

Preprocessing

  • Prometheus pattern: VALUE(nomad_raft_fsm_apply_sum)

    Custom on fail: Discard value

  • Custom multiplier: 1e-09

HashiCorp Nomad Server: FSM enqueue time

Time elapsed to enqueue write to FSM.

Dependent item nomad.server.raft.fsm.enqueue

Preprocessing

  • Prometheus pattern: VALUE(nomad_raft_fsm_enqueue_sum)

    Custom on fail: Discard value

  • Custom multiplier: 1e-09

HashiCorp Nomad Server: FSM autopilot time

Time elapsed to apply Autopilot raft entry.

Dependent item nomad.server.raft.fsm.autopilot

Preprocessing

  • Prometheus pattern: VALUE(nomad_nomad_fsm_autopilot_sum)

    Custom on fail: Set value to: 0

  • Custom multiplier: 1e-09

HashiCorp Nomad Server: FSM register node time

Time elapsed to apply RegisterNode raft entry.

Dependent item nomad.server.raft.fsm.register_node

Preprocessing

  • Prometheus pattern: VALUE(nomad_nomad_fsm_register_node_sum)

    Custom on fail: Discard value

  • Custom multiplier: 1e-09

HashiCorp Nomad Server: FSM index

Current index applied to FSM.

Dependent item nomad.server.raft.applied_index

Preprocessing

  • Prometheus pattern: VALUE(nomad_raft_appliedIndex)

    Custom on fail: Discard value

HashiCorp Nomad Server: Raft last index

Most recent index seen.

Dependent item nomad.server.raft.last_index

Preprocessing

  • Prometheus pattern: VALUE(nomad_raft_lastIndex)

    Custom on fail: Discard value

HashiCorp Nomad Server: Dispatch log time

Time elapsed to write log, mark in flight, and start replication.

Dependent item nomad.server.raft.leader.dispatch_log

Preprocessing

  • Prometheus pattern: VALUE(nomad_raft_leader_dispatchLog_sum)

    Custom on fail: Discard value

  • Custom multiplier: 1e-09

HashiCorp Nomad Server: Logs dispatched

Count of logs dispatched.

Dependent item nomad.server.raft.leader.dispatch_num_logs

Preprocessing

  • Prometheus pattern: VALUE(nomad_raft_leader_dispatchNumLogs)

    Custom on fail: Set value to: 0

HashiCorp Nomad Server: Heartbeat fails

Count of failing to heartbeat and starting election.

Dependent item nomad.server.raft.transition.heartbeat_timeout

Preprocessing

  • Prometheus pattern: VALUE(nomad_raft_transition_heartbeat_timeout)

    Custom on fail: Set value to: 0

  • Discard unchanged with heartbeat: 1h

HashiCorp Nomad Server: Objects freed, rate

Count of objects freed from heap by go runtime GC.

Dependent item nomad.server.runtime.free_count

Preprocessing

  • Prometheus pattern: VALUE(nomad_runtime_free_count)

    Custom on fail: Discard value

  • Change per second
HashiCorp Nomad Server: GC pause time

Go runtime GC pause times.

Dependent item nomad.server.runtime.gc_pause_ns

Preprocessing

  • Prometheus pattern: VALUE(nomad_runtime_gc_pause_ns_sum)

    Custom on fail: Discard value

  • Custom multiplier: 1e-09

HashiCorp Nomad Server: GC metadata size

Go runtime GC metadata size in bytes.

Dependent item nomad.server.runtime.sys_bytes

Preprocessing

  • Prometheus pattern: VALUE(nomad_runtime_sys_bytes)

    Custom on fail: Discard value

HashiCorp Nomad Server: GC runs

Count of go runtime GC runs.

Dependent item nomad.server.runtime.total_gc_runs

Preprocessing

  • Prometheus pattern: VALUE(nomad_runtime_total_gc_runs)

    Custom on fail: Discard value

HashiCorp Nomad Server: Memberlist events

Count of memberlist events received.

Dependent item nomad.server.serf.queue.event

Preprocessing

  • Prometheus pattern: VALUE(nomad_serf_queue_Event_sum)

    Custom on fail: Discard value

HashiCorp Nomad Server: Memberlist changes

Count of memberlist changes.

Dependent item nomad.server.serf.queue.intent

Preprocessing

  • Prometheus pattern: VALUE(nomad_serf_queue_Intent_sum)

    Custom on fail: Discard value

HashiCorp Nomad Server: Memberlist queries

Count of memberlist queries.

Dependent item nomad.server.serf.queue.queries

Preprocessing

  • Prometheus pattern: VALUE(nomad_serf_queue_Query_sum)

    Custom on fail: Discard value

HashiCorp Nomad Server: Snapshot index

Current snapshot index.

Dependent item nomad.server.state.snapshot.index

Preprocessing

  • Prometheus pattern: VALUE(nomad_state_snapshotIndex)

    Custom on fail: Discard value

HashiCorp Nomad Server: Services ready to schedule

Count of service evals ready to be scheduled.

Dependent item nomad.server.broker.service_ready

Preprocessing

  • Prometheus pattern: VALUE(nomad_nomad_broker_service_ready)

    Custom on fail: Discard value

HashiCorp Nomad Server: Services unacknowledged

Count of unacknowledged service evals.

Dependent item nomad.server.broker.service_unacked

Preprocessing

  • Prometheus pattern: VALUE(nomad_nomad_broker_service_unacked)

    Custom on fail: Discard value

HashiCorp Nomad Server: System evaluations ready to schedule

Count of service evals ready to be scheduled.

Dependent item nomad.server.broker.system_ready

Preprocessing

  • Prometheus pattern: VALUE(nomad_nomad_broker_system_ready)

    Custom on fail: Discard value

HashiCorp Nomad Server: System evaluations unacknowledged

Count of unacknowledged system evals.

Dependent item nomad.server.broker.system_unacked

Preprocessing

  • Prometheus pattern: VALUE(nomad_nomad_broker_system_unacked)

    Custom on fail: Discard value

HashiCorp Nomad Server: BoltDB free pages

Number of BoltDB free pages.

Dependent item nomad.server.raft.boltdb.num_free_pages

Preprocessing

  • Prometheus pattern: VALUE(nomad_raft_boltdb_numFreePages)

    Custom on fail: Discard value

HashiCorp Nomad Server: BoltDB pending pages

Number of BoltDB pending pages.

Dependent item nomad.server.raft.boltdb.num_pending_pages

Preprocessing

  • Prometheus pattern: VALUE(nomad_raft_boltdb_numPendingPages)

    Custom on fail: Discard value

HashiCorp Nomad Server: BoltDB free page bytes

Number of free page bytes.

Dependent item nomad.server.raft.boltdb.free_page_bytes

Preprocessing

  • Prometheus pattern: VALUE(nomad_raft_boltdb_freePageBytes)

    Custom on fail: Discard value

HashiCorp Nomad Server: BoltDB freelist bytes

Number of freelist bytes.

Dependent item nomad.server.raft.boltdb.freelist_bytes

Preprocessing

  • Prometheus pattern: VALUE(nomad_raft_boltdb_freelistBytes)

    Custom on fail: Discard value

HashiCorp Nomad Server: BoltDB read transactions, rate

Count of total read transactions.

Dependent item nomad.server.raft.boltdb.total_read_txn

Preprocessing

  • Prometheus pattern: VALUE(nomad_raft_boltdb_totalReadTxn)

    Custom on fail: Discard value

  • Change per second
HashiCorp Nomad Server: BoltDB open read transactions

Number of current open read transactions.

Dependent item nomad.server.raft.boltdb.open_read_txn

Preprocessing

  • Prometheus pattern: VALUE(nomad_raft_boltdb_openReadTxn)

    Custom on fail: Discard value

HashiCorp Nomad Server: BoltDB pages in use

Number of pages in use.

Dependent item nomad.server.raft.boltdb.txstats.page_count

Preprocessing

  • Prometheus pattern: VALUE(nomad_raft_boltdb_txstats_pageCount)

    Custom on fail: Discard value

HashiCorp Nomad Server: BoltDB page allocations, rate

Number of page allocations.

Dependent item nomad.server.raft.boltdb.txstats.page_alloc

Preprocessing

  • Prometheus pattern: VALUE(nomad_raft_boltdb_txstats_pageAlloc)

    Custom on fail: Discard value

  • Change per second
HashiCorp Nomad Server: BoltDB cursors

Count of total database cursors.

Dependent item nomad.server.raft.boltdb.txstats.cursor_count

Preprocessing

  • Prometheus pattern: VALUE(nomad_raft_boltdb_txstats_cursorCount)

    Custom on fail: Discard value

  • Change per second
HashiCorp Nomad Server: BoltDB nodes, rate

Count of total database nodes.

Dependent item nomad.server.raft.boltdb.txstats.node_count

Preprocessing

  • Prometheus pattern: VALUE(nomad_raft_boltdb_txstats_nodeCount)

    Custom on fail: Discard value

  • Change per second
HashiCorp Nomad Server: BoltDB node dereferences, rate

Count of total database node dereferences.

Dependent item nomad.server.raft.boltdb.txstats.node_deref

Preprocessing

  • Prometheus pattern: VALUE(nomad_raft_boltdb_txstats_nodeDeref)

    Custom on fail: Discard value

  • Change per second
HashiCorp Nomad Server: BoltDB rebalance operations, rate

Count of total rebalance operations.

Dependent item nomad.server.raft.boltdb.txstats.rebalance

Preprocessing

  • Prometheus pattern: VALUE(nomad_raft_boltdb_txstats_rebalance)

    Custom on fail: Discard value

  • Change per second
HashiCorp Nomad Server: BoltDB split operations, rate

Count of total split operations.

Dependent item nomad.server.raft.boltdb.txstats.split

Preprocessing

  • Prometheus pattern: VALUE(nomad_raft_boltdb_txstats_split)

    Custom on fail: Discard value

  • Change per second
HashiCorp Nomad Server: BoltDB spill operations, rate

Count of total spill operations.

Dependent item nomad.server.raft.boltdb.txstats.spill

Preprocessing

  • Prometheus pattern: VALUE(nomad_raft_boltdb_txstats_spill)

    Custom on fail: Discard value

  • Change per second
HashiCorp Nomad Server: BoltDB write operations, rate

Count of total write operations.

Dependent item nomad.server.raft.boltdb.txstats.write

Preprocessing

  • Prometheus pattern: VALUE(nomad_raft_boltdb_txstats_write)

    Custom on fail: Discard value

  • Change per second
HashiCorp Nomad Server: BoltDB rebalance time

Sample of rebalance operation times.

Dependent item nomad.server.raft.boltdb.txstats.rebalance_time

Preprocessing

  • Prometheus pattern: VALUE(nomad_raft_boltdb_txstats_rebalanceTime_sum)

    Custom on fail: Discard value

  • Custom multiplier: 1e-09

HashiCorp Nomad Server: BoltDB spill time

Sample of spill operation times.

Dependent item nomad.server.raft.boltdb.txstats.spill_time

Preprocessing

  • Prometheus pattern: VALUE(nomad_raft_boltdb_txstats_spillTime_sum)

    Custom on fail: Discard value

  • Custom multiplier: 1e-09

HashiCorp Nomad Server: BoltDB write time

Sample of write operation times.

Dependent item nomad.server.raft.boltdb.txstats.write_time

Preprocessing

  • Prometheus pattern: VALUE(nomad_raft_boltdb_txstats_writeTime_sum)

    Custom on fail: Discard value

  • Custom multiplier: 1e-09

HashiCorp Nomad Server: Service [rpc] state

Current [rpc] service state.

Simple check net.tcp.service[tcp,,{$NOMAD.SERVER.RPC.PORT}]

Preprocessing

  • Discard unchanged with heartbeat: 1h

HashiCorp Nomad Server: Service [serf] state

Current [serf] service state.

Simple check net.tcp.service[tcp,,{$NOMAD.SERVER.SERF.PORT}]

Preprocessing

  • Discard unchanged with heartbeat: 1h

HashiCorp Nomad Server: Namespace list time

Time elapsed for Namespace.ListNamespaces.

Dependent item nomad.server.namespace.list_namespace

Preprocessing

  • Prometheus pattern: VALUE(nomad_nomad_namespace_list_namespace_sum)

    Custom on fail: Discard value

  • Custom multiplier: 1e-09

HashiCorp Nomad Server: Autopilot state

Current autopilot state.

Dependent item nomad.server.autopilot.state

Preprocessing

  • Prometheus pattern: VALUE(nomad_nomad_autopilot_healthy)

    Custom on fail: Discard value

HashiCorp Nomad Server: Autopilot failure tolerance

The number of redundant healthy servers that can fail without causing an outage.

Dependent item nomad.server.autopilot.failure_tolerance

Preprocessing

  • Prometheus pattern: VALUE(nomad_nomad_autopilot_failure_tolerance)

    Custom on fail: Discard value

HashiCorp Nomad Server: FSM allocation client update time

Time elapsed to apply AllocClientUpdate raft entry.

Dependent item nomad.server.alloc_client_update

Preprocessing

  • Prometheus pattern: VALUE(nomad_nomad_fsm_alloc_client_update_sum)

    Custom on fail: Discard value

  • Custom multiplier: 1e-09

HashiCorp Nomad Server: FSM apply plan results time

Time elapsed to apply ApplyPlanResults raft entry.

Dependent item nomad.server.fsm.apply_plan_results

Preprocessing

  • Prometheus pattern: VALUE(nomad_nomad_fsm_apply_plan_results_sum)

    Custom on fail: Discard value

  • Custom multiplier: 1e-09

HashiCorp Nomad Server: FSM update evaluation time

Time elapsed to apply UpdateEval raft entry.

Dependent item nomad.server.fsm.update_eval

Preprocessing

  • Prometheus pattern: VALUE(nomad_nomad_fsm_update_eval_sum)

    Custom on fail: Discard value

  • Custom multiplier: 1e-09

HashiCorp Nomad Server: FSM job registration time

Time elapsed to apply RegisterJob raft entry.

Dependent item nomad.server.fsm.register_job

Preprocessing

  • Prometheus pattern: VALUE(nomad_nomad_fsm_register_job_sum)

    Custom on fail: Discard value

  • Custom multiplier: 1e-09

HashiCorp Nomad Server: Allocation reschedule attempts

Count of attempts to reschedule an allocation.

Dependent item nomad.server.scheduler.allocs.rescheduled.attempted

Preprocessing

  • Prometheus pattern: SUM(nomad_scheduler_allocs_reschedule_attempted)

    Custom on fail: Set value to: 0

Triggers

Name Description Expression Severity Dependencies and additional info
HashiCorp Nomad Server: Monitoring API connection has failed

Monitoring API connection has failed.
Ensure that Nomad API URL and the necessary permissions have been defined correctly, check the service state and network connectivity between Nomad and Zabbix.

find(/HashiCorp Nomad Server by HTTP/nomad.server.data.api.response,,"like","{$NOMAD.API.RESPONSE.SUCCESS}")=0 Average Manual close: Yes
HashiCorp Nomad Server: Internal stats API connection has failed

Internal stats API connection has failed.
Ensure that Nomad API URL and the necessary permissions have been defined correctly, check the service state and network connectivity between Nomad and Zabbix.

find(/HashiCorp Nomad Server by HTTP/nomad.server.stats.api.response,,"like","{$NOMAD.API.RESPONSE.SUCCESS}")=0 Average Manual close: Yes
Depends on:
  • HashiCorp Nomad Server: Monitoring API connection has failed
HashiCorp Nomad Server: Nomad server version has changed

Nomad server version has changed.

change(/HashiCorp Nomad Server by HTTP/nomad.server.version)<>0 Info Manual close: Yes
HashiCorp Nomad Server: Cluster role has changed

Cluster role has changed.

change(/HashiCorp Nomad Server by HTTP/nomad.server.raft.cluster_role) <> 0 Info Manual close: Yes
HashiCorp Nomad Server: Current number of open files is too high

Heavy file descriptor usage (i.e., near the process file descriptor limit) indicates a potential file descriptor exhaustion issue.

min(/HashiCorp Nomad Server by HTTP/nomad.server.process_open_fds,5m)/last(/HashiCorp Nomad Server by HTTP/nomad.server.process_max_fds)*100>{$NOMAD.OPEN.FDS.MAX} Warning
HashiCorp Nomad Server: Dead jobs found

Jobs with the Dead state discovered.
Check the {$NOMAD.SERVER.API.SCHEME}://{HOST.IP}:{$NOMAD.SERVER.API.PORT}/v1/jobs URL for the details.

last(/HashiCorp Nomad Server by HTTP/nomad.server.job_status.dead) > 0 and nodata(/HashiCorp Nomad Server by HTTP/nomad.server.job_status.dead,5m) = 0 Warning Manual close: Yes
HashiCorp Nomad Server: Leader last contact timeout exceeded

The nomad.raft.leader.lastContact metric is a general indicator of Raft latency which can be used to observe how Raft timing is performing and guide infrastructure provisioning.
If this number trends upwards, look at CPU, disk IOPs, and network latency. nomad.raft.leader.lastContact should not get too close to the leader lease timeout of 500ms.

min(/HashiCorp Nomad Server by HTTP/nomad.server.raft.leader.lastContact,5m) >= {$NOMAD.SERVER.LEADER.LATENCY} and nodata(/HashiCorp Nomad Server by HTTP/nomad.server.raft.leader.lastContact,5m) = 0 Warning
HashiCorp Nomad Server: Service [rpc] is down

Cannot establish the connection to [rpc] service port {$NOMAD.SERVER.RPC.PORT}.
Check the Nomad state and network connectivity between Nomad and Zabbix.

last(/HashiCorp Nomad Server by HTTP/net.tcp.service[tcp,,{$NOMAD.SERVER.RPC.PORT}]) = 0 Average Manual close: Yes
HashiCorp Nomad Server: Service [serf] is down

Cannot establish the connection to [serf] service port {$NOMAD.SERVER.SERF.PORT}.
Check the Nomad state and network connectivity between Nomad and Zabbix.

last(/HashiCorp Nomad Server by HTTP/net.tcp.service[tcp,,{$NOMAD.SERVER.SERF.PORT}]) = 0 Average Manual close: Yes
HashiCorp Nomad Server: Autopilot is unhealthy

The autopilot is in unhealthy state. The successful failover probability is extremely low.

last(/HashiCorp Nomad Server by HTTP/nomad.server.autopilot.state) = 0 and nodata(/HashiCorp Nomad Server by HTTP/nomad.server.autopilot.state,5m) = 0 Average Manual close: Yes
HashiCorp Nomad Server: Autopilot redundancy is low

The autopilot redundancy is low.
Cluster crash risk is high due to one more server failure.

last(/HashiCorp Nomad Server by HTTP/nomad.server.autopilot.failure_tolerance) < {$NOMAD.REDUNDANCY.MIN} and nodata(/HashiCorp Nomad Server by HTTP/nomad.server.autopilot.failure_tolerance,5m) = 0 Warning Manual close: Yes

Feedback

Please report any issues with the template at https://support.zabbix.com

You can also provide feedback, discuss the template, or ask for help at ZABBIX forums