# HashiCorp Consul Node by HTTP ## Overview The template to monitor HashiCorp Consul by Zabbix that works without any external scripts. Most of the metrics are collected in one go, thanks to Zabbix bulk data collection. Do not forget to enable Prometheus format for export metrics. See [documentation](https://www.consul.io/docs/agent/options#telemetry-prometheus_retention_time). More information about metrics you can find in [official documentation](https://www.consul.io/docs/agent/telemetry). Template `HashiCorp Consul Node by HTTP` — collects metrics by HTTP agent from /v1/agent/metrics endpoint. ## Requirements Zabbix version: 7.0 and higher. ## Tested versions This template has been tested on: - HashiCorp Consul 1.10.0 ## Configuration > Zabbix should be configured according to the instructions in the [Templates out of the box](https://www.zabbix.com/documentation/7.0/manual/config/templates_out_of_the_box) section. ## Setup Internal service metrics are collected from /v1/agent/metrics endpoint. Do not forget to enable Prometheus format for export metrics. See [documentation](https://www.consul.io/docs/agent/options#telemetry-prometheus_retention_time). Template need to use Authorization via API token. Don't forget to change macros {$CONSUL.NODE.API.URL}, {$CONSUL.TOKEN}. Also, see the Macros section for a list of macros used to set trigger values. More information about metrics you can find in [official documentation](https://www.consul.io/docs/agent/telemetry). This template support [Consul namespaces](https://www.consul.io/docs/enterprise/namespaces). You can set macros {$CONSUL.LLD.FILTER.SERVICE_NAMESPACE.MATCHES}, {$CONSUL.LLD.FILTER.SERVICE_NAMESPACE.NOT_MATCHES} if you want to filter discovered services by namespace. In case of Open Source version service namespace will be set to 'None'. *NOTE.* Some metrics may not be collected depending on your HashiCorp Consul instance version and configuration. *NOTE.* You maybe are interested in Envoy Proxy by HTTP [template](../../envoy_proxy_http). ### Macros used |Name|Description|Default| |----|-----------|-------| |{$CONSUL.NODE.API.URL}|

Consul instance URL.

|`http://localhost:8500`| |{$CONSUL.TOKEN}|

Consul auth token.

|``| |{$CONSUL.OPEN.FDS.MAX.WARN}|

Maximum percentage of used file descriptors.

|`90`| |{$CONSUL.LLD.FILTER.LOCAL_SERVICE_NAME.MATCHES}|

Filter of discoverable discovered services on local node.

|`.*`| |{$CONSUL.LLD.FILTER.LOCAL_SERVICE_NAME.NOT_MATCHES}|

Filter to exclude discovered services on local node.

|`CHANGE IF NEEDED`| |{$CONSUL.LLD.FILTER.SERVICE_NAMESPACE.MATCHES}|

Filter of discoverable discovered service by namespace on local node. Enterprise only, in case of Open Source version Namespace will be set to 'None'.

|`.*`| |{$CONSUL.LLD.FILTER.SERVICE_NAMESPACE.NOT_MATCHES}|

Filter to exclude discovered service by namespace on local node. Enterprise only, in case of Open Source version Namespace will be set to 'None'.

|`CHANGE IF NEEDED`| |{$CONSUL.NODE.HEALTH_SCORE.MAX.WARN}|

Maximum acceptable value of node's health score for WARNING trigger expression.

|`2`| |{$CONSUL.NODE.HEALTH_SCORE.MAX.HIGH}|

Maximum acceptable value of node's health score for AVERAGE trigger expression.

|`4`| ### Items |Name|Description|Type|Key and additional info| |----|-----------|----|-----------------------| |Consul: Get instance metrics|

Get raw metrics from Consul instance /metrics endpoint.

|HTTP agent|consul.get_metrics

**Preprocessing**

| |Consul: Get node info|

Get configuration and member information of the local agent.

|HTTP agent|consul.get_node_info

**Preprocessing**

| |Consul: Role|

Role of current Consul agent.

|Dependent item|consul.role

**Preprocessing**

| |Consul: Version|

Version of Consul agent.

|Dependent item|consul.version

**Preprocessing**

| |Consul: Number of services|

Number of services on current node.

|Dependent item|consul.services_number

**Preprocessing**

| |Consul: Number of checks|

Number of checks on current node.

|Dependent item|consul.checks_number

**Preprocessing**

| |Consul: Number of check monitors|

Number of check monitors on current node.

|Dependent item|consul.check_monitors_number

**Preprocessing**

| |Consul: Process CPU seconds, total|

Total user and system CPU time spent in seconds.

|Dependent item|consul.cpu_seconds_total.rate

**Preprocessing**

| |Consul: Virtual memory size|

Virtual memory size in bytes.

|Dependent item|consul.virtual_memory_bytes

**Preprocessing**

| |Consul: RSS memory usage|

Resident memory size in bytes.

|Dependent item|consul.resident_memory_bytes

**Preprocessing**

| |Consul: Goroutine count|

The number of Goroutines on Consul instance.

|Dependent item|consul.goroutines

**Preprocessing**

| |Consul: Open file descriptors|

Number of open file descriptors.

|Dependent item|consul.process_open_fds

**Preprocessing**

| |Consul: Open file descriptors, max|

Maximum number of open file descriptors.

|Dependent item|consul.process_max_fds

**Preprocessing**

| |Consul: Client RPC, per second|

Number of times per second whenever a Consul agent in client mode makes an RPC request to a Consul server.

This gives a measure of how much a given agent is loading the Consul servers.

This is only generated by agents in client mode, not Consul servers.

|Dependent item|consul.client_rpc

**Preprocessing**

| |Consul: Client RPC failed ,per second|

Number of times per second whenever a Consul agent in client mode makes an RPC request to a Consul server and fails.

|Dependent item|consul.client_rpc_failed

**Preprocessing**

| |Consul: TCP connections, accepted per second|

This metric counts the number of times a Consul agent has accepted an incoming TCP stream connection per second.

|Dependent item|consul.memberlist.tcp_accept

**Preprocessing**

| |Consul: TCP connections, per second|

This metric counts the number of times a Consul agent has initiated a push/pull sync with an other agent per second.

|Dependent item|consul.memberlist.tcp_connect

**Preprocessing**

| |Consul: TCP send bytes, per second|

This metric measures the total number of bytes sent by a Consul agent through the TCP protocol per second.

|Dependent item|consul.memberlist.tcp_sent

**Preprocessing**

| |Consul: UDP received bytes, per second|

This metric measures the total number of bytes received by a Consul agent through the UDP protocol per second.

|Dependent item|consul.memberlist.udp_received

**Preprocessing**

| |Consul: UDP sent bytes, per second|

This metric measures the total number of bytes sent by a Consul agent through the UDP protocol per second.

|Dependent item|consul.memberlist.udp_sent

**Preprocessing**

| |Consul: GC pause, p90|

The 90 percentile for the number of nanoseconds consumed by stop-the-world garbage collection (GC) pauses since Consul started, in milliseconds.

|Dependent item|consul.gc_pause.p90

**Preprocessing**

| |Consul: GC pause, p50|

The 50 percentile (median) for the number of nanoseconds consumed by stop-the-world garbage collection (GC) pauses since Consul started, in milliseconds.

|Dependent item|consul.gc_pause.p50

**Preprocessing**

| |Consul: Memberlist: degraded|

This metric counts the number of times the Consul agent has performed failure detection on another agent at a slower probe rate.

The agent uses its own health metric as an indicator to perform this action.

If its health score is low, it means that the node is healthy, and vice versa.

|Dependent item|consul.memberlist.degraded

**Preprocessing**

| |Consul: Memberlist: health score|

This metric describes a node's perception of its own health based on how well it is meeting the soft real-time requirements of the protocol.

This metric ranges from 0 to 8, where 0 indicates "totally healthy".

|Dependent item|consul.memberlist.health_score

**Preprocessing**

| |Consul: Memberlist: gossip, p90|

The 90 percentile for the number of gossips (messages) broadcasted to a set of randomly selected nodes.

|Dependent item|consul.memberlist.dispatch_log.p90

**Preprocessing**

| |Consul: Memberlist: gossip, p50|

The 50 for the number of gossips (messages) broadcasted to a set of randomly selected nodes.

|Dependent item|consul.memberlist.gossip.p50

**Preprocessing**

| |Consul: Memberlist: msg alive|

This metric counts the number of alive Consul agents, that the agent has mapped out so far, based on the message information given by the network layer.

|Dependent item|consul.memberlist.msg.alive

**Preprocessing**

| |Consul: Memberlist: msg dead|

This metric counts the number of times a Consul agent has marked another agent to be a dead node.

|Dependent item|consul.memberlist.msg.dead

**Preprocessing**

| |Consul: Memberlist: msg suspect|

The number of times a Consul agent suspects another as failed while probing during gossip protocol.

|Dependent item|consul.memberlist.msg.suspect

**Preprocessing**

| |Consul: Memberlist: probe node, p90|

The 90 percentile for the time taken to perform a single round of failure detection on a select Consul agent.

|Dependent item|consul.memberlist.probe_node.p90

**Preprocessing**

| |Consul: Memberlist: probe node, p50|

The 50 percentile (median) for the time taken to perform a single round of failure detection on a select Consul agent.

|Dependent item|consul.memberlist.probe_node.p50

**Preprocessing**

| |Consul: Memberlist: push pull node, p90|

The 90 percentile for the number of Consul agents that have exchanged state with this agent.

|Dependent item|consul.memberlist.push_pull_node.p90

**Preprocessing**

| |Consul: Memberlist: push pull node, p50|

The 50 percentile (median) for the number of Consul agents that have exchanged state with this agent.

|Dependent item|consul.memberlist.push_pull_node.p50

**Preprocessing**

| |Consul: KV store: apply, p90|

The 90 percentile for the time it takes to complete an update to the KV store.

|Dependent item|consul.kvs.apply.p90

**Preprocessing**

| |Consul: KV store: apply, p50|

The 50 percentile (median) for the time it takes to complete an update to the KV store.

|Dependent item|consul.kvs.apply.p50

**Preprocessing**

| |Consul: KV store: apply, rate|

The number of updates to the KV store per second.

|Dependent item|consul.kvs.apply.rate

**Preprocessing**

| |Consul: Serf member: flap, rate|

Increments when an agent is marked dead and then recovers within a short time period.

This can be an indicator of overloaded agents, network problems, or configuration errors where agents cannot connect to each other on the required ports.

Shown as events per second.

|Dependent item|consul.serf.member.flap.rate

**Preprocessing**

| |Consul: Serf member: failed, rate|

Increments when an agent is marked dead.

This can be an indicator of overloaded agents, network problems, or configuration errors where agents cannot connect to each other on the required ports.

Shown as events per second.

|Dependent item|consul.serf.member.failed.rate

**Preprocessing**

| |Consul: Serf member: join, rate|

Increments when an agent joins the cluster. If an agent flapped or failed this counter also increments when it re-joins.

Shown as events per second.

|Dependent item|consul.serf.member.join.rate

**Preprocessing**

| |Consul: Serf member: left, rate|

Increments when an agent leaves the cluster. Shown as events per second.

|Dependent item|consul.serf.member.left.rate

**Preprocessing**

| |Consul: Serf member: update, rate|

Increments when a Consul agent updates. Shown as events per second.

|Dependent item|consul.serf.member.update.rate

**Preprocessing**

| |Consul: ACL: resolves, rate|

The number of ACL resolves per second.

|Dependent item|consul.acl.resolves.rate

**Preprocessing**

| |Consul: Catalog: register, rate|

The number of catalog register operation per second.

|Dependent item|consul.catalog.register.rate

**Preprocessing**

| |Consul: Catalog: deregister, rate|

The number of catalog deregister operation per second.

|Dependent item|consul.catalog.deregister.rate

**Preprocessing**

| |Consul: Snapshot: append line, p90|

The 90 percentile for the time taken by the Consul agent to append an entry into the existing log.

|Dependent item|consul.snapshot.append_line.p90

**Preprocessing**

| |Consul: Snapshot: append line, p50|

The 50 percentile (median) for the time taken by the Consul agent to append an entry into the existing log.

|Dependent item|consul.snapshot.append_line.p50

**Preprocessing**

| |Consul: Snapshot: append line, rate|

The number of snapshot appendLine operations per second.

|Dependent item|consul.snapshot.append_line.rate

**Preprocessing**

| |Consul: Snapshot: compact, p90|

The 90 percentile for the time taken by the Consul agent to compact a log.

This operation occurs only when the snapshot becomes large enough to justify the compaction.

|Dependent item|consul.snapshot.compact.p90

**Preprocessing**

| |Consul: Snapshot: compact, p50|

The 50 percentile (median) for the time taken by the Consul agent to compact a log.

This operation occurs only when the snapshot becomes large enough to justify the compaction.

|Dependent item|consul.snapshot.compact.p50

**Preprocessing**

| |Consul: Snapshot: compact, rate|

The number of snapshot compact operations per second.

|Dependent item|consul.snapshot.compact.rate

**Preprocessing**

| |Consul: Get local services|

Get all the services that are registered with the local agent and their status.

|Script|consul.get_local_services| |Consul: Get local services check|

Data collection check.

|Dependent item|consul.get_local_services.check

**Preprocessing**

| ### Triggers |Name|Description|Expression|Severity|Dependencies and additional info| |----|-----------|----------|--------|--------------------------------| |Consul: Version has been changed|

Consul version has changed. Acknowledge to close the problem manually.

|`last(/HashiCorp Consul Node by HTTP/consul.version,#1)<>last(/HashiCorp Consul Node by HTTP/consul.version,#2) and length(last(/HashiCorp Consul Node by HTTP/consul.version))>0`|Info|**Manual close**: Yes| |Consul: Current number of open files is too high|

"Heavy file descriptor usage (i.e., near the process’s file descriptor limit) indicates a potential file descriptor exhaustion issue."

|`min(/HashiCorp Consul Node by HTTP/consul.process_open_fds,5m)/last(/HashiCorp Consul Node by HTTP/consul.process_max_fds)*100>{$CONSUL.OPEN.FDS.MAX.WARN}`|Warning|| |Consul: Node's health score is warning|

This metric ranges from 0 to 8, where 0 indicates "totally healthy".
This health score is used to scale the time between outgoing probes, and higher scores translate into longer probing intervals.
For more details see section IV of the Lifeguard paper: https://arxiv.org/pdf/1707.00788.pdf

|`max(/HashiCorp Consul Node by HTTP/consul.memberlist.health_score,#3)>{$CONSUL.NODE.HEALTH_SCORE.MAX.WARN}`|Warning|**Depends on**:
| |Consul: Node's health score is critical|

This metric ranges from 0 to 8, where 0 indicates "totally healthy".
This health score is used to scale the time between outgoing probes, and higher scores translate into longer probing intervals.
For more details see section IV of the Lifeguard paper: https://arxiv.org/pdf/1707.00788.pdf

|`max(/HashiCorp Consul Node by HTTP/consul.memberlist.health_score,#3)>{$CONSUL.NODE.HEALTH_SCORE.MAX.HIGH}`|Average|| |Consul: Failed to get local services|

Failed to get local services. Check debug log for more information.

|`length(last(/HashiCorp Consul Node by HTTP/consul.get_local_services.check))>0`|Warning|| ### LLD rule Local node services discovery |Name|Description|Type|Key and additional info| |----|-----------|----|-----------------------| |Local node services discovery|

Discover metrics for services that are registered with the local agent.

|Dependent item|consul.node_services_lld

**Preprocessing**

| ### Item prototypes for Local node services discovery |Name|Description|Type|Key and additional info| |----|-----------|----|-----------------------| |Consul: ["{#SERVICE_NAME}"]: Aggregated status|

Aggregated values of all health checks for the service instance.

|Dependent item|consul.service.aggregated_state["{#SERVICE_ID}"]

**Preprocessing**

| |Consul: ["{#SERVICE_NAME}"]: Check ["{#SERVICE_CHECK_NAME}"]: Status|

Current state of health check for the service.

|Dependent item|consul.service.check.state["{#SERVICE_ID}/{#SERVICE_CHECK_ID}"]

**Preprocessing**

| |Consul: ["{#SERVICE_NAME}"]: Check ["{#SERVICE_CHECK_NAME}"]: Output|

Current output of health check for the service.

|Dependent item|consul.service.check.output["{#SERVICE_ID}/{#SERVICE_CHECK_ID}"]

**Preprocessing**

| ### Trigger prototypes for Local node services discovery |Name|Description|Expression|Severity|Dependencies and additional info| |----|-----------|----------|--------|--------------------------------| |Consul: Aggregated status is 'warning'|

Aggregated state of service on the local agent is 'warning'.

|`last(/HashiCorp Consul Node by HTTP/consul.service.aggregated_state["{#SERVICE_ID}"]) = 1`|Warning|| |Consul: Aggregated status is 'critical'|

Aggregated state of service on the local agent is 'critical'.

|`last(/HashiCorp Consul Node by HTTP/consul.service.aggregated_state["{#SERVICE_ID}"]) = 2`|Average|| ### LLD rule HTTP API methods discovery |Name|Description|Type|Key and additional info| |----|-----------|----|-----------------------| |HTTP API methods discovery|

Discovery HTTP API methods specific metrics.

|Dependent item|consul.http_api_discovery

**Preprocessing**

| ### Item prototypes for HTTP API methods discovery |Name|Description|Type|Key and additional info| |----|-----------|----|-----------------------| |Consul: HTTP request: ["{#HTTP_METHOD}"], p90|

The 90 percentile of how long it takes to service the given HTTP request for the given verb.

|Dependent item|consul.http.api.p90["{#HTTP_METHOD}"]

**Preprocessing**

| |Consul: HTTP request: ["{#HTTP_METHOD}"], p50|

The 50 percentile (median) of how long it takes to service the given HTTP request for the given verb.

|Dependent item|consul.http.api.p50["{#HTTP_METHOD}"]

**Preprocessing**

| |Consul: HTTP request: ["{#HTTP_METHOD}"], rate|

The number of HTTP request for the given verb per second.

|Dependent item|consul.http.api.rate["{#HTTP_METHOD}"]

**Preprocessing**

| ### LLD rule Raft server metrics discovery |Name|Description|Type|Key and additional info| |----|-----------|----|-----------------------| |Raft server metrics discovery|

Discover raft metrics for server nodes.

|Dependent item|consul.raft.server.discovery

**Preprocessing**

| ### Item prototypes for Raft server metrics discovery |Name|Description|Type|Key and additional info| |----|-----------|----|-----------------------| |Consul: Raft state|

Current state of Consul agent.

|Dependent item|consul.raft.state[{#SINGLETON}]

**Preprocessing**

| |Consul: Raft state: leader|

Increments when a server becomes a leader.

|Dependent item|consul.raft.state_leader[{#SINGLETON}]

**Preprocessing**

| |Consul: Raft state: candidate|

The number of initiated leader elections.

|Dependent item|consul.raft.state_candidate[{#SINGLETON}]

**Preprocessing**

| |Consul: Raft: apply, rate|

Incremented whenever a leader first passes a message into the Raft commit process (called an Apply operation).

This metric describes the arrival rate of new logs into Raft per second.

|Dependent item|consul.raft.apply.rate[{#SINGLETON}]

**Preprocessing**

| ### LLD rule Raft leader metrics discovery |Name|Description|Type|Key and additional info| |----|-----------|----|-----------------------| |Raft leader metrics discovery|

Discover raft metrics for leader nodes.

|Dependent item|consul.raft.leader.discovery

**Preprocessing**

| ### Item prototypes for Raft leader metrics discovery |Name|Description|Type|Key and additional info| |----|-----------|----|-----------------------| |Consul: Raft state: leader last contact, p90|

The 90 percentile of how long it takes a leader node to communicate with followers during a leader lease check, in milliseconds.

|Dependent item|consul.raft.leader_last_contact.p90[{#SINGLETON}]

**Preprocessing**

| |Consul: Raft state: leader last contact, p50|

The 50 percentile (median) of how long it takes a leader node to communicate with followers during a leader lease check, in milliseconds.

|Dependent item|consul.raft.leader_last_contact.p50[{#SINGLETON}]

**Preprocessing**

| |Consul: Raft state: commit time, p90|

The 90 percentile time it takes to commit a new entry to the raft log on the leader, in milliseconds.

|Dependent item|consul.raft.commit_time.p90[{#SINGLETON}]

**Preprocessing**

| |Consul: Raft state: commit time, p50|

The 50 percentile (median) time it takes to commit a new entry to the raft log on the leader, in milliseconds.

|Dependent item|consul.raft.commit_time.p50[{#SINGLETON}]

**Preprocessing**

| |Consul: Raft state: dispatch log, p90|

The 90 percentile time it takes for the leader to write log entries to disk, in milliseconds.

|Dependent item|consul.raft.dispatch_log.p90[{#SINGLETON}]

**Preprocessing**

| |Consul: Raft state: dispatch log, p50|

The 50 percentile (median) time it takes for the leader to write log entries to disk, in milliseconds.

|Dependent item|consul.raft.dispatch_log.p50[{#SINGLETON}]

**Preprocessing**

| |Consul: Raft state: dispatch log, rate|

The number of times a Raft leader writes a log to disk per second.

|Dependent item|consul.raft.dispatch_log.rate[{#SINGLETON}]

**Preprocessing**

| |Consul: Raft state: commit, rate|

The number of commits a new entry to the Raft log on the leader per second.

|Dependent item|consul.raft.commit_time.rate[{#SINGLETON}]

**Preprocessing**

| |Consul: Autopilot healthy|

Tracks the overall health of the local server cluster. 1 if all servers are healthy, 0 if one or more are unhealthy.

|Dependent item|consul.autopilot.healthy[{#SINGLETON}]

**Preprocessing**

| ## Feedback Please report any issues with the template at [`https://support.zabbix.com`](https://support.zabbix.com) You can also provide feedback, discuss the template, or ask for help at [`ZABBIX forums`](https://www.zabbix.com/forum/zabbix-suggestions-and-feedback)