History

yzl 93958d0fb0 zabbix6.0		1 year ago
..
README.md	zabbix6.0	1 year ago
template_app_etcd_http.yaml	zabbix6.0	1 year ago

README.md

Unescape Escape

Etcd by HTTP

Overview

This template is designed to monitor etcd by Zabbix that works without any external scripts. Most of the metrics are collected in one go, thanks to Zabbix bulk data collection.

The template Etcd by HTTP — collects metrics by help of the HTTP agent from /metrics endpoint.

Refer to the vendor documentation.

For the users of etcd version <= 3.4 !

In etcd v3.5 some metrics have been deprecated. See more details on Upgrade etcd from 3.4 to 3.5. Please upgrade your etcd instance, or use older Etcd by HTTP template version.

Requirements

Zabbix version: 7.0 and higher.

Tested versions

This template has been tested on:

Etcd 3.5.6

Configuration

Zabbix should be configured according to the instructions in the Templates out of the box section.

Setup

Follow these instructions:

Import the template into Zabbix.
After importing the template, make sure that etcd allows the collection of metrics. You can test it by running: curl -L http://localhost:2379/metrics.
Check if etcd is accessible from Zabbix proxy or Zabbix server depending on where you are planning to do the monitoring. To verify it, run curl -L http://<etcd_node_address>:2379/metrics.
Add the template to each etcd node. By default, the template uses a client's port. You can configure metrics endpoint location by adding --listen-metrics-urls flag. (For more details, see etcd documentation).

Additional points to consider:

If you have specified a non-standard port for etcd, don't forget to change macros: {$ETCD.SCHEME} and {$ETCD.PORT}.
You can set {$ETCD.USERNAME} and {$ETCD.PASSWORD} macros in the template to use on a host level if necessary.
To test availability, run : zabbix_get -s etcd-host -k etcd.health.
See the macros section, as it will set the trigger values.

Macros used

Name	Description	Default
{$ETCD.PORT}	The port of `etcd` API endpoint.	`2379`
{$ETCD.SCHEME}	The request scheme which may be `http` or `https`.	`http`
{$ETCD.USER}
{$ETCD.PASSWORD}
{$ETCD.LEADER.CHANGES.MAX.WARN}	The maximum number of leader changes.	`5`
{$ETCD.PROPOSAL.FAIL.MAX.WARN}	The maximum number of proposal failures.	`2`
{$ETCD.HTTP.FAIL.MAX.WARN}	The maximum number of HTTP request failures.	`2`
{$ETCD.PROPOSAL.PENDING.MAX.WARN}	The maximum number of proposals in queue.	`5`
{$ETCD.OPEN.FDS.MAX.WARN}	The maximum percentage of used file descriptors.	`90`
{$ETCD.GRPC_CODE.MATCHES}	The filter of discoverable gRPC codes. See more details on https://github.com/grpc/grpc/blob/master/doc/statuscodes.md.	`.*`
{$ETCD.GRPC_CODE.NOT_MATCHES}	The filter to exclude discovered gRPC codes. See more details on https://github.com/grpc/grpc/blob/master/doc/statuscodes.md.	`CHANGE_IF_NEEDED`
{$ETCD.GRPC.ERRORS.MAX.WARN}	The maximum number of gRPC request failures.	`1`
{$ETCD.GRPC_CODE.TRIGGER.MATCHES}	The filter of discoverable gRPC codes, which will create triggers.	`Aborted\|Unavailable`

Items

Name	Description	Type	Key and additional info
Etcd: Service's TCP port state		Simple check	net.tcp.service["{$ETCD.SCHEME}","{HOST.CONN}","{$ETCD.PORT}"] Preprocessing Discard unchanged with heartbeat: `10m`
Etcd: Get node metrics		HTTP agent	etcd.get_metrics
Etcd: Node health		HTTP agent	etcd.health Preprocessing JSON Path: `$.health` Boolean to decimal ⛔️Custom on fail: Set value to: `0` Discard unchanged with heartbeat: `10m`
Etcd: Server is a leader	It defines - whether or not this member is a leader: 1 - it is; 0 - otherwise.	Dependent item	etcd.is.leader Preprocessing Prometheus pattern: `VALUE(etcd_server_is_leader)` ⛔️Custom on fail: Set value to: `0` Discard unchanged with heartbeat: `10m`
Etcd: Server has a leader	It defines - whether or not a leader exists: 1 - it exists; 0 - it does not.	Dependent item	etcd.has.leader Preprocessing Prometheus pattern: `VALUE(etcd_server_has_leader)` Discard unchanged with heartbeat: `10m`
Etcd: Leader changes	The number of leader changes the member has seen since its start.	Dependent item	etcd.leader.changes Preprocessing Prometheus pattern: `VALUE(etcd_server_leader_changes_seen_total)`
Etcd: Proposals committed per second	The number of consensus proposals committed.	Dependent item	etcd.proposals.committed.rate Preprocessing Prometheus pattern: `VALUE(etcd_server_proposals_committed_total)` Change per second
Etcd: Proposals applied per second	The number of consensus proposals applied.	Dependent item	etcd.proposals.applied.rate Preprocessing Prometheus pattern: `VALUE(etcd_server_proposals_applied_total)` Change per second
Etcd: Proposals failed per second	The number of failed proposals seen.	Dependent item	etcd.proposals.failed.rate Preprocessing Prometheus pattern: `VALUE(etcd_server_proposals_failed_total)` Change per second
Etcd: Proposals pending	The current number of pending proposals to commit.	Dependent item	etcd.proposals.pending Preprocessing Prometheus pattern: `VALUE(etcd_server_proposals_pending)`
Etcd: Reads per second	The number of read actions by `get/getRecursive`, local to this member.	Dependent item	etcd.reads.rate Preprocessing Prometheus to JSON: `etcd_debugging_store_reads_total` JavaScript: `The text is too long. Please see the template.` Change per second
Etcd: Writes per second	The number of writes (e.g., `set/compareAndDelete`) seen by this member.	Dependent item	etcd.writes.rate Preprocessing Prometheus to JSON: `etcd_debugging_store_writes_total` JavaScript: `The text is too long. Please see the template.` Change per second
Etcd: Client gRPC received bytes per second	The number of bytes received from gRPC clients per second.	Dependent item	etcd.network.grpc.received.rate Preprocessing Prometheus pattern: `VALUE(etcd_network_client_grpc_received_bytes_total)` Change per second
Etcd: Client gRPC sent bytes per second	The number of bytes sent from gRPC clients per second.	Dependent item	etcd.network.grpc.sent.rate Preprocessing Prometheus pattern: `VALUE(etcd_network_client_grpc_sent_bytes_total)` Change per second
Etcd: HTTP requests received	The number of requests received into the system (successfully parsed and `authd`).	Dependent item	etcd.http.requests.rate Preprocessing Prometheus to JSON: `etcd_http_received_total` JavaScript: `The text is too long. Please see the template.` Change per second
Etcd: HTTP 5XX	The number of handled failures of requests (non-watches), by the method (`GET/PUT` etc.), and the code `5XX`.	Dependent item	etcd.http.requests.5xx.rate Preprocessing Prometheus to JSON: `etcd_http_failed_total{code=~"5.+"}` JavaScript: `The text is too long. Please see the template.` Change per second
Etcd: HTTP 4XX	The number of handled failures of requests (non-watches), by the method (`GET/PUT` etc.), and the code `4XX`.	Dependent item	etcd.http.requests.4xx.rate Preprocessing Prometheus to JSON: `etcd_http_failed_total{code=~"4.+"}` JavaScript: `The text is too long. Please see the template.` Change per second
Etcd: RPCs received per second	The number of RPC stream messages received on the server.	Dependent item	etcd.grpc.received.rate Preprocessing Prometheus to JSON: `grpc_server_msg_received_total` JavaScript: `The text is too long. Please see the template.` Change per second
Etcd: RPCs sent per second	The number of gRPC stream messages sent by the server.	Dependent item	etcd.grpc.sent.rate Preprocessing Prometheus to JSON: `grpc_server_msg_sent_total` JavaScript: `The text is too long. Please see the template.` Change per second
Etcd: RPCs started per second	The number of RPCs started on the server.	Dependent item	etcd.grpc.started.rate Preprocessing Prometheus to JSON: `grpc_server_started_total` JavaScript: `The text is too long. Please see the template.` Change per second
Etcd: Get version		HTTP agent	etcd.get_version
Etcd: Server version	The version of the `etcd server`.	Dependent item	etcd.server.version Preprocessing JSON Path: `$.etcdserver` Discard unchanged with heartbeat: `1d`
Etcd: Cluster version	The version of the `etcd cluster`.	Dependent item	etcd.cluster.version Preprocessing JSON Path: `$.etcdcluster` Discard unchanged with heartbeat: `1d`
Etcd: DB size	The total size of the underlying database.	Dependent item	etcd.db.size Preprocessing Prometheus pattern: `VALUE(etcd_mvcc_db_total_size_in_bytes)`
Etcd: Keys compacted per second	The number of DB keys compacted per second.	Dependent item	etcd.keys.compacted.rate Preprocessing Prometheus pattern: `VALUE(etcd_debugging_mvcc_db_compaction_keys_total)` ⛔️Custom on fail: Set value to: `0` Change per second
Etcd: Keys expired per second	The number of expired keys per second.	Dependent item	etcd.keys.expired.rate Preprocessing Prometheus pattern: `VALUE(etcd_debugging_store_expires_total)` Change per second
Etcd: Keys total	The total number of keys.	Dependent item	etcd.keys.total Preprocessing Prometheus pattern: `VALUE(etcd_debugging_mvcc_keys_total)`
Etcd: Uptime	`Etcd` server uptime.	Dependent item	etcd.uptime Preprocessing Prometheus pattern: `VALUE(process_start_time_seconds)` JavaScript: `The text is too long. Please see the template.`
Etcd: Virtual memory	The size of virtual memory expressed in bytes.	Dependent item	etcd.virtual.bytes Preprocessing Prometheus pattern: `VALUE(process_virtual_memory_bytes)`
Etcd: Resident memory	The size of resident memory expressed in bytes.	Dependent item	etcd.res.bytes Preprocessing Prometheus pattern: `VALUE(process_resident_memory_bytes)`
Etcd: CPU	The total user and system CPU time spent in seconds.	Dependent item	etcd.cpu.util Preprocessing Prometheus pattern: `VALUE(process_cpu_seconds_total)` Change per second
Etcd: Open file descriptors	The number of open file descriptors.	Dependent item	etcd.open.fds Preprocessing Prometheus pattern: `VALUE(process_open_fds)`
Etcd: Maximum open file descriptors	The Maximum number of open file descriptors.	Dependent item	etcd.max.fds Preprocessing Prometheus pattern: `VALUE(process_max_fds)`
Etcd: Deletes per second	The number of deletes seen by this member per second.	Dependent item	etcd.delete.rate Preprocessing Prometheus pattern: `VALUE(etcd_mvcc_delete_total)` Change per second
Etcd: PUT per second	The number of puts seen by this member per second.	Dependent item	etcd.put.rate Preprocessing Prometheus pattern: `VALUE(etcd_mvcc_put_total)` Change per second
Etcd: Range per second	The number of ranges seen by this member per second.	Dependent item	etcd.range.rate Preprocessing Prometheus pattern: `VALUE(etcd_debugging_mvcc_range_total)` Change per second
Etcd: Transaction per second	The number of transactions seen by this member per second.	Dependent item	etcd.txn.rate Preprocessing Prometheus pattern: `VALUE(etcd_debugging_mvcc_range_total)` Change per second
Etcd: Pending events	The total number of pending events to be sent.	Dependent item	etcd.events.sent.rate Preprocessing Prometheus pattern: `VALUE(etcd_debugging_mvcc_pending_events_total)`

Triggers

Name	Description	Expression	Severity	Dependencies and additional info
Etcd: Service is unavailable		`last(/Etcd by HTTP/net.tcp.service["{$ETCD.SCHEME}","{HOST.CONN}","{$ETCD.PORT}"])=0`	Average	Manual close: Yes
Etcd: Node healthcheck failed	See more details on https://etcd.io/docs/v3.5/op-guide/monitoring/#health-check.	`last(/Etcd by HTTP/etcd.health)=0`	Average	Depends on: Etcd: Service is unavailable
Etcd: Failed to fetch info data	Zabbix has not received any data for items for the last 30 minutes.	`nodata(/Etcd by HTTP/etcd.is.leader,30m)=1`	Warning	Manual close: Yes Depends on: Etcd: Service is unavailable
Etcd: Member has no leader	If a member does not have a leader, it is totally unavailable.	`last(/Etcd by HTTP/etcd.has.leader)=0`	Average
Etcd: Instance has seen too many leader changes	Rapid leadership changes impact the performance of `etcd` significantly. It also signals that the leader is unstable, perhaps due to network connectivity issues or excessive load hitting the `etcd cluster`.	`(max(/Etcd by HTTP/etcd.leader.changes,15m)-min(/Etcd by HTTP/etcd.leader.changes,15m))>{$ETCD.LEADER.CHANGES.MAX.WARN}`	Warning
Etcd: Too many proposal failures	Normally related to two issues: temporary failures related to a leader election or longer downtime caused by a loss of quorum in the cluster.	`min(/Etcd by HTTP/etcd.proposals.failed.rate,5m)>{$ETCD.PROPOSAL.FAIL.MAX.WARN}`	Warning
Etcd: Too many proposals are queued to commit	Rising pending proposals suggests there is a high client load, or the member cannot commit proposals.	`min(/Etcd by HTTP/etcd.proposals.pending,5m)>{$ETCD.PROPOSAL.PENDING.MAX.WARN}`	Warning
Etcd: Too many HTTP requests failures	Too many requests failed on `etcd` instance with the `5xx HTTP code`.	`min(/Etcd by HTTP/etcd.http.requests.5xx.rate,5m)>{$ETCD.HTTP.FAIL.MAX.WARN}`	Warning
Etcd: Server version has changed	Etcd version has changed. Acknowledge to close the problem manually.	`last(/Etcd by HTTP/etcd.server.version,#1)<>last(/Etcd by HTTP/etcd.server.version,#2) and length(last(/Etcd by HTTP/etcd.server.version))>0`	Info	Manual close: Yes
Etcd: Cluster version has changed	Etcd version has changed. Acknowledge to close the problem manually.	`last(/Etcd by HTTP/etcd.cluster.version,#1)<>last(/Etcd by HTTP/etcd.cluster.version,#2) and length(last(/Etcd by HTTP/etcd.cluster.version))>0`	Info	Manual close: Yes
Etcd: Host has been restarted	Uptime is less than 10 minutes.	`last(/Etcd by HTTP/etcd.uptime)<10m`	Info	Manual close: Yes
Etcd: Current number of open files is too high	Heavy usage of a file descriptor (i.e., near the limit of the process's file descriptor) indicates a potential file descriptor exhaustion issue. If the file descriptors are exhausted, `etcd` may panic because it cannot create new WAL files.	`min(/Etcd by HTTP/etcd.open.fds,5m)/last(/Etcd by HTTP/etcd.max.fds)*100>{$ETCD.OPEN.FDS.MAX.WARN}`	Warning

LLD rule gRPC codes discovery

Name Description Type Key and additional info

gRPC codes discovery

Dependent item

Name	Description	Type	Key and additional info
gRPC codes discovery		Dependent item	etcd.grpc_code.discovery Preprocessing Prometheus to JSON: `grpc_server_handled_total` JavaScript: `The text is too long. Please see the template.` Discard unchanged with heartbeat: `1h`

etcd.grpc_code.discovery

Preprocessing

Prometheus to JSON: grpc_server_handled_total
JavaScript: The text is too long. Please see the template.
Discard unchanged with heartbeat: 1h

Item prototypes for gRPC codes discovery

Name Description Type Key and additional info

Etcd: RPCs completed with code {#GRPC.CODE}

Name	Description	Type	Key and additional info
Etcd: RPCs completed with code {#GRPC.CODE}	The number of RPCs completed on the server with grpc_code {#GRPC.CODE}.	Dependent item	etcd.grpc.handled.rate[{#GRPC.CODE}] Preprocessing Prometheus to JSON: `grpc_server_handled_total{grpc_method="{#GRPC.CODE}"}` JavaScript: `The text is too long. Please see the template.` Change per second

The number of RPCs completed on the server with grpc_code {#GRPC.CODE}.

Dependent item

etcd.grpc.handled.rate[{#GRPC.CODE}]

Preprocessing

Prometheus to JSON: grpc_server_handled_total{grpc_method="{#GRPC.CODE}"}
JavaScript: The text is too long. Please see the template.
Change per second

Trigger prototypes for gRPC codes discovery

Name	Description	Expression	Severity	Dependencies and additional info
Etcd: Too many failed gRPC requests with code: {#GRPC.CODE}		`min(/Etcd by HTTP/etcd.grpc.handled.rate[{#GRPC.CODE}],5m)>{$ETCD.GRPC.ERRORS.MAX.WARN}`	Warning

LLD rule Peers discovery

Name Description Type Key and additional info

Peers discovery

Dependent item

Name	Description	Type	Key and additional info
Peers discovery		Dependent item	etcd.peer.discovery Preprocessing Prometheus to JSON: `etcd_network_peer_sent_bytes_total`

etcd.peer.discovery

Preprocessing

Prometheus to JSON: etcd_network_peer_sent_bytes_total

Item prototypes for Peers discovery

Name	Description	Type	Key and additional info
Etcd: Etcd peer {#ETCD.PEER}: Bytes sent	The number of bytes sent to a peer with the ID `{#ETCD.PEER}`.	Dependent item	etcd.bytes.sent.rate[{#ETCD.PEER}] Preprocessing Prometheus pattern: `VALUE(etcd_network_peer_sent_bytes_total{To="{#ETCD.PEER}"})` ⛔️Custom on fail: Set value to: `0` Change per second
Etcd: Etcd peer {#ETCD.PEER}: Bytes received	The number of bytes received from a peer with the ID `{#ETCD.PEER}`.	Dependent item	etcd.bytes.received.rate[{#ETCD.PEER}] Preprocessing Prometheus pattern: `The text is too long. Please see the template.` ⛔️Custom on fail: Set value to: `0` Change per second
Etcd: Etcd peer {#ETCD.PEER}: Send failures	The number of sent failures from a peer with the ID `{#ETCD.PEER}`.	Dependent item	etcd.sent.fail.rate[{#ETCD.PEER}] Preprocessing Prometheus pattern: `The text is too long. Please see the template.` ⛔️Custom on fail: Set value to: `0` Change per second
Etcd: Etcd peer {#ETCD.PEER}: Receive failures	The number of received failures from a peer with the ID `{#ETCD.PEER}`.	Dependent item	etcd.received.fail.rate[{#ETCD.PEER}] Preprocessing Prometheus pattern: `The text is too long. Please see the template.` ⛔️Custom on fail: Set value to: `0` Change per second

Feedback

Please report any issues with the template at https://support.zabbix.com

You can also provide feedback, discuss the template, or ask for help at ZABBIX forums

README.md Unescape Escape

Etcd by HTTP

Overview

Requirements

Tested versions

Configuration

Setup

Macros used

Items

Triggers

LLD rule gRPC codes discovery

Item prototypes for gRPC codes discovery

Trigger prototypes for gRPC codes discovery

LLD rule Peers discovery

Item prototypes for Peers discovery

Feedback

README.md

Unescape Escape