# Etcd by HTTP ## Overview This template is designed to monitor `etcd` by Zabbix that works without any external scripts. Most of the metrics are collected in one go, thanks to Zabbix bulk data collection. The template `Etcd by HTTP` — collects metrics by help of the HTTP agent from `/metrics` endpoint. > Refer to the [vendor documentation](https://etcd.io/docs/v3.5/op-guide/monitoring/#metrics-endpoint). **For the users of `etcd version <= 3.4` !** > In `etcd v3.5` some metrics have been deprecated. See more details on [Upgrade etcd from 3.4 to 3.5](https://etcd.io/docs/v3.4/upgrades/upgrade_3_5/). Please upgrade your `etcd` instance, or use older `Etcd by HTTP` template version. ## Requirements Zabbix version: 7.0 and higher. ## Tested versions This template has been tested on: - Etcd 3.5.6 ## Configuration > Zabbix should be configured according to the instructions in the [Templates out of the box](https://www.zabbix.com/documentation/7.0/manual/config/templates_out_of_the_box) section. ## Setup Follow these instructions: 1. Import the template into Zabbix. 2. After importing the template, make sure that `etcd` allows the collection of metrics. You can test it by running: `curl -L http://localhost:2379/metrics`. 3. Check if `etcd` is accessible from Zabbix proxy or Zabbix server depending on where you are planning to do the monitoring. To verify it, run `curl -L http://:2379/metrics`. 4. Add the template to each `etcd node`. By default, the template uses a client's port. You can configure metrics endpoint location by adding `--listen-metrics-urls flag`. (For more details, see [etcd documentation](https://etcd.io/docs/v3.5/op-guide/configuration/#profiling-and-monitoring)). Additional points to consider: - If you have specified a non-standard port for `etcd`, don't forget to change macros: `{$ETCD.SCHEME}` and `{$ETCD.PORT}`. - You can set `{$ETCD.USERNAME}` and `{$ETCD.PASSWORD}` macros in the template to use on a host level if necessary. - To test availability, run : `zabbix_get -s etcd-host -k etcd.health`. - See the macros section, as it will set the trigger values. ### Macros used |Name|Description|Default| |----|-----------|-------| |{$ETCD.PORT}|

The port of `etcd` API endpoint.

|`2379`| |{$ETCD.SCHEME}|

The request scheme which may be `http` or `https`.

|`http`| |{$ETCD.USER}||| |{$ETCD.PASSWORD}||| |{$ETCD.LEADER.CHANGES.MAX.WARN}|

The maximum number of leader changes.

|`5`| |{$ETCD.PROPOSAL.FAIL.MAX.WARN}|

The maximum number of proposal failures.

|`2`| |{$ETCD.HTTP.FAIL.MAX.WARN}|

The maximum number of HTTP request failures.

|`2`| |{$ETCD.PROPOSAL.PENDING.MAX.WARN}|

The maximum number of proposals in queue.

|`5`| |{$ETCD.OPEN.FDS.MAX.WARN}|

The maximum percentage of used file descriptors.

|`90`| |{$ETCD.GRPC_CODE.MATCHES}|

The filter of discoverable gRPC codes. See more details on https://github.com/grpc/grpc/blob/master/doc/statuscodes.md.

|`.*`| |{$ETCD.GRPC_CODE.NOT_MATCHES}|

The filter to exclude discovered gRPC codes. See more details on https://github.com/grpc/grpc/blob/master/doc/statuscodes.md.

|`CHANGE_IF_NEEDED`| |{$ETCD.GRPC.ERRORS.MAX.WARN}|

The maximum number of gRPC request failures.

|`1`| |{$ETCD.GRPC_CODE.TRIGGER.MATCHES}|

The filter of discoverable gRPC codes, which will create triggers.

|`Aborted\|Unavailable`| ### Items |Name|Description|Type|Key and additional info| |----|-----------|----|-----------------------| |Etcd: Service's TCP port state||Simple check|net.tcp.service["{$ETCD.SCHEME}","{HOST.CONN}","{$ETCD.PORT}"]

**Preprocessing**

| |Etcd: Get node metrics||HTTP agent|etcd.get_metrics| |Etcd: Node health||HTTP agent|etcd.health

**Preprocessing**

| |Etcd: Server is a leader|

It defines - whether or not this member is a leader:

1 - it is;

0 - otherwise.

|Dependent item|etcd.is.leader

**Preprocessing**

| |Etcd: Server has a leader|

It defines - whether or not a leader exists:

1 - it exists;

0 - it does not.

|Dependent item|etcd.has.leader

**Preprocessing**

| |Etcd: Leader changes|

The number of leader changes the member has seen since its start.

|Dependent item|etcd.leader.changes

**Preprocessing**

| |Etcd: Proposals committed per second|

The number of consensus proposals committed.

|Dependent item|etcd.proposals.committed.rate

**Preprocessing**

| |Etcd: Proposals applied per second|

The number of consensus proposals applied.

|Dependent item|etcd.proposals.applied.rate

**Preprocessing**

| |Etcd: Proposals failed per second|

The number of failed proposals seen.

|Dependent item|etcd.proposals.failed.rate

**Preprocessing**

| |Etcd: Proposals pending|

The current number of pending proposals to commit.

|Dependent item|etcd.proposals.pending

**Preprocessing**

| |Etcd: Reads per second|

The number of read actions by `get/getRecursive`, local to this member.

|Dependent item|etcd.reads.rate

**Preprocessing**

| |Etcd: Writes per second|

The number of writes (e.g., `set/compareAndDelete`) seen by this member.

|Dependent item|etcd.writes.rate

**Preprocessing**

| |Etcd: Client gRPC received bytes per second|

The number of bytes received from gRPC clients per second.

|Dependent item|etcd.network.grpc.received.rate

**Preprocessing**

| |Etcd: Client gRPC sent bytes per second|

The number of bytes sent from gRPC clients per second.

|Dependent item|etcd.network.grpc.sent.rate

**Preprocessing**

| |Etcd: HTTP requests received|

The number of requests received into the system (successfully parsed and `authd`).

|Dependent item|etcd.http.requests.rate

**Preprocessing**

| |Etcd: HTTP 5XX|

The number of handled failures of requests (non-watches), by the method (`GET/PUT` etc.), and the code `5XX`.

|Dependent item|etcd.http.requests.5xx.rate

**Preprocessing**

| |Etcd: HTTP 4XX|

The number of handled failures of requests (non-watches), by the method (`GET/PUT` etc.), and the code `4XX`.

|Dependent item|etcd.http.requests.4xx.rate

**Preprocessing**

| |Etcd: RPCs received per second|

The number of RPC stream messages received on the server.

|Dependent item|etcd.grpc.received.rate

**Preprocessing**

| |Etcd: RPCs sent per second|

The number of gRPC stream messages sent by the server.

|Dependent item|etcd.grpc.sent.rate

**Preprocessing**

| |Etcd: RPCs started per second|

The number of RPCs started on the server.

|Dependent item|etcd.grpc.started.rate

**Preprocessing**

| |Etcd: Get version||HTTP agent|etcd.get_version| |Etcd: Server version|

The version of the `etcd server`.

|Dependent item|etcd.server.version

**Preprocessing**

| |Etcd: Cluster version|

The version of the `etcd cluster`.

|Dependent item|etcd.cluster.version

**Preprocessing**

| |Etcd: DB size|

The total size of the underlying database.

|Dependent item|etcd.db.size

**Preprocessing**

| |Etcd: Keys compacted per second|

The number of DB keys compacted per second.

|Dependent item|etcd.keys.compacted.rate

**Preprocessing**

| |Etcd: Keys expired per second|

The number of expired keys per second.

|Dependent item|etcd.keys.expired.rate

**Preprocessing**

| |Etcd: Keys total|

The total number of keys.

|Dependent item|etcd.keys.total

**Preprocessing**

| |Etcd: Uptime|

`Etcd` server uptime.

|Dependent item|etcd.uptime

**Preprocessing**

| |Etcd: Virtual memory|

The size of virtual memory expressed in bytes.

|Dependent item|etcd.virtual.bytes

**Preprocessing**

| |Etcd: Resident memory|

The size of resident memory expressed in bytes.

|Dependent item|etcd.res.bytes

**Preprocessing**

| |Etcd: CPU|

The total user and system CPU time spent in seconds.

|Dependent item|etcd.cpu.util

**Preprocessing**

| |Etcd: Open file descriptors|

The number of open file descriptors.

|Dependent item|etcd.open.fds

**Preprocessing**

| |Etcd: Maximum open file descriptors|

The Maximum number of open file descriptors.

|Dependent item|etcd.max.fds

**Preprocessing**

| |Etcd: Deletes per second|

The number of deletes seen by this member per second.

|Dependent item|etcd.delete.rate

**Preprocessing**

| |Etcd: PUT per second|

The number of puts seen by this member per second.

|Dependent item|etcd.put.rate

**Preprocessing**

| |Etcd: Range per second|

The number of ranges seen by this member per second.

|Dependent item|etcd.range.rate

**Preprocessing**

| |Etcd: Transaction per second|

The number of transactions seen by this member per second.

|Dependent item|etcd.txn.rate

**Preprocessing**

| |Etcd: Pending events|

The total number of pending events to be sent.

|Dependent item|etcd.events.sent.rate

**Preprocessing**

| ### Triggers |Name|Description|Expression|Severity|Dependencies and additional info| |----|-----------|----------|--------|--------------------------------| |Etcd: Service is unavailable||`last(/Etcd by HTTP/net.tcp.service["{$ETCD.SCHEME}","{HOST.CONN}","{$ETCD.PORT}"])=0`|Average|**Manual close**: Yes| |Etcd: Node healthcheck failed|

See more details on https://etcd.io/docs/v3.5/op-guide/monitoring/#health-check.

|`last(/Etcd by HTTP/etcd.health)=0`|Average|**Depends on**:
| |Etcd: Failed to fetch info data|

Zabbix has not received any data for items for the last 30 minutes.

|`nodata(/Etcd by HTTP/etcd.is.leader,30m)=1`|Warning|**Manual close**: Yes
**Depends on**:
| |Etcd: Member has no leader|

If a member does not have a leader, it is totally unavailable.

|`last(/Etcd by HTTP/etcd.has.leader)=0`|Average|| |Etcd: Instance has seen too many leader changes|

Rapid leadership changes impact the performance of `etcd` significantly. It also signals that the leader is unstable, perhaps due to network connectivity issues or excessive load hitting the `etcd cluster`.

|`(max(/Etcd by HTTP/etcd.leader.changes,15m)-min(/Etcd by HTTP/etcd.leader.changes,15m))>{$ETCD.LEADER.CHANGES.MAX.WARN}`|Warning|| |Etcd: Too many proposal failures|

Normally related to two issues: temporary failures related to a leader election or longer downtime caused by a loss of quorum in the cluster.

|`min(/Etcd by HTTP/etcd.proposals.failed.rate,5m)>{$ETCD.PROPOSAL.FAIL.MAX.WARN}`|Warning|| |Etcd: Too many proposals are queued to commit|

Rising pending proposals suggests there is a high client load, or the member cannot commit proposals.

|`min(/Etcd by HTTP/etcd.proposals.pending,5m)>{$ETCD.PROPOSAL.PENDING.MAX.WARN}`|Warning|| |Etcd: Too many HTTP requests failures|

Too many requests failed on `etcd` instance with the `5xx HTTP code`.

|`min(/Etcd by HTTP/etcd.http.requests.5xx.rate,5m)>{$ETCD.HTTP.FAIL.MAX.WARN}`|Warning|| |Etcd: Server version has changed|

Etcd version has changed. Acknowledge to close the problem manually.

|`last(/Etcd by HTTP/etcd.server.version,#1)<>last(/Etcd by HTTP/etcd.server.version,#2) and length(last(/Etcd by HTTP/etcd.server.version))>0`|Info|**Manual close**: Yes| |Etcd: Cluster version has changed|

Etcd version has changed. Acknowledge to close the problem manually.

|`last(/Etcd by HTTP/etcd.cluster.version,#1)<>last(/Etcd by HTTP/etcd.cluster.version,#2) and length(last(/Etcd by HTTP/etcd.cluster.version))>0`|Info|**Manual close**: Yes| |Etcd: Host has been restarted|

Uptime is less than 10 minutes.

|`last(/Etcd by HTTP/etcd.uptime)<10m`|Info|**Manual close**: Yes| |Etcd: Current number of open files is too high|

Heavy usage of a file descriptor (i.e., near the limit of the process's file descriptor) indicates a potential file descriptor exhaustion issue.
If the file descriptors are exhausted, `etcd` may panic because it cannot create new WAL files.

|`min(/Etcd by HTTP/etcd.open.fds,5m)/last(/Etcd by HTTP/etcd.max.fds)*100>{$ETCD.OPEN.FDS.MAX.WARN}`|Warning|| ### LLD rule gRPC codes discovery |Name|Description|Type|Key and additional info| |----|-----------|----|-----------------------| |gRPC codes discovery||Dependent item|etcd.grpc_code.discovery

**Preprocessing**

| ### Item prototypes for gRPC codes discovery |Name|Description|Type|Key and additional info| |----|-----------|----|-----------------------| |Etcd: RPCs completed with code {#GRPC.CODE}|

The number of RPCs completed on the server with grpc_code {#GRPC.CODE}.

|Dependent item|etcd.grpc.handled.rate[{#GRPC.CODE}]

**Preprocessing**

| ### Trigger prototypes for gRPC codes discovery |Name|Description|Expression|Severity|Dependencies and additional info| |----|-----------|----------|--------|--------------------------------| |Etcd: Too many failed gRPC requests with code: {#GRPC.CODE}||`min(/Etcd by HTTP/etcd.grpc.handled.rate[{#GRPC.CODE}],5m)>{$ETCD.GRPC.ERRORS.MAX.WARN}`|Warning|| ### LLD rule Peers discovery |Name|Description|Type|Key and additional info| |----|-----------|----|-----------------------| |Peers discovery||Dependent item|etcd.peer.discovery

**Preprocessing**

| ### Item prototypes for Peers discovery |Name|Description|Type|Key and additional info| |----|-----------|----|-----------------------| |Etcd: Etcd peer {#ETCD.PEER}: Bytes sent|

The number of bytes sent to a peer with the ID `{#ETCD.PEER}`.

|Dependent item|etcd.bytes.sent.rate[{#ETCD.PEER}]

**Preprocessing**

| |Etcd: Etcd peer {#ETCD.PEER}: Bytes received|

The number of bytes received from a peer with the ID `{#ETCD.PEER}`.

|Dependent item|etcd.bytes.received.rate[{#ETCD.PEER}]

**Preprocessing**

| |Etcd: Etcd peer {#ETCD.PEER}: Send failures|

The number of sent failures from a peer with the ID `{#ETCD.PEER}`.

|Dependent item|etcd.sent.fail.rate[{#ETCD.PEER}]

**Preprocessing**

| |Etcd: Etcd peer {#ETCD.PEER}: Receive failures|

The number of received failures from a peer with the ID `{#ETCD.PEER}`.

|Dependent item|etcd.received.fail.rate[{#ETCD.PEER}]

**Preprocessing**

| ## Feedback Please report any issues with the template at [`https://support.zabbix.com`](https://support.zabbix.com) You can also provide feedback, discuss the template, or ask for help at [`ZABBIX forums`](https://www.zabbix.com/forum/zabbix-suggestions-and-feedback)