You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

177 lines
16 KiB

1 year ago
# TiDB PD by HTTP
## Overview
The template to monitor PD server of TiDB cluster by Zabbix that works without any external scripts.
Most of the metrics are collected in one go, thanks to Zabbix bulk data collection.
Template `TiDB PD by HTTP` — collects metrics by HTTP agent from PD /metrics endpoint and from monitoring API.
See https://docs.pingcap.com/tidb/stable/tidb-monitoring-api.
## Requirements
Zabbix version: 7.0 and higher.
## Tested versions
This template has been tested on:
- TiDB cluster 4.0.10, 6.5.1
## Configuration
> Zabbix should be configured according to the instructions in the [Templates out of the box](https://www.zabbix.com/documentation/7.0/manual/config/templates_out_of_the_box) section.
## Setup
This template works with PD server of TiDB cluster.
Internal service metrics are collected from PD /metrics endpoint and from monitoring API.
See https://docs.pingcap.com/tidb/stable/tidb-monitoring-api.
Don't forget to change the macros {$PD.URL}, {$PD.PORT}.
Also, see the Macros section for a list of macros used to set trigger values.
### Macros used
|Name|Description|Default|
|----|-----------|-------|
|{$PD.PORT}|<p>The port of PD server metrics web endpoint</p>|`2379`|
|{$PD.URL}|<p>PD server URL</p>|`localhost`|
|{$PD.MISS_REGION.MAX.WARN}|<p>Maximum number of missed regions</p>|`100`|
|{$PD.STORAGE_USAGE.MAX.WARN}|<p>Maximum percentage of cluster space used</p>|`80`|
### Items
|Name|Description|Type|Key and additional info|
|----|-----------|----|-----------------------|
|PD: Get instance metrics|<p>Get TiDB PD instance metrics.</p>|HTTP agent|pd.get_metrics<p>**Preprocessing**</p><ul><li><p>Check for not supported value</p><p>Custom on fail: Discard value</p></li><li>Prometheus to JSON</li></ul>|
|PD: Get instance status|<p>Get TiDB PD instance status info.</p>|HTTP agent|pd.get_status<p>**Preprocessing**</p><ul><li><p>Check for not supported value</p><p>Custom on fail: Set value to: `{"status": "0"}`</p></li></ul>|
|PD: Status|<p>Status of PD instance.</p>|Dependent item|pd.status<p>**Preprocessing**</p><ul><li><p>JSON Path: `$.status`</p><p>Custom on fail: Set value to: `1`</p></li><li><p>Discard unchanged with heartbeat: `1h`</p></li></ul>|
|PD: gRPC Commands total, rate|<p>The rate at which gRPC commands are completed.</p>|Dependent item|pd.grpc_command.rate<p>**Preprocessing**</p><ul><li><p>JSON Path: `The text is too long. Please see the template.`</p><p>Custom on fail: Discard value</p></li><li>Change per second</li></ul>|
|PD: Version|<p>Version of the PD instance.</p>|Dependent item|pd.version<p>**Preprocessing**</p><ul><li><p>JSON Path: `$.version`</p></li><li><p>Discard unchanged with heartbeat: `3h`</p></li></ul>|
|PD: Uptime|<p>The runtime of each PD instance.</p>|Dependent item|pd.uptime<p>**Preprocessing**</p><ul><li><p>JSON Path: `$.start_timestamp`</p></li><li><p>JavaScript: `The text is too long. Please see the template.`</p></li></ul>|
|PD: Get cluster metrics|<p>Get cluster metrics.</p>|Dependent item|pd.cluster_status.get_metrics<p>**Preprocessing**</p><ul><li><p>JSON Path: `$[?(@.name == "pd_cluster_status")]`</p><p>Custom on fail: Discard value</p></li></ul>|
|PD: Get region metrics|<p>Get region metrics.</p>|Dependent item|pd.regions.get_metrics<p>**Preprocessing**</p><ul><li><p>JSON Path: `$[?(@.name == "pd_scheduler_region_heartbeat")]`</p><p>Custom on fail: Discard value</p></li></ul>|
|PD: Get region label metrics|<p>Get region label metrics.</p>|Dependent item|pd.region_labels.get_metrics<p>**Preprocessing**</p><ul><li><p>JSON Path: `$[?(@.name == "pd_regions_label_level")]`</p><p>Custom on fail: Discard value</p></li></ul>|
|PD: Get region status metrics|<p>Get region status metrics.</p>|Dependent item|pd.region_status.get_metrics<p>**Preprocessing**</p><ul><li><p>JSON Path: `$[?(@.name == "pd_regions_status")]`</p><p>Custom on fail: Discard value</p></li></ul>|
|PD: Get gRPC command metrics|<p>Get gRPC command metrics.</p>|Dependent item|pd.grpc_commands.get_metrics<p>**Preprocessing**</p><ul><li><p>JSON Path: `$[?(@.name == "grpc_server_handling_seconds_count")]`</p><p>Custom on fail: Discard value</p></li></ul>|
|PD: Get scheduler metrics|<p>Get scheduler metrics.</p>|Dependent item|pd.scheduler.get_metrics<p>**Preprocessing**</p><ul><li><p>JSON Path: `The text is too long. Please see the template.`</p><p>Custom on fail: Discard value</p></li></ul>|
### Triggers
|Name|Description|Expression|Severity|Dependencies and additional info|
|----|-----------|----------|--------|--------------------------------|
|PD: Instance is not responding||`last(/TiDB PD by HTTP/pd.status)=0`|Average||
|PD: Version has changed|<p>PD version has changed. Acknowledge to close the problem manually.</p>|`last(/TiDB PD by HTTP/pd.version,#1)<>last(/TiDB PD by HTTP/pd.version,#2) and length(last(/TiDB PD by HTTP/pd.version))>0`|Info|**Manual close**: Yes|
|PD: has been restarted|<p>Uptime is less than 10 minutes.</p>|`last(/TiDB PD by HTTP/pd.uptime)<10m`|Info|**Manual close**: Yes|
### LLD rule Cluster metrics discovery
|Name|Description|Type|Key and additional info|
|----|-----------|----|-----------------------|
|Cluster metrics discovery|<p>Discovery cluster specific metrics.</p>|Dependent item|pd.cluster.discovery<p>**Preprocessing**</p><ul><li><p>JavaScript: `The text is too long. Please see the template.`</p></li><li><p>Discard unchanged with heartbeat: `1h`</p></li></ul>|
### Item prototypes for Cluster metrics discovery
|Name|Description|Type|Key and additional info|
|----|-----------|----|-----------------------|
|TiDB cluster: Offline stores||Dependent item|pd.cluster_status.store_offline[{#SINGLETON}]<p>**Preprocessing**</p><ul><li><p>JSON Path: `$[?(@.labels.type == "store_offline_count")].value.first()`</p></li><li><p>Discard unchanged with heartbeat: `1h`</p></li></ul>|
|TiDB cluster: Tombstone stores|<p>The count of tombstone stores.</p>|Dependent item|pd.cluster_status.store_tombstone[{#SINGLETON}]<p>**Preprocessing**</p><ul><li><p>JSON Path: `$[?(@.labels.type == "store_tombstone_count")].value.first()`</p></li><li><p>Discard unchanged with heartbeat: `1h`</p></li></ul>|
|TiDB cluster: Down stores|<p>The count of down stores.</p>|Dependent item|pd.cluster_status.store_down[{#SINGLETON}]<p>**Preprocessing**</p><ul><li><p>JSON Path: `$[?(@.labels.type == "store_down_count")].value.first()`</p></li><li><p>Discard unchanged with heartbeat: `1h`</p></li></ul>|
|TiDB cluster: Lowspace stores|<p>The count of low space stores.</p>|Dependent item|pd.cluster_status.store_low_space[{#SINGLETON}]<p>**Preprocessing**</p><ul><li><p>JSON Path: `$[?(@.labels.type == "store_low_space_count")].value.first()`</p></li><li><p>Discard unchanged with heartbeat: `1h`</p></li></ul>|
|TiDB cluster: Unhealth stores|<p>The count of unhealthy stores.</p>|Dependent item|pd.cluster_status.store_unhealth[{#SINGLETON}]<p>**Preprocessing**</p><ul><li><p>JSON Path: `$[?(@.labels.type == "store_unhealth_count")].value.first()`</p></li><li><p>Discard unchanged with heartbeat: `1h`</p></li></ul>|
|TiDB cluster: Disconnect stores|<p>The count of disconnected stores.</p>|Dependent item|pd.cluster_status.store_disconnected[{#SINGLETON}]<p>**Preprocessing**</p><ul><li><p>JSON Path: `The text is too long. Please see the template.`</p></li><li><p>Discard unchanged with heartbeat: `1h`</p></li></ul>|
|TiDB cluster: Normal stores|<p>The count of healthy storage instances.</p>|Dependent item|pd.cluster_status.store_up[{#SINGLETON}]<p>**Preprocessing**</p><ul><li><p>JSON Path: `$[?(@.labels.type == "store_up_count")].value.first()`</p></li><li><p>Discard unchanged with heartbeat: `1h`</p></li></ul>|
|TiDB cluster: Storage capacity|<p>The total storage capacity for this TiDB cluster.</p>|Dependent item|pd.cluster_status.storage_capacity[{#SINGLETON}]<p>**Preprocessing**</p><ul><li><p>JSON Path: `$[?(@.labels.type == "storage_capacity")].value.first()`</p></li><li><p>Discard unchanged with heartbeat: `1h`</p></li></ul>|
|TiDB cluster: Storage size|<p>The storage size that is currently used by the TiDB cluster.</p>|Dependent item|pd.cluster_status.storage_size[{#SINGLETON}]<p>**Preprocessing**</p><ul><li><p>JSON Path: `$[?(@.labels.type == "storage_size")].value.first()`</p></li></ul>|
|TiDB cluster: Number of regions|<p>The total count of cluster Regions.</p>|Dependent item|pd.cluster_status.leader_count[{#SINGLETON}]<p>**Preprocessing**</p><ul><li><p>JSON Path: `$[?(@.labels.type == "leader_count")].value.first()`</p></li></ul>|
|TiDB cluster: Current peer count|<p>The current count of all cluster peers.</p>|Dependent item|pd.cluster_status.region_count[{#SINGLETON}]<p>**Preprocessing**</p><ul><li><p>JSON Path: `$[?(@.labels.type == "region_count")].value.first()`</p></li></ul>|
### Trigger prototypes for Cluster metrics discovery
|Name|Description|Expression|Severity|Dependencies and additional info|
|----|-----------|----------|--------|--------------------------------|
|TiDB cluster: There are offline TiKV nodes|<p>PD has not received a TiKV heartbeat for a long time.</p>|`last(/TiDB PD by HTTP/pd.cluster_status.store_down[{#SINGLETON}])>0`|Average||
|TiDB cluster: There are low space TiKV nodes|<p>Indicates that there is no sufficient space on the TiKV node.</p>|`last(/TiDB PD by HTTP/pd.cluster_status.store_low_space[{#SINGLETON}])>0`|Average||
|TiDB cluster: There are disconnected TiKV nodes|<p>PD does not receive a TiKV heartbeat within 20 seconds. Normally a TiKV heartbeat comes in every 10 seconds.</p>|`last(/TiDB PD by HTTP/pd.cluster_status.store_disconnected[{#SINGLETON}])>0`|Warning||
|TiDB cluster: Current storage usage is too high|<p>Over {$PD.STORAGE_USAGE.MAX.WARN}% of the cluster space is occupied.</p>|`min(/TiDB PD by HTTP/pd.cluster_status.storage_size[{#SINGLETON}],5m)/last(/TiDB PD by HTTP/pd.cluster_status.storage_capacity[{#SINGLETON}])*100>{$PD.STORAGE_USAGE.MAX.WARN}`|Warning||
### LLD rule Region labels discovery
|Name|Description|Type|Key and additional info|
|----|-----------|----|-----------------------|
|Region labels discovery|<p>Discovery region labels specific metrics.</p>|Dependent item|pd.region_labels.discovery<p>**Preprocessing**</p><ul><li><p>JavaScript: `The text is too long. Please see the template.`</p></li><li><p>Discard unchanged with heartbeat: `1h`</p></li></ul>|
### Item prototypes for Region labels discovery
|Name|Description|Type|Key and additional info|
|----|-----------|----|-----------------------|
|TiDB cluster: Regions label: {#TYPE}|<p>The number of Regions in different label levels.</p>|Dependent item|pd.region_labels[{#TYPE}]<p>**Preprocessing**</p><ul><li><p>JSON Path: `$[?(@.labels.type == "{#TYPE}")].value.first()`</p></li></ul>|
### LLD rule Region status discovery
|Name|Description|Type|Key and additional info|
|----|-----------|----|-----------------------|
|Region status discovery|<p>Discovery region status specific metrics.</p>|Dependent item|pd.region_status.discovery<p>**Preprocessing**</p><ul><li><p>JavaScript: `The text is too long. Please see the template.`</p></li><li><p>Discard unchanged with heartbeat: `1h`</p></li></ul>|
### Item prototypes for Region status discovery
|Name|Description|Type|Key and additional info|
|----|-----------|----|-----------------------|
|TiDB cluster: Regions status: {#TYPE}|<p>The health status of Regions indicated via the count of unusual Regions including pending peers, down peers, extra peers, offline peers, missing peers, learner peers and incorrect namespaces.</p>|Dependent item|pd.region_status[{#TYPE}]<p>**Preprocessing**</p><ul><li><p>JSON Path: `$[?(@.labels.type == "{#TYPE}")].value.first()`</p></li></ul>|
### Trigger prototypes for Region status discovery
|Name|Description|Expression|Severity|Dependencies and additional info|
|----|-----------|----------|--------|--------------------------------|
|TiDB cluster: Too many missed regions|<p>The number of Region replicas is smaller than the value of max-replicas. When a TiKV machine is down and its downtime exceeds max-down-time, it usually leads to missing replicas for some Regions during a period of time. When a TiKV node is made offline, it might result in a small number of Regions with missing replicas.</p>|`min(/TiDB PD by HTTP/pd.region_status[{#TYPE}],5m)>{$PD.MISS_REGION.MAX.WARN}`|Warning||
|TiDB cluster: There are unresponsive peers|<p>The number of Regions with an unresponsive peer reported by the Raft leader.</p>|`min(/TiDB PD by HTTP/pd.region_status[{#TYPE}],5m)>0`|Warning||
### LLD rule Running scheduler discovery
|Name|Description|Type|Key and additional info|
|----|-----------|----|-----------------------|
|Running scheduler discovery|<p>Discovery scheduler specific metrics.</p>|Dependent item|pd.scheduler.discovery<p>**Preprocessing**</p><ul><li><p>JavaScript: `The text is too long. Please see the template.`</p></li><li><p>Discard unchanged with heartbeat: `1h`</p></li></ul>|
### Item prototypes for Running scheduler discovery
|Name|Description|Type|Key and additional info|
|----|-----------|----|-----------------------|
|TiDB cluster: Scheduler status: {#KIND}|<p>The current running schedulers.</p>|Dependent item|pd.scheduler[{#KIND}]<p>**Preprocessing**</p><ul><li><p>JSON Path: `$[?(@.labels.kind == "{#KIND}")].value.first()`</p><p>Custom on fail: Set value to: `0`</p></li></ul>|
### LLD rule gRPC commands discovery
|Name|Description|Type|Key and additional info|
|----|-----------|----|-----------------------|
|gRPC commands discovery|<p>Discovery grpc commands specific metrics.</p>|Dependent item|pd.grpc_command.discovery<p>**Preprocessing**</p><ul><li><p>JavaScript: `The text is too long. Please see the template.`</p></li><li><p>Discard unchanged with heartbeat: `1h`</p></li></ul>|
### Item prototypes for gRPC commands discovery
|Name|Description|Type|Key and additional info|
|----|-----------|----|-----------------------|
|PD: gRPC Commands: {#GRPC_METHOD}, rate|<p>The rate per command type at which gRPC commands are completed.</p>|Dependent item|pd.grpc_command.rate[{#GRPC_METHOD}]<p>**Preprocessing**</p><ul><li><p>JSON Path: `$[?(@.labels.grpc_method == "{#GRPC_METHOD}")].value.first()`</p></li><li>Change per second</li></ul>|
### LLD rule Region discovery
|Name|Description|Type|Key and additional info|
|----|-----------|----|-----------------------|
|Region discovery|<p>Discovery region specific metrics.</p>|Dependent item|pd.region.discovery<p>**Preprocessing**</p><ul><li><p>JavaScript: `The text is too long. Please see the template.`</p></li><li><p>Discard unchanged with heartbeat: `1h`</p></li></ul>|
### Item prototypes for Region discovery
|Name|Description|Type|Key and additional info|
|----|-----------|----|-----------------------|
|PD: Get metrics: {#STORE_ADDRESS}|<p>Get region metrics for {#STORE_ADDRESS}.</p>|Dependent item|pd.region_heartbeat.get_metrics[{#STORE_ADDRESS}]<p>**Preprocessing**</p><ul><li><p>JSON Path: `$[?(@.labels.address == "{#STORE_ADDRESS}")]`</p><p>Custom on fail: Discard value</p></li></ul>|
|PD: Region heartbeat: active, rate|<p>The count of heartbeats with the ok status per second.</p>|Dependent item|pd.region_heartbeat.ok.rate[{#STORE_ADDRESS}]<p>**Preprocessing**</p><ul><li><p>JSON Path: `The text is too long. Please see the template.`</p><p>Custom on fail: Set value to: `0`</p></li><li>Change per second</li></ul>|
|PD: Region heartbeat: error, rate|<p>The count of heartbeats with the error status per second.</p>|Dependent item|pd.region_heartbeat.error.rate[{#STORE_ADDRESS}]<p>**Preprocessing**</p><ul><li><p>JSON Path: `The text is too long. Please see the template.`</p><p>Custom on fail: Set value to: `0`</p></li><li>Change per second</li></ul>|
|PD: Region heartbeat: total, rate|<p>The count of heartbeats reported to PD per instance per second.</p>|Dependent item|pd.region_heartbeat.rate[{#STORE_ADDRESS}]<p>**Preprocessing**</p><ul><li><p>JSON Path: `$[?(@.labels.type == "report")].value.sum()`</p><p>Custom on fail: Set value to: `0`</p></li><li>Change per second</li></ul>|
|PD: Region schedule push: total, rate||Dependent item|pd.region_heartbeat.push.err.rate[{#STORE_ADDRESS}]<p>**Preprocessing**</p><ul><li><p>JSON Path: `$[?(@.labels.type == "push")].value.sum()`</p><p>Custom on fail: Set value to: `0`</p></li><li>Change per second</li></ul>|
## Feedback
Please report any issues with the template at [`https://support.zabbix.com`](https://support.zabbix.com)
You can also provide feedback, discuss the template, or ask for help at [`ZABBIX forums`](https://www.zabbix.com/forum/zabbix-suggestions-and-feedback)