# TiDB PD by HTTP ## Overview The template to monitor PD server of TiDB cluster by Zabbix that works without any external scripts. Most of the metrics are collected in one go, thanks to Zabbix bulk data collection. Template `TiDB PD by HTTP` — collects metrics by HTTP agent from PD /metrics endpoint and from monitoring API. See https://docs.pingcap.com/tidb/stable/tidb-monitoring-api. ## Requirements Zabbix version: 7.0 and higher. ## Tested versions This template has been tested on: - TiDB cluster 4.0.10, 6.5.1 ## Configuration > Zabbix should be configured according to the instructions in the [Templates out of the box](https://www.zabbix.com/documentation/7.0/manual/config/templates_out_of_the_box) section. ## Setup This template works with PD server of TiDB cluster. Internal service metrics are collected from PD /metrics endpoint and from monitoring API. See https://docs.pingcap.com/tidb/stable/tidb-monitoring-api. Don't forget to change the macros {$PD.URL}, {$PD.PORT}. Also, see the Macros section for a list of macros used to set trigger values. ### Macros used |Name|Description|Default| |----|-----------|-------| |{$PD.PORT}|
The port of PD server metrics web endpoint
|`2379`| |{$PD.URL}|PD server URL
|`localhost`| |{$PD.MISS_REGION.MAX.WARN}|Maximum number of missed regions
|`100`| |{$PD.STORAGE_USAGE.MAX.WARN}|Maximum percentage of cluster space used
|`80`| ### Items |Name|Description|Type|Key and additional info| |----|-----------|----|-----------------------| |PD: Get instance metrics|Get TiDB PD instance metrics.
|HTTP agent|pd.get_metrics**Preprocessing**
Check for not supported value
⛔️Custom on fail: Discard value
Get TiDB PD instance status info.
|HTTP agent|pd.get_status**Preprocessing**
Check for not supported value
⛔️Custom on fail: Set value to: `{"status": "0"}`
Status of PD instance.
|Dependent item|pd.status**Preprocessing**
JSON Path: `$.status`
⛔️Custom on fail: Set value to: `1`
Discard unchanged with heartbeat: `1h`
The rate at which gRPC commands are completed.
|Dependent item|pd.grpc_command.rate**Preprocessing**
JSON Path: `The text is too long. Please see the template.`
⛔️Custom on fail: Discard value
Version of the PD instance.
|Dependent item|pd.version**Preprocessing**
JSON Path: `$.version`
Discard unchanged with heartbeat: `3h`
The runtime of each PD instance.
|Dependent item|pd.uptime**Preprocessing**
JSON Path: `$.start_timestamp`
JavaScript: `The text is too long. Please see the template.`
Get cluster metrics.
|Dependent item|pd.cluster_status.get_metrics**Preprocessing**
JSON Path: `$[?(@.name == "pd_cluster_status")]`
⛔️Custom on fail: Discard value
Get region metrics.
|Dependent item|pd.regions.get_metrics**Preprocessing**
JSON Path: `$[?(@.name == "pd_scheduler_region_heartbeat")]`
⛔️Custom on fail: Discard value
Get region label metrics.
|Dependent item|pd.region_labels.get_metrics**Preprocessing**
JSON Path: `$[?(@.name == "pd_regions_label_level")]`
⛔️Custom on fail: Discard value
Get region status metrics.
|Dependent item|pd.region_status.get_metrics**Preprocessing**
JSON Path: `$[?(@.name == "pd_regions_status")]`
⛔️Custom on fail: Discard value
Get gRPC command metrics.
|Dependent item|pd.grpc_commands.get_metrics**Preprocessing**
JSON Path: `$[?(@.name == "grpc_server_handling_seconds_count")]`
⛔️Custom on fail: Discard value
Get scheduler metrics.
|Dependent item|pd.scheduler.get_metrics**Preprocessing**
JSON Path: `The text is too long. Please see the template.`
⛔️Custom on fail: Discard value
PD version has changed. Acknowledge to close the problem manually.
|`last(/TiDB PD by HTTP/pd.version,#1)<>last(/TiDB PD by HTTP/pd.version,#2) and length(last(/TiDB PD by HTTP/pd.version))>0`|Info|**Manual close**: Yes| |PD: has been restarted|Uptime is less than 10 minutes.
|`last(/TiDB PD by HTTP/pd.uptime)<10m`|Info|**Manual close**: Yes| ### LLD rule Cluster metrics discovery |Name|Description|Type|Key and additional info| |----|-----------|----|-----------------------| |Cluster metrics discovery|Discovery cluster specific metrics.
|Dependent item|pd.cluster.discovery**Preprocessing**
JavaScript: `The text is too long. Please see the template.`
Discard unchanged with heartbeat: `1h`
**Preprocessing**
JSON Path: `$[?(@.labels.type == "store_offline_count")].value.first()`
Discard unchanged with heartbeat: `1h`
The count of tombstone stores.
|Dependent item|pd.cluster_status.store_tombstone[{#SINGLETON}]**Preprocessing**
JSON Path: `$[?(@.labels.type == "store_tombstone_count")].value.first()`
Discard unchanged with heartbeat: `1h`
The count of down stores.
|Dependent item|pd.cluster_status.store_down[{#SINGLETON}]**Preprocessing**
JSON Path: `$[?(@.labels.type == "store_down_count")].value.first()`
Discard unchanged with heartbeat: `1h`
The count of low space stores.
|Dependent item|pd.cluster_status.store_low_space[{#SINGLETON}]**Preprocessing**
JSON Path: `$[?(@.labels.type == "store_low_space_count")].value.first()`
Discard unchanged with heartbeat: `1h`
The count of unhealthy stores.
|Dependent item|pd.cluster_status.store_unhealth[{#SINGLETON}]**Preprocessing**
JSON Path: `$[?(@.labels.type == "store_unhealth_count")].value.first()`
Discard unchanged with heartbeat: `1h`
The count of disconnected stores.
|Dependent item|pd.cluster_status.store_disconnected[{#SINGLETON}]**Preprocessing**
JSON Path: `The text is too long. Please see the template.`
Discard unchanged with heartbeat: `1h`
The count of healthy storage instances.
|Dependent item|pd.cluster_status.store_up[{#SINGLETON}]**Preprocessing**
JSON Path: `$[?(@.labels.type == "store_up_count")].value.first()`
Discard unchanged with heartbeat: `1h`
The total storage capacity for this TiDB cluster.
|Dependent item|pd.cluster_status.storage_capacity[{#SINGLETON}]**Preprocessing**
JSON Path: `$[?(@.labels.type == "storage_capacity")].value.first()`
Discard unchanged with heartbeat: `1h`
The storage size that is currently used by the TiDB cluster.
|Dependent item|pd.cluster_status.storage_size[{#SINGLETON}]**Preprocessing**
JSON Path: `$[?(@.labels.type == "storage_size")].value.first()`
The total count of cluster Regions.
|Dependent item|pd.cluster_status.leader_count[{#SINGLETON}]**Preprocessing**
JSON Path: `$[?(@.labels.type == "leader_count")].value.first()`
The current count of all cluster peers.
|Dependent item|pd.cluster_status.region_count[{#SINGLETON}]**Preprocessing**
JSON Path: `$[?(@.labels.type == "region_count")].value.first()`
PD has not received a TiKV heartbeat for a long time.
|`last(/TiDB PD by HTTP/pd.cluster_status.store_down[{#SINGLETON}])>0`|Average|| |TiDB cluster: There are low space TiKV nodes|Indicates that there is no sufficient space on the TiKV node.
|`last(/TiDB PD by HTTP/pd.cluster_status.store_low_space[{#SINGLETON}])>0`|Average|| |TiDB cluster: There are disconnected TiKV nodes|PD does not receive a TiKV heartbeat within 20 seconds. Normally a TiKV heartbeat comes in every 10 seconds.
|`last(/TiDB PD by HTTP/pd.cluster_status.store_disconnected[{#SINGLETON}])>0`|Warning|| |TiDB cluster: Current storage usage is too high|Over {$PD.STORAGE_USAGE.MAX.WARN}% of the cluster space is occupied.
|`min(/TiDB PD by HTTP/pd.cluster_status.storage_size[{#SINGLETON}],5m)/last(/TiDB PD by HTTP/pd.cluster_status.storage_capacity[{#SINGLETON}])*100>{$PD.STORAGE_USAGE.MAX.WARN}`|Warning|| ### LLD rule Region labels discovery |Name|Description|Type|Key and additional info| |----|-----------|----|-----------------------| |Region labels discovery|Discovery region labels specific metrics.
|Dependent item|pd.region_labels.discovery**Preprocessing**
JavaScript: `The text is too long. Please see the template.`
Discard unchanged with heartbeat: `1h`
The number of Regions in different label levels.
|Dependent item|pd.region_labels[{#TYPE}]**Preprocessing**
JSON Path: `$[?(@.labels.type == "{#TYPE}")].value.first()`
Discovery region status specific metrics.
|Dependent item|pd.region_status.discovery**Preprocessing**
JavaScript: `The text is too long. Please see the template.`
Discard unchanged with heartbeat: `1h`
The health status of Regions indicated via the count of unusual Regions including pending peers, down peers, extra peers, offline peers, missing peers, learner peers and incorrect namespaces.
|Dependent item|pd.region_status[{#TYPE}]**Preprocessing**
JSON Path: `$[?(@.labels.type == "{#TYPE}")].value.first()`
The number of Region replicas is smaller than the value of max-replicas. When a TiKV machine is down and its downtime exceeds max-down-time, it usually leads to missing replicas for some Regions during a period of time. When a TiKV node is made offline, it might result in a small number of Regions with missing replicas.
|`min(/TiDB PD by HTTP/pd.region_status[{#TYPE}],5m)>{$PD.MISS_REGION.MAX.WARN}`|Warning|| |TiDB cluster: There are unresponsive peers|The number of Regions with an unresponsive peer reported by the Raft leader.
|`min(/TiDB PD by HTTP/pd.region_status[{#TYPE}],5m)>0`|Warning|| ### LLD rule Running scheduler discovery |Name|Description|Type|Key and additional info| |----|-----------|----|-----------------------| |Running scheduler discovery|Discovery scheduler specific metrics.
|Dependent item|pd.scheduler.discovery**Preprocessing**
JavaScript: `The text is too long. Please see the template.`
Discard unchanged with heartbeat: `1h`
The current running schedulers.
|Dependent item|pd.scheduler[{#KIND}]**Preprocessing**
JSON Path: `$[?(@.labels.kind == "{#KIND}")].value.first()`
⛔️Custom on fail: Set value to: `0`
Discovery grpc commands specific metrics.
|Dependent item|pd.grpc_command.discovery**Preprocessing**
JavaScript: `The text is too long. Please see the template.`
Discard unchanged with heartbeat: `1h`
The rate per command type at which gRPC commands are completed.
|Dependent item|pd.grpc_command.rate[{#GRPC_METHOD}]**Preprocessing**
JSON Path: `$[?(@.labels.grpc_method == "{#GRPC_METHOD}")].value.first()`
Discovery region specific metrics.
|Dependent item|pd.region.discovery**Preprocessing**
JavaScript: `The text is too long. Please see the template.`
Discard unchanged with heartbeat: `1h`
Get region metrics for {#STORE_ADDRESS}.
|Dependent item|pd.region_heartbeat.get_metrics[{#STORE_ADDRESS}]**Preprocessing**
JSON Path: `$[?(@.labels.address == "{#STORE_ADDRESS}")]`
⛔️Custom on fail: Discard value
The count of heartbeats with the ok status per second.
|Dependent item|pd.region_heartbeat.ok.rate[{#STORE_ADDRESS}]**Preprocessing**
JSON Path: `The text is too long. Please see the template.`
⛔️Custom on fail: Set value to: `0`
The count of heartbeats with the error status per second.
|Dependent item|pd.region_heartbeat.error.rate[{#STORE_ADDRESS}]**Preprocessing**
JSON Path: `The text is too long. Please see the template.`
⛔️Custom on fail: Set value to: `0`
The count of heartbeats reported to PD per instance per second.
|Dependent item|pd.region_heartbeat.rate[{#STORE_ADDRESS}]**Preprocessing**
JSON Path: `$[?(@.labels.type == "report")].value.sum()`
⛔️Custom on fail: Set value to: `0`
**Preprocessing**
JSON Path: `$[?(@.labels.type == "push")].value.sum()`
⛔️Custom on fail: Set value to: `0`