yzl
93958d0fb0
|
1 year ago | |
---|---|---|
.. | ||
README.md | 1 year ago | |
template_db_tidb_pd_http.yaml | 1 year ago |
README.md
TiDB PD by HTTP
Overview
The template to monitor PD server of TiDB cluster by Zabbix that works without any external scripts. Most of the metrics are collected in one go, thanks to Zabbix bulk data collection.
Template TiDB PD by HTTP
— collects metrics by HTTP agent from PD /metrics endpoint and from monitoring API.
See https://docs.pingcap.com/tidb/stable/tidb-monitoring-api.
Requirements
Zabbix version: 7.0 and higher.
Tested versions
This template has been tested on:
- TiDB cluster 4.0.10, 6.5.1
Configuration
Zabbix should be configured according to the instructions in the Templates out of the box section.
Setup
This template works with PD server of TiDB cluster. Internal service metrics are collected from PD /metrics endpoint and from monitoring API. See https://docs.pingcap.com/tidb/stable/tidb-monitoring-api. Don't forget to change the macros {$PD.URL}, {$PD.PORT}. Also, see the Macros section for a list of macros used to set trigger values.
Macros used
Name | Description | Default |
---|---|---|
{$PD.PORT} | The port of PD server metrics web endpoint |
2379 |
{$PD.URL} | PD server URL |
localhost |
{$PD.MISS_REGION.MAX.WARN} | Maximum number of missed regions |
100 |
{$PD.STORAGE_USAGE.MAX.WARN} | Maximum percentage of cluster space used |
80 |
Items
Name | Description | Type | Key and additional info |
---|---|---|---|
PD: Get instance metrics | Get TiDB PD instance metrics. |
HTTP agent | pd.get_metrics Preprocessing
|
PD: Get instance status | Get TiDB PD instance status info. |
HTTP agent | pd.get_status Preprocessing
|
PD: Status | Status of PD instance. |
Dependent item | pd.status Preprocessing
|
PD: gRPC Commands total, rate | The rate at which gRPC commands are completed. |
Dependent item | pd.grpc_command.rate Preprocessing
|
PD: Version | Version of the PD instance. |
Dependent item | pd.version Preprocessing
|
PD: Uptime | The runtime of each PD instance. |
Dependent item | pd.uptime Preprocessing
|
PD: Get cluster metrics | Get cluster metrics. |
Dependent item | pd.cluster_status.get_metrics Preprocessing
|
PD: Get region metrics | Get region metrics. |
Dependent item | pd.regions.get_metrics Preprocessing
|
PD: Get region label metrics | Get region label metrics. |
Dependent item | pd.region_labels.get_metrics Preprocessing
|
PD: Get region status metrics | Get region status metrics. |
Dependent item | pd.region_status.get_metrics Preprocessing
|
PD: Get gRPC command metrics | Get gRPC command metrics. |
Dependent item | pd.grpc_commands.get_metrics Preprocessing
|
PD: Get scheduler metrics | Get scheduler metrics. |
Dependent item | pd.scheduler.get_metrics Preprocessing
|
Triggers
Name | Description | Expression | Severity | Dependencies and additional info |
---|---|---|---|---|
PD: Instance is not responding | last(/TiDB PD by HTTP/pd.status)=0 |
Average | ||
PD: Version has changed | PD version has changed. Acknowledge to close the problem manually. |
last(/TiDB PD by HTTP/pd.version,#1)<>last(/TiDB PD by HTTP/pd.version,#2) and length(last(/TiDB PD by HTTP/pd.version))>0 |
Info | Manual close: Yes |
PD: has been restarted | Uptime is less than 10 minutes. |
last(/TiDB PD by HTTP/pd.uptime)<10m |
Info | Manual close: Yes |
LLD rule Cluster metrics discovery
Name | Description | Type | Key and additional info |
---|---|---|---|
Cluster metrics discovery | Discovery cluster specific metrics. |
Dependent item | pd.cluster.discovery Preprocessing
|
Item prototypes for Cluster metrics discovery
Name | Description | Type | Key and additional info |
---|---|---|---|
TiDB cluster: Offline stores | Dependent item | pd.cluster_status.store_offline[{#SINGLETON}] Preprocessing
|
|
TiDB cluster: Tombstone stores | The count of tombstone stores. |
Dependent item | pd.cluster_status.store_tombstone[{#SINGLETON}] Preprocessing
|
TiDB cluster: Down stores | The count of down stores. |
Dependent item | pd.cluster_status.store_down[{#SINGLETON}] Preprocessing
|
TiDB cluster: Lowspace stores | The count of low space stores. |
Dependent item | pd.cluster_status.store_low_space[{#SINGLETON}] Preprocessing
|
TiDB cluster: Unhealth stores | The count of unhealthy stores. |
Dependent item | pd.cluster_status.store_unhealth[{#SINGLETON}] Preprocessing
|
TiDB cluster: Disconnect stores | The count of disconnected stores. |
Dependent item | pd.cluster_status.store_disconnected[{#SINGLETON}] Preprocessing
|
TiDB cluster: Normal stores | The count of healthy storage instances. |
Dependent item | pd.cluster_status.store_up[{#SINGLETON}] Preprocessing
|
TiDB cluster: Storage capacity | The total storage capacity for this TiDB cluster. |
Dependent item | pd.cluster_status.storage_capacity[{#SINGLETON}] Preprocessing
|
TiDB cluster: Storage size | The storage size that is currently used by the TiDB cluster. |
Dependent item | pd.cluster_status.storage_size[{#SINGLETON}] Preprocessing
|
TiDB cluster: Number of regions | The total count of cluster Regions. |
Dependent item | pd.cluster_status.leader_count[{#SINGLETON}] Preprocessing
|
TiDB cluster: Current peer count | The current count of all cluster peers. |
Dependent item | pd.cluster_status.region_count[{#SINGLETON}] Preprocessing
|
Trigger prototypes for Cluster metrics discovery
Name | Description | Expression | Severity | Dependencies and additional info |
---|---|---|---|---|
TiDB cluster: There are offline TiKV nodes | PD has not received a TiKV heartbeat for a long time. |
last(/TiDB PD by HTTP/pd.cluster_status.store_down[{#SINGLETON}])>0 |
Average | |
TiDB cluster: There are low space TiKV nodes | Indicates that there is no sufficient space on the TiKV node. |
last(/TiDB PD by HTTP/pd.cluster_status.store_low_space[{#SINGLETON}])>0 |
Average | |
TiDB cluster: There are disconnected TiKV nodes | PD does not receive a TiKV heartbeat within 20 seconds. Normally a TiKV heartbeat comes in every 10 seconds. |
last(/TiDB PD by HTTP/pd.cluster_status.store_disconnected[{#SINGLETON}])>0 |
Warning | |
TiDB cluster: Current storage usage is too high | Over {$PD.STORAGE_USAGE.MAX.WARN}% of the cluster space is occupied. |
min(/TiDB PD by HTTP/pd.cluster_status.storage_size[{#SINGLETON}],5m)/last(/TiDB PD by HTTP/pd.cluster_status.storage_capacity[{#SINGLETON}])*100>{$PD.STORAGE_USAGE.MAX.WARN} |
Warning |
LLD rule Region labels discovery
Name | Description | Type | Key and additional info |
---|---|---|---|
Region labels discovery | Discovery region labels specific metrics. |
Dependent item | pd.region_labels.discovery Preprocessing
|
Item prototypes for Region labels discovery
Name | Description | Type | Key and additional info |
---|---|---|---|
TiDB cluster: Regions label: {#TYPE} | The number of Regions in different label levels. |
Dependent item | pd.region_labels[{#TYPE}] Preprocessing
|
LLD rule Region status discovery
Name | Description | Type | Key and additional info |
---|---|---|---|
Region status discovery | Discovery region status specific metrics. |
Dependent item | pd.region_status.discovery Preprocessing
|
Item prototypes for Region status discovery
Name | Description | Type | Key and additional info |
---|---|---|---|
TiDB cluster: Regions status: {#TYPE} | The health status of Regions indicated via the count of unusual Regions including pending peers, down peers, extra peers, offline peers, missing peers, learner peers and incorrect namespaces. |
Dependent item | pd.region_status[{#TYPE}] Preprocessing
|
Trigger prototypes for Region status discovery
Name | Description | Expression | Severity | Dependencies and additional info |
---|---|---|---|---|
TiDB cluster: Too many missed regions | The number of Region replicas is smaller than the value of max-replicas. When a TiKV machine is down and its downtime exceeds max-down-time, it usually leads to missing replicas for some Regions during a period of time. When a TiKV node is made offline, it might result in a small number of Regions with missing replicas. |
min(/TiDB PD by HTTP/pd.region_status[{#TYPE}],5m)>{$PD.MISS_REGION.MAX.WARN} |
Warning | |
TiDB cluster: There are unresponsive peers | The number of Regions with an unresponsive peer reported by the Raft leader. |
min(/TiDB PD by HTTP/pd.region_status[{#TYPE}],5m)>0 |
Warning |
LLD rule Running scheduler discovery
Name | Description | Type | Key and additional info |
---|---|---|---|
Running scheduler discovery | Discovery scheduler specific metrics. |
Dependent item | pd.scheduler.discovery Preprocessing
|
Item prototypes for Running scheduler discovery
Name | Description | Type | Key and additional info |
---|---|---|---|
TiDB cluster: Scheduler status: {#KIND} | The current running schedulers. |
Dependent item | pd.scheduler[{#KIND}] Preprocessing
|
LLD rule gRPC commands discovery
Name | Description | Type | Key and additional info |
---|---|---|---|
gRPC commands discovery | Discovery grpc commands specific metrics. |
Dependent item | pd.grpc_command.discovery Preprocessing
|
Item prototypes for gRPC commands discovery
Name | Description | Type | Key and additional info |
---|---|---|---|
PD: gRPC Commands: {#GRPC_METHOD}, rate | The rate per command type at which gRPC commands are completed. |
Dependent item | pd.grpc_command.rate[{#GRPC_METHOD}] Preprocessing
|
LLD rule Region discovery
Name | Description | Type | Key and additional info |
---|---|---|---|
Region discovery | Discovery region specific metrics. |
Dependent item | pd.region.discovery Preprocessing
|
Item prototypes for Region discovery
Name | Description | Type | Key and additional info |
---|---|---|---|
PD: Get metrics: {#STORE_ADDRESS} | Get region metrics for {#STORE_ADDRESS}. |
Dependent item | pd.region_heartbeat.get_metrics[{#STORE_ADDRESS}] Preprocessing
|
PD: Region heartbeat: active, rate | The count of heartbeats with the ok status per second. |
Dependent item | pd.region_heartbeat.ok.rate[{#STORE_ADDRESS}] Preprocessing
|
PD: Region heartbeat: error, rate | The count of heartbeats with the error status per second. |
Dependent item | pd.region_heartbeat.error.rate[{#STORE_ADDRESS}] Preprocessing
|
PD: Region heartbeat: total, rate | The count of heartbeats reported to PD per instance per second. |
Dependent item | pd.region_heartbeat.rate[{#STORE_ADDRESS}] Preprocessing
|
PD: Region schedule push: total, rate | Dependent item | pd.region_heartbeat.push.err.rate[{#STORE_ADDRESS}] Preprocessing
|
Feedback
Please report any issues with the template at https://support.zabbix.com
You can also provide feedback, discuss the template, or ask for help at ZABBIX forums