History

yzl 93958d0fb0 zabbix6.0		2 years ago
..
README.md	zabbix6.0	2 years ago
template_db_tidb_pd_http.yaml	zabbix6.0	2 years ago

README.md

Unescape Escape

TiDB PD by HTTP

Overview

The template to monitor PD server of TiDB cluster by Zabbix that works without any external scripts. Most of the metrics are collected in one go, thanks to Zabbix bulk data collection.

Template TiDB PD by HTTP — collects metrics by HTTP agent from PD /metrics endpoint and from monitoring API. See https://docs.pingcap.com/tidb/stable/tidb-monitoring-api.

Requirements

Zabbix version: 7.0 and higher.

Tested versions

This template has been tested on:

TiDB cluster 4.0.10, 6.5.1

Configuration

Zabbix should be configured according to the instructions in the Templates out of the box section.

Setup

This template works with PD server of TiDB cluster. Internal service metrics are collected from PD /metrics endpoint and from monitoring API. See https://docs.pingcap.com/tidb/stable/tidb-monitoring-api. Don't forget to change the macros {$PD.URL}, {$PD.PORT}. Also, see the Macros section for a list of macros used to set trigger values.

Macros used

Name	Description	Default
{$PD.PORT}	The port of PD server metrics web endpoint	`2379`
{$PD.URL}	PD server URL	`localhost`
{$PD.MISS_REGION.MAX.WARN}	Maximum number of missed regions	`100`
{$PD.STORAGE_USAGE.MAX.WARN}	Maximum percentage of cluster space used	`80`

Items

Name	Description	Type	Key and additional info
PD: Get instance metrics	Get TiDB PD instance metrics.	HTTP agent	pd.get_metrics Preprocessing Check for not supported value ⛔️Custom on fail: Discard value Prometheus to JSON
PD: Get instance status	Get TiDB PD instance status info.	HTTP agent	pd.get_status Preprocessing Check for not supported value ⛔️Custom on fail: Set value to: `{"status": "0"}`
PD: Status	Status of PD instance.	Dependent item	pd.status Preprocessing JSON Path: `$.status` ⛔️Custom on fail: Set value to: `1` Discard unchanged with heartbeat: `1h`
PD: gRPC Commands total, rate	The rate at which gRPC commands are completed.	Dependent item	pd.grpc_command.rate Preprocessing JSON Path: `The text is too long. Please see the template.` ⛔️Custom on fail: Discard value Change per second
PD: Version	Version of the PD instance.	Dependent item	pd.version Preprocessing JSON Path: `$.version` Discard unchanged with heartbeat: `3h`
PD: Uptime	The runtime of each PD instance.	Dependent item	pd.uptime Preprocessing JSON Path: `$.start_timestamp` JavaScript: `The text is too long. Please see the template.`
PD: Get cluster metrics	Get cluster metrics.	Dependent item	pd.cluster_status.get_metrics Preprocessing JSON Path: `$[?(@.name == "pd_cluster_status")]` ⛔️Custom on fail: Discard value
PD: Get region metrics	Get region metrics.	Dependent item	pd.regions.get_metrics Preprocessing JSON Path: `$[?(@.name == "pd_scheduler_region_heartbeat")]` ⛔️Custom on fail: Discard value
PD: Get region label metrics	Get region label metrics.	Dependent item	pd.region_labels.get_metrics Preprocessing JSON Path: `$[?(@.name == "pd_regions_label_level")]` ⛔️Custom on fail: Discard value
PD: Get region status metrics	Get region status metrics.	Dependent item	pd.region_status.get_metrics Preprocessing JSON Path: `$[?(@.name == "pd_regions_status")]` ⛔️Custom on fail: Discard value
PD: Get gRPC command metrics	Get gRPC command metrics.	Dependent item	pd.grpc_commands.get_metrics Preprocessing JSON Path: `$[?(@.name == "grpc_server_handling_seconds_count")]` ⛔️Custom on fail: Discard value
PD: Get scheduler metrics	Get scheduler metrics.	Dependent item	pd.scheduler.get_metrics Preprocessing JSON Path: `The text is too long. Please see the template.` ⛔️Custom on fail: Discard value

Triggers

Name	Description	Expression	Severity	Dependencies and additional info
PD: Instance is not responding		`last(/TiDB PD by HTTP/pd.status)=0`	Average
PD: Version has changed	PD version has changed. Acknowledge to close the problem manually.	`last(/TiDB PD by HTTP/pd.version,#1)<>last(/TiDB PD by HTTP/pd.version,#2) and length(last(/TiDB PD by HTTP/pd.version))>0`	Info	Manual close: Yes
PD: has been restarted	Uptime is less than 10 minutes.	`last(/TiDB PD by HTTP/pd.uptime)<10m`	Info	Manual close: Yes

LLD rule Cluster metrics discovery

Name Description Type Key and additional info

Cluster metrics discovery

Name	Description	Type	Key and additional info
Cluster metrics discovery	Discovery cluster specific metrics.	Dependent item	pd.cluster.discovery Preprocessing JavaScript: `The text is too long. Please see the template.` Discard unchanged with heartbeat: `1h`

Discovery cluster specific metrics.

Dependent item

pd.cluster.discovery

Preprocessing

JavaScript: The text is too long. Please see the template.
Discard unchanged with heartbeat: 1h

Item prototypes for Cluster metrics discovery

Name	Description	Type	Key and additional info
TiDB cluster: Offline stores		Dependent item	pd.cluster_status.store_offline[{#SINGLETON}] Preprocessing JSON Path: `$[?(@.labels.type == "store_offline_count")].value.first()` Discard unchanged with heartbeat: `1h`
TiDB cluster: Tombstone stores	The count of tombstone stores.	Dependent item	pd.cluster_status.store_tombstone[{#SINGLETON}] Preprocessing JSON Path: `$[?(@.labels.type == "store_tombstone_count")].value.first()` Discard unchanged with heartbeat: `1h`
TiDB cluster: Down stores	The count of down stores.	Dependent item	pd.cluster_status.store_down[{#SINGLETON}] Preprocessing JSON Path: `$[?(@.labels.type == "store_down_count")].value.first()` Discard unchanged with heartbeat: `1h`
TiDB cluster: Lowspace stores	The count of low space stores.	Dependent item	pd.cluster_status.store_low_space[{#SINGLETON}] Preprocessing JSON Path: `$[?(@.labels.type == "store_low_space_count")].value.first()` Discard unchanged with heartbeat: `1h`
TiDB cluster: Unhealth stores	The count of unhealthy stores.	Dependent item	pd.cluster_status.store_unhealth[{#SINGLETON}] Preprocessing JSON Path: `$[?(@.labels.type == "store_unhealth_count")].value.first()` Discard unchanged with heartbeat: `1h`
TiDB cluster: Disconnect stores	The count of disconnected stores.	Dependent item	pd.cluster_status.store_disconnected[{#SINGLETON}] Preprocessing JSON Path: `The text is too long. Please see the template.` Discard unchanged with heartbeat: `1h`
TiDB cluster: Normal stores	The count of healthy storage instances.	Dependent item	pd.cluster_status.store_up[{#SINGLETON}] Preprocessing JSON Path: `$[?(@.labels.type == "store_up_count")].value.first()` Discard unchanged with heartbeat: `1h`
TiDB cluster: Storage capacity	The total storage capacity for this TiDB cluster.	Dependent item	pd.cluster_status.storage_capacity[{#SINGLETON}] Preprocessing JSON Path: `$[?(@.labels.type == "storage_capacity")].value.first()` Discard unchanged with heartbeat: `1h`
TiDB cluster: Storage size	The storage size that is currently used by the TiDB cluster.	Dependent item	pd.cluster_status.storage_size[{#SINGLETON}] Preprocessing JSON Path: `$[?(@.labels.type == "storage_size")].value.first()`
TiDB cluster: Number of regions	The total count of cluster Regions.	Dependent item	pd.cluster_status.leader_count[{#SINGLETON}] Preprocessing JSON Path: `$[?(@.labels.type == "leader_count")].value.first()`
TiDB cluster: Current peer count	The current count of all cluster peers.	Dependent item	pd.cluster_status.region_count[{#SINGLETON}] Preprocessing JSON Path: `$[?(@.labels.type == "region_count")].value.first()`

Trigger prototypes for Cluster metrics discovery

Name	Description	Expression	Severity
TiDB cluster: There are offline TiKV nodes	PD has not received a TiKV heartbeat for a long time.	`last(/TiDB PD by HTTP/pd.cluster_status.store_down[{#SINGLETON}])>0`	Average
TiDB cluster: There are low space TiKV nodes	Indicates that there is no sufficient space on the TiKV node.	`last(/TiDB PD by HTTP/pd.cluster_status.store_low_space[{#SINGLETON}])>0`	Average
TiDB cluster: There are disconnected TiKV nodes	PD does not receive a TiKV heartbeat within 20 seconds. Normally a TiKV heartbeat comes in every 10 seconds.	`last(/TiDB PD by HTTP/pd.cluster_status.store_disconnected[{#SINGLETON}])>0`	Warning
TiDB cluster: Current storage usage is too high	Over {$PD.STORAGE_USAGE.MAX.WARN}% of the cluster space is occupied.	`min(/TiDB PD by HTTP/pd.cluster_status.storage_size[{#SINGLETON}],5m)/last(/TiDB PD by HTTP/pd.cluster_status.storage_capacity[{#SINGLETON}])*100>{$PD.STORAGE_USAGE.MAX.WARN}`	Warning

LLD rule Region labels discovery

Name Description Type Key and additional info

Region labels discovery

Name	Description	Type	Key and additional info
Region labels discovery	Discovery region labels specific metrics.	Dependent item	pd.region_labels.discovery Preprocessing JavaScript: `The text is too long. Please see the template.` Discard unchanged with heartbeat: `1h`

Discovery region labels specific metrics.

Dependent item

pd.region_labels.discovery

Preprocessing

JavaScript: The text is too long. Please see the template.
Discard unchanged with heartbeat: 1h

Item prototypes for Region labels discovery

Name Description Type Key and additional info

TiDB cluster: Regions label: {#TYPE}

Name	Description	Type	Key and additional info
TiDB cluster: Regions label: {#TYPE}	The number of Regions in different label levels.	Dependent item	pd.region_labels[{#TYPE}] Preprocessing JSON Path: `$[?(@.labels.type == "{#TYPE}")].value.first()`

The number of Regions in different label levels.

Dependent item

pd.region_labels[{#TYPE}]

Preprocessing

JSON Path: $[?(@.labels.type == "{#TYPE}")].value.first()

LLD rule Region status discovery

Name Description Type Key and additional info

Region status discovery

Name	Description	Type	Key and additional info
Region status discovery	Discovery region status specific metrics.	Dependent item	pd.region_status.discovery Preprocessing JavaScript: `The text is too long. Please see the template.` Discard unchanged with heartbeat: `1h`

Discovery region status specific metrics.

Dependent item

pd.region_status.discovery

Preprocessing

JavaScript: The text is too long. Please see the template.
Discard unchanged with heartbeat: 1h

Item prototypes for Region status discovery

Name Description Type Key and additional info

TiDB cluster: Regions status: {#TYPE}

Name	Description	Type	Key and additional info
TiDB cluster: Regions status: {#TYPE}	The health status of Regions indicated via the count of unusual Regions including pending peers, down peers, extra peers, offline peers, missing peers, learner peers and incorrect namespaces.	Dependent item	pd.region_status[{#TYPE}] Preprocessing JSON Path: `$[?(@.labels.type == "{#TYPE}")].value.first()`

The health status of Regions indicated via the count of unusual Regions including pending peers, down peers, extra peers, offline peers, missing peers, learner peers and incorrect namespaces.

Dependent item

pd.region_status[{#TYPE}]

Preprocessing

JSON Path: $[?(@.labels.type == "{#TYPE}")].value.first()

Trigger prototypes for Region status discovery

Name	Description	Expression	Severity	Dependencies and additional info
TiDB cluster: Too many missed regions	The number of Region replicas is smaller than the value of max-replicas. When a TiKV machine is down and its downtime exceeds max-down-time, it usually leads to missing replicas for some Regions during a period of time. When a TiKV node is made offline, it might result in a small number of Regions with missing replicas.	`min(/TiDB PD by HTTP/pd.region_status[{#TYPE}],5m)>{$PD.MISS_REGION.MAX.WARN}`	Warning
TiDB cluster: There are unresponsive peers	The number of Regions with an unresponsive peer reported by the Raft leader.	`min(/TiDB PD by HTTP/pd.region_status[{#TYPE}],5m)>0`	Warning

LLD rule Running scheduler discovery

Name Description Type Key and additional info

Running scheduler discovery

Name	Description	Type	Key and additional info
Running scheduler discovery	Discovery scheduler specific metrics.	Dependent item	pd.scheduler.discovery Preprocessing JavaScript: `The text is too long. Please see the template.` Discard unchanged with heartbeat: `1h`

Discovery scheduler specific metrics.

Dependent item

pd.scheduler.discovery

Preprocessing

JavaScript: The text is too long. Please see the template.
Discard unchanged with heartbeat: 1h

Item prototypes for Running scheduler discovery

Name Description Type Key and additional info

TiDB cluster: Scheduler status: {#KIND}

Name	Description	Type	Key and additional info
TiDB cluster: Scheduler status: {#KIND}	The current running schedulers.	Dependent item	pd.scheduler[{#KIND}] Preprocessing JSON Path: `$[?(@.labels.kind == "{#KIND}")].value.first()` ⛔️Custom on fail: Set value to: `0`

The current running schedulers.

Dependent item

pd.scheduler[{#KIND}]

Preprocessing

JSON Path: $[?(@.labels.kind == "{#KIND}")].value.first()
⛔️Custom on fail: Set value to: 0

LLD rule gRPC commands discovery

Name Description Type Key and additional info

gRPC commands discovery

Name	Description	Type	Key and additional info
gRPC commands discovery	Discovery grpc commands specific metrics.	Dependent item	pd.grpc_command.discovery Preprocessing JavaScript: `The text is too long. Please see the template.` Discard unchanged with heartbeat: `1h`

Discovery grpc commands specific metrics.

Dependent item

pd.grpc_command.discovery

Preprocessing

JavaScript: The text is too long. Please see the template.
Discard unchanged with heartbeat: 1h

Item prototypes for gRPC commands discovery

Name Description Type Key and additional info

PD: gRPC Commands: {#GRPC_METHOD}, rate

Name	Description	Type	Key and additional info
PD: gRPC Commands: {#GRPC_METHOD}, rate	The rate per command type at which gRPC commands are completed.	Dependent item	pd.grpc_command.rate[{#GRPC_METHOD}] Preprocessing JSON Path: `$[?(@.labels.grpc_method == "{#GRPC_METHOD}")].value.first()` Change per second

The rate per command type at which gRPC commands are completed.

Dependent item

pd.grpc_command.rate[{#GRPC_METHOD}]

Preprocessing

JSON Path: $[?(@.labels.grpc_method == "{#GRPC_METHOD}")].value.first()
Change per second

LLD rule Region discovery

Name Description Type Key and additional info

Region discovery

Name	Description	Type	Key and additional info
Region discovery	Discovery region specific metrics.	Dependent item	pd.region.discovery Preprocessing JavaScript: `The text is too long. Please see the template.` Discard unchanged with heartbeat: `1h`

Discovery region specific metrics.

Dependent item

pd.region.discovery

Preprocessing

JavaScript: The text is too long. Please see the template.
Discard unchanged with heartbeat: 1h

Item prototypes for Region discovery

Name	Description	Type	Key and additional info
PD: Get metrics: {#STORE_ADDRESS}	Get region metrics for {#STORE_ADDRESS}.	Dependent item	pd.region_heartbeat.get_metrics[{#STORE_ADDRESS}] Preprocessing JSON Path: `$[?(@.labels.address == "{#STORE_ADDRESS}")]` ⛔️Custom on fail: Discard value
PD: Region heartbeat: active, rate	The count of heartbeats with the ok status per second.	Dependent item	pd.region_heartbeat.ok.rate[{#STORE_ADDRESS}] Preprocessing JSON Path: `The text is too long. Please see the template.` ⛔️Custom on fail: Set value to: `0` Change per second
PD: Region heartbeat: error, rate	The count of heartbeats with the error status per second.	Dependent item	pd.region_heartbeat.error.rate[{#STORE_ADDRESS}] Preprocessing JSON Path: `The text is too long. Please see the template.` ⛔️Custom on fail: Set value to: `0` Change per second
PD: Region heartbeat: total, rate	The count of heartbeats reported to PD per instance per second.	Dependent item	pd.region_heartbeat.rate[{#STORE_ADDRESS}] Preprocessing JSON Path: `$[?(@.labels.type == "report")].value.sum()` ⛔️Custom on fail: Set value to: `0` Change per second
PD: Region schedule push: total, rate		Dependent item	pd.region_heartbeat.push.err.rate[{#STORE_ADDRESS}] Preprocessing JSON Path: `$[?(@.labels.type == "push")].value.sum()` ⛔️Custom on fail: Set value to: `0` Change per second

Feedback

Please report any issues with the template at https://support.zabbix.com

You can also provide feedback, discuss the template, or ask for help at ZABBIX forums

README.md Unescape Escape

TiDB PD by HTTP

Overview

Requirements

Tested versions

Configuration

Setup

Macros used

Items

Triggers

LLD rule Cluster metrics discovery

Item prototypes for Cluster metrics discovery

Trigger prototypes for Cluster metrics discovery

LLD rule Region labels discovery

Item prototypes for Region labels discovery

LLD rule Region status discovery

Item prototypes for Region status discovery

Trigger prototypes for Region status discovery

LLD rule Running scheduler discovery

Item prototypes for Running scheduler discovery

LLD rule gRPC commands discovery

Item prototypes for gRPC commands discovery

LLD rule Region discovery

Item prototypes for Region discovery

Feedback

README.md

Unescape Escape