You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
yzl 93958d0fb0
zabbix6.0
1 year ago
..
README.md zabbix6.0 1 year ago
template_db_tidb_pd_http.yaml zabbix6.0 1 year ago

README.md

TiDB PD by HTTP

Overview

The template to monitor PD server of TiDB cluster by Zabbix that works without any external scripts. Most of the metrics are collected in one go, thanks to Zabbix bulk data collection.

Template TiDB PD by HTTP — collects metrics by HTTP agent from PD /metrics endpoint and from monitoring API. See https://docs.pingcap.com/tidb/stable/tidb-monitoring-api.

Requirements

Zabbix version: 7.0 and higher.

Tested versions

This template has been tested on:

  • TiDB cluster 4.0.10, 6.5.1

Configuration

Zabbix should be configured according to the instructions in the Templates out of the box section.

Setup

This template works with PD server of TiDB cluster. Internal service metrics are collected from PD /metrics endpoint and from monitoring API. See https://docs.pingcap.com/tidb/stable/tidb-monitoring-api. Don't forget to change the macros {$PD.URL}, {$PD.PORT}. Also, see the Macros section for a list of macros used to set trigger values.

Macros used

Name Description Default
{$PD.PORT}

The port of PD server metrics web endpoint

2379
{$PD.URL}

PD server URL

localhost
{$PD.MISS_REGION.MAX.WARN}

Maximum number of missed regions

100
{$PD.STORAGE_USAGE.MAX.WARN}

Maximum percentage of cluster space used

80

Items

Name Description Type Key and additional info
PD: Get instance metrics

Get TiDB PD instance metrics.

HTTP agent pd.get_metrics

Preprocessing

  • Check for not supported value

    Custom on fail: Discard value

  • Prometheus to JSON
PD: Get instance status

Get TiDB PD instance status info.

HTTP agent pd.get_status

Preprocessing

  • Check for not supported value

    Custom on fail: Set value to: {"status": "0"}

PD: Status

Status of PD instance.

Dependent item pd.status

Preprocessing

  • JSON Path: $.status

    Custom on fail: Set value to: 1

  • Discard unchanged with heartbeat: 1h

PD: gRPC Commands total, rate

The rate at which gRPC commands are completed.

Dependent item pd.grpc_command.rate

Preprocessing

  • JSON Path: The text is too long. Please see the template.

    Custom on fail: Discard value

  • Change per second
PD: Version

Version of the PD instance.

Dependent item pd.version

Preprocessing

  • JSON Path: $.version

  • Discard unchanged with heartbeat: 3h

PD: Uptime

The runtime of each PD instance.

Dependent item pd.uptime

Preprocessing

  • JSON Path: $.start_timestamp

  • JavaScript: The text is too long. Please see the template.

PD: Get cluster metrics

Get cluster metrics.

Dependent item pd.cluster_status.get_metrics

Preprocessing

  • JSON Path: $[?(@.name == "pd_cluster_status")]

    Custom on fail: Discard value

PD: Get region metrics

Get region metrics.

Dependent item pd.regions.get_metrics

Preprocessing

  • JSON Path: $[?(@.name == "pd_scheduler_region_heartbeat")]

    Custom on fail: Discard value

PD: Get region label metrics

Get region label metrics.

Dependent item pd.region_labels.get_metrics

Preprocessing

  • JSON Path: $[?(@.name == "pd_regions_label_level")]

    Custom on fail: Discard value

PD: Get region status metrics

Get region status metrics.

Dependent item pd.region_status.get_metrics

Preprocessing

  • JSON Path: $[?(@.name == "pd_regions_status")]

    Custom on fail: Discard value

PD: Get gRPC command metrics

Get gRPC command metrics.

Dependent item pd.grpc_commands.get_metrics

Preprocessing

  • JSON Path: $[?(@.name == "grpc_server_handling_seconds_count")]

    Custom on fail: Discard value

PD: Get scheduler metrics

Get scheduler metrics.

Dependent item pd.scheduler.get_metrics

Preprocessing

  • JSON Path: The text is too long. Please see the template.

    Custom on fail: Discard value

Triggers

Name Description Expression Severity Dependencies and additional info
PD: Instance is not responding last(/TiDB PD by HTTP/pd.status)=0 Average
PD: Version has changed

PD version has changed. Acknowledge to close the problem manually.

last(/TiDB PD by HTTP/pd.version,#1)<>last(/TiDB PD by HTTP/pd.version,#2) and length(last(/TiDB PD by HTTP/pd.version))>0 Info Manual close: Yes
PD: has been restarted

Uptime is less than 10 minutes.

last(/TiDB PD by HTTP/pd.uptime)<10m Info Manual close: Yes

LLD rule Cluster metrics discovery

Name Description Type Key and additional info
Cluster metrics discovery

Discovery cluster specific metrics.

Dependent item pd.cluster.discovery

Preprocessing

  • JavaScript: The text is too long. Please see the template.

  • Discard unchanged with heartbeat: 1h

Item prototypes for Cluster metrics discovery

Name Description Type Key and additional info
TiDB cluster: Offline stores Dependent item pd.cluster_status.store_offline[{#SINGLETON}]

Preprocessing

  • JSON Path: $[?(@.labels.type == "store_offline_count")].value.first()

  • Discard unchanged with heartbeat: 1h

TiDB cluster: Tombstone stores

The count of tombstone stores.

Dependent item pd.cluster_status.store_tombstone[{#SINGLETON}]

Preprocessing

  • JSON Path: $[?(@.labels.type == "store_tombstone_count")].value.first()

  • Discard unchanged with heartbeat: 1h

TiDB cluster: Down stores

The count of down stores.

Dependent item pd.cluster_status.store_down[{#SINGLETON}]

Preprocessing

  • JSON Path: $[?(@.labels.type == "store_down_count")].value.first()

  • Discard unchanged with heartbeat: 1h

TiDB cluster: Lowspace stores

The count of low space stores.

Dependent item pd.cluster_status.store_low_space[{#SINGLETON}]

Preprocessing

  • JSON Path: $[?(@.labels.type == "store_low_space_count")].value.first()

  • Discard unchanged with heartbeat: 1h

TiDB cluster: Unhealth stores

The count of unhealthy stores.

Dependent item pd.cluster_status.store_unhealth[{#SINGLETON}]

Preprocessing

  • JSON Path: $[?(@.labels.type == "store_unhealth_count")].value.first()

  • Discard unchanged with heartbeat: 1h

TiDB cluster: Disconnect stores

The count of disconnected stores.

Dependent item pd.cluster_status.store_disconnected[{#SINGLETON}]

Preprocessing

  • JSON Path: The text is too long. Please see the template.

  • Discard unchanged with heartbeat: 1h

TiDB cluster: Normal stores

The count of healthy storage instances.

Dependent item pd.cluster_status.store_up[{#SINGLETON}]

Preprocessing

  • JSON Path: $[?(@.labels.type == "store_up_count")].value.first()

  • Discard unchanged with heartbeat: 1h

TiDB cluster: Storage capacity

The total storage capacity for this TiDB cluster.

Dependent item pd.cluster_status.storage_capacity[{#SINGLETON}]

Preprocessing

  • JSON Path: $[?(@.labels.type == "storage_capacity")].value.first()

  • Discard unchanged with heartbeat: 1h

TiDB cluster: Storage size

The storage size that is currently used by the TiDB cluster.

Dependent item pd.cluster_status.storage_size[{#SINGLETON}]

Preprocessing

  • JSON Path: $[?(@.labels.type == "storage_size")].value.first()

TiDB cluster: Number of regions

The total count of cluster Regions.

Dependent item pd.cluster_status.leader_count[{#SINGLETON}]

Preprocessing

  • JSON Path: $[?(@.labels.type == "leader_count")].value.first()

TiDB cluster: Current peer count

The current count of all cluster peers.

Dependent item pd.cluster_status.region_count[{#SINGLETON}]

Preprocessing

  • JSON Path: $[?(@.labels.type == "region_count")].value.first()

Trigger prototypes for Cluster metrics discovery

Name Description Expression Severity Dependencies and additional info
TiDB cluster: There are offline TiKV nodes

PD has not received a TiKV heartbeat for a long time.

last(/TiDB PD by HTTP/pd.cluster_status.store_down[{#SINGLETON}])>0 Average
TiDB cluster: There are low space TiKV nodes

Indicates that there is no sufficient space on the TiKV node.

last(/TiDB PD by HTTP/pd.cluster_status.store_low_space[{#SINGLETON}])>0 Average
TiDB cluster: There are disconnected TiKV nodes

PD does not receive a TiKV heartbeat within 20 seconds. Normally a TiKV heartbeat comes in every 10 seconds.

last(/TiDB PD by HTTP/pd.cluster_status.store_disconnected[{#SINGLETON}])>0 Warning
TiDB cluster: Current storage usage is too high

Over {$PD.STORAGE_USAGE.MAX.WARN}% of the cluster space is occupied.

min(/TiDB PD by HTTP/pd.cluster_status.storage_size[{#SINGLETON}],5m)/last(/TiDB PD by HTTP/pd.cluster_status.storage_capacity[{#SINGLETON}])*100>{$PD.STORAGE_USAGE.MAX.WARN} Warning

LLD rule Region labels discovery

Name Description Type Key and additional info
Region labels discovery

Discovery region labels specific metrics.

Dependent item pd.region_labels.discovery

Preprocessing

  • JavaScript: The text is too long. Please see the template.

  • Discard unchanged with heartbeat: 1h

Item prototypes for Region labels discovery

Name Description Type Key and additional info
TiDB cluster: Regions label: {#TYPE}

The number of Regions in different label levels.

Dependent item pd.region_labels[{#TYPE}]

Preprocessing

  • JSON Path: $[?(@.labels.type == "{#TYPE}")].value.first()

LLD rule Region status discovery

Name Description Type Key and additional info
Region status discovery

Discovery region status specific metrics.

Dependent item pd.region_status.discovery

Preprocessing

  • JavaScript: The text is too long. Please see the template.

  • Discard unchanged with heartbeat: 1h

Item prototypes for Region status discovery

Name Description Type Key and additional info
TiDB cluster: Regions status: {#TYPE}

The health status of Regions indicated via the count of unusual Regions including pending peers, down peers, extra peers, offline peers, missing peers, learner peers and incorrect namespaces.

Dependent item pd.region_status[{#TYPE}]

Preprocessing

  • JSON Path: $[?(@.labels.type == "{#TYPE}")].value.first()

Trigger prototypes for Region status discovery

Name Description Expression Severity Dependencies and additional info
TiDB cluster: Too many missed regions

The number of Region replicas is smaller than the value of max-replicas. When a TiKV machine is down and its downtime exceeds max-down-time, it usually leads to missing replicas for some Regions during a period of time. When a TiKV node is made offline, it might result in a small number of Regions with missing replicas.

min(/TiDB PD by HTTP/pd.region_status[{#TYPE}],5m)>{$PD.MISS_REGION.MAX.WARN} Warning
TiDB cluster: There are unresponsive peers

The number of Regions with an unresponsive peer reported by the Raft leader.

min(/TiDB PD by HTTP/pd.region_status[{#TYPE}],5m)>0 Warning

LLD rule Running scheduler discovery

Name Description Type Key and additional info
Running scheduler discovery

Discovery scheduler specific metrics.

Dependent item pd.scheduler.discovery

Preprocessing

  • JavaScript: The text is too long. Please see the template.

  • Discard unchanged with heartbeat: 1h

Item prototypes for Running scheduler discovery

Name Description Type Key and additional info
TiDB cluster: Scheduler status: {#KIND}

The current running schedulers.

Dependent item pd.scheduler[{#KIND}]

Preprocessing

  • JSON Path: $[?(@.labels.kind == "{#KIND}")].value.first()

    Custom on fail: Set value to: 0

LLD rule gRPC commands discovery

Name Description Type Key and additional info
gRPC commands discovery

Discovery grpc commands specific metrics.

Dependent item pd.grpc_command.discovery

Preprocessing

  • JavaScript: The text is too long. Please see the template.

  • Discard unchanged with heartbeat: 1h

Item prototypes for gRPC commands discovery

Name Description Type Key and additional info
PD: gRPC Commands: {#GRPC_METHOD}, rate

The rate per command type at which gRPC commands are completed.

Dependent item pd.grpc_command.rate[{#GRPC_METHOD}]

Preprocessing

  • JSON Path: $[?(@.labels.grpc_method == "{#GRPC_METHOD}")].value.first()

  • Change per second

LLD rule Region discovery

Name Description Type Key and additional info
Region discovery

Discovery region specific metrics.

Dependent item pd.region.discovery

Preprocessing

  • JavaScript: The text is too long. Please see the template.

  • Discard unchanged with heartbeat: 1h

Item prototypes for Region discovery

Name Description Type Key and additional info
PD: Get metrics: {#STORE_ADDRESS}

Get region metrics for {#STORE_ADDRESS}.

Dependent item pd.region_heartbeat.get_metrics[{#STORE_ADDRESS}]

Preprocessing

  • JSON Path: $[?(@.labels.address == "{#STORE_ADDRESS}")]

    Custom on fail: Discard value

PD: Region heartbeat: active, rate

The count of heartbeats with the ok status per second.

Dependent item pd.region_heartbeat.ok.rate[{#STORE_ADDRESS}]

Preprocessing

  • JSON Path: The text is too long. Please see the template.

    Custom on fail: Set value to: 0

  • Change per second
PD: Region heartbeat: error, rate

The count of heartbeats with the error status per second.

Dependent item pd.region_heartbeat.error.rate[{#STORE_ADDRESS}]

Preprocessing

  • JSON Path: The text is too long. Please see the template.

    Custom on fail: Set value to: 0

  • Change per second
PD: Region heartbeat: total, rate

The count of heartbeats reported to PD per instance per second.

Dependent item pd.region_heartbeat.rate[{#STORE_ADDRESS}]

Preprocessing

  • JSON Path: $[?(@.labels.type == "report")].value.sum()

    Custom on fail: Set value to: 0

  • Change per second
PD: Region schedule push: total, rate Dependent item pd.region_heartbeat.push.err.rate[{#STORE_ADDRESS}]

Preprocessing

  • JSON Path: $[?(@.labels.type == "push")].value.sum()

    Custom on fail: Set value to: 0

  • Change per second

Feedback

Please report any issues with the template at https://support.zabbix.com

You can also provide feedback, discuss the template, or ask for help at ZABBIX forums