You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
yzl 93958d0fb0
zabbix6.0
1 year ago
..
README.md zabbix6.0 1 year ago
template_db_tidb_tidb_http.yaml zabbix6.0 1 year ago

README.md

TiDB by HTTP

Overview

The template to monitor TiDB server of TiDB cluster by Zabbix that works without any external scripts. Most of the metrics are collected in one go, thanks to Zabbix bulk data collection.

Template TiDB by HTTP — collects metrics by HTTP agent from PD /metrics endpoint and from monitoring API. See https://docs.pingcap.com/tidb/stable/tidb-monitoring-api.

Requirements

Zabbix version: 7.0 and higher.

Tested versions

This template has been tested on:

  • TiDB cluster 4.0.10, 6.5.1

Configuration

Zabbix should be configured according to the instructions in the Templates out of the box section.

Setup

This template works with TiDB server of TiDB cluster. Internal service metrics are collected from TiDB /metrics endpoint and from monitoring API. See https://docs.pingcap.com/tidb/stable/tidb-monitoring-api. Don't forget to change the macros {$TIDB.URL}, {$TIDB.PORT}. Also, see the Macros section for a list of macros used to set trigger values.

Macros used

Name Description Default
{$TIDB.PORT}

The port of TiDB server metrics web endpoint

10080
{$TIDB.URL}

TiDB server URL

localhost
{$TIDB.OPEN.FDS.MAX.WARN}

Maximum percentage of used file descriptors

90
{$TIDB.HEAP.USAGE.MAX.WARN}

Maximum heap memory used

10G
{$TIDB.DDL.WAITING.MAX.WARN}

Maximum number of DDL tasks that are waiting

5
{$TIDB.TIME_JUMP_BACK.MAX.WARN}

Maximum number of times that the operating system rewinds every second

1
{$TIDB.SCHEMA_LEASE_ERRORS.MAX.WARN}

Maximum number of schema lease errors

0
{$TIDB.SCHEMA_LOAD_ERRORS.MAX.WARN}

Maximum number of load schema errors

1
{$TIDB.GC_ACTIONS.ERRORS.MAX.WARN}

Maximum number of GC-related operations failures

1
{$TIDB.REGION_ERROR.MAX.WARN}

Maximum number of region related errors

50
{$TIDB.MONITOR_KEEP_ALIVE.MAX.WARN}

Minimum number of keep alive operations

10

Items

Name Description Type Key and additional info
TiDB: Get instance metrics

Get TiDB instance metrics.

HTTP agent tidb.get_metrics

Preprocessing

  • Check for not supported value

    Custom on fail: Discard value

  • Prometheus to JSON
TiDB: Get instance status

Get TiDB instance status info.

HTTP agent tidb.get_status

Preprocessing

  • Check for not supported value

    Custom on fail: Set value to: {"status": "0"}

TiDB: Status

Status of PD instance.

Dependent item tidb.status

Preprocessing

  • JSON Path: $.status

    Custom on fail: Set value to: 1

  • Discard unchanged with heartbeat: 1h

TiDB: Get total server query metrics

Get information about server queries.

Dependent item tidb.server_query.get_metrics

Preprocessing

  • JSON Path: $[?(@.name == "tidb_server_query_total")]

    Custom on fail: Discard value

TiDB: Total "error" server query, rate

The number of queries on TiDB instance per second with failure of command execution results.

Dependent item tidb.server_query.error.rate

Preprocessing

  • JSON Path: $[?(@.labels.result == "Error")].value.sum()

  • Change per second
TiDB: Total "ok" server query, rate

The number of queries on TiDB instance per second with success of command execution results.

Dependent item tidb.server_query.ok.rate

Preprocessing

  • JSON Path: $[?(@.labels.result == "OK")].value.sum()

  • Change per second
TiDB: Total server query, rate

The number of queries per second on TiDB instance.

Dependent item tidb.server_query.rate

Preprocessing

  • JSON Path: $..value.sum()

  • Change per second
TiDB: Get SQL statements metrics

Get SQL statements metrics.

Dependent item tidb.statement_total.get_metrics

Preprocessing

  • JSON Path: $[?(@.name=="tidb_executor_statement_total")]

    Custom on fail: Discard value

TiDB: SQL statements, rate

The total number of SQL statements executed per second.

Dependent item tidb.statement_total.rate

Preprocessing

  • JSON Path: $..value.sum()

  • Change per second
TiDB: Failed Query, rate

The number of error occurred when executing SQL statements per second (such as syntax errors and primary key conflicts).

Dependent item tidb.execute_error.rate

Preprocessing

  • JSON Path: $[?(@.name=="tidb_server_execute_error_total")].value.sum()

    Custom on fail: Discard value

  • Change per second
TiDB: Get TiKV client metrics

Get TiKV client metrics.

Dependent item tidb.tikvclient.get_metrics

Preprocessing

  • JSON Path: $[?(@.name=~"tidb_tikvclient_*")]

    Custom on fail: Discard value

TiDB: KV commands, rate

The number of executed KV commands per second.

Dependent item tidb.tikvclient_txn.rate

Preprocessing

  • JSON Path: The text is too long. Please see the template.

  • Change per second
TiDB: PD TSO commands, rate

The number of TSO commands that TiDB obtains from PD per second.

Dependent item tidb.pd_tso_cmd.rate

Preprocessing

  • JSON Path: The text is too long. Please see the template.

  • Change per second
TiDB: PD TSO requests, rate

The number of TSO requests that TiDB obtains from PD per second.

Dependent item tidb.pd_tso_request.rate

Preprocessing

  • JSON Path: The text is too long. Please see the template.

  • Change per second
TiDB: TiClient region errors, rate

The number of region related errors returned by TiKV per second.

Dependent item tidb.tikvclient_region_err.rate

Preprocessing

  • JSON Path: $[?(@.name=="tidb_tikvclient_region_err_total")].value.sum()

  • Change per second
TiDB: Lock resolves, rate

The number of DDL tasks that are waiting.

Dependent item tidb.tikvclient_lock_resolver_action.rate

Preprocessing

  • JSON Path: The text is too long. Please see the template.

  • Change per second
TiDB: DDL waiting jobs

The number of TiDB operations that resolve locks per second. When TiDB's read or write request encounters a lock, it tries to resolve the lock.

Dependent item tidb.ddl_waiting_jobs

Preprocessing

  • JSON Path: $[?(@.name=="tidb_ddl_waiting_jobs")].value.sum()

    Custom on fail: Set value to: 0

TiDB: Load schema total, rate

The statistics of the schemas that TiDB obtains from TiKV per second.

Dependent item tidb.domain_load_schema.rate

Preprocessing

  • JSON Path: $[?(@.name=="tidb_domain_load_schema_total")].value.sum()

  • Change per second
TiDB: Load schema failed, rate

The total number of failures to reload the latest schema information in TiDB per second.

Dependent item tidb.domain_load_schema.failed.rate

Preprocessing

  • JSON Path: The text is too long. Please see the template.

    Custom on fail: Discard value

  • Change per second
TiDB: Schema lease "outdate" errors , rate

The number of schema lease errors per second.

"outdate" errors means that the schema cannot be updated, which is a more serious error and triggers an alert.

Dependent item tidb.session_schema_lease_error.outdate.rate

Preprocessing

  • JSON Path: The text is too long. Please see the template.

    Custom on fail: Discard value

  • Change per second
TiDB: Schema lease "change" errors, rate

The number of schema lease errors per second.

"change" means that the schema has changed

Dependent item tidb.session_schema_lease_error.change.rate

Preprocessing

  • JSON Path: The text is too long. Please see the template.

    Custom on fail: Discard value

  • Change per second
TiDB: KV backoff, rate

The number of errors returned by TiKV.

Dependent item tidb.tikvclient_backoff.rate

Preprocessing

  • JSON Path: $[?(@.name=="tidb_tikvclient_backoff_total")].value.sum()

    Custom on fail: Discard value

  • Change per second
TiDB: Keep alive, rate

The number of times that the metrics are refreshed on TiDB instance per minute.

Dependent item tidb.monitor_keep_alive.rate

Preprocessing

  • JSON Path: $[?(@.name=="tidb_monitor_keep_alive_total")].value.first()

    Custom on fail: Discard value

  • Simple change
TiDB: Server connections

The connection number of current TiDB instance.

Dependent item tidb.tidb_server_connections

Preprocessing

  • JSON Path: $[?(@.name=="tidb_server_connections")].value.first()

TiDB: Heap memory usage

Number of heap bytes that are in use.

Dependent item tidb.heap_bytes

Preprocessing

  • JSON Path: $[?(@.name=="go_memstats_heap_inuse_bytes")].value.first()

TiDB: RSS memory usage

Resident memory size in bytes.

Dependent item tidb.rss_bytes

Preprocessing

  • JSON Path: $[?(@.name=="process_resident_memory_bytes")].value.first()

TiDB: Goroutine count

The number of Goroutines on TiDB instance.

Dependent item tidb.goroutines

Preprocessing

  • JSON Path: $[?(@.name=="go_goroutines")].value.first()

TiDB: Open file descriptors

Number of open file descriptors.

Dependent item tidb.process_open_fds

Preprocessing

  • JSON Path: $[?(@.name=="process_open_fds")].value.first()

TiDB: Open file descriptors, max

Maximum number of open file descriptors.

Dependent item tidb.process_max_fds

Preprocessing

  • JSON Path: $[?(@.name=="process_max_fds")].value.first()

TiDB: CPU

Total user and system CPU usage ratio.

Dependent item tidb.cpu.util

Preprocessing

  • JSON Path: $[?(@.name=="process_cpu_seconds_total")].value.first()

  • Change per second
  • Custom multiplier: 100

TiDB: Uptime

The runtime of each TiDB instance.

Dependent item tidb.uptime

Preprocessing

  • JSON Path: $[?(@.name=="process_start_time_seconds")].value.first()

  • JavaScript: The text is too long. Please see the template.

TiDB: Version

Version of the TiDB instance.

Dependent item tidb.version

Preprocessing

  • JSON Path: $.version

  • Discard unchanged with heartbeat: 3h

TiDB: Time jump back, rate

The number of times that the operating system rewinds every second.

Dependent item tidb.monitor_time_jump_back.rate

Preprocessing

  • JSON Path: The text is too long. Please see the template.

  • Change per second
TiDB: Server critical error, rate

The number of critical errors occurred in TiDB per second.

Dependent item tidb.tidb_server_critical_error_total.rate

Preprocessing

  • JSON Path: The text is too long. Please see the template.

  • Change per second
TiDB: Server panic, rate

The number of panics occurred in TiDB per second.

Dependent item tidb.tidb_server_panic_total.rate

Preprocessing

  • JSON Path: $[?(@.name=="tidb_server_panic_total")].value.first()

    Custom on fail: Discard value

  • Change per second

Triggers

Name Description Expression Severity Dependencies and additional info
TiDB: Instance is not responding last(/TiDB by HTTP/tidb.status)=0 Average
TiDB: Too many region related errors min(/TiDB by HTTP/tidb.tikvclient_region_err.rate,5m)>{$TIDB.REGION_ERROR.MAX.WARN} Average
TiDB: Too many DDL waiting jobs min(/TiDB by HTTP/tidb.ddl_waiting_jobs,5m)>{$TIDB.DDL.WAITING.MAX.WARN} Warning
TiDB: Too many schema lease errors min(/TiDB by HTTP/tidb.domain_load_schema.failed.rate,5m)>{$TIDB.SCHEMA_LOAD_ERRORS.MAX.WARN} Average
TiDB: Too many schema lease errors

The latest schema information is not reloaded in TiDB within one lease.

min(/TiDB by HTTP/tidb.session_schema_lease_error.outdate.rate,5m)>{$TIDB.SCHEMA_LEASE_ERRORS.MAX.WARN} Average
TiDB: Too few keep alive operations

Indicates whether the TiDB process still exists. If the number of times for tidb_monitor_keep_alive_total increases less than 10 per minute, the TiDB process might already exit and an alert is triggered.

max(/TiDB by HTTP/tidb.monitor_keep_alive.rate,5m)<{$TIDB.MONITOR_KEEP_ALIVE.MAX.WARN} Average
TiDB: Heap memory usage is too high min(/TiDB by HTTP/tidb.heap_bytes,5m)>{$TIDB.HEAP.USAGE.MAX.WARN} Warning
TiDB: Current number of open files is too high

Heavy file descriptor usage (i.e., near the process's file descriptor limit) indicates a potential file descriptor exhaustion issue.

min(/TiDB by HTTP/tidb.process_open_fds,5m)/last(/TiDB by HTTP/tidb.process_max_fds)*100>{$TIDB.OPEN.FDS.MAX.WARN} Warning
TiDB: has been restarted

Uptime is less than 10 minutes.

last(/TiDB by HTTP/tidb.uptime)<10m Info Manual close: Yes
TiDB: Version has changed

TiDB version has changed. Acknowledge to close the problem manually.

last(/TiDB by HTTP/tidb.version,#1)<>last(/TiDB by HTTP/tidb.version,#2) and length(last(/TiDB by HTTP/tidb.version))>0 Info Manual close: Yes
TiDB: Too many time jump backs min(/TiDB by HTTP/tidb.monitor_time_jump_back.rate,5m)>{$TIDB.TIME_JUMP_BACK.MAX.WARN} Warning
TiDB: There are panicked TiDB threads

When a panic occurs, an alert is triggered. The thread is often recovered, otherwise, TiDB will frequently restart.

last(/TiDB by HTTP/tidb.tidb_server_panic_total.rate)>0 Average

LLD rule QPS metrics discovery

Name Description Type Key and additional info
QPS metrics discovery

Discovery QPS specific metrics.

Dependent item tidb.qps.discovery

Preprocessing

  • JavaScript: The text is too long. Please see the template.

  • Discard unchanged with heartbeat: 1h

Item prototypes for QPS metrics discovery

Name Description Type Key and additional info
TiDB: Get QPS metrics: {#TYPE}

Get QPS metrics of {#TYPE}.

Dependent item tidb.qps.get_metrics[{#TYPE}]

Preprocessing

  • JSON Path: $[?(@.labels.type == "{#TYPE}")]

    Custom on fail: Discard value

TiDB: Server query "OK": {#TYPE}, rate

The number of queries on TiDB instance per second with success of command execution results.

Dependent item tidb.server_query.ok.rate[{#TYPE}]

Preprocessing

  • JSON Path: $[?(@.labels.result == "OK")].value.first()

  • Change per second
TiDB: Server query "Error": {#TYPE}, rate

The number of queries on TiDB instance per second with failure of command execution results.

Dependent item tidb.server_query.error.rate[{#TYPE}]

Preprocessing

  • JSON Path: $[?(@.labels.result == "Error")].value.first()

  • Change per second

LLD rule Statement metrics discovery

Name Description Type Key and additional info
Statement metrics discovery

Discovery statement specific metrics.

Dependent item tidb.statement.discover

Preprocessing

  • JavaScript: The text is too long. Please see the template.

  • Discard unchanged with heartbeat: 1h

Item prototypes for Statement metrics discovery

Name Description Type Key and additional info
TiDB: SQL statements: {#TYPE}, rate

The number of SQL statements executed per second.

Dependent item tidb.statement.rate[{#TYPE}]

Preprocessing

  • JSON Path: $[?(@.labels.type == "{#TYPE}")].value.first()

  • Change per second

LLD rule KV metrics discovery

Name Description Type Key and additional info
KV metrics discovery

Discovery KV specific metrics.

Dependent item tidb.kv_ops.discovery

Preprocessing

  • JSON Path: The text is too long. Please see the template.

  • JavaScript: The text is too long. Please see the template.

  • Discard unchanged with heartbeat: 1h

Item prototypes for KV metrics discovery

Name Description Type Key and additional info
TiDB: KV Commands: {#TYPE}, rate

The number of executed KV commands per second.

Dependent item tidb.tikvclient_txn.rate[{#TYPE}]

Preprocessing

  • JSON Path: The text is too long. Please see the template.

  • Change per second

LLD rule Lock resolves discovery

Name Description Type Key and additional info
Lock resolves discovery

Discovery lock resolves specific metrics.

Dependent item tidb.tikvclient_lock_resolver_action.discovery

Preprocessing

  • JSON Path: $[?(@.name=="tidb_tikvclient_lock_resolver_actions_total")]

  • JavaScript: The text is too long. Please see the template.

  • Discard unchanged with heartbeat: 1h

Item prototypes for Lock resolves discovery

Name Description Type Key and additional info
TiDB: Lock resolves: {#TYPE}, rate

The number of TiDB operations that resolve locks per second. When TiDB's read or write request encounters a lock, it tries to resolve the lock.

Dependent item tidb.tikvclient_lock_resolver_action.rate[{#TYPE}]

Preprocessing

  • JSON Path: The text is too long. Please see the template.

  • Change per second

LLD rule KV backoff discovery

Name Description Type Key and additional info
KV backoff discovery

Discovery KV backoff specific metrics.

Dependent item tidb.tikvclient_backoff.discovery

Preprocessing

  • JSON Path: $[?(@.name=="tidb_tikvclient_backoff_total")]

    Custom on fail: Discard value

  • JavaScript: The text is too long. Please see the template.

  • Discard unchanged with heartbeat: 1h

Item prototypes for KV backoff discovery

Name Description Type Key and additional info
TiDB: KV backoff: {#TYPE}, rate

The number of TiDB operations that resolve locks per second. When TiDB's read or write request encounters a lock, it tries to resolve the lock.

Dependent item tidb.tikvclient_backoff.rate[{#TYPE}]

Preprocessing

  • JSON Path: The text is too long. Please see the template.

  • Change per second

LLD rule GC action results discovery

Name Description Type Key and additional info
GC action results discovery

Discovery GC action results metrics.

Dependent item tidb.tikvclient_gc_action.discovery

Preprocessing

  • JSON Path: $[?(@.name=="tidb_tikvclient_gc_action_result")]

    Custom on fail: Discard value

  • JavaScript: The text is too long. Please see the template.

  • Discard unchanged with heartbeat: 1h

Item prototypes for GC action results discovery

Name Description Type Key and additional info
TiDB: GC action result: {#TYPE}, rate

The number of results of GC-related operations per second.

Dependent item tidb.tikvclient_gc_action.rate[{#TYPE}]

Preprocessing

  • JSON Path: The text is too long. Please see the template.

  • Change per second

Trigger prototypes for GC action results discovery

Name Description Expression Severity Dependencies and additional info
TiDB: Too many failed GC-related operations min(/TiDB by HTTP/tidb.tikvclient_gc_action.rate[{#TYPE}],5m)>{$TIDB.GC_ACTIONS.ERRORS.MAX.WARN} Warning

Feedback

Please report any issues with the template at https://support.zabbix.com

You can also provide feedback, discuss the template, or ask for help at ZABBIX forums