# TiDB by HTTP ## Overview The template to monitor TiDB server of TiDB cluster by Zabbix that works without any external scripts. Most of the metrics are collected in one go, thanks to Zabbix bulk data collection. Template `TiDB by HTTP` — collects metrics by HTTP agent from PD /metrics endpoint and from monitoring API. See https://docs.pingcap.com/tidb/stable/tidb-monitoring-api. ## Requirements Zabbix version: 7.0 and higher. ## Tested versions This template has been tested on: - TiDB cluster 4.0.10, 6.5.1 ## Configuration > Zabbix should be configured according to the instructions in the [Templates out of the box](https://www.zabbix.com/documentation/7.0/manual/config/templates_out_of_the_box) section. ## Setup This template works with TiDB server of TiDB cluster. Internal service metrics are collected from TiDB /metrics endpoint and from monitoring API. See https://docs.pingcap.com/tidb/stable/tidb-monitoring-api. Don't forget to change the macros {$TIDB.URL}, {$TIDB.PORT}. Also, see the Macros section for a list of macros used to set trigger values. ### Macros used |Name|Description|Default| |----|-----------|-------| |{$TIDB.PORT}|

The port of TiDB server metrics web endpoint

|`10080`| |{$TIDB.URL}|

TiDB server URL

|`localhost`| |{$TIDB.OPEN.FDS.MAX.WARN}|

Maximum percentage of used file descriptors

|`90`| |{$TIDB.HEAP.USAGE.MAX.WARN}|

Maximum heap memory used

|`10G`| |{$TIDB.DDL.WAITING.MAX.WARN}|

Maximum number of DDL tasks that are waiting

|`5`| |{$TIDB.TIME_JUMP_BACK.MAX.WARN}|

Maximum number of times that the operating system rewinds every second

|`1`| |{$TIDB.SCHEMA_LEASE_ERRORS.MAX.WARN}|

Maximum number of schema lease errors

|`0`| |{$TIDB.SCHEMA_LOAD_ERRORS.MAX.WARN}|

Maximum number of load schema errors

|`1`| |{$TIDB.GC_ACTIONS.ERRORS.MAX.WARN}|

Maximum number of GC-related operations failures

|`1`| |{$TIDB.REGION_ERROR.MAX.WARN}|

Maximum number of region related errors

|`50`| |{$TIDB.MONITOR_KEEP_ALIVE.MAX.WARN}|

Minimum number of keep alive operations

|`10`| ### Items |Name|Description|Type|Key and additional info| |----|-----------|----|-----------------------| |TiDB: Get instance metrics|

Get TiDB instance metrics.

|HTTP agent|tidb.get_metrics

**Preprocessing**

| |TiDB: Get instance status|

Get TiDB instance status info.

|HTTP agent|tidb.get_status

**Preprocessing**

| |TiDB: Status|

Status of PD instance.

|Dependent item|tidb.status

**Preprocessing**

| |TiDB: Get total server query metrics|

Get information about server queries.

|Dependent item|tidb.server_query.get_metrics

**Preprocessing**

| |TiDB: Total "error" server query, rate|

The number of queries on TiDB instance per second with failure of command execution results.

|Dependent item|tidb.server_query.error.rate

**Preprocessing**

| |TiDB: Total "ok" server query, rate|

The number of queries on TiDB instance per second with success of command execution results.

|Dependent item|tidb.server_query.ok.rate

**Preprocessing**

| |TiDB: Total server query, rate|

The number of queries per second on TiDB instance.

|Dependent item|tidb.server_query.rate

**Preprocessing**

| |TiDB: Get SQL statements metrics|

Get SQL statements metrics.

|Dependent item|tidb.statement_total.get_metrics

**Preprocessing**

| |TiDB: SQL statements, rate|

The total number of SQL statements executed per second.

|Dependent item|tidb.statement_total.rate

**Preprocessing**

| |TiDB: Failed Query, rate|

The number of error occurred when executing SQL statements per second (such as syntax errors and primary key conflicts).

|Dependent item|tidb.execute_error.rate

**Preprocessing**

| |TiDB: Get TiKV client metrics|

Get TiKV client metrics.

|Dependent item|tidb.tikvclient.get_metrics

**Preprocessing**

| |TiDB: KV commands, rate|

The number of executed KV commands per second.

|Dependent item|tidb.tikvclient_txn.rate

**Preprocessing**

| |TiDB: PD TSO commands, rate|

The number of TSO commands that TiDB obtains from PD per second.

|Dependent item|tidb.pd_tso_cmd.rate

**Preprocessing**

| |TiDB: PD TSO requests, rate|

The number of TSO requests that TiDB obtains from PD per second.

|Dependent item|tidb.pd_tso_request.rate

**Preprocessing**

| |TiDB: TiClient region errors, rate|

The number of region related errors returned by TiKV per second.

|Dependent item|tidb.tikvclient_region_err.rate

**Preprocessing**

| |TiDB: Lock resolves, rate|

The number of DDL tasks that are waiting.

|Dependent item|tidb.tikvclient_lock_resolver_action.rate

**Preprocessing**

| |TiDB: DDL waiting jobs|

The number of TiDB operations that resolve locks per second. When TiDB's read or write request encounters a lock, it tries to resolve the lock.

|Dependent item|tidb.ddl_waiting_jobs

**Preprocessing**

| |TiDB: Load schema total, rate|

The statistics of the schemas that TiDB obtains from TiKV per second.

|Dependent item|tidb.domain_load_schema.rate

**Preprocessing**

| |TiDB: Load schema failed, rate|

The total number of failures to reload the latest schema information in TiDB per second.

|Dependent item|tidb.domain_load_schema.failed.rate

**Preprocessing**

| |TiDB: Schema lease "outdate" errors , rate|

The number of schema lease errors per second.

"outdate" errors means that the schema cannot be updated, which is a more serious error and triggers an alert.

|Dependent item|tidb.session_schema_lease_error.outdate.rate

**Preprocessing**

| |TiDB: Schema lease "change" errors, rate|

The number of schema lease errors per second.

"change" means that the schema has changed

|Dependent item|tidb.session_schema_lease_error.change.rate

**Preprocessing**

| |TiDB: KV backoff, rate|

The number of errors returned by TiKV.

|Dependent item|tidb.tikvclient_backoff.rate

**Preprocessing**

| |TiDB: Keep alive, rate|

The number of times that the metrics are refreshed on TiDB instance per minute.

|Dependent item|tidb.monitor_keep_alive.rate

**Preprocessing**

| |TiDB: Server connections|

The connection number of current TiDB instance.

|Dependent item|tidb.tidb_server_connections

**Preprocessing**

| |TiDB: Heap memory usage|

Number of heap bytes that are in use.

|Dependent item|tidb.heap_bytes

**Preprocessing**

| |TiDB: RSS memory usage|

Resident memory size in bytes.

|Dependent item|tidb.rss_bytes

**Preprocessing**

| |TiDB: Goroutine count|

The number of Goroutines on TiDB instance.

|Dependent item|tidb.goroutines

**Preprocessing**

| |TiDB: Open file descriptors|

Number of open file descriptors.

|Dependent item|tidb.process_open_fds

**Preprocessing**

| |TiDB: Open file descriptors, max|

Maximum number of open file descriptors.

|Dependent item|tidb.process_max_fds

**Preprocessing**

| |TiDB: CPU|

Total user and system CPU usage ratio.

|Dependent item|tidb.cpu.util

**Preprocessing**

| |TiDB: Uptime|

The runtime of each TiDB instance.

|Dependent item|tidb.uptime

**Preprocessing**

| |TiDB: Version|

Version of the TiDB instance.

|Dependent item|tidb.version

**Preprocessing**

| |TiDB: Time jump back, rate|

The number of times that the operating system rewinds every second.

|Dependent item|tidb.monitor_time_jump_back.rate

**Preprocessing**

| |TiDB: Server critical error, rate|

The number of critical errors occurred in TiDB per second.

|Dependent item|tidb.tidb_server_critical_error_total.rate

**Preprocessing**

| |TiDB: Server panic, rate|

The number of panics occurred in TiDB per second.

|Dependent item|tidb.tidb_server_panic_total.rate

**Preprocessing**

| ### Triggers |Name|Description|Expression|Severity|Dependencies and additional info| |----|-----------|----------|--------|--------------------------------| |TiDB: Instance is not responding||`last(/TiDB by HTTP/tidb.status)=0`|Average|| |TiDB: Too many region related errors||`min(/TiDB by HTTP/tidb.tikvclient_region_err.rate,5m)>{$TIDB.REGION_ERROR.MAX.WARN}`|Average|| |TiDB: Too many DDL waiting jobs||`min(/TiDB by HTTP/tidb.ddl_waiting_jobs,5m)>{$TIDB.DDL.WAITING.MAX.WARN}`|Warning|| |TiDB: Too many schema lease errors||`min(/TiDB by HTTP/tidb.domain_load_schema.failed.rate,5m)>{$TIDB.SCHEMA_LOAD_ERRORS.MAX.WARN}`|Average|| |TiDB: Too many schema lease errors|

The latest schema information is not reloaded in TiDB within one lease.

|`min(/TiDB by HTTP/tidb.session_schema_lease_error.outdate.rate,5m)>{$TIDB.SCHEMA_LEASE_ERRORS.MAX.WARN}`|Average|| |TiDB: Too few keep alive operations|

Indicates whether the TiDB process still exists. If the number of times for tidb_monitor_keep_alive_total increases less than 10 per minute, the TiDB process might already exit and an alert is triggered.

|`max(/TiDB by HTTP/tidb.monitor_keep_alive.rate,5m)<{$TIDB.MONITOR_KEEP_ALIVE.MAX.WARN}`|Average|| |TiDB: Heap memory usage is too high||`min(/TiDB by HTTP/tidb.heap_bytes,5m)>{$TIDB.HEAP.USAGE.MAX.WARN}`|Warning|| |TiDB: Current number of open files is too high|

Heavy file descriptor usage (i.e., near the process's file descriptor limit) indicates a potential file descriptor exhaustion issue.

|`min(/TiDB by HTTP/tidb.process_open_fds,5m)/last(/TiDB by HTTP/tidb.process_max_fds)*100>{$TIDB.OPEN.FDS.MAX.WARN}`|Warning|| |TiDB: has been restarted|

Uptime is less than 10 minutes.

|`last(/TiDB by HTTP/tidb.uptime)<10m`|Info|**Manual close**: Yes| |TiDB: Version has changed|

TiDB version has changed. Acknowledge to close the problem manually.

|`last(/TiDB by HTTP/tidb.version,#1)<>last(/TiDB by HTTP/tidb.version,#2) and length(last(/TiDB by HTTP/tidb.version))>0`|Info|**Manual close**: Yes| |TiDB: Too many time jump backs||`min(/TiDB by HTTP/tidb.monitor_time_jump_back.rate,5m)>{$TIDB.TIME_JUMP_BACK.MAX.WARN}`|Warning|| |TiDB: There are panicked TiDB threads|

When a panic occurs, an alert is triggered. The thread is often recovered, otherwise, TiDB will frequently restart.

|`last(/TiDB by HTTP/tidb.tidb_server_panic_total.rate)>0`|Average|| ### LLD rule QPS metrics discovery |Name|Description|Type|Key and additional info| |----|-----------|----|-----------------------| |QPS metrics discovery|

Discovery QPS specific metrics.

|Dependent item|tidb.qps.discovery

**Preprocessing**

| ### Item prototypes for QPS metrics discovery |Name|Description|Type|Key and additional info| |----|-----------|----|-----------------------| |TiDB: Get QPS metrics: {#TYPE}|

Get QPS metrics of {#TYPE}.

|Dependent item|tidb.qps.get_metrics[{#TYPE}]

**Preprocessing**

| |TiDB: Server query "OK": {#TYPE}, rate|

The number of queries on TiDB instance per second with success of command execution results.

|Dependent item|tidb.server_query.ok.rate[{#TYPE}]

**Preprocessing**

| |TiDB: Server query "Error": {#TYPE}, rate|

The number of queries on TiDB instance per second with failure of command execution results.

|Dependent item|tidb.server_query.error.rate[{#TYPE}]

**Preprocessing**

| ### LLD rule Statement metrics discovery |Name|Description|Type|Key and additional info| |----|-----------|----|-----------------------| |Statement metrics discovery|

Discovery statement specific metrics.

|Dependent item|tidb.statement.discover

**Preprocessing**

| ### Item prototypes for Statement metrics discovery |Name|Description|Type|Key and additional info| |----|-----------|----|-----------------------| |TiDB: SQL statements: {#TYPE}, rate|

The number of SQL statements executed per second.

|Dependent item|tidb.statement.rate[{#TYPE}]

**Preprocessing**

| ### LLD rule KV metrics discovery |Name|Description|Type|Key and additional info| |----|-----------|----|-----------------------| |KV metrics discovery|

Discovery KV specific metrics.

|Dependent item|tidb.kv_ops.discovery

**Preprocessing**

| ### Item prototypes for KV metrics discovery |Name|Description|Type|Key and additional info| |----|-----------|----|-----------------------| |TiDB: KV Commands: {#TYPE}, rate|

The number of executed KV commands per second.

|Dependent item|tidb.tikvclient_txn.rate[{#TYPE}]

**Preprocessing**

| ### LLD rule Lock resolves discovery |Name|Description|Type|Key and additional info| |----|-----------|----|-----------------------| |Lock resolves discovery|

Discovery lock resolves specific metrics.

|Dependent item|tidb.tikvclient_lock_resolver_action.discovery

**Preprocessing**

| ### Item prototypes for Lock resolves discovery |Name|Description|Type|Key and additional info| |----|-----------|----|-----------------------| |TiDB: Lock resolves: {#TYPE}, rate|

The number of TiDB operations that resolve locks per second. When TiDB's read or write request encounters a lock, it tries to resolve the lock.

|Dependent item|tidb.tikvclient_lock_resolver_action.rate[{#TYPE}]

**Preprocessing**

| ### LLD rule KV backoff discovery |Name|Description|Type|Key and additional info| |----|-----------|----|-----------------------| |KV backoff discovery|

Discovery KV backoff specific metrics.

|Dependent item|tidb.tikvclient_backoff.discovery

**Preprocessing**

| ### Item prototypes for KV backoff discovery |Name|Description|Type|Key and additional info| |----|-----------|----|-----------------------| |TiDB: KV backoff: {#TYPE}, rate|

The number of TiDB operations that resolve locks per second. When TiDB's read or write request encounters a lock, it tries to resolve the lock.

|Dependent item|tidb.tikvclient_backoff.rate[{#TYPE}]

**Preprocessing**

| ### LLD rule GC action results discovery |Name|Description|Type|Key and additional info| |----|-----------|----|-----------------------| |GC action results discovery|

Discovery GC action results metrics.

|Dependent item|tidb.tikvclient_gc_action.discovery

**Preprocessing**

| ### Item prototypes for GC action results discovery |Name|Description|Type|Key and additional info| |----|-----------|----|-----------------------| |TiDB: GC action result: {#TYPE}, rate|

The number of results of GC-related operations per second.

|Dependent item|tidb.tikvclient_gc_action.rate[{#TYPE}]

**Preprocessing**

| ### Trigger prototypes for GC action results discovery |Name|Description|Expression|Severity|Dependencies and additional info| |----|-----------|----------|--------|--------------------------------| |TiDB: Too many failed GC-related operations||`min(/TiDB by HTTP/tidb.tikvclient_gc_action.rate[{#TYPE}],5m)>{$TIDB.GC_ACTIONS.ERRORS.MAX.WARN}`|Warning|| ## Feedback Please report any issues with the template at [`https://support.zabbix.com`](https://support.zabbix.com) You can also provide feedback, discuss the template, or ask for help at [`ZABBIX forums`](https://www.zabbix.com/forum/zabbix-suggestions-and-feedback)