# TiDB by HTTP ## Overview The template to monitor TiDB server of TiDB cluster by Zabbix that works without any external scripts. Most of the metrics are collected in one go, thanks to Zabbix bulk data collection. Template `TiDB by HTTP` — collects metrics by HTTP agent from PD /metrics endpoint and from monitoring API. See https://docs.pingcap.com/tidb/stable/tidb-monitoring-api. ## Requirements Zabbix version: 7.0 and higher. ## Tested versions This template has been tested on: - TiDB cluster 4.0.10, 6.5.1 ## Configuration > Zabbix should be configured according to the instructions in the [Templates out of the box](https://www.zabbix.com/documentation/7.0/manual/config/templates_out_of_the_box) section. ## Setup This template works with TiDB server of TiDB cluster. Internal service metrics are collected from TiDB /metrics endpoint and from monitoring API. See https://docs.pingcap.com/tidb/stable/tidb-monitoring-api. Don't forget to change the macros {$TIDB.URL}, {$TIDB.PORT}. Also, see the Macros section for a list of macros used to set trigger values. ### Macros used |Name|Description|Default| |----|-----------|-------| |{$TIDB.PORT}|
The port of TiDB server metrics web endpoint
|`10080`| |{$TIDB.URL}|TiDB server URL
|`localhost`| |{$TIDB.OPEN.FDS.MAX.WARN}|Maximum percentage of used file descriptors
|`90`| |{$TIDB.HEAP.USAGE.MAX.WARN}|Maximum heap memory used
|`10G`| |{$TIDB.DDL.WAITING.MAX.WARN}|Maximum number of DDL tasks that are waiting
|`5`| |{$TIDB.TIME_JUMP_BACK.MAX.WARN}|Maximum number of times that the operating system rewinds every second
|`1`| |{$TIDB.SCHEMA_LEASE_ERRORS.MAX.WARN}|Maximum number of schema lease errors
|`0`| |{$TIDB.SCHEMA_LOAD_ERRORS.MAX.WARN}|Maximum number of load schema errors
|`1`| |{$TIDB.GC_ACTIONS.ERRORS.MAX.WARN}|Maximum number of GC-related operations failures
|`1`| |{$TIDB.REGION_ERROR.MAX.WARN}|Maximum number of region related errors
|`50`| |{$TIDB.MONITOR_KEEP_ALIVE.MAX.WARN}|Minimum number of keep alive operations
|`10`| ### Items |Name|Description|Type|Key and additional info| |----|-----------|----|-----------------------| |TiDB: Get instance metrics|Get TiDB instance metrics.
|HTTP agent|tidb.get_metrics**Preprocessing**
Check for not supported value
⛔️Custom on fail: Discard value
Get TiDB instance status info.
|HTTP agent|tidb.get_status**Preprocessing**
Check for not supported value
⛔️Custom on fail: Set value to: `{"status": "0"}`
Status of PD instance.
|Dependent item|tidb.status**Preprocessing**
JSON Path: `$.status`
⛔️Custom on fail: Set value to: `1`
Discard unchanged with heartbeat: `1h`
Get information about server queries.
|Dependent item|tidb.server_query.get_metrics**Preprocessing**
JSON Path: `$[?(@.name == "tidb_server_query_total")]`
⛔️Custom on fail: Discard value
The number of queries on TiDB instance per second with failure of command execution results.
|Dependent item|tidb.server_query.error.rate**Preprocessing**
JSON Path: `$[?(@.labels.result == "Error")].value.sum()`
The number of queries on TiDB instance per second with success of command execution results.
|Dependent item|tidb.server_query.ok.rate**Preprocessing**
JSON Path: `$[?(@.labels.result == "OK")].value.sum()`
The number of queries per second on TiDB instance.
|Dependent item|tidb.server_query.rate**Preprocessing**
JSON Path: `$..value.sum()`
Get SQL statements metrics.
|Dependent item|tidb.statement_total.get_metrics**Preprocessing**
JSON Path: `$[?(@.name=="tidb_executor_statement_total")]`
⛔️Custom on fail: Discard value
The total number of SQL statements executed per second.
|Dependent item|tidb.statement_total.rate**Preprocessing**
JSON Path: `$..value.sum()`
The number of error occurred when executing SQL statements per second (such as syntax errors and primary key conflicts).
|Dependent item|tidb.execute_error.rate**Preprocessing**
JSON Path: `$[?(@.name=="tidb_server_execute_error_total")].value.sum()`
⛔️Custom on fail: Discard value
Get TiKV client metrics.
|Dependent item|tidb.tikvclient.get_metrics**Preprocessing**
JSON Path: `$[?(@.name=~"tidb_tikvclient_*")]`
⛔️Custom on fail: Discard value
The number of executed KV commands per second.
|Dependent item|tidb.tikvclient_txn.rate**Preprocessing**
JSON Path: `The text is too long. Please see the template.`
The number of TSO commands that TiDB obtains from PD per second.
|Dependent item|tidb.pd_tso_cmd.rate**Preprocessing**
JSON Path: `The text is too long. Please see the template.`
The number of TSO requests that TiDB obtains from PD per second.
|Dependent item|tidb.pd_tso_request.rate**Preprocessing**
JSON Path: `The text is too long. Please see the template.`
The number of region related errors returned by TiKV per second.
|Dependent item|tidb.tikvclient_region_err.rate**Preprocessing**
JSON Path: `$[?(@.name=="tidb_tikvclient_region_err_total")].value.sum()`
The number of DDL tasks that are waiting.
|Dependent item|tidb.tikvclient_lock_resolver_action.rate**Preprocessing**
JSON Path: `The text is too long. Please see the template.`
The number of TiDB operations that resolve locks per second. When TiDB's read or write request encounters a lock, it tries to resolve the lock.
|Dependent item|tidb.ddl_waiting_jobs**Preprocessing**
JSON Path: `$[?(@.name=="tidb_ddl_waiting_jobs")].value.sum()`
⛔️Custom on fail: Set value to: `0`
The statistics of the schemas that TiDB obtains from TiKV per second.
|Dependent item|tidb.domain_load_schema.rate**Preprocessing**
JSON Path: `$[?(@.name=="tidb_domain_load_schema_total")].value.sum()`
The total number of failures to reload the latest schema information in TiDB per second.
|Dependent item|tidb.domain_load_schema.failed.rate**Preprocessing**
JSON Path: `The text is too long. Please see the template.`
⛔️Custom on fail: Discard value
The number of schema lease errors per second.
"outdate" errors means that the schema cannot be updated, which is a more serious error and triggers an alert.
|Dependent item|tidb.session_schema_lease_error.outdate.rate**Preprocessing**
JSON Path: `The text is too long. Please see the template.`
⛔️Custom on fail: Discard value
The number of schema lease errors per second.
"change" means that the schema has changed
|Dependent item|tidb.session_schema_lease_error.change.rate**Preprocessing**
JSON Path: `The text is too long. Please see the template.`
⛔️Custom on fail: Discard value
The number of errors returned by TiKV.
|Dependent item|tidb.tikvclient_backoff.rate**Preprocessing**
JSON Path: `$[?(@.name=="tidb_tikvclient_backoff_total")].value.sum()`
⛔️Custom on fail: Discard value
The number of times that the metrics are refreshed on TiDB instance per minute.
|Dependent item|tidb.monitor_keep_alive.rate**Preprocessing**
JSON Path: `$[?(@.name=="tidb_monitor_keep_alive_total")].value.first()`
⛔️Custom on fail: Discard value
The connection number of current TiDB instance.
|Dependent item|tidb.tidb_server_connections**Preprocessing**
JSON Path: `$[?(@.name=="tidb_server_connections")].value.first()`
Number of heap bytes that are in use.
|Dependent item|tidb.heap_bytes**Preprocessing**
JSON Path: `$[?(@.name=="go_memstats_heap_inuse_bytes")].value.first()`
Resident memory size in bytes.
|Dependent item|tidb.rss_bytes**Preprocessing**
JSON Path: `$[?(@.name=="process_resident_memory_bytes")].value.first()`
The number of Goroutines on TiDB instance.
|Dependent item|tidb.goroutines**Preprocessing**
JSON Path: `$[?(@.name=="go_goroutines")].value.first()`
Number of open file descriptors.
|Dependent item|tidb.process_open_fds**Preprocessing**
JSON Path: `$[?(@.name=="process_open_fds")].value.first()`
Maximum number of open file descriptors.
|Dependent item|tidb.process_max_fds**Preprocessing**
JSON Path: `$[?(@.name=="process_max_fds")].value.first()`
Total user and system CPU usage ratio.
|Dependent item|tidb.cpu.util**Preprocessing**
JSON Path: `$[?(@.name=="process_cpu_seconds_total")].value.first()`
Custom multiplier: `100`
The runtime of each TiDB instance.
|Dependent item|tidb.uptime**Preprocessing**
JSON Path: `$[?(@.name=="process_start_time_seconds")].value.first()`
JavaScript: `The text is too long. Please see the template.`
Version of the TiDB instance.
|Dependent item|tidb.version**Preprocessing**
JSON Path: `$.version`
Discard unchanged with heartbeat: `3h`
The number of times that the operating system rewinds every second.
|Dependent item|tidb.monitor_time_jump_back.rate**Preprocessing**
JSON Path: `The text is too long. Please see the template.`
The number of critical errors occurred in TiDB per second.
|Dependent item|tidb.tidb_server_critical_error_total.rate**Preprocessing**
JSON Path: `The text is too long. Please see the template.`
The number of panics occurred in TiDB per second.
|Dependent item|tidb.tidb_server_panic_total.rate**Preprocessing**
JSON Path: `$[?(@.name=="tidb_server_panic_total")].value.first()`
⛔️Custom on fail: Discard value
The latest schema information is not reloaded in TiDB within one lease.
|`min(/TiDB by HTTP/tidb.session_schema_lease_error.outdate.rate,5m)>{$TIDB.SCHEMA_LEASE_ERRORS.MAX.WARN}`|Average|| |TiDB: Too few keep alive operations|Indicates whether the TiDB process still exists. If the number of times for tidb_monitor_keep_alive_total increases less than 10 per minute, the TiDB process might already exit and an alert is triggered.
|`max(/TiDB by HTTP/tidb.monitor_keep_alive.rate,5m)<{$TIDB.MONITOR_KEEP_ALIVE.MAX.WARN}`|Average|| |TiDB: Heap memory usage is too high||`min(/TiDB by HTTP/tidb.heap_bytes,5m)>{$TIDB.HEAP.USAGE.MAX.WARN}`|Warning|| |TiDB: Current number of open files is too high|Heavy file descriptor usage (i.e., near the process's file descriptor limit) indicates a potential file descriptor exhaustion issue.
|`min(/TiDB by HTTP/tidb.process_open_fds,5m)/last(/TiDB by HTTP/tidb.process_max_fds)*100>{$TIDB.OPEN.FDS.MAX.WARN}`|Warning|| |TiDB: has been restarted|Uptime is less than 10 minutes.
|`last(/TiDB by HTTP/tidb.uptime)<10m`|Info|**Manual close**: Yes| |TiDB: Version has changed|TiDB version has changed. Acknowledge to close the problem manually.
|`last(/TiDB by HTTP/tidb.version,#1)<>last(/TiDB by HTTP/tidb.version,#2) and length(last(/TiDB by HTTP/tidb.version))>0`|Info|**Manual close**: Yes| |TiDB: Too many time jump backs||`min(/TiDB by HTTP/tidb.monitor_time_jump_back.rate,5m)>{$TIDB.TIME_JUMP_BACK.MAX.WARN}`|Warning|| |TiDB: There are panicked TiDB threads|When a panic occurs, an alert is triggered. The thread is often recovered, otherwise, TiDB will frequently restart.
|`last(/TiDB by HTTP/tidb.tidb_server_panic_total.rate)>0`|Average|| ### LLD rule QPS metrics discovery |Name|Description|Type|Key and additional info| |----|-----------|----|-----------------------| |QPS metrics discovery|Discovery QPS specific metrics.
|Dependent item|tidb.qps.discovery**Preprocessing**
JavaScript: `The text is too long. Please see the template.`
Discard unchanged with heartbeat: `1h`
Get QPS metrics of {#TYPE}.
|Dependent item|tidb.qps.get_metrics[{#TYPE}]**Preprocessing**
JSON Path: `$[?(@.labels.type == "{#TYPE}")]`
⛔️Custom on fail: Discard value
The number of queries on TiDB instance per second with success of command execution results.
|Dependent item|tidb.server_query.ok.rate[{#TYPE}]**Preprocessing**
JSON Path: `$[?(@.labels.result == "OK")].value.first()`
The number of queries on TiDB instance per second with failure of command execution results.
|Dependent item|tidb.server_query.error.rate[{#TYPE}]**Preprocessing**
JSON Path: `$[?(@.labels.result == "Error")].value.first()`
Discovery statement specific metrics.
|Dependent item|tidb.statement.discover**Preprocessing**
JavaScript: `The text is too long. Please see the template.`
Discard unchanged with heartbeat: `1h`
The number of SQL statements executed per second.
|Dependent item|tidb.statement.rate[{#TYPE}]**Preprocessing**
JSON Path: `$[?(@.labels.type == "{#TYPE}")].value.first()`
Discovery KV specific metrics.
|Dependent item|tidb.kv_ops.discovery**Preprocessing**
JSON Path: `The text is too long. Please see the template.`
JavaScript: `The text is too long. Please see the template.`
Discard unchanged with heartbeat: `1h`
The number of executed KV commands per second.
|Dependent item|tidb.tikvclient_txn.rate[{#TYPE}]**Preprocessing**
JSON Path: `The text is too long. Please see the template.`
Discovery lock resolves specific metrics.
|Dependent item|tidb.tikvclient_lock_resolver_action.discovery**Preprocessing**
JSON Path: `$[?(@.name=="tidb_tikvclient_lock_resolver_actions_total")]`
JavaScript: `The text is too long. Please see the template.`
Discard unchanged with heartbeat: `1h`
The number of TiDB operations that resolve locks per second. When TiDB's read or write request encounters a lock, it tries to resolve the lock.
|Dependent item|tidb.tikvclient_lock_resolver_action.rate[{#TYPE}]**Preprocessing**
JSON Path: `The text is too long. Please see the template.`
Discovery KV backoff specific metrics.
|Dependent item|tidb.tikvclient_backoff.discovery**Preprocessing**
JSON Path: `$[?(@.name=="tidb_tikvclient_backoff_total")]`
⛔️Custom on fail: Discard value
JavaScript: `The text is too long. Please see the template.`
Discard unchanged with heartbeat: `1h`
The number of TiDB operations that resolve locks per second. When TiDB's read or write request encounters a lock, it tries to resolve the lock.
|Dependent item|tidb.tikvclient_backoff.rate[{#TYPE}]**Preprocessing**
JSON Path: `The text is too long. Please see the template.`
Discovery GC action results metrics.
|Dependent item|tidb.tikvclient_gc_action.discovery**Preprocessing**
JSON Path: `$[?(@.name=="tidb_tikvclient_gc_action_result")]`
⛔️Custom on fail: Discard value
JavaScript: `The text is too long. Please see the template.`
Discard unchanged with heartbeat: `1h`
The number of results of GC-related operations per second.
|Dependent item|tidb.tikvclient_gc_action.rate[{#TYPE}]**Preprocessing**
JSON Path: `The text is too long. Please see the template.`