You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

103 lines
12 KiB

1 year ago
# Kubernetes Controller manager by HTTP
## Overview
The template to monitor Kubernetes Controller manager by Zabbix that works without any external scripts.
Most of the metrics are collected in one go, thanks to Zabbix bulk data collection.
Template `Kubernetes Controller manager by HTTP` - collects metrics by HTTP agent from Controller manager /metrics endpoint.
## Requirements
Zabbix version: 7.0 and higher.
## Tested versions
This template has been tested on:
- Kubernetes Controller manager 1.19.10
## Configuration
> Zabbix should be configured according to the instructions in the [Templates out of the box](https://www.zabbix.com/documentation/7.0/manual/config/templates_out_of_the_box) section.
## Setup
Internal service metrics are collected from /metrics endpoint.
Template needs to use Authorization via API token.
Don't forget change macros {$KUBE.CONTROLLER.SERVER.URL}, {$KUBE.API.TOKEN}.
Also, see the Macros section for a list of macros used to set trigger values.
*NOTE.* You might need to set the `--binding-address` option for Controller Manager to the address where Zabbix proxy can reach it.
For example, for clusters created with `kubeadm` it can be set in the following manifest file (changes will be applied immediately):
- /etc/kubernetes/manifests/kube-controller-manager.yaml
*NOTE.* Some metrics may not be collected depending on your Kubernetes Controller manager instance version and configuration.
### Macros used
|Name|Description|Default|
|----|-----------|-------|
|{$KUBE.CONTROLLER.SERVER.URL}|<p>Kubernetes Controller manager metrics endpoint URL.</p>|`https://localhost:10257/metrics`|
|{$KUBE.API.TOKEN}|<p>API Authorization Token</p>||
|{$KUBE.CONTROLLER.HTTP.CLIENT.ERROR}|<p>Maximum number of HTTP client requests failures used for trigger.</p>|`2`|
### Items
|Name|Description|Type|Key and additional info|
|----|-----------|----|-----------------------|
|Kubernetes Controller: Get Controller metrics|<p>Get raw metrics from Controller instance /metrics endpoint.</p>|HTTP agent|kubernetes.controller.get_metrics<p>**Preprocessing**</p><ul><li><p>Check for not supported value</p><p>Custom on fail: Discard value</p></li></ul>|
|Kubernetes Controller Manager: Leader election status|<p>Gauge of if the reporting system is master of the relevant lease, 0 indicates backup, 1 indicates master.</p>|Dependent item|kubernetes.controller.leader_election_master_status<p>**Preprocessing**</p><ul><li><p>Prometheus pattern: `VALUE(leader_election_master_status)`</p><p>Custom on fail: Discard value</p></li></ul>|
|Kubernetes Controller Manager: Virtual memory, bytes|<p>Virtual memory size in bytes.</p>|Dependent item|kubernetes.controller.process_virtual_memory_bytes<p>**Preprocessing**</p><ul><li><p>Prometheus pattern: `VALUE(process_virtual_memory_bytes)`</p><p>Custom on fail: Discard value</p></li></ul>|
|Kubernetes Controller Manager: Resident memory, bytes|<p>Resident memory size in bytes.</p>|Dependent item|kubernetes.controller.process_resident_memory_bytes<p>**Preprocessing**</p><ul><li><p>Prometheus pattern: `VALUE(process_resident_memory_bytes)`</p><p>Custom on fail: Discard value</p></li></ul>|
|Kubernetes Controller Manager: CPU|<p>Total user and system CPU usage ratio.</p>|Dependent item|kubernetes.controller.cpu.util<p>**Preprocessing**</p><ul><li><p>Prometheus pattern: `VALUE(process_cpu_seconds_total)`</p></li><li>Change per second</li><li><p>Custom multiplier: `100`</p></li></ul>|
|Kubernetes Controller Manager: Goroutines|<p>Number of goroutines that currently exist.</p>|Dependent item|kubernetes.controller.go_goroutines<p>**Preprocessing**</p><ul><li><p>Prometheus pattern: `SUM(go_goroutines)`</p><p>Custom on fail: Discard value</p></li></ul>|
|Kubernetes Controller Manager: Go threads|<p>Number of OS threads created.</p>|Dependent item|kubernetes.controller.go_threads<p>**Preprocessing**</p><ul><li><p>Prometheus pattern: `VALUE(go_threads)`</p><p>Custom on fail: Discard value</p></li></ul>|
|Kubernetes Controller Manager: Fds open|<p>Number of open file descriptors.</p>|Dependent item|kubernetes.controller.open_fds<p>**Preprocessing**</p><ul><li><p>Prometheus pattern: `VALUE(process_open_fds)`</p><p>Custom on fail: Discard value</p></li></ul>|
|Kubernetes Controller Manager: Fds max|<p>Maximum allowed open file descriptors.</p>|Dependent item|kubernetes.controller.max_fds<p>**Preprocessing**</p><ul><li><p>Prometheus pattern: `VALUE(process_max_fds)`</p><p>Custom on fail: Discard value</p></li></ul>|
|Kubernetes Controller Manager: REST Client requests: 2xx, rate|<p>Number of HTTP requests with 2xx status code per second.</p>|Dependent item|kubernetes.controller.client_http_requests_200.rate<p>**Preprocessing**</p><ul><li><p>Prometheus pattern: `SUM(rest_client_requests_total{code =~ "2.."})`</p><p>Custom on fail: Discard value</p></li><li>Change per second</li></ul>|
|Kubernetes Controller Manager: REST Client requests: 3xx, rate|<p>Number of HTTP requests with 3xx status code per second.</p>|Dependent item|kubernetes.controller.client_http_requests_300.rate<p>**Preprocessing**</p><ul><li><p>Prometheus pattern: `SUM(rest_client_requests_total{code =~ "3.."})`</p><p>Custom on fail: Discard value</p></li><li>Change per second</li></ul>|
|Kubernetes Controller Manager: REST Client requests: 4xx, rate|<p>Number of HTTP requests with 4xx status code per second.</p>|Dependent item|kubernetes.controller.client_http_requests_400.rate<p>**Preprocessing**</p><ul><li><p>Prometheus pattern: `SUM(rest_client_requests_total{code =~ "4.."})`</p><p>Custom on fail: Discard value</p></li><li>Change per second</li></ul>|
|Kubernetes Controller Manager: REST Client requests: 5xx, rate|<p>Number of HTTP requests with 5xx status code per second.</p>|Dependent item|kubernetes.controller.client_http_requests_500.rate<p>**Preprocessing**</p><ul><li><p>Prometheus pattern: `SUM(rest_client_requests_total{code =~ "5.."})`</p><p>Custom on fail: Discard value</p></li><li>Change per second</li></ul>|
### Triggers
|Name|Description|Expression|Severity|Dependencies and additional info|
|----|-----------|----------|--------|--------------------------------|
|Kubernetes Controller Manager: Too many HTTP client errors|<p>"Kubernetes Controller manager is experiencing high error rate (with 5xx HTTP code).</p>|`min(/Kubernetes Controller manager by HTTP/kubernetes.controller.client_http_requests_500.rate,5m)>{$KUBE.CONTROLLER.HTTP.CLIENT.ERROR}`|Warning||
### LLD rule Workqueue metrics discovery
|Name|Description|Type|Key and additional info|
|----|-----------|----|-----------------------|
|Workqueue metrics discovery||Dependent item|kubernetes.controller.workqueue.discovery<p>**Preprocessing**</p><ul><li><p>Prometheus to JSON: `{__name__=~ "workqueue_*", name =~ ".*"}`</p></li><li><p>JavaScript: `The text is too long. Please see the template.`</p></li><li><p>Discard unchanged with heartbeat: `3h`</p></li></ul>|
### Item prototypes for Workqueue metrics discovery
|Name|Description|Type|Key and additional info|
|----|-----------|----|-----------------------|
|Kubernetes Controller Manager: ["{#NAME}"]: Workqueue adds total, rate|<p>Total number of adds handled by workqueue per second.</p>|Dependent item|kubernetes.controller.workqueue_adds_total["{#NAME}"]<p>**Preprocessing**</p><ul><li><p>Prometheus pattern: `VALUE(workqueue_adds_total{name = "{#NAME}"})`</p><p>Custom on fail: Discard value</p></li><li>Change per second</li></ul>|
|Kubernetes Controller Manager: ["{#NAME}"]: Workqueue depth|<p>Current depth of workqueue.</p>|Dependent item|kubernetes.controller.workqueue_depth["{#NAME}"]<p>**Preprocessing**</p><ul><li><p>Prometheus pattern: `VALUE(workqueue_depth{name = "{#NAME}"})`</p><p>Custom on fail: Discard value</p></li></ul>|
|Kubernetes Controller Manager: ["{#NAME}"]: Workqueue unfinished work, sec|<p>How many seconds of work has done that is in progress and hasn't been observed by work_duration. Large values indicate stuck threads. One can deduce the number of stuck threads by observing the rate at which this increases.</p>|Dependent item|kubernetes.controller.workqueue_unfinished_work_seconds["{#NAME}"]<p>**Preprocessing**</p><ul><li><p>Prometheus pattern: `VALUE(workqueue_unfinished_work_seconds{name = "{#NAME}"})`</p><p>Custom on fail: Discard value</p></li></ul>|
|Kubernetes Controller Manager: ["{#NAME}"]: Workqueue retries, rate|<p>Total number of retries handled by workqueue per second.</p>|Dependent item|kubernetes.controller.workqueue_retries_total["{#NAME}"]<p>**Preprocessing**</p><ul><li><p>Prometheus pattern: `VALUE(workqueue_retries_total{name = "{#NAME}"})`</p><p>Custom on fail: Discard value</p></li><li>Change per second</li></ul>|
|Kubernetes Controller Manager: ["{#NAME}"]: Workqueue longest running processor, sec|<p>How many seconds has the longest running processor for workqueue been running.</p>|Dependent item|kubernetes.controller.workqueue_longest_running_processor_seconds["{#NAME}"]<p>**Preprocessing**</p><ul><li><p>Prometheus pattern: `The text is too long. Please see the template.`</p><p>Custom on fail: Discard value</p></li></ul>|
|Kubernetes Controller Manager: ["{#NAME}"]: Workqueue work duration, p90|<p>90 percentile of how long in seconds processing an item from workqueue takes, by queue.</p>|Calculated|kubernetes.controller.workqueue_work_duration_seconds_p90["{#NAME}"]|
|Kubernetes Controller Manager: ["{#NAME}"]: Workqueue work duration, p95|<p>95 percentile of how long in seconds processing an item from workqueue takes, by queue.</p>|Calculated|kubernetes.controller.workqueue_work_duration_seconds_p95["{#NAME}"]|
|Kubernetes Controller Manager: ["{#NAME}"]: Workqueue work duration, p99|<p>99 percentile of how long in seconds processing an item from workqueue takes, by queue.</p>|Calculated|kubernetes.controller.workqueue_work_duration_seconds_p99["{#NAME}"]|
|Kubernetes Controller Manager: ["{#NAME}"]: Workqueue work duration, 50p|<p>50 percentiles of how long in seconds processing an item from workqueue takes, by queue.</p>|Calculated|kubernetes.controller.workqueue_work_duration_seconds_p50["{#NAME}"]|
|Kubernetes Controller Manager: ["{#NAME}"]: Workqueue queue duration, p90|<p>90 percentile of how long in seconds an item stays in workqueue before being requested, by queue.</p>|Calculated|kubernetes.controller.workqueue_queue_duration_seconds_p90["{#NAME}"]|
|Kubernetes Controller Manager: ["{#NAME}"]: Workqueue queue duration, p95|<p>95 percentile of how long in seconds an item stays in workqueue before being requested, by queue.</p>|Calculated|kubernetes.controller.workqueue_queue_duration_seconds_p95["{#NAME}"]|
|Kubernetes Controller Manager: ["{#NAME}"]: Workqueue queue duration, p99|<p>99 percentile of how long in seconds an item stays in workqueue before being requested, by queue.</p>|Calculated|kubernetes.controller.workqueue_queue_duration_seconds_p99["{#NAME}"]|
|Kubernetes Controller Manager: ["{#NAME}"]: Workqueue queue duration, 50p|<p>50 percentile of how long in seconds an item stays in workqueue before being requested. If there are no requests for 5 minute, item value will be discarded.</p>|Calculated|kubernetes.controller.workqueue_queue_duration_seconds_p50["{#NAME}"]<p>**Preprocessing**</p><ul><li><p>Check for not supported value</p><p>Custom on fail: Discard value</p></li></ul>|
|Kubernetes Controller Manager: ["{#NAME}"]: Workqueue duration seconds bucket, {#LE}|<p>How long in seconds processing an item from workqueue takes.</p>|Dependent item|kubernetes.controller.duration_seconds_bucket[{#LE},"{#NAME}"]<p>**Preprocessing**</p><ul><li><p>Prometheus pattern: `The text is too long. Please see the template.`</p><p>Custom on fail: Discard value</p></li></ul>|
|Kubernetes Controller Manager: ["{#NAME}"]: Queue duration seconds bucket, {#LE}|<p>How long in seconds an item stays in workqueue before being requested.</p>|Dependent item|kubernetes.controller.queue_duration_seconds_bucket[{#LE},"{#NAME}"]<p>**Preprocessing**</p><ul><li><p>Prometheus pattern: `The text is too long. Please see the template.`</p><p>Custom on fail: Discard value</p></li></ul>|
## Feedback
Please report any issues with the template at [`https://support.zabbix.com`](https://support.zabbix.com)
You can also provide feedback, discuss the template, or ask for help at [`ZABBIX forums`](https://www.zabbix.com/forum/zabbix-suggestions-and-feedback)