# Kubernetes Controller manager by HTTP ## Overview The template to monitor Kubernetes Controller manager by Zabbix that works without any external scripts. Most of the metrics are collected in one go, thanks to Zabbix bulk data collection. Template `Kubernetes Controller manager by HTTP` - collects metrics by HTTP agent from Controller manager /metrics endpoint. ## Requirements Zabbix version: 7.0 and higher. ## Tested versions This template has been tested on: - Kubernetes Controller manager 1.19.10 ## Configuration > Zabbix should be configured according to the instructions in the [Templates out of the box](https://www.zabbix.com/documentation/7.0/manual/config/templates_out_of_the_box) section. ## Setup Internal service metrics are collected from /metrics endpoint. Template needs to use Authorization via API token. Don't forget change macros {$KUBE.CONTROLLER.SERVER.URL}, {$KUBE.API.TOKEN}. Also, see the Macros section for a list of macros used to set trigger values. *NOTE.* You might need to set the `--binding-address` option for Controller Manager to the address where Zabbix proxy can reach it. For example, for clusters created with `kubeadm` it can be set in the following manifest file (changes will be applied immediately): - /etc/kubernetes/manifests/kube-controller-manager.yaml *NOTE.* Some metrics may not be collected depending on your Kubernetes Controller manager instance version and configuration. ### Macros used |Name|Description|Default| |----|-----------|-------| |{$KUBE.CONTROLLER.SERVER.URL}|
Kubernetes Controller manager metrics endpoint URL.
|`https://localhost:10257/metrics`| |{$KUBE.API.TOKEN}|API Authorization Token
|| |{$KUBE.CONTROLLER.HTTP.CLIENT.ERROR}|Maximum number of HTTP client requests failures used for trigger.
|`2`| ### Items |Name|Description|Type|Key and additional info| |----|-----------|----|-----------------------| |Kubernetes Controller: Get Controller metrics|Get raw metrics from Controller instance /metrics endpoint.
|HTTP agent|kubernetes.controller.get_metrics**Preprocessing**
Check for not supported value
⛔️Custom on fail: Discard value
Gauge of if the reporting system is master of the relevant lease, 0 indicates backup, 1 indicates master.
|Dependent item|kubernetes.controller.leader_election_master_status**Preprocessing**
Prometheus pattern: `VALUE(leader_election_master_status)`
⛔️Custom on fail: Discard value
Virtual memory size in bytes.
|Dependent item|kubernetes.controller.process_virtual_memory_bytes**Preprocessing**
Prometheus pattern: `VALUE(process_virtual_memory_bytes)`
⛔️Custom on fail: Discard value
Resident memory size in bytes.
|Dependent item|kubernetes.controller.process_resident_memory_bytes**Preprocessing**
Prometheus pattern: `VALUE(process_resident_memory_bytes)`
⛔️Custom on fail: Discard value
Total user and system CPU usage ratio.
|Dependent item|kubernetes.controller.cpu.util**Preprocessing**
Prometheus pattern: `VALUE(process_cpu_seconds_total)`
Custom multiplier: `100`
Number of goroutines that currently exist.
|Dependent item|kubernetes.controller.go_goroutines**Preprocessing**
Prometheus pattern: `SUM(go_goroutines)`
⛔️Custom on fail: Discard value
Number of OS threads created.
|Dependent item|kubernetes.controller.go_threads**Preprocessing**
Prometheus pattern: `VALUE(go_threads)`
⛔️Custom on fail: Discard value
Number of open file descriptors.
|Dependent item|kubernetes.controller.open_fds**Preprocessing**
Prometheus pattern: `VALUE(process_open_fds)`
⛔️Custom on fail: Discard value
Maximum allowed open file descriptors.
|Dependent item|kubernetes.controller.max_fds**Preprocessing**
Prometheus pattern: `VALUE(process_max_fds)`
⛔️Custom on fail: Discard value
Number of HTTP requests with 2xx status code per second.
|Dependent item|kubernetes.controller.client_http_requests_200.rate**Preprocessing**
Prometheus pattern: `SUM(rest_client_requests_total{code =~ "2.."})`
⛔️Custom on fail: Discard value
Number of HTTP requests with 3xx status code per second.
|Dependent item|kubernetes.controller.client_http_requests_300.rate**Preprocessing**
Prometheus pattern: `SUM(rest_client_requests_total{code =~ "3.."})`
⛔️Custom on fail: Discard value
Number of HTTP requests with 4xx status code per second.
|Dependent item|kubernetes.controller.client_http_requests_400.rate**Preprocessing**
Prometheus pattern: `SUM(rest_client_requests_total{code =~ "4.."})`
⛔️Custom on fail: Discard value
Number of HTTP requests with 5xx status code per second.
|Dependent item|kubernetes.controller.client_http_requests_500.rate**Preprocessing**
Prometheus pattern: `SUM(rest_client_requests_total{code =~ "5.."})`
⛔️Custom on fail: Discard value
"Kubernetes Controller manager is experiencing high error rate (with 5xx HTTP code).
|`min(/Kubernetes Controller manager by HTTP/kubernetes.controller.client_http_requests_500.rate,5m)>{$KUBE.CONTROLLER.HTTP.CLIENT.ERROR}`|Warning|| ### LLD rule Workqueue metrics discovery |Name|Description|Type|Key and additional info| |----|-----------|----|-----------------------| |Workqueue metrics discovery||Dependent item|kubernetes.controller.workqueue.discovery**Preprocessing**
Prometheus to JSON: `{__name__=~ "workqueue_*", name =~ ".*"}`
JavaScript: `The text is too long. Please see the template.`
Discard unchanged with heartbeat: `3h`
Total number of adds handled by workqueue per second.
|Dependent item|kubernetes.controller.workqueue_adds_total["{#NAME}"]**Preprocessing**
Prometheus pattern: `VALUE(workqueue_adds_total{name = "{#NAME}"})`
⛔️Custom on fail: Discard value
Current depth of workqueue.
|Dependent item|kubernetes.controller.workqueue_depth["{#NAME}"]**Preprocessing**
Prometheus pattern: `VALUE(workqueue_depth{name = "{#NAME}"})`
⛔️Custom on fail: Discard value
How many seconds of work has done that is in progress and hasn't been observed by work_duration. Large values indicate stuck threads. One can deduce the number of stuck threads by observing the rate at which this increases.
|Dependent item|kubernetes.controller.workqueue_unfinished_work_seconds["{#NAME}"]**Preprocessing**
Prometheus pattern: `VALUE(workqueue_unfinished_work_seconds{name = "{#NAME}"})`
⛔️Custom on fail: Discard value
Total number of retries handled by workqueue per second.
|Dependent item|kubernetes.controller.workqueue_retries_total["{#NAME}"]**Preprocessing**
Prometheus pattern: `VALUE(workqueue_retries_total{name = "{#NAME}"})`
⛔️Custom on fail: Discard value
How many seconds has the longest running processor for workqueue been running.
|Dependent item|kubernetes.controller.workqueue_longest_running_processor_seconds["{#NAME}"]**Preprocessing**
Prometheus pattern: `The text is too long. Please see the template.`
⛔️Custom on fail: Discard value
90 percentile of how long in seconds processing an item from workqueue takes, by queue.
|Calculated|kubernetes.controller.workqueue_work_duration_seconds_p90["{#NAME}"]| |Kubernetes Controller Manager: ["{#NAME}"]: Workqueue work duration, p95|95 percentile of how long in seconds processing an item from workqueue takes, by queue.
|Calculated|kubernetes.controller.workqueue_work_duration_seconds_p95["{#NAME}"]| |Kubernetes Controller Manager: ["{#NAME}"]: Workqueue work duration, p99|99 percentile of how long in seconds processing an item from workqueue takes, by queue.
|Calculated|kubernetes.controller.workqueue_work_duration_seconds_p99["{#NAME}"]| |Kubernetes Controller Manager: ["{#NAME}"]: Workqueue work duration, 50p|50 percentiles of how long in seconds processing an item from workqueue takes, by queue.
|Calculated|kubernetes.controller.workqueue_work_duration_seconds_p50["{#NAME}"]| |Kubernetes Controller Manager: ["{#NAME}"]: Workqueue queue duration, p90|90 percentile of how long in seconds an item stays in workqueue before being requested, by queue.
|Calculated|kubernetes.controller.workqueue_queue_duration_seconds_p90["{#NAME}"]| |Kubernetes Controller Manager: ["{#NAME}"]: Workqueue queue duration, p95|95 percentile of how long in seconds an item stays in workqueue before being requested, by queue.
|Calculated|kubernetes.controller.workqueue_queue_duration_seconds_p95["{#NAME}"]| |Kubernetes Controller Manager: ["{#NAME}"]: Workqueue queue duration, p99|99 percentile of how long in seconds an item stays in workqueue before being requested, by queue.
|Calculated|kubernetes.controller.workqueue_queue_duration_seconds_p99["{#NAME}"]| |Kubernetes Controller Manager: ["{#NAME}"]: Workqueue queue duration, 50p|50 percentile of how long in seconds an item stays in workqueue before being requested. If there are no requests for 5 minute, item value will be discarded.
|Calculated|kubernetes.controller.workqueue_queue_duration_seconds_p50["{#NAME}"]**Preprocessing**
Check for not supported value
⛔️Custom on fail: Discard value
How long in seconds processing an item from workqueue takes.
|Dependent item|kubernetes.controller.duration_seconds_bucket[{#LE},"{#NAME}"]**Preprocessing**
Prometheus pattern: `The text is too long. Please see the template.`
⛔️Custom on fail: Discard value
How long in seconds an item stays in workqueue before being requested.
|Dependent item|kubernetes.controller.queue_duration_seconds_bucket[{#LE},"{#NAME}"]**Preprocessing**
Prometheus pattern: `The text is too long. Please see the template.`
⛔️Custom on fail: Discard value