12 KiB

Raw Blame History Unescape Escape

Kubernetes Controller manager by HTTP

Overview

The template to monitor Kubernetes Controller manager by Zabbix that works without any external scripts. Most of the metrics are collected in one go, thanks to Zabbix bulk data collection.

Template Kubernetes Controller manager by HTTP - collects metrics by HTTP agent from Controller manager /metrics endpoint.

Requirements

Zabbix version: 7.0 and higher.

Tested versions

This template has been tested on:

Kubernetes Controller manager 1.19.10

Configuration

Zabbix should be configured according to the instructions in the Templates out of the box section.

Setup

Internal service metrics are collected from /metrics endpoint. Template needs to use Authorization via API token.

Don't forget change macros {$KUBE.CONTROLLER.SERVER.URL}, {$KUBE.API.TOKEN}. Also, see the Macros section for a list of macros used to set trigger values.

NOTE. You might need to set the --binding-address option for Controller Manager to the address where Zabbix proxy can reach it. For example, for clusters created with kubeadm it can be set in the following manifest file (changes will be applied immediately):

/etc/kubernetes/manifests/kube-controller-manager.yaml

NOTE. Some metrics may not be collected depending on your Kubernetes Controller manager instance version and configuration.

Macros used

Name	Description	Default
{$KUBE.CONTROLLER.SERVER.URL}	Kubernetes Controller manager metrics endpoint URL.	`https://localhost:10257/metrics`
{$KUBE.API.TOKEN}	API Authorization Token
{$KUBE.CONTROLLER.HTTP.CLIENT.ERROR}	Maximum number of HTTP client requests failures used for trigger.	`2`

Items

Name	Description	Type	Key and additional info
Kubernetes Controller: Get Controller metrics	Get raw metrics from Controller instance /metrics endpoint.	HTTP agent	kubernetes.controller.get_metrics Preprocessing Check for not supported value ⛔️Custom on fail: Discard value
Kubernetes Controller Manager: Leader election status	Gauge of if the reporting system is master of the relevant lease, 0 indicates backup, 1 indicates master.	Dependent item	kubernetes.controller.leader_election_master_status Preprocessing Prometheus pattern: `VALUE(leader_election_master_status)` ⛔️Custom on fail: Discard value
Kubernetes Controller Manager: Virtual memory, bytes	Virtual memory size in bytes.	Dependent item	kubernetes.controller.process_virtual_memory_bytes Preprocessing Prometheus pattern: `VALUE(process_virtual_memory_bytes)` ⛔️Custom on fail: Discard value
Kubernetes Controller Manager: Resident memory, bytes	Resident memory size in bytes.	Dependent item	kubernetes.controller.process_resident_memory_bytes Preprocessing Prometheus pattern: `VALUE(process_resident_memory_bytes)` ⛔️Custom on fail: Discard value
Kubernetes Controller Manager: CPU	Total user and system CPU usage ratio.	Dependent item	kubernetes.controller.cpu.util Preprocessing Prometheus pattern: `VALUE(process_cpu_seconds_total)` Change per second Custom multiplier: `100`
Kubernetes Controller Manager: Goroutines	Number of goroutines that currently exist.	Dependent item	kubernetes.controller.go_goroutines Preprocessing Prometheus pattern: `SUM(go_goroutines)` ⛔️Custom on fail: Discard value
Kubernetes Controller Manager: Go threads	Number of OS threads created.	Dependent item	kubernetes.controller.go_threads Preprocessing Prometheus pattern: `VALUE(go_threads)` ⛔️Custom on fail: Discard value
Kubernetes Controller Manager: Fds open	Number of open file descriptors.	Dependent item	kubernetes.controller.open_fds Preprocessing Prometheus pattern: `VALUE(process_open_fds)` ⛔️Custom on fail: Discard value
Kubernetes Controller Manager: Fds max	Maximum allowed open file descriptors.	Dependent item	kubernetes.controller.max_fds Preprocessing Prometheus pattern: `VALUE(process_max_fds)` ⛔️Custom on fail: Discard value
Kubernetes Controller Manager: REST Client requests: 2xx, rate	Number of HTTP requests with 2xx status code per second.	Dependent item	kubernetes.controller.client_http_requests_200.rate Preprocessing Prometheus pattern: `SUM(rest_client_requests_total{code =~ "2.."})` ⛔️Custom on fail: Discard value Change per second
Kubernetes Controller Manager: REST Client requests: 3xx, rate	Number of HTTP requests with 3xx status code per second.	Dependent item	kubernetes.controller.client_http_requests_300.rate Preprocessing Prometheus pattern: `SUM(rest_client_requests_total{code =~ "3.."})` ⛔️Custom on fail: Discard value Change per second
Kubernetes Controller Manager: REST Client requests: 4xx, rate	Number of HTTP requests with 4xx status code per second.	Dependent item	kubernetes.controller.client_http_requests_400.rate Preprocessing Prometheus pattern: `SUM(rest_client_requests_total{code =~ "4.."})` ⛔️Custom on fail: Discard value Change per second
Kubernetes Controller Manager: REST Client requests: 5xx, rate	Number of HTTP requests with 5xx status code per second.	Dependent item	kubernetes.controller.client_http_requests_500.rate Preprocessing Prometheus pattern: `SUM(rest_client_requests_total{code =~ "5.."})` ⛔️Custom on fail: Discard value Change per second

Triggers

Name	Description	Expression	Severity	Dependencies and additional info
Kubernetes Controller Manager: Too many HTTP client errors	"Kubernetes Controller manager is experiencing high error rate (with 5xx HTTP code).	`min(/Kubernetes Controller manager by HTTP/kubernetes.controller.client_http_requests_500.rate,5m)>{$KUBE.CONTROLLER.HTTP.CLIENT.ERROR}`	Warning

LLD rule Workqueue metrics discovery

Name Description Type Key and additional info

Workqueue metrics discovery

Dependent item

Name	Description	Type	Key and additional info
Workqueue metrics discovery		Dependent item	kubernetes.controller.workqueue.discovery Preprocessing Prometheus to JSON: `{__name__=~ "workqueue_", name =~ "."}` JavaScript: `The text is too long. Please see the template.` Discard unchanged with heartbeat: `3h`

kubernetes.controller.workqueue.discovery

Preprocessing

Prometheus to JSON: {__name__=~ "workqueue_*", name =~ ".*"}
JavaScript: The text is too long. Please see the template.
Discard unchanged with heartbeat: 3h

Item prototypes for Workqueue metrics discovery

Name	Description	Type	Key and additional info
Kubernetes Controller Manager: ["{#NAME}"]: Workqueue adds total, rate	Total number of adds handled by workqueue per second.	Dependent item	kubernetes.controller.workqueue_adds_total["{#NAME}"] Preprocessing Prometheus pattern: `VALUE(workqueue_adds_total{name = "{#NAME}"})` ⛔️Custom on fail: Discard value Change per second
Kubernetes Controller Manager: ["{#NAME}"]: Workqueue depth	Current depth of workqueue.	Dependent item	kubernetes.controller.workqueue_depth["{#NAME}"] Preprocessing Prometheus pattern: `VALUE(workqueue_depth{name = "{#NAME}"})` ⛔️Custom on fail: Discard value
Kubernetes Controller Manager: ["{#NAME}"]: Workqueue unfinished work, sec	How many seconds of work has done that is in progress and hasn't been observed by work_duration. Large values indicate stuck threads. One can deduce the number of stuck threads by observing the rate at which this increases.	Dependent item	kubernetes.controller.workqueue_unfinished_work_seconds["{#NAME}"] Preprocessing Prometheus pattern: `VALUE(workqueue_unfinished_work_seconds{name = "{#NAME}"})` ⛔️Custom on fail: Discard value
Kubernetes Controller Manager: ["{#NAME}"]: Workqueue retries, rate	Total number of retries handled by workqueue per second.	Dependent item	kubernetes.controller.workqueue_retries_total["{#NAME}"] Preprocessing Prometheus pattern: `VALUE(workqueue_retries_total{name = "{#NAME}"})` ⛔️Custom on fail: Discard value Change per second
Kubernetes Controller Manager: ["{#NAME}"]: Workqueue longest running processor, sec	How many seconds has the longest running processor for workqueue been running.	Dependent item	kubernetes.controller.workqueue_longest_running_processor_seconds["{#NAME}"] Preprocessing Prometheus pattern: `The text is too long. Please see the template.` ⛔️Custom on fail: Discard value
Kubernetes Controller Manager: ["{#NAME}"]: Workqueue work duration, p90	90 percentile of how long in seconds processing an item from workqueue takes, by queue.	Calculated	kubernetes.controller.workqueue_work_duration_seconds_p90["{#NAME}"]
Kubernetes Controller Manager: ["{#NAME}"]: Workqueue work duration, p95	95 percentile of how long in seconds processing an item from workqueue takes, by queue.	Calculated	kubernetes.controller.workqueue_work_duration_seconds_p95["{#NAME}"]
Kubernetes Controller Manager: ["{#NAME}"]: Workqueue work duration, p99	99 percentile of how long in seconds processing an item from workqueue takes, by queue.	Calculated	kubernetes.controller.workqueue_work_duration_seconds_p99["{#NAME}"]
Kubernetes Controller Manager: ["{#NAME}"]: Workqueue work duration, 50p	50 percentiles of how long in seconds processing an item from workqueue takes, by queue.	Calculated	kubernetes.controller.workqueue_work_duration_seconds_p50["{#NAME}"]
Kubernetes Controller Manager: ["{#NAME}"]: Workqueue queue duration, p90	90 percentile of how long in seconds an item stays in workqueue before being requested, by queue.	Calculated	kubernetes.controller.workqueue_queue_duration_seconds_p90["{#NAME}"]
Kubernetes Controller Manager: ["{#NAME}"]: Workqueue queue duration, p95	95 percentile of how long in seconds an item stays in workqueue before being requested, by queue.	Calculated	kubernetes.controller.workqueue_queue_duration_seconds_p95["{#NAME}"]
Kubernetes Controller Manager: ["{#NAME}"]: Workqueue queue duration, p99	99 percentile of how long in seconds an item stays in workqueue before being requested, by queue.	Calculated	kubernetes.controller.workqueue_queue_duration_seconds_p99["{#NAME}"]
Kubernetes Controller Manager: ["{#NAME}"]: Workqueue queue duration, 50p	50 percentile of how long in seconds an item stays in workqueue before being requested. If there are no requests for 5 minute, item value will be discarded.	Calculated	kubernetes.controller.workqueue_queue_duration_seconds_p50["{#NAME}"] Preprocessing Check for not supported value ⛔️Custom on fail: Discard value
Kubernetes Controller Manager: ["{#NAME}"]: Workqueue duration seconds bucket, {#LE}	How long in seconds processing an item from workqueue takes.	Dependent item	kubernetes.controller.duration_seconds_bucket[{#LE},"{#NAME}"] Preprocessing Prometheus pattern: `The text is too long. Please see the template.` ⛔️Custom on fail: Discard value
Kubernetes Controller Manager: ["{#NAME}"]: Queue duration seconds bucket, {#LE}	How long in seconds an item stays in workqueue before being requested.	Dependent item	kubernetes.controller.queue_duration_seconds_bucket[{#LE},"{#NAME}"] Preprocessing Prometheus pattern: `The text is too long. Please see the template.` ⛔️Custom on fail: Discard value

Feedback

Please report any issues with the template at https://support.zabbix.com

You can also provide feedback, discuss the template, or ask for help at ZABBIX forums

12 KiB Raw Blame History Unescape Escape