You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

12 KiB

Kubernetes Controller manager by HTTP

Overview

The template to monitor Kubernetes Controller manager by Zabbix that works without any external scripts. Most of the metrics are collected in one go, thanks to Zabbix bulk data collection.

Template Kubernetes Controller manager by HTTP - collects metrics by HTTP agent from Controller manager /metrics endpoint.

Requirements

Zabbix version: 7.0 and higher.

Tested versions

This template has been tested on:

  • Kubernetes Controller manager 1.19.10

Configuration

Zabbix should be configured according to the instructions in the Templates out of the box section.

Setup

Internal service metrics are collected from /metrics endpoint. Template needs to use Authorization via API token.

Don't forget change macros {$KUBE.CONTROLLER.SERVER.URL}, {$KUBE.API.TOKEN}. Also, see the Macros section for a list of macros used to set trigger values.

NOTE. You might need to set the --binding-address option for Controller Manager to the address where Zabbix proxy can reach it. For example, for clusters created with kubeadm it can be set in the following manifest file (changes will be applied immediately):

  • /etc/kubernetes/manifests/kube-controller-manager.yaml

NOTE. Some metrics may not be collected depending on your Kubernetes Controller manager instance version and configuration.

Macros used

Name Description Default
{$KUBE.CONTROLLER.SERVER.URL}

Kubernetes Controller manager metrics endpoint URL.

https://localhost:10257/metrics
{$KUBE.API.TOKEN}

API Authorization Token

{$KUBE.CONTROLLER.HTTP.CLIENT.ERROR}

Maximum number of HTTP client requests failures used for trigger.

2

Items

Name Description Type Key and additional info
Kubernetes Controller: Get Controller metrics

Get raw metrics from Controller instance /metrics endpoint.

HTTP agent kubernetes.controller.get_metrics

Preprocessing

  • Check for not supported value

    Custom on fail: Discard value

Kubernetes Controller Manager: Leader election status

Gauge of if the reporting system is master of the relevant lease, 0 indicates backup, 1 indicates master.

Dependent item kubernetes.controller.leader_election_master_status

Preprocessing

  • Prometheus pattern: VALUE(leader_election_master_status)

    Custom on fail: Discard value

Kubernetes Controller Manager: Virtual memory, bytes

Virtual memory size in bytes.

Dependent item kubernetes.controller.process_virtual_memory_bytes

Preprocessing

  • Prometheus pattern: VALUE(process_virtual_memory_bytes)

    Custom on fail: Discard value

Kubernetes Controller Manager: Resident memory, bytes

Resident memory size in bytes.

Dependent item kubernetes.controller.process_resident_memory_bytes

Preprocessing

  • Prometheus pattern: VALUE(process_resident_memory_bytes)

    Custom on fail: Discard value

Kubernetes Controller Manager: CPU

Total user and system CPU usage ratio.

Dependent item kubernetes.controller.cpu.util

Preprocessing

  • Prometheus pattern: VALUE(process_cpu_seconds_total)

  • Change per second
  • Custom multiplier: 100

Kubernetes Controller Manager: Goroutines

Number of goroutines that currently exist.

Dependent item kubernetes.controller.go_goroutines

Preprocessing

  • Prometheus pattern: SUM(go_goroutines)

    Custom on fail: Discard value

Kubernetes Controller Manager: Go threads

Number of OS threads created.

Dependent item kubernetes.controller.go_threads

Preprocessing

  • Prometheus pattern: VALUE(go_threads)

    Custom on fail: Discard value

Kubernetes Controller Manager: Fds open

Number of open file descriptors.

Dependent item kubernetes.controller.open_fds

Preprocessing

  • Prometheus pattern: VALUE(process_open_fds)

    Custom on fail: Discard value

Kubernetes Controller Manager: Fds max

Maximum allowed open file descriptors.

Dependent item kubernetes.controller.max_fds

Preprocessing

  • Prometheus pattern: VALUE(process_max_fds)

    Custom on fail: Discard value

Kubernetes Controller Manager: REST Client requests: 2xx, rate

Number of HTTP requests with 2xx status code per second.

Dependent item kubernetes.controller.client_http_requests_200.rate

Preprocessing

  • Prometheus pattern: SUM(rest_client_requests_total{code =~ "2.."})

    Custom on fail: Discard value

  • Change per second
Kubernetes Controller Manager: REST Client requests: 3xx, rate

Number of HTTP requests with 3xx status code per second.

Dependent item kubernetes.controller.client_http_requests_300.rate

Preprocessing

  • Prometheus pattern: SUM(rest_client_requests_total{code =~ "3.."})

    Custom on fail: Discard value

  • Change per second
Kubernetes Controller Manager: REST Client requests: 4xx, rate

Number of HTTP requests with 4xx status code per second.

Dependent item kubernetes.controller.client_http_requests_400.rate

Preprocessing

  • Prometheus pattern: SUM(rest_client_requests_total{code =~ "4.."})

    Custom on fail: Discard value

  • Change per second
Kubernetes Controller Manager: REST Client requests: 5xx, rate

Number of HTTP requests with 5xx status code per second.

Dependent item kubernetes.controller.client_http_requests_500.rate

Preprocessing

  • Prometheus pattern: SUM(rest_client_requests_total{code =~ "5.."})

    Custom on fail: Discard value

  • Change per second

Triggers

Name Description Expression Severity Dependencies and additional info
Kubernetes Controller Manager: Too many HTTP client errors

"Kubernetes Controller manager is experiencing high error rate (with 5xx HTTP code).

min(/Kubernetes Controller manager by HTTP/kubernetes.controller.client_http_requests_500.rate,5m)>{$KUBE.CONTROLLER.HTTP.CLIENT.ERROR} Warning

LLD rule Workqueue metrics discovery

Name Description Type Key and additional info
Workqueue metrics discovery Dependent item kubernetes.controller.workqueue.discovery

Preprocessing

  • Prometheus to JSON: {__name__=~ "workqueue_*", name =~ ".*"}

  • JavaScript: The text is too long. Please see the template.

  • Discard unchanged with heartbeat: 3h

Item prototypes for Workqueue metrics discovery

Name Description Type Key and additional info
Kubernetes Controller Manager: ["{#NAME}"]: Workqueue adds total, rate

Total number of adds handled by workqueue per second.

Dependent item kubernetes.controller.workqueue_adds_total["{#NAME}"]

Preprocessing

  • Prometheus pattern: VALUE(workqueue_adds_total{name = "{#NAME}"})

    Custom on fail: Discard value

  • Change per second
Kubernetes Controller Manager: ["{#NAME}"]: Workqueue depth

Current depth of workqueue.

Dependent item kubernetes.controller.workqueue_depth["{#NAME}"]

Preprocessing

  • Prometheus pattern: VALUE(workqueue_depth{name = "{#NAME}"})

    Custom on fail: Discard value

Kubernetes Controller Manager: ["{#NAME}"]: Workqueue unfinished work, sec

How many seconds of work has done that is in progress and hasn't been observed by work_duration. Large values indicate stuck threads. One can deduce the number of stuck threads by observing the rate at which this increases.

Dependent item kubernetes.controller.workqueue_unfinished_work_seconds["{#NAME}"]

Preprocessing

  • Prometheus pattern: VALUE(workqueue_unfinished_work_seconds{name = "{#NAME}"})

    Custom on fail: Discard value

Kubernetes Controller Manager: ["{#NAME}"]: Workqueue retries, rate

Total number of retries handled by workqueue per second.

Dependent item kubernetes.controller.workqueue_retries_total["{#NAME}"]

Preprocessing

  • Prometheus pattern: VALUE(workqueue_retries_total{name = "{#NAME}"})

    Custom on fail: Discard value

  • Change per second
Kubernetes Controller Manager: ["{#NAME}"]: Workqueue longest running processor, sec

How many seconds has the longest running processor for workqueue been running.

Dependent item kubernetes.controller.workqueue_longest_running_processor_seconds["{#NAME}"]

Preprocessing

  • Prometheus pattern: The text is too long. Please see the template.

    Custom on fail: Discard value

Kubernetes Controller Manager: ["{#NAME}"]: Workqueue work duration, p90

90 percentile of how long in seconds processing an item from workqueue takes, by queue.

Calculated kubernetes.controller.workqueue_work_duration_seconds_p90["{#NAME}"]
Kubernetes Controller Manager: ["{#NAME}"]: Workqueue work duration, p95

95 percentile of how long in seconds processing an item from workqueue takes, by queue.

Calculated kubernetes.controller.workqueue_work_duration_seconds_p95["{#NAME}"]
Kubernetes Controller Manager: ["{#NAME}"]: Workqueue work duration, p99

99 percentile of how long in seconds processing an item from workqueue takes, by queue.

Calculated kubernetes.controller.workqueue_work_duration_seconds_p99["{#NAME}"]
Kubernetes Controller Manager: ["{#NAME}"]: Workqueue work duration, 50p

50 percentiles of how long in seconds processing an item from workqueue takes, by queue.

Calculated kubernetes.controller.workqueue_work_duration_seconds_p50["{#NAME}"]
Kubernetes Controller Manager: ["{#NAME}"]: Workqueue queue duration, p90

90 percentile of how long in seconds an item stays in workqueue before being requested, by queue.

Calculated kubernetes.controller.workqueue_queue_duration_seconds_p90["{#NAME}"]
Kubernetes Controller Manager: ["{#NAME}"]: Workqueue queue duration, p95

95 percentile of how long in seconds an item stays in workqueue before being requested, by queue.

Calculated kubernetes.controller.workqueue_queue_duration_seconds_p95["{#NAME}"]
Kubernetes Controller Manager: ["{#NAME}"]: Workqueue queue duration, p99

99 percentile of how long in seconds an item stays in workqueue before being requested, by queue.

Calculated kubernetes.controller.workqueue_queue_duration_seconds_p99["{#NAME}"]
Kubernetes Controller Manager: ["{#NAME}"]: Workqueue queue duration, 50p

50 percentile of how long in seconds an item stays in workqueue before being requested. If there are no requests for 5 minute, item value will be discarded.

Calculated kubernetes.controller.workqueue_queue_duration_seconds_p50["{#NAME}"]

Preprocessing

  • Check for not supported value

    Custom on fail: Discard value

Kubernetes Controller Manager: ["{#NAME}"]: Workqueue duration seconds bucket, {#LE}

How long in seconds processing an item from workqueue takes.

Dependent item kubernetes.controller.duration_seconds_bucket[{#LE},"{#NAME}"]

Preprocessing

  • Prometheus pattern: The text is too long. Please see the template.

    Custom on fail: Discard value

Kubernetes Controller Manager: ["{#NAME}"]: Queue duration seconds bucket, {#LE}

How long in seconds an item stays in workqueue before being requested.

Dependent item kubernetes.controller.queue_duration_seconds_bucket[{#LE},"{#NAME}"]

Preprocessing

  • Prometheus pattern: The text is too long. Please see the template.

    Custom on fail: Discard value

Feedback

Please report any issues with the template at https://support.zabbix.com

You can also provide feedback, discuss the template, or ask for help at ZABBIX forums