You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

13 KiB

Kubernetes Scheduler by HTTP

Overview

The template to monitor Kubernetes Scheduler by Zabbix that works without any external scripts. Most of the metrics are collected in one go, thanks to Zabbix bulk data collection.

Template Kubernetes Scheduler by HTTP - collects metrics by HTTP agent from Scheduler /metrics endpoint.

Requirements

Zabbix version: 7.0 and higher.

Tested versions

This template has been tested on:

  • Kubernetes Scheduler 1.19.10

Configuration

Zabbix should be configured according to the instructions in the Templates out of the box section.

Setup

Internal service metrics are collected from /metrics endpoint. Template needs to use Authorization via API token.

Don't forget change macros {$KUBE.SCHEDULER.SERVER.URL}, {$KUBE.API.TOKEN}. Also, see the Macros section for a list of macros used to set trigger values.

NOTE. You might need to set the --binding-address option for Scheduler to the address where Zabbix proxy can reach it. For example, for clusters created with kubeadm it can be set in the following manifest file (changes will be applied immediately):

  • /etc/kubernetes/manifests/kube-scheduler.yaml

NOTE. Some metrics may not be collected depending on your Kubernetes Scheduler instance version and configuration.

Macros used

Name Description Default
{$KUBE.SCHEDULER.SERVER.URL}

Kubernetes Scheduler metrics endpoint URL.

https://localhost:10259/metrics
{$KUBE.API.TOKEN}

API Authorization Token.

{$KUBE.SCHEDULER.HTTP.CLIENT.ERROR}

Maximum number of HTTP client requests failures used for trigger.

2
{$KUBE.SCHEDULER.UNSCHEDULABLE}

Maximum number of scheduling failures with 'unschedulable' used for trigger.

2
{$KUBE.SCHEDULER.ERROR}

Maximum number of scheduling failures with 'error' used for trigger.

2

Items

Name Description Type Key and additional info
Kubernetes Scheduler: Get Scheduler metrics

Get raw metrics from Scheduler instance /metrics endpoint.

HTTP agent kubernetes.scheduler.get_metrics

Preprocessing

  • Check for not supported value

    Custom on fail: Discard value

Kubernetes Scheduler: Virtual memory, bytes

Virtual memory size in bytes.

Dependent item kubernetes.scheduler.process_virtual_memory_bytes

Preprocessing

  • Prometheus pattern: VALUE(process_virtual_memory_bytes)

    Custom on fail: Discard value

Kubernetes Scheduler: Resident memory, bytes

Resident memory size in bytes.

Dependent item kubernetes.scheduler.process_resident_memory_bytes

Preprocessing

  • Prometheus pattern: VALUE(process_resident_memory_bytes)

    Custom on fail: Discard value

Kubernetes Scheduler: CPU

Total user and system CPU usage ratio.

Dependent item kubernetes.scheduler.cpu.util

Preprocessing

  • Prometheus pattern: VALUE(process_cpu_seconds_total)

  • Change per second
  • Custom multiplier: 100

Kubernetes Scheduler: Goroutines

Number of goroutines that currently exist.

Dependent item kubernetes.scheduler.go_goroutines

Preprocessing

  • Prometheus pattern: SUM(go_goroutines)

    Custom on fail: Discard value

Kubernetes Scheduler: Go threads

Number of OS threads created.

Dependent item kubernetes.scheduler.go_threads

Preprocessing

  • Prometheus pattern: VALUE(go_threads)

    Custom on fail: Discard value

Kubernetes Scheduler: Fds open

Number of open file descriptors.

Dependent item kubernetes.scheduler.open_fds

Preprocessing

  • Prometheus pattern: VALUE(process_open_fds)

    Custom on fail: Discard value

Kubernetes Scheduler: Fds max

Maximum allowed open file descriptors.

Dependent item kubernetes.scheduler.max_fds

Preprocessing

  • Prometheus pattern: VALUE(process_max_fds)

    Custom on fail: Discard value

Kubernetes Scheduler: REST Client requests: 2xx, rate

Number of HTTP requests with 2xx status code per second.

Dependent item kubernetes.scheduler.client_http_requests_200.rate

Preprocessing

  • Prometheus pattern: SUM(rest_client_requests_total{code =~ "2.."})

    Custom on fail: Discard value

  • Change per second
Kubernetes Scheduler: REST Client requests: 3xx, rate

Number of HTTP requests with 3xx status code per second.

Dependent item kubernetes.scheduler.client_http_requests_300.rate

Preprocessing

  • Prometheus pattern: SUM(rest_client_requests_total{code =~ "3.."})

    Custom on fail: Discard value

  • Change per second
Kubernetes Scheduler: REST Client requests: 4xx, rate

Number of HTTP requests with 4xx status code per second.

Dependent item kubernetes.scheduler.client_http_requests_400.rate

Preprocessing

  • Prometheus pattern: SUM(rest_client_requests_total{code =~ "4.."})

    Custom on fail: Discard value

  • Change per second
Kubernetes Scheduler: REST Client requests: 5xx, rate

Number of HTTP requests with 5xx status code per second.

Dependent item kubernetes.scheduler.client_http_requests_500.rate

Preprocessing

  • Prometheus pattern: SUM(rest_client_requests_total{code =~ "5.."})

    Custom on fail: Discard value

  • Change per second
Kubernetes Scheduler: Schedule attempts: scheduled

Number of attempts to schedule pods with result "scheduled" per second.

Dependent item kubernetes.scheduler.scheduler_schedule_attempts.scheduled.rate

Preprocessing

  • Prometheus pattern: SUM(scheduler_schedule_attempts_total{result = "scheduled"})

    Custom on fail: Discard value

  • Change per second
Kubernetes Scheduler: Schedule attempts: unschedulable

Number of attempts to schedule pods with result "unschedulable" per second.

Dependent item kubernetes.scheduler.scheduler_schedule_attempts.unschedulable.rate

Preprocessing

  • Prometheus pattern: The text is too long. Please see the template.

    Custom on fail: Discard value

  • Change per second
Kubernetes Scheduler: Schedule attempts: error

Number of attempts to schedule pods with result "error" per second.

Dependent item kubernetes.scheduler.scheduler_schedule_attempts.error.rate

Preprocessing

  • Prometheus pattern: SUM(scheduler_schedule_attempts_total{result = "error"})

    Custom on fail: Discard value

  • Change per second

Triggers

Name Description Expression Severity Dependencies and additional info
Kubernetes Scheduler: Too many REST Client errors

"Kubernetes Scheduler REST Client requests is experiencing high error rate (with 5xx HTTP code).

min(/Kubernetes Scheduler by HTTP/kubernetes.scheduler.client_http_requests_500.rate,5m)>{$KUBE.SCHEDULER.HTTP.CLIENT.ERROR} Warning
Kubernetes Scheduler: Too many unschedulable pods

Number of attempts to schedule pods with 'unschedulable' result is too high. 'unschedulable' means a pod could not be scheduled.

min(/Kubernetes Scheduler by HTTP/kubernetes.scheduler.scheduler_schedule_attempts.unschedulable.rate,5m)>{$KUBE.SCHEDULER.UNSCHEDULABLE} Warning
Kubernetes Scheduler: Too many schedule attempts with errors

Number of attempts to schedule pods with 'error' result is too high. 'error' means an internal scheduler problem.

min(/Kubernetes Scheduler by HTTP/kubernetes.scheduler.scheduler_schedule_attempts.error.rate,5m)>{$KUBE.SCHEDULER.ERROR} Warning

LLD rule Scheduling algorithm histogram

Name Description Type Key and additional info
Scheduling algorithm histogram

Discovery raw data of scheduling algorithm latency.

Dependent item kubernetes.scheduler.scheduling_algorithm.discovery

Preprocessing

  • Prometheus to JSON: The text is too long. Please see the template.

  • JavaScript: The text is too long. Please see the template.

  • Discard unchanged with heartbeat: 3h

Item prototypes for Scheduling algorithm histogram

Name Description Type Key and additional info
Kubernetes Scheduler: Scheduling algorithm duration bucket, {#LE}

Scheduling algorithm latency in seconds.

Dependent item kubernetes.scheduler.scheduling_algorithm_duration[{#LE}]

Preprocessing

  • Prometheus pattern: The text is too long. Please see the template.

    Custom on fail: Discard value

Kubernetes Scheduler: Scheduling algorithm duration, p90

90 percentile of scheduling algorithm latency in seconds.

Calculated kubernetes.scheduler.scheduling_algorithm_duration_p90[{#SINGLETON}]
Kubernetes Scheduler: Scheduling algorithm duration, p95

95 percentile of scheduling algorithm latency in seconds.

Calculated kubernetes.scheduler.scheduling_algorithm_duration_p95[{#SINGLETON}]
Kubernetes Scheduler: Scheduling algorithm duration, p99

99 percentile of scheduling algorithm latency in seconds.

Calculated kubernetes.scheduler.scheduling_algorithm_duration_p99[{#SINGLETON}]
Kubernetes Scheduler: Scheduling algorithm duration, p50

50 percentile of scheduling algorithm latency in seconds.

Calculated kubernetes.scheduler.scheduling_algorithm_duration_p50[{#SINGLETON}]

LLD rule Binding histogram

Name Description Type Key and additional info
Binding histogram

Discovery raw data of binding latency.

Dependent item kubernetes.scheduler.binding.discovery

Preprocessing

  • Prometheus to JSON: {__name__=~ "scheduler_binding_duration_seconds_*"}

  • JavaScript: The text is too long. Please see the template.

  • Discard unchanged with heartbeat: 3h

Item prototypes for Binding histogram

Name Description Type Key and additional info
Kubernetes Scheduler: Binding duration bucket, {#LE}

Binding latency in seconds.

Dependent item kubernetes.scheduler.binding_duration[{#LE}]

Preprocessing

  • Prometheus pattern: The text is too long. Please see the template.

    Custom on fail: Discard value

Kubernetes Scheduler: Binding duration, p90

90 percentile of binding latency in seconds.

Calculated kubernetes.scheduler.binding_duration_p90[{#SINGLETON}]
Kubernetes Scheduler: Binding duration, p95

99 percentile of binding latency in seconds.

Calculated kubernetes.scheduler.binding_duration_p95[{#SINGLETON}]
Kubernetes Scheduler: Binding duration, p99

95 percentile of binding latency in seconds.

Calculated kubernetes.scheduler.binding_duration_p99[{#SINGLETON}]
Kubernetes Scheduler: Binding duration, p50

50 percentile of binding latency in seconds.

Calculated kubernetes.scheduler.binding_duration_p50[{#SINGLETON}]

LLD rule e2e scheduling histogram

Name Description Type Key and additional info
e2e scheduling histogram

Discovery raw data and percentile items of e2e scheduling latency.

Dependent item kubernetes.controller.e2e_scheduling.discovery

Preprocessing

  • Prometheus to JSON: The text is too long. Please see the template.

  • JavaScript: The text is too long. Please see the template.

  • Discard unchanged with heartbeat: 3h

Item prototypes for e2e scheduling histogram

Name Description Type Key and additional info
Kubernetes Scheduler: ["{#RESULT}"]: e2e scheduling seconds bucket, {#LE}

E2e scheduling latency in seconds (scheduling algorithm + binding)

Dependent item kubernetes.scheduler.e2e_scheduling_bucket[{#LE},"{#RESULT}"]

Preprocessing

  • Prometheus pattern: The text is too long. Please see the template.

    Custom on fail: Discard value

Kubernetes Scheduler: ["{#RESULT}"]: e2e scheduling, p50

50 percentile of e2e scheduling latency.

Calculated kubernetes.scheduler.e2e_scheduling_p50["{#RESULT}"]
Kubernetes Scheduler: ["{#RESULT}"]: e2e scheduling, p90

90 percentile of e2e scheduling latency.

Calculated kubernetes.scheduler.e2e_scheduling_p90["{#RESULT}"]
Kubernetes Scheduler: ["{#RESULT}"]: e2e scheduling, p95

95 percentile of e2e scheduling latency.

Calculated kubernetes.scheduler.e2e_scheduling_p95["{#RESULT}"]
Kubernetes Scheduler: ["{#RESULT}"]: e2e scheduling, p99

95 percentile of e2e scheduling latency.

Calculated kubernetes.scheduler.e2e_scheduling_p99["{#RESULT}"]

Feedback

Please report any issues with the template at https://support.zabbix.com

You can also provide feedback, discuss the template, or ask for help at ZABBIX forums