13 KiB

Raw Blame History Unescape Escape

Kubernetes Scheduler by HTTP

Overview

The template to monitor Kubernetes Scheduler by Zabbix that works without any external scripts. Most of the metrics are collected in one go, thanks to Zabbix bulk data collection.

Template Kubernetes Scheduler by HTTP - collects metrics by HTTP agent from Scheduler /metrics endpoint.

Requirements

Zabbix version: 7.0 and higher.

Tested versions

This template has been tested on:

Kubernetes Scheduler 1.19.10

Configuration

Zabbix should be configured according to the instructions in the Templates out of the box section.

Setup

Internal service metrics are collected from /metrics endpoint. Template needs to use Authorization via API token.

Don't forget change macros {$KUBE.SCHEDULER.SERVER.URL}, {$KUBE.API.TOKEN}. Also, see the Macros section for a list of macros used to set trigger values.

NOTE. You might need to set the --binding-address option for Scheduler to the address where Zabbix proxy can reach it. For example, for clusters created with kubeadm it can be set in the following manifest file (changes will be applied immediately):

/etc/kubernetes/manifests/kube-scheduler.yaml

NOTE. Some metrics may not be collected depending on your Kubernetes Scheduler instance version and configuration.

Macros used

Name	Description	Default
{$KUBE.SCHEDULER.SERVER.URL}	Kubernetes Scheduler metrics endpoint URL.	`https://localhost:10259/metrics`
{$KUBE.API.TOKEN}	API Authorization Token.
{$KUBE.SCHEDULER.HTTP.CLIENT.ERROR}	Maximum number of HTTP client requests failures used for trigger.	`2`
{$KUBE.SCHEDULER.UNSCHEDULABLE}	Maximum number of scheduling failures with 'unschedulable' used for trigger.	`2`
{$KUBE.SCHEDULER.ERROR}	Maximum number of scheduling failures with 'error' used for trigger.	`2`

Items

Name	Description	Type	Key and additional info
Kubernetes Scheduler: Get Scheduler metrics	Get raw metrics from Scheduler instance /metrics endpoint.	HTTP agent	kubernetes.scheduler.get_metrics Preprocessing Check for not supported value ⛔️Custom on fail: Discard value
Kubernetes Scheduler: Virtual memory, bytes	Virtual memory size in bytes.	Dependent item	kubernetes.scheduler.process_virtual_memory_bytes Preprocessing Prometheus pattern: `VALUE(process_virtual_memory_bytes)` ⛔️Custom on fail: Discard value
Kubernetes Scheduler: Resident memory, bytes	Resident memory size in bytes.	Dependent item	kubernetes.scheduler.process_resident_memory_bytes Preprocessing Prometheus pattern: `VALUE(process_resident_memory_bytes)` ⛔️Custom on fail: Discard value
Kubernetes Scheduler: CPU	Total user and system CPU usage ratio.	Dependent item	kubernetes.scheduler.cpu.util Preprocessing Prometheus pattern: `VALUE(process_cpu_seconds_total)` Change per second Custom multiplier: `100`
Kubernetes Scheduler: Goroutines	Number of goroutines that currently exist.	Dependent item	kubernetes.scheduler.go_goroutines Preprocessing Prometheus pattern: `SUM(go_goroutines)` ⛔️Custom on fail: Discard value
Kubernetes Scheduler: Go threads	Number of OS threads created.	Dependent item	kubernetes.scheduler.go_threads Preprocessing Prometheus pattern: `VALUE(go_threads)` ⛔️Custom on fail: Discard value
Kubernetes Scheduler: Fds open	Number of open file descriptors.	Dependent item	kubernetes.scheduler.open_fds Preprocessing Prometheus pattern: `VALUE(process_open_fds)` ⛔️Custom on fail: Discard value
Kubernetes Scheduler: Fds max	Maximum allowed open file descriptors.	Dependent item	kubernetes.scheduler.max_fds Preprocessing Prometheus pattern: `VALUE(process_max_fds)` ⛔️Custom on fail: Discard value
Kubernetes Scheduler: REST Client requests: 2xx, rate	Number of HTTP requests with 2xx status code per second.	Dependent item	kubernetes.scheduler.client_http_requests_200.rate Preprocessing Prometheus pattern: `SUM(rest_client_requests_total{code =~ "2.."})` ⛔️Custom on fail: Discard value Change per second
Kubernetes Scheduler: REST Client requests: 3xx, rate	Number of HTTP requests with 3xx status code per second.	Dependent item	kubernetes.scheduler.client_http_requests_300.rate Preprocessing Prometheus pattern: `SUM(rest_client_requests_total{code =~ "3.."})` ⛔️Custom on fail: Discard value Change per second
Kubernetes Scheduler: REST Client requests: 4xx, rate	Number of HTTP requests with 4xx status code per second.	Dependent item	kubernetes.scheduler.client_http_requests_400.rate Preprocessing Prometheus pattern: `SUM(rest_client_requests_total{code =~ "4.."})` ⛔️Custom on fail: Discard value Change per second
Kubernetes Scheduler: REST Client requests: 5xx, rate	Number of HTTP requests with 5xx status code per second.	Dependent item	kubernetes.scheduler.client_http_requests_500.rate Preprocessing Prometheus pattern: `SUM(rest_client_requests_total{code =~ "5.."})` ⛔️Custom on fail: Discard value Change per second
Kubernetes Scheduler: Schedule attempts: scheduled	Number of attempts to schedule pods with result "scheduled" per second.	Dependent item	kubernetes.scheduler.scheduler_schedule_attempts.scheduled.rate Preprocessing Prometheus pattern: `SUM(scheduler_schedule_attempts_total{result = "scheduled"})` ⛔️Custom on fail: Discard value Change per second
Kubernetes Scheduler: Schedule attempts: unschedulable	Number of attempts to schedule pods with result "unschedulable" per second.	Dependent item	kubernetes.scheduler.scheduler_schedule_attempts.unschedulable.rate Preprocessing Prometheus pattern: `The text is too long. Please see the template.` ⛔️Custom on fail: Discard value Change per second
Kubernetes Scheduler: Schedule attempts: error	Number of attempts to schedule pods with result "error" per second.	Dependent item	kubernetes.scheduler.scheduler_schedule_attempts.error.rate Preprocessing Prometheus pattern: `SUM(scheduler_schedule_attempts_total{result = "error"})` ⛔️Custom on fail: Discard value Change per second

Triggers

Name	Description	Expression	Severity
Kubernetes Scheduler: Too many REST Client errors	"Kubernetes Scheduler REST Client requests is experiencing high error rate (with 5xx HTTP code).	`min(/Kubernetes Scheduler by HTTP/kubernetes.scheduler.client_http_requests_500.rate,5m)>{$KUBE.SCHEDULER.HTTP.CLIENT.ERROR}`	Warning
Kubernetes Scheduler: Too many unschedulable pods	Number of attempts to schedule pods with 'unschedulable' result is too high. 'unschedulable' means a pod could not be scheduled.	`min(/Kubernetes Scheduler by HTTP/kubernetes.scheduler.scheduler_schedule_attempts.unschedulable.rate,5m)>{$KUBE.SCHEDULER.UNSCHEDULABLE}`	Warning
Kubernetes Scheduler: Too many schedule attempts with errors	Number of attempts to schedule pods with 'error' result is too high. 'error' means an internal scheduler problem.	`min(/Kubernetes Scheduler by HTTP/kubernetes.scheduler.scheduler_schedule_attempts.error.rate,5m)>{$KUBE.SCHEDULER.ERROR}`	Warning

LLD rule Scheduling algorithm histogram

Name Description Type Key and additional info

Scheduling algorithm histogram

Name	Description	Type	Key and additional info
Scheduling algorithm histogram	Discovery raw data of scheduling algorithm latency.	Dependent item	kubernetes.scheduler.scheduling_algorithm.discovery Preprocessing Prometheus to JSON: `The text is too long. Please see the template.` JavaScript: `The text is too long. Please see the template.` Discard unchanged with heartbeat: `3h`

Discovery raw data of scheduling algorithm latency.

Dependent item

kubernetes.scheduler.scheduling_algorithm.discovery

Preprocessing

Prometheus to JSON: The text is too long. Please see the template.
JavaScript: The text is too long. Please see the template.
Discard unchanged with heartbeat: 3h

Item prototypes for Scheduling algorithm histogram

Name	Description	Type	Key and additional info
Kubernetes Scheduler: Scheduling algorithm duration bucket, {#LE}	Scheduling algorithm latency in seconds.	Dependent item	kubernetes.scheduler.scheduling_algorithm_duration[{#LE}] Preprocessing Prometheus pattern: `The text is too long. Please see the template.` ⛔️Custom on fail: Discard value
Kubernetes Scheduler: Scheduling algorithm duration, p90	90 percentile of scheduling algorithm latency in seconds.	Calculated	kubernetes.scheduler.scheduling_algorithm_duration_p90[{#SINGLETON}]
Kubernetes Scheduler: Scheduling algorithm duration, p95	95 percentile of scheduling algorithm latency in seconds.	Calculated	kubernetes.scheduler.scheduling_algorithm_duration_p95[{#SINGLETON}]
Kubernetes Scheduler: Scheduling algorithm duration, p99	99 percentile of scheduling algorithm latency in seconds.	Calculated	kubernetes.scheduler.scheduling_algorithm_duration_p99[{#SINGLETON}]
Kubernetes Scheduler: Scheduling algorithm duration, p50	50 percentile of scheduling algorithm latency in seconds.	Calculated	kubernetes.scheduler.scheduling_algorithm_duration_p50[{#SINGLETON}]

LLD rule Binding histogram

Name Description Type Key and additional info

Binding histogram

Name	Description	Type	Key and additional info
Binding histogram	Discovery raw data of binding latency.	Dependent item	kubernetes.scheduler.binding.discovery Preprocessing Prometheus to JSON: `{__name__=~ "scheduler_binding_duration_seconds_*"}` JavaScript: `The text is too long. Please see the template.` Discard unchanged with heartbeat: `3h`

Discovery raw data of binding latency.

Dependent item

kubernetes.scheduler.binding.discovery

Preprocessing

Prometheus to JSON: {__name__=~ "scheduler_binding_duration_seconds_*"}
JavaScript: The text is too long. Please see the template.
Discard unchanged with heartbeat: 3h

Item prototypes for Binding histogram

Name	Description	Type	Key and additional info
Kubernetes Scheduler: Binding duration bucket, {#LE}	Binding latency in seconds.	Dependent item	kubernetes.scheduler.binding_duration[{#LE}] Preprocessing Prometheus pattern: `The text is too long. Please see the template.` ⛔️Custom on fail: Discard value
Kubernetes Scheduler: Binding duration, p90	90 percentile of binding latency in seconds.	Calculated	kubernetes.scheduler.binding_duration_p90[{#SINGLETON}]
Kubernetes Scheduler: Binding duration, p95	99 percentile of binding latency in seconds.	Calculated	kubernetes.scheduler.binding_duration_p95[{#SINGLETON}]
Kubernetes Scheduler: Binding duration, p99	95 percentile of binding latency in seconds.	Calculated	kubernetes.scheduler.binding_duration_p99[{#SINGLETON}]
Kubernetes Scheduler: Binding duration, p50	50 percentile of binding latency in seconds.	Calculated	kubernetes.scheduler.binding_duration_p50[{#SINGLETON}]

LLD rule e2e scheduling histogram

Name Description Type Key and additional info

e2e scheduling histogram

Name	Description	Type	Key and additional info
e2e scheduling histogram	Discovery raw data and percentile items of e2e scheduling latency.	Dependent item	kubernetes.controller.e2e_scheduling.discovery Preprocessing Prometheus to JSON: `The text is too long. Please see the template.` JavaScript: `The text is too long. Please see the template.` Discard unchanged with heartbeat: `3h`

Discovery raw data and percentile items of e2e scheduling latency.

Dependent item

kubernetes.controller.e2e_scheduling.discovery

Preprocessing

Prometheus to JSON: The text is too long. Please see the template.
JavaScript: The text is too long. Please see the template.
Discard unchanged with heartbeat: 3h

Item prototypes for e2e scheduling histogram

Name	Description	Type	Key and additional info
Kubernetes Scheduler: ["{#RESULT}"]: e2e scheduling seconds bucket, {#LE}	E2e scheduling latency in seconds (scheduling algorithm + binding)	Dependent item	kubernetes.scheduler.e2e_scheduling_bucket[{#LE},"{#RESULT}"] Preprocessing Prometheus pattern: `The text is too long. Please see the template.` ⛔️Custom on fail: Discard value
Kubernetes Scheduler: ["{#RESULT}"]: e2e scheduling, p50	50 percentile of e2e scheduling latency.	Calculated	kubernetes.scheduler.e2e_scheduling_p50["{#RESULT}"]
Kubernetes Scheduler: ["{#RESULT}"]: e2e scheduling, p90	90 percentile of e2e scheduling latency.	Calculated	kubernetes.scheduler.e2e_scheduling_p90["{#RESULT}"]
Kubernetes Scheduler: ["{#RESULT}"]: e2e scheduling, p95	95 percentile of e2e scheduling latency.	Calculated	kubernetes.scheduler.e2e_scheduling_p95["{#RESULT}"]
Kubernetes Scheduler: ["{#RESULT}"]: e2e scheduling, p99	95 percentile of e2e scheduling latency.	Calculated	kubernetes.scheduler.e2e_scheduling_p99["{#RESULT}"]

Feedback

Please report any issues with the template at https://support.zabbix.com

You can also provide feedback, discuss the template, or ask for help at ZABBIX forums

13 KiB Raw Blame History Unescape Escape

Kubernetes Scheduler by HTTP

Overview

Requirements

Tested versions

Configuration

Setup

Macros used

Items

Triggers

LLD rule Scheduling algorithm histogram

Item prototypes for Scheduling algorithm histogram

LLD rule Binding histogram

Item prototypes for Binding histogram

LLD rule e2e scheduling histogram

Item prototypes for e2e scheduling histogram

Feedback

13 KiB

Raw Blame History Unescape Escape