# Kubernetes Scheduler by HTTP ## Overview The template to monitor Kubernetes Scheduler by Zabbix that works without any external scripts. Most of the metrics are collected in one go, thanks to Zabbix bulk data collection. Template `Kubernetes Scheduler by HTTP` - collects metrics by HTTP agent from Scheduler /metrics endpoint. ## Requirements Zabbix version: 7.0 and higher. ## Tested versions This template has been tested on: - Kubernetes Scheduler 1.19.10 ## Configuration > Zabbix should be configured according to the instructions in the [Templates out of the box](https://www.zabbix.com/documentation/7.0/manual/config/templates_out_of_the_box) section. ## Setup Internal service metrics are collected from /metrics endpoint. Template needs to use Authorization via API token. Don't forget change macros {$KUBE.SCHEDULER.SERVER.URL}, {$KUBE.API.TOKEN}. Also, see the Macros section for a list of macros used to set trigger values. *NOTE.* You might need to set the `--binding-address` option for Scheduler to the address where Zabbix proxy can reach it. For example, for clusters created with `kubeadm` it can be set in the following manifest file (changes will be applied immediately): - /etc/kubernetes/manifests/kube-scheduler.yaml *NOTE.* Some metrics may not be collected depending on your Kubernetes Scheduler instance version and configuration. ### Macros used |Name|Description|Default| |----|-----------|-------| |{$KUBE.SCHEDULER.SERVER.URL}|
Kubernetes Scheduler metrics endpoint URL.
|`https://localhost:10259/metrics`| |{$KUBE.API.TOKEN}|API Authorization Token.
|| |{$KUBE.SCHEDULER.HTTP.CLIENT.ERROR}|Maximum number of HTTP client requests failures used for trigger.
|`2`| |{$KUBE.SCHEDULER.UNSCHEDULABLE}|Maximum number of scheduling failures with 'unschedulable' used for trigger.
|`2`| |{$KUBE.SCHEDULER.ERROR}|Maximum number of scheduling failures with 'error' used for trigger.
|`2`| ### Items |Name|Description|Type|Key and additional info| |----|-----------|----|-----------------------| |Kubernetes Scheduler: Get Scheduler metrics|Get raw metrics from Scheduler instance /metrics endpoint.
|HTTP agent|kubernetes.scheduler.get_metrics**Preprocessing**
Check for not supported value
⛔️Custom on fail: Discard value
Virtual memory size in bytes.
|Dependent item|kubernetes.scheduler.process_virtual_memory_bytes**Preprocessing**
Prometheus pattern: `VALUE(process_virtual_memory_bytes)`
⛔️Custom on fail: Discard value
Resident memory size in bytes.
|Dependent item|kubernetes.scheduler.process_resident_memory_bytes**Preprocessing**
Prometheus pattern: `VALUE(process_resident_memory_bytes)`
⛔️Custom on fail: Discard value
Total user and system CPU usage ratio.
|Dependent item|kubernetes.scheduler.cpu.util**Preprocessing**
Prometheus pattern: `VALUE(process_cpu_seconds_total)`
Custom multiplier: `100`
Number of goroutines that currently exist.
|Dependent item|kubernetes.scheduler.go_goroutines**Preprocessing**
Prometheus pattern: `SUM(go_goroutines)`
⛔️Custom on fail: Discard value
Number of OS threads created.
|Dependent item|kubernetes.scheduler.go_threads**Preprocessing**
Prometheus pattern: `VALUE(go_threads)`
⛔️Custom on fail: Discard value
Number of open file descriptors.
|Dependent item|kubernetes.scheduler.open_fds**Preprocessing**
Prometheus pattern: `VALUE(process_open_fds)`
⛔️Custom on fail: Discard value
Maximum allowed open file descriptors.
|Dependent item|kubernetes.scheduler.max_fds**Preprocessing**
Prometheus pattern: `VALUE(process_max_fds)`
⛔️Custom on fail: Discard value
Number of HTTP requests with 2xx status code per second.
|Dependent item|kubernetes.scheduler.client_http_requests_200.rate**Preprocessing**
Prometheus pattern: `SUM(rest_client_requests_total{code =~ "2.."})`
⛔️Custom on fail: Discard value
Number of HTTP requests with 3xx status code per second.
|Dependent item|kubernetes.scheduler.client_http_requests_300.rate**Preprocessing**
Prometheus pattern: `SUM(rest_client_requests_total{code =~ "3.."})`
⛔️Custom on fail: Discard value
Number of HTTP requests with 4xx status code per second.
|Dependent item|kubernetes.scheduler.client_http_requests_400.rate**Preprocessing**
Prometheus pattern: `SUM(rest_client_requests_total{code =~ "4.."})`
⛔️Custom on fail: Discard value
Number of HTTP requests with 5xx status code per second.
|Dependent item|kubernetes.scheduler.client_http_requests_500.rate**Preprocessing**
Prometheus pattern: `SUM(rest_client_requests_total{code =~ "5.."})`
⛔️Custom on fail: Discard value
Number of attempts to schedule pods with result "scheduled" per second.
|Dependent item|kubernetes.scheduler.scheduler_schedule_attempts.scheduled.rate**Preprocessing**
Prometheus pattern: `SUM(scheduler_schedule_attempts_total{result = "scheduled"})`
⛔️Custom on fail: Discard value
Number of attempts to schedule pods with result "unschedulable" per second.
|Dependent item|kubernetes.scheduler.scheduler_schedule_attempts.unschedulable.rate**Preprocessing**
Prometheus pattern: `The text is too long. Please see the template.`
⛔️Custom on fail: Discard value
Number of attempts to schedule pods with result "error" per second.
|Dependent item|kubernetes.scheduler.scheduler_schedule_attempts.error.rate**Preprocessing**
Prometheus pattern: `SUM(scheduler_schedule_attempts_total{result = "error"})`
⛔️Custom on fail: Discard value
"Kubernetes Scheduler REST Client requests is experiencing high error rate (with 5xx HTTP code).
|`min(/Kubernetes Scheduler by HTTP/kubernetes.scheduler.client_http_requests_500.rate,5m)>{$KUBE.SCHEDULER.HTTP.CLIENT.ERROR}`|Warning|| |Kubernetes Scheduler: Too many unschedulable pods|Number of attempts to schedule pods with 'unschedulable' result is too high. 'unschedulable' means a pod could not be scheduled.
|`min(/Kubernetes Scheduler by HTTP/kubernetes.scheduler.scheduler_schedule_attempts.unschedulable.rate,5m)>{$KUBE.SCHEDULER.UNSCHEDULABLE}`|Warning|| |Kubernetes Scheduler: Too many schedule attempts with errors|Number of attempts to schedule pods with 'error' result is too high. 'error' means an internal scheduler problem.
|`min(/Kubernetes Scheduler by HTTP/kubernetes.scheduler.scheduler_schedule_attempts.error.rate,5m)>{$KUBE.SCHEDULER.ERROR}`|Warning|| ### LLD rule Scheduling algorithm histogram |Name|Description|Type|Key and additional info| |----|-----------|----|-----------------------| |Scheduling algorithm histogram|Discovery raw data of scheduling algorithm latency.
|Dependent item|kubernetes.scheduler.scheduling_algorithm.discovery**Preprocessing**
Prometheus to JSON: `The text is too long. Please see the template.`
JavaScript: `The text is too long. Please see the template.`
Discard unchanged with heartbeat: `3h`
Scheduling algorithm latency in seconds.
|Dependent item|kubernetes.scheduler.scheduling_algorithm_duration[{#LE}]**Preprocessing**
Prometheus pattern: `The text is too long. Please see the template.`
⛔️Custom on fail: Discard value
90 percentile of scheduling algorithm latency in seconds.
|Calculated|kubernetes.scheduler.scheduling_algorithm_duration_p90[{#SINGLETON}]| |Kubernetes Scheduler: Scheduling algorithm duration, p95|95 percentile of scheduling algorithm latency in seconds.
|Calculated|kubernetes.scheduler.scheduling_algorithm_duration_p95[{#SINGLETON}]| |Kubernetes Scheduler: Scheduling algorithm duration, p99|99 percentile of scheduling algorithm latency in seconds.
|Calculated|kubernetes.scheduler.scheduling_algorithm_duration_p99[{#SINGLETON}]| |Kubernetes Scheduler: Scheduling algorithm duration, p50|50 percentile of scheduling algorithm latency in seconds.
|Calculated|kubernetes.scheduler.scheduling_algorithm_duration_p50[{#SINGLETON}]| ### LLD rule Binding histogram |Name|Description|Type|Key and additional info| |----|-----------|----|-----------------------| |Binding histogram|Discovery raw data of binding latency.
|Dependent item|kubernetes.scheduler.binding.discovery**Preprocessing**
Prometheus to JSON: `{__name__=~ "scheduler_binding_duration_seconds_*"}`
JavaScript: `The text is too long. Please see the template.`
Discard unchanged with heartbeat: `3h`
Binding latency in seconds.
|Dependent item|kubernetes.scheduler.binding_duration[{#LE}]**Preprocessing**
Prometheus pattern: `The text is too long. Please see the template.`
⛔️Custom on fail: Discard value
90 percentile of binding latency in seconds.
|Calculated|kubernetes.scheduler.binding_duration_p90[{#SINGLETON}]| |Kubernetes Scheduler: Binding duration, p95|99 percentile of binding latency in seconds.
|Calculated|kubernetes.scheduler.binding_duration_p95[{#SINGLETON}]| |Kubernetes Scheduler: Binding duration, p99|95 percentile of binding latency in seconds.
|Calculated|kubernetes.scheduler.binding_duration_p99[{#SINGLETON}]| |Kubernetes Scheduler: Binding duration, p50|50 percentile of binding latency in seconds.
|Calculated|kubernetes.scheduler.binding_duration_p50[{#SINGLETON}]| ### LLD rule e2e scheduling histogram |Name|Description|Type|Key and additional info| |----|-----------|----|-----------------------| |e2e scheduling histogram|Discovery raw data and percentile items of e2e scheduling latency.
|Dependent item|kubernetes.controller.e2e_scheduling.discovery**Preprocessing**
Prometheus to JSON: `The text is too long. Please see the template.`
JavaScript: `The text is too long. Please see the template.`
Discard unchanged with heartbeat: `3h`
E2e scheduling latency in seconds (scheduling algorithm + binding)
|Dependent item|kubernetes.scheduler.e2e_scheduling_bucket[{#LE},"{#RESULT}"]**Preprocessing**
Prometheus pattern: `The text is too long. Please see the template.`
⛔️Custom on fail: Discard value
50 percentile of e2e scheduling latency.
|Calculated|kubernetes.scheduler.e2e_scheduling_p50["{#RESULT}"]| |Kubernetes Scheduler: ["{#RESULT}"]: e2e scheduling, p90|90 percentile of e2e scheduling latency.
|Calculated|kubernetes.scheduler.e2e_scheduling_p90["{#RESULT}"]| |Kubernetes Scheduler: ["{#RESULT}"]: e2e scheduling, p95|95 percentile of e2e scheduling latency.
|Calculated|kubernetes.scheduler.e2e_scheduling_p95["{#RESULT}"]| |Kubernetes Scheduler: ["{#RESULT}"]: e2e scheduling, p99|95 percentile of e2e scheduling latency.
|Calculated|kubernetes.scheduler.e2e_scheduling_p99["{#RESULT}"]| ## Feedback Please report any issues with the template at [`https://support.zabbix.com`](https://support.zabbix.com) You can also provide feedback, discuss the template, or ask for help at [`ZABBIX forums`](https://www.zabbix.com/forum/zabbix-suggestions-and-feedback)