You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
yzl 93958d0fb0
zabbix6.0
1 year ago
..
README.md zabbix6.0 1 year ago
template_kubernetes_api_servers.yaml zabbix6.0 1 year ago

README.md

Kubernetes API server by HTTP

Overview

The template to monitor Kubernetes API server that works without any external scripts. Most of the metrics are collected in one go, thanks to Zabbix bulk data collection.

Template Kubernetes API server by HTTP - collects metrics by HTTP agent from API server /metrics endpoint.

Requirements

Zabbix version: 7.0 and higher.

Tested versions

This template has been tested on:

  • Kubernetes API server 1.19.10

Configuration

Zabbix should be configured according to the instructions in the Templates out of the box section.

Setup

Internal service metrics are collected from /metrics endpoint. Template needs to use Authorization via API token.

Don't forget change macros {$KUBE.API.SERVER.URL}, {$KUBE.API.TOKEN}. Also, see the Macros section for a list of macros used to set trigger values.

NOTE. Some metrics may not be collected depending on your Kubernetes API server instance version and configuration.

Macros used

Name Description Default
{$KUBE.API.SERVER.URL}

Kubernetes API server metrics endpoint URL.

https://localhost:6443/metrics
{$KUBE.API.TOKEN}

API Authorization Token.

{$KUBE.API.CERT.EXPIRATION}

Number of days for alert of client certificate used for trigger.

7
{$KUBE.API.HTTP.CLIENT.ERROR}

Maximum number of HTTP client requests failures used for trigger.

2
{$KUBE.API.HTTP.SERVER.ERROR}

Maximum number of HTTP server requests failures used for trigger.

2

Items

Name Description Type Key and additional info
Kubernetes API: Get API instance metrics

Get raw metrics from API instance /metrics endpoint.

HTTP agent kubernetes.api.get_metrics

Preprocessing

  • Check for not supported value

    Custom on fail: Discard value

Kubernetes API: Audit events, total

Accumulated number audit events generated and sent to the audit backend.

Dependent item kubernetes.api.audit_event_total

Preprocessing

  • Prometheus pattern: SUM(apiserver_audit_event_total)

    Custom on fail: Discard value

Kubernetes API: Virtual memory, bytes

Virtual memory size in bytes.

Dependent item kubernetes.api.process_virtual_memory_bytes

Preprocessing

  • Prometheus pattern: VALUE(process_virtual_memory_bytes)

    Custom on fail: Discard value

Kubernetes API: Resident memory, bytes

Resident memory size in bytes.

Dependent item kubernetes.api.process_resident_memory_bytes

Preprocessing

  • Prometheus pattern: VALUE(process_resident_memory_bytes)

    Custom on fail: Discard value

Kubernetes API: CPU

Total user and system CPU usage ratio.

Dependent item kubernetes.api.cpu.util

Preprocessing

  • Prometheus pattern: VALUE(process_cpu_seconds_total)

  • Change per second
  • Custom multiplier: 100

Kubernetes API: Goroutines

Number of goroutines that currently exist.

Dependent item kubernetes.api.go_goroutines

Preprocessing

  • Prometheus pattern: SUM(go_goroutines)

    Custom on fail: Discard value

Kubernetes API: Go threads

Number of OS threads created.

Dependent item kubernetes.api.go_threads

Preprocessing

  • Prometheus pattern: VALUE(go_threads)

    Custom on fail: Discard value

Kubernetes API: Fds open

Number of open file descriptors.

Dependent item kubernetes.api.open_fds

Preprocessing

  • Prometheus pattern: VALUE(process_open_fds)

    Custom on fail: Discard value

Kubernetes API: Fds max

Maximum allowed open file descriptors.

Dependent item kubernetes.api.max_fds

Preprocessing

  • Prometheus pattern: VALUE(process_max_fds)

    Custom on fail: Discard value

Kubernetes API: gRPCs client started, rate

Total number of RPCs started per second.

Dependent item kubernetes.api.grpc_client_started.rate

Preprocessing

  • Prometheus pattern: SUM(grpc_client_started_total)

    Custom on fail: Discard value

  • Change per second
Kubernetes API: gRPCs messages received, rate

Total number of gRPC stream messages received per second.

Dependent item kubernetes.api.grpc_client_msg_received.rate

Preprocessing

  • Prometheus pattern: SUM(grpc_client_msg_received_total)

    Custom on fail: Discard value

  • Change per second
Kubernetes API: gRPCs messages sent, rate

Total number of gRPC stream messages sent per second.

Dependent item kubernetes.api.grpc_client_msg_sent.rate

Preprocessing

  • Prometheus pattern: SUM(grpc_client_msg_sent_total)

    Custom on fail: Discard value

  • Change per second
Kubernetes API: Request terminations, rate

Number of requests which apiserver terminated in self-defense per second.

Dependent item kubernetes.api.apiserver_request_terminations

Preprocessing

  • Prometheus pattern: SUM(apiserver_request_terminations_total)

    Custom on fail: Discard value

  • Change per second
Kubernetes API: TLS handshake errors, rate

Number of requests dropped with 'TLS handshake error from' error per second.

Dependent item kubernetes.api.apiserver_tls_handshake_errors_total.rate

Preprocessing

  • Prometheus pattern: SUM(apiserver_tls_handshake_errors_total)

    Custom on fail: Discard value

Kubernetes API: API server requests: 5xx, rate

Counter of apiserver requests broken out for each HTTP response code.

Dependent item kubernetes.api.apiserver_request_total_500.rate

Preprocessing

  • Prometheus pattern: SUM(apiserver_request_total{code =~ "5.."})

    Custom on fail: Discard value

  • Change per second
Kubernetes API: API server requests: 4xx, rate

Counter of apiserver requests broken out for each HTTP response code.

Dependent item kubernetes.api.apiserver_request_total_400.rate

Preprocessing

  • Prometheus pattern: SUM(apiserver_request_total{code =~ "4.."})

    Custom on fail: Discard value

  • Change per second
Kubernetes API: API server requests: 3xx, rate

Counter of apiserver requests broken out for each HTTP response code.

Dependent item kubernetes.api.apiserver_request_total_300.rate

Preprocessing

  • Prometheus pattern: SUM(apiserver_request_total{code =~ "3.."})

    Custom on fail: Discard value

  • Change per second
Kubernetes API: API server requests: 0

Counter of apiserver requests broken out for each HTTP response code.

Dependent item kubernetes.api.apiserver_request_total_0.rate

Preprocessing

  • Prometheus pattern: SUM(apiserver_request_total{code = "0"})

    Custom on fail: Discard value

  • Change per second
Kubernetes API: API server requests: 2xx, rate

Counter of apiserver requests broken out for each HTTP response code.

Dependent item kubernetes.api.apiserver_request_total_200.rate

Preprocessing

  • Prometheus pattern: SUM(apiserver_request_total{code =~ "2.."})

    Custom on fail: Discard value

  • Change per second
Kubernetes API: HTTP requests: 5xx, rate

Number of HTTP requests with 5xx status code per second.

Dependent item kubernetes.api.rest_client_requests_total_500.rate

Preprocessing

  • Prometheus pattern: SUM(rest_client_requests_total{code =~ "5.."})

    Custom on fail: Discard value

  • Change per second
Kubernetes API: HTTP requests: 4xx, rate

Number of HTTP requests with 4xx status code per second.

Dependent item kubernetes.api.rest_client_requests_total_400.rate

Preprocessing

  • Prometheus pattern: SUM(rest_client_requests_total{code =~ "4.."})

    Custom on fail: Discard value

  • Change per second
Kubernetes API: HTTP requests: 3xx, rate

Number of HTTP requests with 3xx status code per second.

Dependent item kubernetes.api.rest_client_requests_total_300.rate

Preprocessing

  • Prometheus pattern: SUM(rest_client_requests_total{code =~ "3.."})

    Custom on fail: Discard value

  • Change per second
Kubernetes API: HTTP requests: 2xx, rate

Number of HTTP requests with 2xx status code per second.

Dependent item kubernetes.api.rest_client_requests_total_200.rate

Preprocessing

  • Prometheus pattern: SUM(rest_client_requests_total{code =~ "2.."})

    Custom on fail: Discard value

  • Change per second

Triggers

Name Description Expression Severity Dependencies and additional info
Kubernetes API: Too many server errors

"Kubernetes API server is experiencing high error rate (with 5xx HTTP code).

min(/Kubernetes API server by HTTP/kubernetes.api.apiserver_request_total_500.rate,5m)>{$KUBE.API.HTTP.SERVER.ERROR} Warning
Kubernetes API: Too many client errors

"Kubernetes API client is experiencing high error rate (with 5xx HTTP code).

min(/Kubernetes API server by HTTP/kubernetes.api.rest_client_requests_total_500.rate,5m)>{$KUBE.API.HTTP.CLIENT.ERROR} Warning

LLD rule Long-running requests

Name Description Type Key and additional info
Long-running requests

Discovery of long-running requests by verb, resource and scope.

Dependent item kubernetes.api.longrunning_gauge.discovery

Preprocessing

  • Prometheus to JSON: The text is too long. Please see the template.

    Custom on fail: Discard value

  • JavaScript: The text is too long. Please see the template.

  • Discard unchanged with heartbeat: 3h

Item prototypes for Long-running requests

Name Description Type Key and additional info
Kubernetes API: Long-running ["{#VERB}"] requests ["{#RESOURCE}"]: {#SCOPE}

Gauge of all active long-running apiserver requests broken out by verb, resource and scope. Not all requests are tracked this way.

Dependent item kubernetes.api.longrunning_gauge["{#RESOURCE}","{#SCOPE}","{#VERB}"]

Preprocessing

  • Prometheus pattern: The text is too long. Please see the template.

    Custom on fail: Discard value

LLD rule Request duration histogram

Name Description Type Key and additional info
Request duration histogram

Discovery raw data and percentile items of request duration.

Dependent item kubernetes.api.requests_bucket.discovery

Preprocessing

  • Prometheus to JSON: {__name__=~ "apiserver_request_duration_*", verb =~ ".*"}

  • JavaScript: The text is too long. Please see the template.

  • Discard unchanged with heartbeat: 3h

Item prototypes for Request duration histogram

Name Description Type Key and additional info
Kubernetes API: ["{#VERB}"] Requests bucket: {#LE}

Response latency distribution in seconds for each verb.

Dependent item kubernetes.api.request_duration_seconds_bucket[{#LE},"{#VERB}"]

Preprocessing

  • Prometheus pattern: The text is too long. Please see the template.

Kubernetes API: ["{#VERB}"] Requests, p90

90 percentile of response latency distribution in seconds for each verb.

Calculated kubernetes.api.request_duration_seconds_p90["{#VERB}"]
Kubernetes API: ["{#VERB}"] Requests, p95

95 percentile of response latency distribution in seconds for each verb.

Calculated kubernetes.api.request_duration_seconds_p95["{#VERB}"]
Kubernetes API: ["{#VERB}"] Requests, p99

99 percentile of response latency distribution in seconds for each verb.

Calculated kubernetes.api.request_duration_seconds_p99["{#VERB}"]
Kubernetes API: ["{#VERB}"] Requests, p50

50 percentile of response latency distribution in seconds for each verb.

Calculated kubernetes.api.request_duration_seconds_p50["{#VERB}"]

LLD rule Requests inflight discovery

Name Description Type Key and additional info
Requests inflight discovery

Discovery requests inflight by kind.

Dependent item kubernetes.api.inflight_requests.discovery

Preprocessing

  • Prometheus to JSON: apiserver_current_inflight_requests{request_kind =~ ".*"}

  • JavaScript: The text is too long. Please see the template.

  • Discard unchanged with heartbeat: 3h

Item prototypes for Requests inflight discovery

Name Description Type Key and additional info
Kubernetes API: Requests current: {#KIND}

Maximal number of currently used inflight request limit of this apiserver per request kind in last second.

Dependent item kubernetes.api.current_inflight_requests["{#KIND}"]

Preprocessing

  • Prometheus pattern: The text is too long. Please see the template.

    Custom on fail: Discard value

LLD rule gRPC completed requests discovery

Name Description Type Key and additional info
gRPC completed requests discovery

Discovery grpc completed requests by grpc code.

Dependent item kubernetes.api.grpc_client_handled.discovery

Preprocessing

  • Prometheus to JSON: grpc_client_handled_total{grpc_code =~ ".*"}

  • JavaScript: The text is too long. Please see the template.

  • Discard unchanged with heartbeat: 3h

Item prototypes for gRPC completed requests discovery

Name Description Type Key and additional info
Kubernetes API: gRPCs completed: {#GRPC_CODE}, rate

Total number of RPCs completed by the client regardless of success or failure per second.

Dependent item kubernetes.api.grpc_client_handled_total.rate["{#GRPC_CODE}"]

Preprocessing

  • Prometheus pattern: SUM(grpc_client_handled_total{grpc_code = "{#GRPC_CODE}"})

    Custom on fail: Discard value

  • Change per second

LLD rule Authentication attempts discovery

Name Description Type Key and additional info
Authentication attempts discovery

Discovery authentication attempts by result.

Dependent item kubernetes.api.authentication_attempts.discovery

Preprocessing

  • Prometheus to JSON: authentication_attempts{result =~ ".*"}

  • JavaScript: The text is too long. Please see the template.

  • Discard unchanged with heartbeat: 3h

Item prototypes for Authentication attempts discovery

Name Description Type Key and additional info
Kubernetes API: Authentication attempts: {#RESULT}, rate

Authentication attempts by result per second.

Dependent item kubernetes.api.authentication_attempts.rate["{#RESULT}"]

Preprocessing

  • Prometheus pattern: SUM(authentication_attempts{result = "{#RESULT}"})

    Custom on fail: Discard value

  • Change per second

LLD rule Authentication requests discovery

Name Description Type Key and additional info
Authentication requests discovery

Discovery authentication attempts by name.

Dependent item kubernetes.api.authenticated_user_requests.discovery

Preprocessing

  • Prometheus to JSON: authenticated_user_requests{username =~ ".*"}

  • JavaScript: The text is too long. Please see the template.

  • Discard unchanged with heartbeat: 3h

Item prototypes for Authentication requests discovery

Name Description Type Key and additional info
Kubernetes API: Authenticated requests: {#NAME}, rate

Counter of authenticated requests broken out by username per second.

Dependent item kubernetes.api.authenticated_user_requests.rate["{#NAME}"]

Preprocessing

  • Prometheus pattern: VALUE(authenticated_user_requests{result = "{#NAME}"})

    Custom on fail: Discard value

  • Change per second

LLD rule Watchers metrics discovery

Name Description Type Key and additional info
Watchers metrics discovery

Discovery watchers by kind.

Dependent item kubernetes.api.apiserver_registered_watchers.discovery

Preprocessing

  • Prometheus to JSON: apiserver_registered_watchers{kind =~ ".*"}

  • JavaScript: The text is too long. Please see the template.

  • Discard unchanged with heartbeat: 3h

Item prototypes for Watchers metrics discovery

Name Description Type Key and additional info
Kubernetes API: Watchers: {#KIND}

Number of currently registered watchers for a given resource.

Dependent item kubernetes.api.apiserver_registered_watchers["{#KIND}"]

Preprocessing

  • Prometheus pattern: VALUE(apiserver_registered_watchers{kind = "{#KIND}"})

    Custom on fail: Discard value

LLD rule Etcd objects metrics discovery

Name Description Type Key and additional info
Etcd objects metrics discovery

Discovery etcd objects by resource.

Dependent item kubernetes.api.etcd_object_counts.discovery

Preprocessing

  • Prometheus to JSON: etcd_object_counts{resource =~ ".*"}

  • JavaScript: The text is too long. Please see the template.

  • Discard unchanged with heartbeat: 3h

Item prototypes for Etcd objects metrics discovery

Name Description Type Key and additional info
Kubernetes API: etcd objects: {#RESOURCE}

Number of stored objects at the time of last check split by kind.

Dependent item kubernetes.api.etcd_object_counts["{#RESOURCE}"]

Preprocessing

  • Prometheus pattern: VALUE(etcd_object_counts{ resource = "{#RESOURCE}"})

    Custom on fail: Discard value

LLD rule Workqueue metrics discovery

Name Description Type Key and additional info
Workqueue metrics discovery

Discovery workqueue metrics by name.

Dependent item kubernetes.api.workqueue.discovery

Preprocessing

  • Prometheus to JSON: workqueue_adds_total{name =~ ".*"}

  • JavaScript: The text is too long. Please see the template.

  • Discard unchanged with heartbeat: 3h

Item prototypes for Workqueue metrics discovery

Name Description Type Key and additional info
Kubernetes API: ["{#NAME}"] Workqueue depth

Current depth of workqueue.

Dependent item kubernetes.api.workqueue_depth["{#NAME}"]

Preprocessing

  • Prometheus pattern: VALUE(workqueue_depth{name = "{#NAME}"})

    Custom on fail: Discard value

Kubernetes API: ["{#NAME}"] Workqueue adds total, rate

Total number of adds handled by workqueue per second.

Dependent item kubernetes.api.workqueue_adds_total.rate["{#NAME}"]

Preprocessing

  • Prometheus pattern: VALUE(workqueue_adds_total{name = "{#NAME}"})

    Custom on fail: Discard value

  • Change per second

LLD rule Client certificate expiration histogram

Name Description Type Key and additional info
Client certificate expiration histogram

Discovery raw data of client certificate expiration

Dependent item kubernetes.api.certificate_expiration.discovery

Preprocessing

  • Prometheus to JSON: The text is too long. Please see the template.

  • JavaScript: The text is too long. Please see the template.

  • Discard unchanged with heartbeat: 3h

Item prototypes for Client certificate expiration histogram

Name Description Type Key and additional info
Kubernetes API: Certificate expiration seconds bucket, {#LE}

Distribution of the remaining lifetime on the certificate used to authenticate a request.

Dependent item kubernetes.api.client_certificate_expiration_seconds_bucket[{#LE}]

Preprocessing

  • Prometheus pattern: The text is too long. Please see the template.

    Custom on fail: Discard value

Kubernetes API: Client certificate expiration, p1

1 percentile of the remaining lifetime on the certificate used to authenticate a request.

Calculated kubernetes.api.client_certificate_expiration_p1[{#SINGLETON}]

Trigger prototypes for Client certificate expiration histogram

Name Description Expression Severity Dependencies and additional info
Kubernetes API: Kubernetes client certificate is expiring

A client certificate used to authenticate to the apiserver is expiring in {$KUBE.API.CERT.EXPIRATION} days.

last(/Kubernetes API server by HTTP/kubernetes.api.client_certificate_expiration_p1[{#SINGLETON}]) > 0 and last(/Kubernetes API server by HTTP/kubernetes.api.client_certificate_expiration_p1[{#SINGLETON}]) < {$KUBE.API.CERT.EXPIRATION}*24*60*60 Warning Depends on:
  • Kubernetes API: Kubernetes client certificate expires soon
Kubernetes API: Kubernetes client certificate expires soon

A client certificate used to authenticate to the apiserver is expiring in less than 24.0 hours.

last(/Kubernetes API server by HTTP/kubernetes.api.client_certificate_expiration_p1[{#SINGLETON}]) > 0 and last(/Kubernetes API server by HTTP/kubernetes.api.client_certificate_expiration_p1[{#SINGLETON}]) < 24*60*60 Warning

Feedback

Please report any issues with the template at https://support.zabbix.com

You can also provide feedback, discuss the template, or ask for help at ZABBIX forums