22 KiB
Apache Kafka by JMX
Overview
This template is designed for the effortless deployment of Apache Kafka monitoring by Zabbix via JMX and doesn't require any external scripts.
Requirements
Zabbix version: 7.0 and higher.
Tested versions
This template has been tested on:
- Apache Kafka 2.6.0
Configuration
Zabbix should be configured according to the instructions in the Templates out of the box section.
Setup
Metrics are collected by JMX.
- Enable and configure JMX access to Apache Kafka. See documentation for instructions.
- Set the user name and password in host macros {$KAFKA.USER} and {$KAFKA.PASSWORD}.
Macros used
Name | Description | Default |
---|---|---|
{$KAFKA.USER} | zabbix |
|
{$KAFKA.PASSWORD} | zabbix |
|
{$KAFKA.TOPIC.MATCHES} | Filter of discoverable topics |
.* |
{$KAFKA.TOPIC.NOT_MATCHES} | Filter to exclude discovered topics |
__consumer_offsets |
{$KAFKA.NET_PROC_AVG_IDLE.MIN.WARN} | The minimum Network processor average idle percent for trigger expression. |
30 |
{$KAFKA.REQUEST_HANDLER_AVG_IDLE.MIN.WARN} | The minimum Request handler average idle percent for trigger expression. |
30 |
Items
Name | Description | Type | Key and additional info |
---|---|---|---|
Kafka: Leader election per second | Number of leader elections per second. |
JMX agent | jmx["kafka.controller:type=ControllerStats,name=LeaderElectionRateAndTimeMs","Count"] |
Kafka: Unclean leader election per second | Number of “unclean” elections per second. |
JMX agent | jmx["kafka.controller:type=ControllerStats,name=UncleanLeaderElectionsPerSec","Count"] Preprocessing
|
Kafka: Controller state on broker | One indicates that the broker is the controller for the cluster. |
JMX agent | jmx["kafka.controller:type=KafkaController,name=ActiveControllerCount","Value"] Preprocessing
|
Kafka: Ineligible pending replica deletes | The number of ineligible pending replica deletes. |
JMX agent | jmx["kafka.controller:type=KafkaController,name=ReplicasIneligibleToDeleteCount","Value"] |
Kafka: Pending replica deletes | The number of pending replica deletes. |
JMX agent | jmx["kafka.controller:type=KafkaController,name=ReplicasToDeleteCount","Value"] |
Kafka: Ineligible pending topic deletes | The number of ineligible pending topic deletes. |
JMX agent | jmx["kafka.controller:type=KafkaController,name=TopicsIneligibleToDeleteCount","Value"] |
Kafka: Pending topic deletes | The number of pending topic deletes. |
JMX agent | jmx["kafka.controller:type=KafkaController,name=TopicsToDeleteCount","Value"] |
Kafka: Offline log directory count | The number of offline log directories (for example, after a hardware failure). |
JMX agent | jmx["kafka.log:type=LogManager,name=OfflineLogDirectoryCount","Value"] |
Kafka: Offline partitions count | Number of partitions that don't have an active leader. |
JMX agent | jmx["kafka.controller:type=KafkaController,name=OfflinePartitionsCount","Value"] |
Kafka: Bytes out per second | The rate at which data is fetched and read from the broker by consumers. |
JMX agent | jmx["kafka.server:type=BrokerTopicMetrics,name=BytesOutPerSec","Count"] Preprocessing
|
Kafka: Bytes in per second | The rate at which data sent from producers is consumed by the broker. |
JMX agent | jmx["kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec","Count"] Preprocessing
|
Kafka: Messages in per second | The rate at which individual messages are consumed by the broker. |
JMX agent | jmx["kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec","Count"] Preprocessing
|
Kafka: Bytes rejected per second | The rate at which bytes rejected per second by the broker. |
JMX agent | jmx["kafka.server:type=BrokerTopicMetrics,name=BytesRejectedPerSec","Count"] Preprocessing
|
Kafka: Client fetch request failed per second | Number of client fetch request failures per second. |
JMX agent | jmx["kafka.server:type=BrokerTopicMetrics,name=FailedFetchRequestsPerSec","Count"] Preprocessing
|
Kafka: Produce requests failed per second | Number of failed produce requests per second. |
JMX agent | jmx["kafka.server:type=BrokerTopicMetrics,name=FailedProduceRequestsPerSec","Count"] Preprocessing
|
Kafka: Request handler average idle percent | Indicates the percentage of time that the request handler (IO) threads are not in use. |
JMX agent | jmx["kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent","OneMinuteRate"] Preprocessing
|
Kafka: Fetch-Consumer response send time, mean | Average time taken, in milliseconds, to send the response. |
JMX agent | jmx["kafka.network:type=RequestMetrics,name=ResponseSendTimeMs,request=FetchConsumer","Mean"] |
Kafka: Fetch-Consumer response send time, p95 | The time taken, in milliseconds, to send the response for 95th percentile. |
JMX agent | jmx["kafka.network:type=RequestMetrics,name=ResponseSendTimeMs,request=FetchConsumer","95thPercentile"] |
Kafka: Fetch-Consumer response send time, p99 | The time taken, in milliseconds, to send the response for 99th percentile. |
JMX agent | jmx["kafka.network:type=RequestMetrics,name=ResponseSendTimeMs,request=FetchConsumer","99thPercentile"] |
Kafka: Fetch-Follower response send time, mean | Average time taken, in milliseconds, to send the response. |
JMX agent | jmx["kafka.network:type=RequestMetrics,name=ResponseSendTimeMs,request=FetchFollower","Mean"] |
Kafka: Fetch-Follower response send time, p95 | The time taken, in milliseconds, to send the response for 95th percentile. |
JMX agent | jmx["kafka.network:type=RequestMetrics,name=ResponseSendTimeMs,request=FetchFollower","95thPercentile"] |
Kafka: Fetch-Follower response send time, p99 | The time taken, in milliseconds, to send the response for 99th percentile. |
JMX agent | jmx["kafka.network:type=RequestMetrics,name=ResponseSendTimeMs,request=FetchFollower","99thPercentile"] |
Kafka: Produce response send time, mean | Average time taken, in milliseconds, to send the response. |
JMX agent | jmx["kafka.network:type=RequestMetrics,name=ResponseSendTimeMs,request=Produce","Mean"] |
Kafka: Produce response send time, p95 | The time taken, in milliseconds, to send the response for 95th percentile. |
JMX agent | jmx["kafka.network:type=RequestMetrics,name=ResponseSendTimeMs,request=Produce","95thPercentile"] |
Kafka: Produce response send time, p99 | The time taken, in milliseconds, to send the response for 99th percentile. |
JMX agent | jmx["kafka.network:type=RequestMetrics,name=ResponseSendTimeMs,request=Produce","99thPercentile"] |
Kafka: Fetch-Consumer request total time, mean | Average time in ms to serve the Fetch-Consumer request. |
JMX agent | jmx["kafka.network:type=RequestMetrics,name=TotalTimeMs,request=FetchConsumer","Mean"] |
Kafka: Fetch-Consumer request total time, p95 | Time in ms to serve the Fetch-Consumer request for 95th percentile. |
JMX agent | jmx["kafka.network:type=RequestMetrics,name=TotalTimeMs,request=FetchConsumer","95thPercentile"] |
Kafka: Fetch-Consumer request total time, p99 | Time in ms to serve the specified Fetch-Consumer for 99th percentile. |
JMX agent | jmx["kafka.network:type=RequestMetrics,name=TotalTimeMs,request=FetchConsumer","99thPercentile"] |
Kafka: Fetch-Follower request total time, mean | Average time in ms to serve the Fetch-Follower request. |
JMX agent | jmx["kafka.network:type=RequestMetrics,name=TotalTimeMs,request=FetchFollower","Mean"] |
Kafka: Fetch-Follower request total time, p95 | Time in ms to serve the Fetch-Follower request for 95th percentile. |
JMX agent | jmx["kafka.network:type=RequestMetrics,name=TotalTimeMs,request=FetchFollower","95thPercentile"] |
Kafka: Fetch-Follower request total time, p99 | Time in ms to serve the Fetch-Follower request for 99th percentile. |
JMX agent | jmx["kafka.network:type=RequestMetrics,name=TotalTimeMs,request=FetchFollower","99thPercentile"] |
Kafka: Produce request total time, mean | Average time in ms to serve the Produce request. |
JMX agent | jmx["kafka.network:type=RequestMetrics,name=TotalTimeMs,request=Produce","Mean"] |
Kafka: Produce request total time, p95 | Time in ms to serve the Produce requests for 95th percentile. |
JMX agent | jmx["kafka.network:type=RequestMetrics,name=TotalTimeMs,request=Produce","95thPercentile"] |
Kafka: Produce request total time, p99 | Time in ms to serve the Produce requests for 99th percentile. |
JMX agent | jmx["kafka.network:type=RequestMetrics,name=TotalTimeMs,request=Produce","99thPercentile"] |
Kafka: Fetch-Consumer request total time, mean | Average time for a request to update metadata. |
JMX agent | jmx["kafka.network:type=RequestMetrics,name=TotalTimeMs,request=UpdateMetadata","Mean"] |
Kafka: UpdateMetadata request total time, p95 | Time for update metadata requests for 95th percentile. |
JMX agent | jmx["kafka.network:type=RequestMetrics,name=TotalTimeMs,request=UpdateMetadata","95thPercentile"] |
Kafka: UpdateMetadata request total time, p99 | Time for update metadata requests for 99th percentile. |
JMX agent | jmx["kafka.network:type=RequestMetrics,name=TotalTimeMs,request=UpdateMetadata","99thPercentile"] |
Kafka: Temporary memory size in bytes (Fetch), max | The maximum of temporary memory used for converting message formats and decompressing messages. |
JMX agent | jmx["kafka.network:type=RequestMetrics,name=TemporaryMemoryBytes,request=Fetch","Max"] |
Kafka: Temporary memory size in bytes (Fetch), min | The minimum of temporary memory used for converting message formats and decompressing messages. |
JMX agent | jmx["kafka.network:type=RequestMetrics,name=TemporaryMemoryBytes,request=Fetch","Mean"] |
Kafka: Temporary memory size in bytes (Produce), max | The maximum of temporary memory used for converting message formats and decompressing messages. |
JMX agent | jmx["kafka.network:type=RequestMetrics,name=TemporaryMemoryBytes,request=Produce","Max"] |
Kafka: Temporary memory size in bytes (Produce), avg | The amount of temporary memory used for converting message formats and decompressing messages. |
JMX agent | jmx["kafka.network:type=RequestMetrics,name=TemporaryMemoryBytes,request=Produce","Mean"] |
Kafka: Temporary memory size in bytes (Produce), min | The minimum of temporary memory used for converting message formats and decompressing messages. |
JMX agent | jmx["kafka.network:type=RequestMetrics,name=TemporaryMemoryBytes,request=Produce","Min"] |
Kafka: Network processor average idle percent | The average percentage of time that the network processors are idle. |
JMX agent | jmx["kafka.network:type=SocketServer,name=NetworkProcessorAvgIdlePercent","Value"] Preprocessing
|
Kafka: Requests in producer purgatory | Number of requests waiting in producer purgatory. |
JMX agent | jmx["kafka.server:type=DelayedOperationPurgatory,name=PurgatorySize,delayedOperation=Fetch","Value"] |
Kafka: Requests in fetch purgatory | Number of requests waiting in fetch purgatory. |
JMX agent | jmx["kafka.server:type=DelayedOperationPurgatory,name=PurgatorySize,delayedOperation=Produce","Value"] |
Kafka: Replication maximum lag | The maximum lag between the time that messages are received by the leader replica and by the follower replicas. |
JMX agent | jmx["kafka.server:type=ReplicaFetcherManager,name=MaxLag,clientId=Replica","Value"] |
Kafka: Under minimum ISR partition count | The number of partitions under the minimum In-Sync Replica (ISR) count. |
JMX agent | jmx["kafka.server:type=ReplicaManager,name=UnderMinIsrPartitionCount","Value"] |
Kafka: Under replicated partitions | The number of partitions that have not been fully replicated in the follower replicas (the number of non-reassigning replicas - the number of ISR > 0). |
JMX agent | jmx["kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions","Value"] |
Kafka: ISR expands per second | The rate at which the number of ISRs in the broker increases. |
JMX agent | jmx["kafka.server:type=ReplicaManager,name=IsrExpandsPerSec","Count"] Preprocessing
|
Kafka: ISR shrink per second | Rate of replicas leaving the ISR pool. |
JMX agent | jmx["kafka.server:type=ReplicaManager,name=IsrShrinksPerSec","Count"] Preprocessing
|
Kafka: Leader count | The number of replicas for which this broker is the leader. |
JMX agent | jmx["kafka.server:type=ReplicaManager,name=LeaderCount","Value"] |
Kafka: Partition count | The number of partitions in the broker. |
JMX agent | jmx["kafka.server:type=ReplicaManager,name=PartitionCount","Value"] |
Kafka: Number of reassigning partitions | The number of reassigning leader partitions on a broker. |
JMX agent | jmx["kafka.server:type=ReplicaManager,name=ReassigningPartitions","Value"] |
Kafka: Request queue size | The size of the delay queue. |
JMX agent | jmx["kafka.server:type=Request","queue-size"] |
Kafka: Version | Current version of broker. |
JMX agent | jmx["kafka.server:type=app-info","version"] Preprocessing
|
Kafka: Uptime | The service uptime expressed in seconds. |
JMX agent | jmx["kafka.server:type=app-info","start-time-ms"] Preprocessing
|
Kafka: ZooKeeper client request latency | Latency in milliseconds for ZooKeeper requests from broker. |
JMX agent | jmx["kafka.server:type=ZooKeeperClientMetrics,name=ZooKeeperRequestLatencyMs","Count"] |
Kafka: ZooKeeper connection status | Connection status of broker's ZooKeeper session. |
JMX agent | jmx["kafka.server:type=SessionExpireListener,name=SessionState","Value"] Preprocessing
|
Kafka: ZooKeeper disconnect rate | ZooKeeper client disconnect per second. |
JMX agent | jmx["kafka.server:type=SessionExpireListener,name=ZooKeeperDisconnectsPerSec","Count"] Preprocessing
|
Kafka: ZooKeeper session expiration rate | ZooKeeper client session expiration per second. |
JMX agent | jmx["kafka.server:type=SessionExpireListener,name=ZooKeeperExpiresPerSec","Count"] Preprocessing
|
Kafka: ZooKeeper readonly rate | ZooKeeper client readonly per second. |
JMX agent | jmx["kafka.server:type=SessionExpireListener,name=ZooKeeperReadOnlyConnectsPerSec","Count"] Preprocessing
|
Kafka: ZooKeeper sync rate | ZooKeeper client sync per second. |
JMX agent | jmx["kafka.server:type=SessionExpireListener,name=ZooKeeperSyncConnectsPerSec","Count"] Preprocessing
|
Triggers
Name | Description | Expression | Severity | Dependencies and additional info |
---|---|---|---|---|
Kafka: Unclean leader election detected | Unclean leader elections occur when there is no qualified partition leader among Kafka brokers. If Kafka is configured to allow an unclean leader election, a leader is chosen from the out-of-sync replicas, and any messages that were not synced prior to the loss of the former leader are lost forever. Essentially, unclean leader elections sacrifice consistency for availability. |
last(/Apache Kafka by JMX/jmx["kafka.controller:type=ControllerStats,name=UncleanLeaderElectionsPerSec","Count"])>0 |
Average | |
Kafka: There are offline log directories | The offline log directory count metric indicate the number of log directories which are offline (due to a hardware failure for example) so that the broker cannot store incoming messages anymore. |
last(/Apache Kafka by JMX/jmx["kafka.log:type=LogManager,name=OfflineLogDirectoryCount","Value"]) > 0 |
Warning | |
Kafka: One or more partitions have no leader | Any partition without an active leader will be completely inaccessible, and both consumers and producers of that partition will be blocked until a leader becomes available. |
last(/Apache Kafka by JMX/jmx["kafka.controller:type=KafkaController,name=OfflinePartitionsCount","Value"]) > 0 |
Warning | |
Kafka: Request handler average idle percent is too low | The request handler idle ratio metric indicates the percentage of time the request handlers are not in use. The lower this number, the more loaded the broker is. |
max(/Apache Kafka by JMX/jmx["kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent","OneMinuteRate"],15m)<{$KAFKA.REQUEST_HANDLER_AVG_IDLE.MIN.WARN} |
Average | |
Kafka: Network processor average idle percent is too low | The network processor idle ratio metric indicates the percentage of time the network processor are not in use. The lower this number, the more loaded the broker is. |
max(/Apache Kafka by JMX/jmx["kafka.network:type=SocketServer,name=NetworkProcessorAvgIdlePercent","Value"],15m)<{$KAFKA.NET_PROC_AVG_IDLE.MIN.WARN} |
Average | |
Kafka: Failed to fetch info data | Zabbix has not received data for items for the last 15 minutes |
nodata(/Apache Kafka by JMX/jmx["kafka.network:type=SocketServer,name=NetworkProcessorAvgIdlePercent","Value"],15m)=1 |
Warning | |
Kafka: There are partitions under the min ISR | The Under min ISR partitions metric displays the number of partitions, where the number of In-Sync Replicas (ISR) is less than the minimum number of in-sync replicas specified. The two most common causes of under-min ISR partitions are that one or more brokers is unresponsive, or the cluster is experiencing performance issues and one or more brokers are falling behind. |
last(/Apache Kafka by JMX/jmx["kafka.server:type=ReplicaManager,name=UnderMinIsrPartitionCount","Value"])>0 |
Average | |
Kafka: There are under replicated partitions | The Under replicated partitions metric displays the number of partitions that do not have enough replicas to meet the desired replication factor. A partition will also be considered under-replicated if the correct number of replicas exist, but one or more of the replicas have fallen significantly behind the partition leader. The two most common causes of under-replicated partitions are that one or more brokers is unresponsive, or the cluster is experiencing performance issues and one or more brokers have fallen behind. |
last(/Apache Kafka by JMX/jmx["kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions","Value"])>0 |
Average | |
Kafka: Version has changed | The Kafka version has changed. Acknowledge to close the problem manually. |
last(/Apache Kafka by JMX/jmx["kafka.server:type=app-info","version"],#1)<>last(/Apache Kafka by JMX/jmx["kafka.server:type=app-info","version"],#2) and length(last(/Apache Kafka by JMX/jmx["kafka.server:type=app-info","version"]))>0 |
Info | Manual close: Yes |
Kafka: has been restarted | Uptime is less than 10 minutes. |
last(/Apache Kafka by JMX/jmx["kafka.server:type=app-info","start-time-ms"])<10m |
Info | Manual close: Yes |
Kafka: Broker is not connected to ZooKeeper | find(/Apache Kafka by JMX/jmx["kafka.server:type=SessionExpireListener,name=SessionState","Value"],,"regexp","CONNECTED")=0 |
Average |
LLD rule Topic Metrics (write)
Name | Description | Type | Key and additional info |
---|---|---|---|
Topic Metrics (write) | JMX agent | jmx.discovery[beans,"kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec,topic=*"] |
Item prototypes for Topic Metrics (write)
Name | Description | Type | Key and additional info |
---|---|---|---|
Kafka {#JMXTOPIC}: Messages in per second | The rate at which individual messages are consumed by topic. |
JMX agent | jmx["kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec,topic={#JMXTOPIC}","Count"] Preprocessing
|
Kafka {#JMXTOPIC}: Bytes in per second | The rate at which data sent from producers is consumed by topic. |
JMX agent | jmx["kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec,topic={#JMXTOPIC}","Count"] Preprocessing
|
LLD rule Topic Metrics (read)
Name | Description | Type | Key and additional info |
---|---|---|---|
Topic Metrics (read) | JMX agent | jmx.discovery[beans,"kafka.server:type=BrokerTopicMetrics,name=BytesOutPerSec,topic=*"] |
Item prototypes for Topic Metrics (read)
Name | Description | Type | Key and additional info |
---|---|---|---|
Kafka {#JMXTOPIC}: Bytes out per second | The rate at which data is fetched and read from the broker by consumers (by topic). |
JMX agent | jmx["kafka.server:type=BrokerTopicMetrics,name=BytesOutPerSec,topic={#JMXTOPIC}","Count"] Preprocessing
|
LLD rule Topic Metrics (errors)
Name | Description | Type | Key and additional info |
---|---|---|---|
Topic Metrics (errors) | JMX agent | jmx.discovery[beans,"kafka.server:type=BrokerTopicMetrics,name=BytesRejectedPerSec,topic=*"] |
Item prototypes for Topic Metrics (errors)
Name | Description | Type | Key and additional info |
---|---|---|---|
Kafka {#JMXTOPIC}: Bytes rejected per second | Rejected bytes rate by topic. |
JMX agent | jmx["kafka.server:type=BrokerTopicMetrics,name=BytesRejectedPerSec,topic={#JMXTOPIC}","Count"] Preprocessing
|
Feedback
Please report any issues with the template at https://support.zabbix.com
You can also provide feedback, discuss the template, or ask for help at ZABBIX forums