History

yzl 93958d0fb0 zabbix6.0		1 year ago
..
README.md	zabbix6.0	1 year ago
template_app_hadoop_http.yaml	zabbix6.0	1 year ago

README.md

Unescape Escape

Hadoop by HTTP

Overview

The template for monitoring Hadoop over HTTP that works without any external scripts. It collects metrics by polling the Hadoop API remotely using an HTTP agent and JSONPath preprocessing. Zabbix server (or proxy) execute direct requests to ResourceManager, NodeManagers, NameNode, DataNodes APIs. All metrics are collected at once, thanks to the Zabbix bulk data collection.

Requirements

Zabbix version: 7.0 and higher.

Tested versions

This template has been tested on:

Hadoop 3.1 and later

Configuration

Zabbix should be configured according to the instructions in the Templates out of the box section.

Setup

You should define the IP address (or FQDN) and Web-UI port for the ResourceManager in {$HADOOP.RESOURCEMANAGER.HOST} and {$HADOOP.RESOURCEMANAGER.PORT} macros and for the NameNode in {$HADOOP.NAMENODE.HOST} and {$HADOOP.NAMENODE.PORT} macros respectively. Macros can be set in the template or overridden at the host level.

Macros used

Name	Description	Default
{$HADOOP.RESOURCEMANAGER.HOST}	The Hadoop ResourceManager host IP address or FQDN.	`ResourceManager`
{$HADOOP.RESOURCEMANAGER.PORT}	The Hadoop ResourceManager Web-UI port.	`8088`
{$HADOOP.RESOURCEMANAGER.RESPONSE_TIME.MAX.WARN}	The Hadoop ResourceManager API page maximum response time in seconds for trigger expression.	`10s`
{$HADOOP.NAMENODE.HOST}	The Hadoop NameNode host IP address or FQDN.	`NameNode`
{$HADOOP.NAMENODE.PORT}	The Hadoop NameNode Web-UI port.	`9870`
{$HADOOP.NAMENODE.RESPONSE_TIME.MAX.WARN}	The Hadoop NameNode API page maximum response time in seconds for trigger expression.	`10s`
{$HADOOP.CAPACITY_REMAINING.MIN.WARN}	The Hadoop cluster capacity remaining percent for trigger expression.	`20`

Items

Name	Description	Type	Key and additional info
ResourceManager: Service status	Hadoop ResourceManager API port availability.	Simple check	net.tcp.service["tcp","{$HADOOP.RESOURCEMANAGER.HOST}","{$HADOOP.RESOURCEMANAGER.PORT}"] Preprocessing Discard unchanged with heartbeat: `10m`
ResourceManager: Service response time	Hadoop ResourceManager API performance.	Simple check	net.tcp.service.perf["tcp","{$HADOOP.RESOURCEMANAGER.HOST}","{$HADOOP.RESOURCEMANAGER.PORT}"]
Hadoop: Get ResourceManager stats		HTTP agent	hadoop.resourcemanager.get
ResourceManager: Uptime		Dependent item	hadoop.resourcemanager.uptime Preprocessing JSON Path: `$.beans[?(@.name=='java.lang:type=Runtime')].Uptime.first()` Custom multiplier: `0.001`
ResourceManager: Get info		Dependent item	hadoop.resourcemanager.info Preprocessing JSON Path: `$.beans[?(@.name=~'Hadoop:service=ResourceManager,name=*')]` ⛔️Custom on fail: Set value to: `[]`
ResourceManager: RPC queue & processing time	Average time spent on processing RPC requests.	Dependent item	hadoop.resourcemanager.rpc_processing_time_avg Preprocessing JSON Path: `The text is too long. Please see the template.`
ResourceManager: Active NMs	Number of Active NodeManagers.	Dependent item	hadoop.resourcemanager.num_active_nm Preprocessing JSON Path: `The text is too long. Please see the template.` Discard unchanged with heartbeat: `6h`
ResourceManager: Decommissioning NMs	Number of Decommissioning NodeManagers.	Dependent item	hadoop.resourcemanager.num_decommissioning_nm Preprocessing JSON Path: `The text is too long. Please see the template.` Discard unchanged with heartbeat: `6h`
ResourceManager: Decommissioned NMs	Number of Decommissioned NodeManagers.	Dependent item	hadoop.resourcemanager.num_decommissioned_nm Preprocessing JSON Path: `The text is too long. Please see the template.`
ResourceManager: Lost NMs	Number of Lost NodeManagers.	Dependent item	hadoop.resourcemanager.num_lost_nm Preprocessing JSON Path: `The text is too long. Please see the template.` Discard unchanged with heartbeat: `6h`
ResourceManager: Unhealthy NMs	Number of Unhealthy NodeManagers.	Dependent item	hadoop.resourcemanager.num_unhealthy_nm Preprocessing JSON Path: `The text is too long. Please see the template.`
ResourceManager: Rebooted NMs	Number of Rebooted NodeManagers.	Dependent item	hadoop.resourcemanager.num_rebooted_nm Preprocessing JSON Path: `The text is too long. Please see the template.`
ResourceManager: Shutdown NMs	Number of Shutdown NodeManagers.	Dependent item	hadoop.resourcemanager.num_shutdown_nm Preprocessing JSON Path: `The text is too long. Please see the template.`
NameNode: Service status	Hadoop NameNode API port availability.	Simple check	net.tcp.service["tcp","{$HADOOP.NAMENODE.HOST}","{$HADOOP.NAMENODE.PORT}"] Preprocessing Discard unchanged with heartbeat: `10m`
NameNode: Service response time	Hadoop NameNode API performance.	Simple check	net.tcp.service.perf["tcp","{$HADOOP.NAMENODE.HOST}","{$HADOOP.NAMENODE.PORT}"]
Hadoop: Get NameNode stats		HTTP agent	hadoop.namenode.get
NameNode: Uptime		Dependent item	hadoop.namenode.uptime Preprocessing JSON Path: `$.beans[?(@.name=='java.lang:type=Runtime')].Uptime.first()` Custom multiplier: `0.001`
NameNode: Get info		Dependent item	hadoop.namenode.info Preprocessing JSON Path: `$.beans[?(@.name=~'Hadoop:service=NameNode,name=*')]` ⛔️Custom on fail: Set value to: `[]`
NameNode: RPC queue & processing time	Average time spent on processing RPC requests.	Dependent item	hadoop.namenode.rpc_processing_time_avg Preprocessing JSON Path: `The text is too long. Please see the template.`
NameNode: Block Pool Renaming		Dependent item	hadoop.namenode.percent_block_pool_used Preprocessing JSON Path: `The text is too long. Please see the template.`
NameNode: Transactions since last checkpoint	Total number of transactions since last checkpoint.	Dependent item	hadoop.namenode.transactions_since_last_checkpoint Preprocessing JSON Path: `The text is too long. Please see the template.`
NameNode: Percent capacity remaining	Available capacity in percent.	Dependent item	hadoop.namenode.percent_remaining Preprocessing JSON Path: `The text is too long. Please see the template.` Discard unchanged with heartbeat: `6h`
NameNode: Capacity remaining	Available capacity.	Dependent item	hadoop.namenode.capacity_remaining Preprocessing JSON Path: `The text is too long. Please see the template.`
NameNode: Corrupt blocks	Number of corrupt blocks.	Dependent item	hadoop.namenode.corrupt_blocks Preprocessing JSON Path: `The text is too long. Please see the template.`
NameNode: Missing blocks	Number of missing blocks.	Dependent item	hadoop.namenode.missing_blocks Preprocessing JSON Path: `The text is too long. Please see the template.`
NameNode: Failed volumes	Number of failed volumes.	Dependent item	hadoop.namenode.volume_failures_total Preprocessing JSON Path: `The text is too long. Please see the template.`
NameNode: Alive DataNodes	Count of alive DataNodes.	Dependent item	hadoop.namenode.num_live_data_nodes Preprocessing JSON Path: `The text is too long. Please see the template.` Discard unchanged with heartbeat: `6h`
NameNode: Dead DataNodes	Count of dead DataNodes.	Dependent item	hadoop.namenode.num_dead_data_nodes Preprocessing JSON Path: `The text is too long. Please see the template.` Discard unchanged with heartbeat: `6h`
NameNode: Stale DataNodes	DataNodes that do not send a heartbeat within 30 seconds are marked as "stale".	Dependent item	hadoop.namenode.num_stale_data_nodes Preprocessing JSON Path: `The text is too long. Please see the template.` Discard unchanged with heartbeat: `6h`
NameNode: Total files	Total count of files tracked by the NameNode.	Dependent item	hadoop.namenode.files_total Preprocessing JSON Path: `The text is too long. Please see the template.`
NameNode: Total load	The current number of concurrent file accesses (read/write) across all DataNodes.	Dependent item	hadoop.namenode.total_load Preprocessing JSON Path: `The text is too long. Please see the template.`
NameNode: Blocks allocable	Maximum number of blocks allocable.	Dependent item	hadoop.namenode.block_capacity Preprocessing JSON Path: `The text is too long. Please see the template.`
NameNode: Total blocks	Count of blocks tracked by NameNode.	Dependent item	hadoop.namenode.blocks_total Preprocessing JSON Path: `The text is too long. Please see the template.`
NameNode: Under-replicated blocks	The number of blocks with insufficient replication.	Dependent item	hadoop.namenode.under_replicated_blocks Preprocessing JSON Path: `The text is too long. Please see the template.`
Hadoop: Get NodeManagers states		HTTP agent	hadoop.nodemanagers.get Preprocessing JavaScript: `The text is too long. Please see the template.`
Hadoop: Get DataNodes states		HTTP agent	hadoop.datanodes.get Preprocessing JavaScript: `The text is too long. Please see the template.`

Triggers

Name	Description	Expression	Severity	Dependencies and additional info
ResourceManager: Service is unavailable		`last(/Hadoop by HTTP/net.tcp.service["tcp","{$HADOOP.RESOURCEMANAGER.HOST}","{$HADOOP.RESOURCEMANAGER.PORT}"])=0`	Average	Manual close: Yes
ResourceManager: Service response time is too high		`min(/Hadoop by HTTP/net.tcp.service.perf["tcp","{$HADOOP.RESOURCEMANAGER.HOST}","{$HADOOP.RESOURCEMANAGER.PORT}"],5m)>{$HADOOP.RESOURCEMANAGER.RESPONSE_TIME.MAX.WARN}`	Warning	Manual close: Yes Depends on: ResourceManager: Service is unavailable
ResourceManager: Service has been restarted	Uptime is less than 10 minutes.	`last(/Hadoop by HTTP/hadoop.resourcemanager.uptime)<10m`	Info	Manual close: Yes
ResourceManager: Failed to fetch ResourceManager API page	Zabbix has not received any data for items for the last 30 minutes.	`nodata(/Hadoop by HTTP/hadoop.resourcemanager.uptime,30m)=1`	Warning	Manual close: Yes Depends on: ResourceManager: Service is unavailable
ResourceManager: Cluster has no active NodeManagers	Cluster is unable to execute any jobs without at least one NodeManager.	`max(/Hadoop by HTTP/hadoop.resourcemanager.num_active_nm,5m)=0`	High
ResourceManager: Cluster has unhealthy NodeManagers	YARN considers any node with disk utilization exceeding the value specified under the property yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage (in yarn-site.xml) to be unhealthy. Ample disk space is critical to ensure uninterrupted operation of a Hadoop cluster, and large numbers of unhealthyNodes (the number to alert on depends on the size of your cluster) should be quickly investigated and resolved.	`min(/Hadoop by HTTP/hadoop.resourcemanager.num_unhealthy_nm,15m)>0`	Average
NameNode: Service is unavailable		`last(/Hadoop by HTTP/net.tcp.service["tcp","{$HADOOP.NAMENODE.HOST}","{$HADOOP.NAMENODE.PORT}"])=0`	Average	Manual close: Yes
NameNode: Service response time is too high		`min(/Hadoop by HTTP/net.tcp.service.perf["tcp","{$HADOOP.NAMENODE.HOST}","{$HADOOP.NAMENODE.PORT}"],5m)>{$HADOOP.NAMENODE.RESPONSE_TIME.MAX.WARN}`	Warning	Manual close: Yes Depends on: NameNode: Service is unavailable
NameNode: Service has been restarted	Uptime is less than 10 minutes.	`last(/Hadoop by HTTP/hadoop.namenode.uptime)<10m`	Info	Manual close: Yes
NameNode: Failed to fetch NameNode API page	Zabbix has not received any data for items for the last 30 minutes.	`nodata(/Hadoop by HTTP/hadoop.namenode.uptime,30m)=1`	Warning	Manual close: Yes Depends on: NameNode: Service is unavailable
NameNode: Cluster capacity remaining is low	A good practice is to ensure that disk use never exceeds 80 percent capacity.	`max(/Hadoop by HTTP/hadoop.namenode.percent_remaining,15m)<{$HADOOP.CAPACITY_REMAINING.MIN.WARN}`	Warning
NameNode: Cluster has missing blocks	A missing block is far worse than a corrupt block, because a missing block cannot be recovered by copying a replica.	`min(/Hadoop by HTTP/hadoop.namenode.missing_blocks,15m)>0`	Average
NameNode: Cluster has volume failures	HDFS now allows for disks to fail in place, without affecting DataNode operations, until a threshold value is reached. This is set on each DataNode via the dfs.datanode.failed.volumes.tolerated property; it defaults to 0, meaning that any volume failure will shut down the DataNode; on a production cluster where DataNodes typically have 6, 8, or 12 disks, setting this parameter to 1 or 2 is typically the best practice.	`min(/Hadoop by HTTP/hadoop.namenode.volume_failures_total,15m)>0`	Average
NameNode: Cluster has DataNodes in Dead state	The death of a DataNode causes a flurry of network activity, as the NameNode initiates replication of blocks lost on the dead nodes.	`min(/Hadoop by HTTP/hadoop.namenode.num_dead_data_nodes,5m)>0`	Average

LLD rule Node manager discovery

Name Description Type Key and additional info

Node manager discovery

HTTP agent

Name	Description	Type	Key and additional info
Node manager discovery		HTTP agent	hadoop.nodemanager.discovery Preprocessing JavaScript: `The text is too long. Please see the template.`

hadoop.nodemanager.discovery

Preprocessing

JavaScript: The text is too long. Please see the template.

Item prototypes for Node manager discovery

Name	Description	Type	Key and additional info
Hadoop NodeManager {#HOSTNAME}: Get stats		HTTP agent	hadoop.nodemanager.get[{#HOSTNAME}]
{#HOSTNAME}: RPC queue & processing time	Average time spent on processing RPC requests.	Dependent item	hadoop.nodemanager.rpc_processing_time_avg[{#HOSTNAME}] Preprocessing JSON Path: `The text is too long. Please see the template.`
{#HOSTNAME}: Container launch avg duration		Dependent item	hadoop.nodemanager.container_launch_duration_avg[{#HOSTNAME}] Preprocessing JSON Path: `The text is too long. Please see the template.`
{#HOSTNAME}: JVM Threads	The number of JVM threads.	Dependent item	hadoop.nodemanager.jvm.threads[{#HOSTNAME}] Preprocessing JSON Path: `The text is too long. Please see the template.`
{#HOSTNAME}: JVM Garbage collection time	The JVM garbage collection time in milliseconds.	Dependent item	hadoop.nodemanager.jvm.gc_time[{#HOSTNAME}] Preprocessing JSON Path: `The text is too long. Please see the template.`
{#HOSTNAME}: JVM Heap usage	The JVM heap usage in MBytes.	Dependent item	hadoop.nodemanager.jvm.mem_heap_used[{#HOSTNAME}] Preprocessing JSON Path: `The text is too long. Please see the template.`
{#HOSTNAME}: Uptime		Dependent item	hadoop.nodemanager.uptime[{#HOSTNAME}] Preprocessing JSON Path: `$.beans[?(@.name=='java.lang:type=Runtime')].Uptime.first()` Custom multiplier: `0.001`
Hadoop NodeManager {#HOSTNAME}: Get raw info		Dependent item	hadoop.nodemanager.raw_info[{#HOSTNAME}] Preprocessing JSON Path: `$.[?(@.HostName=='{#HOSTNAME}')].first()` ⛔️Custom on fail: Discard value
{#HOSTNAME}: State	State of the node - valid values are: NEW, RUNNING, UNHEALTHY, DECOMMISSIONING, DECOMMISSIONED, LOST, REBOOTED, SHUTDOWN.	Dependent item	hadoop.nodemanager.state[{#HOSTNAME}] Preprocessing JSON Path: `$.State` Discard unchanged with heartbeat: `6h`
{#HOSTNAME}: Version		Dependent item	hadoop.nodemanager.version[{#HOSTNAME}] Preprocessing JSON Path: `$.NodeManagerVersion` Discard unchanged with heartbeat: `6h`
{#HOSTNAME}: Number of containers		Dependent item	hadoop.nodemanager.numcontainers[{#HOSTNAME}] Preprocessing JSON Path: `$.NumContainers`
{#HOSTNAME}: Used memory		Dependent item	hadoop.nodemanager.usedmemory[{#HOSTNAME}] Preprocessing JSON Path: `$.UsedMemoryMB`
{#HOSTNAME}: Available memory		Dependent item	hadoop.nodemanager.availablememory[{#HOSTNAME}] Preprocessing JSON Path: `$.AvailableMemoryMB`

Trigger prototypes for Node manager discovery

Name	Description	Expression	Severity	Dependencies and additional info
{#HOSTNAME}: Service has been restarted	Uptime is less than 10 minutes.	`last(/Hadoop by HTTP/hadoop.nodemanager.uptime[{#HOSTNAME}])<10m`	Info	Manual close: Yes
{#HOSTNAME}: Failed to fetch NodeManager API page	Zabbix has not received any data for items for the last 30 minutes.	`nodata(/Hadoop by HTTP/hadoop.nodemanager.uptime[{#HOSTNAME}],30m)=1`	Warning	Manual close: Yes Depends on: {#HOSTNAME}: NodeManager has state {ITEM.VALUE}.
{#HOSTNAME}: NodeManager has state {ITEM.VALUE}.	The state is different from normal.	`last(/Hadoop by HTTP/hadoop.nodemanager.state[{#HOSTNAME}])<>"RUNNING"`	Average

LLD rule Data node discovery