# Hadoop by HTTP ## Overview The template for monitoring Hadoop over HTTP that works without any external scripts. It collects metrics by polling the Hadoop API remotely using an HTTP agent and JSONPath preprocessing. Zabbix server (or proxy) execute direct requests to ResourceManager, NodeManagers, NameNode, DataNodes APIs. All metrics are collected at once, thanks to the Zabbix bulk data collection. ## Requirements Zabbix version: 7.0 and higher. ## Tested versions This template has been tested on: - Hadoop 3.1 and later ## Configuration > Zabbix should be configured according to the instructions in the [Templates out of the box](https://www.zabbix.com/documentation/7.0/manual/config/templates_out_of_the_box) section. ## Setup You should define the IP address (or FQDN) and Web-UI port for the ResourceManager in {$HADOOP.RESOURCEMANAGER.HOST} and {$HADOOP.RESOURCEMANAGER.PORT} macros and for the NameNode in {$HADOOP.NAMENODE.HOST} and {$HADOOP.NAMENODE.PORT} macros respectively. Macros can be set in the template or overridden at the host level. ### Macros used |Name|Description|Default| |----|-----------|-------| |{$HADOOP.RESOURCEMANAGER.HOST}|
The Hadoop ResourceManager host IP address or FQDN.
|`ResourceManager`| |{$HADOOP.RESOURCEMANAGER.PORT}|The Hadoop ResourceManager Web-UI port.
|`8088`| |{$HADOOP.RESOURCEMANAGER.RESPONSE_TIME.MAX.WARN}|The Hadoop ResourceManager API page maximum response time in seconds for trigger expression.
|`10s`| |{$HADOOP.NAMENODE.HOST}|The Hadoop NameNode host IP address or FQDN.
|`NameNode`| |{$HADOOP.NAMENODE.PORT}|The Hadoop NameNode Web-UI port.
|`9870`| |{$HADOOP.NAMENODE.RESPONSE_TIME.MAX.WARN}|The Hadoop NameNode API page maximum response time in seconds for trigger expression.
|`10s`| |{$HADOOP.CAPACITY_REMAINING.MIN.WARN}|The Hadoop cluster capacity remaining percent for trigger expression.
|`20`| ### Items |Name|Description|Type|Key and additional info| |----|-----------|----|-----------------------| |ResourceManager: Service status|Hadoop ResourceManager API port availability.
|Simple check|net.tcp.service["tcp","{$HADOOP.RESOURCEMANAGER.HOST}","{$HADOOP.RESOURCEMANAGER.PORT}"]**Preprocessing**
Discard unchanged with heartbeat: `10m`
Hadoop ResourceManager API performance.
|Simple check|net.tcp.service.perf["tcp","{$HADOOP.RESOURCEMANAGER.HOST}","{$HADOOP.RESOURCEMANAGER.PORT}"]| |Hadoop: Get ResourceManager stats||HTTP agent|hadoop.resourcemanager.get| |ResourceManager: Uptime||Dependent item|hadoop.resourcemanager.uptime**Preprocessing**
JSON Path: `$.beans[?(@.name=='java.lang:type=Runtime')].Uptime.first()`
Custom multiplier: `0.001`
**Preprocessing**
JSON Path: `$.beans[?(@.name=~'Hadoop:service=ResourceManager,name=*')]`
⛔️Custom on fail: Set value to: `[]`
Average time spent on processing RPC requests.
|Dependent item|hadoop.resourcemanager.rpc_processing_time_avg**Preprocessing**
JSON Path: `The text is too long. Please see the template.`
Number of Active NodeManagers.
|Dependent item|hadoop.resourcemanager.num_active_nm**Preprocessing**
JSON Path: `The text is too long. Please see the template.`
Discard unchanged with heartbeat: `6h`
Number of Decommissioning NodeManagers.
|Dependent item|hadoop.resourcemanager.num_decommissioning_nm**Preprocessing**
JSON Path: `The text is too long. Please see the template.`
Discard unchanged with heartbeat: `6h`
Number of Decommissioned NodeManagers.
|Dependent item|hadoop.resourcemanager.num_decommissioned_nm**Preprocessing**
JSON Path: `The text is too long. Please see the template.`
Number of Lost NodeManagers.
|Dependent item|hadoop.resourcemanager.num_lost_nm**Preprocessing**
JSON Path: `The text is too long. Please see the template.`
Discard unchanged with heartbeat: `6h`
Number of Unhealthy NodeManagers.
|Dependent item|hadoop.resourcemanager.num_unhealthy_nm**Preprocessing**
JSON Path: `The text is too long. Please see the template.`
Number of Rebooted NodeManagers.
|Dependent item|hadoop.resourcemanager.num_rebooted_nm**Preprocessing**
JSON Path: `The text is too long. Please see the template.`
Number of Shutdown NodeManagers.
|Dependent item|hadoop.resourcemanager.num_shutdown_nm**Preprocessing**
JSON Path: `The text is too long. Please see the template.`
Hadoop NameNode API port availability.
|Simple check|net.tcp.service["tcp","{$HADOOP.NAMENODE.HOST}","{$HADOOP.NAMENODE.PORT}"]**Preprocessing**
Discard unchanged with heartbeat: `10m`
Hadoop NameNode API performance.
|Simple check|net.tcp.service.perf["tcp","{$HADOOP.NAMENODE.HOST}","{$HADOOP.NAMENODE.PORT}"]| |Hadoop: Get NameNode stats||HTTP agent|hadoop.namenode.get| |NameNode: Uptime||Dependent item|hadoop.namenode.uptime**Preprocessing**
JSON Path: `$.beans[?(@.name=='java.lang:type=Runtime')].Uptime.first()`
Custom multiplier: `0.001`
**Preprocessing**
JSON Path: `$.beans[?(@.name=~'Hadoop:service=NameNode,name=*')]`
⛔️Custom on fail: Set value to: `[]`
Average time spent on processing RPC requests.
|Dependent item|hadoop.namenode.rpc_processing_time_avg**Preprocessing**
JSON Path: `The text is too long. Please see the template.`
**Preprocessing**
JSON Path: `The text is too long. Please see the template.`
Total number of transactions since last checkpoint.
|Dependent item|hadoop.namenode.transactions_since_last_checkpoint**Preprocessing**
JSON Path: `The text is too long. Please see the template.`
Available capacity in percent.
|Dependent item|hadoop.namenode.percent_remaining**Preprocessing**
JSON Path: `The text is too long. Please see the template.`
Discard unchanged with heartbeat: `6h`
Available capacity.
|Dependent item|hadoop.namenode.capacity_remaining**Preprocessing**
JSON Path: `The text is too long. Please see the template.`
Number of corrupt blocks.
|Dependent item|hadoop.namenode.corrupt_blocks**Preprocessing**
JSON Path: `The text is too long. Please see the template.`
Number of missing blocks.
|Dependent item|hadoop.namenode.missing_blocks**Preprocessing**
JSON Path: `The text is too long. Please see the template.`
Number of failed volumes.
|Dependent item|hadoop.namenode.volume_failures_total**Preprocessing**
JSON Path: `The text is too long. Please see the template.`
Count of alive DataNodes.
|Dependent item|hadoop.namenode.num_live_data_nodes**Preprocessing**
JSON Path: `The text is too long. Please see the template.`
Discard unchanged with heartbeat: `6h`
Count of dead DataNodes.
|Dependent item|hadoop.namenode.num_dead_data_nodes**Preprocessing**
JSON Path: `The text is too long. Please see the template.`
Discard unchanged with heartbeat: `6h`
DataNodes that do not send a heartbeat within 30 seconds are marked as "stale".
|Dependent item|hadoop.namenode.num_stale_data_nodes**Preprocessing**
JSON Path: `The text is too long. Please see the template.`
Discard unchanged with heartbeat: `6h`
Total count of files tracked by the NameNode.
|Dependent item|hadoop.namenode.files_total**Preprocessing**
JSON Path: `The text is too long. Please see the template.`
The current number of concurrent file accesses (read/write) across all DataNodes.
|Dependent item|hadoop.namenode.total_load**Preprocessing**
JSON Path: `The text is too long. Please see the template.`
Maximum number of blocks allocable.
|Dependent item|hadoop.namenode.block_capacity**Preprocessing**
JSON Path: `The text is too long. Please see the template.`
Count of blocks tracked by NameNode.
|Dependent item|hadoop.namenode.blocks_total**Preprocessing**
JSON Path: `The text is too long. Please see the template.`
The number of blocks with insufficient replication.
|Dependent item|hadoop.namenode.under_replicated_blocks**Preprocessing**
JSON Path: `The text is too long. Please see the template.`
**Preprocessing**
JavaScript: `The text is too long. Please see the template.`
**Preprocessing**
JavaScript: `The text is too long. Please see the template.`
Uptime is less than 10 minutes.
|`last(/Hadoop by HTTP/hadoop.resourcemanager.uptime)<10m`|Info|**Manual close**: Yes| |ResourceManager: Failed to fetch ResourceManager API page|Zabbix has not received any data for items for the last 30 minutes.
|`nodata(/Hadoop by HTTP/hadoop.resourcemanager.uptime,30m)=1`|Warning|**Manual close**: YesCluster is unable to execute any jobs without at least one NodeManager.
|`max(/Hadoop by HTTP/hadoop.resourcemanager.num_active_nm,5m)=0`|High|| |ResourceManager: Cluster has unhealthy NodeManagers|YARN considers any node with disk utilization exceeding the value specified under the property yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage (in yarn-site.xml) to be unhealthy. Ample disk space is critical to ensure uninterrupted operation of a Hadoop cluster, and large numbers of unhealthyNodes (the number to alert on depends on the size of your cluster) should be quickly investigated and resolved.
|`min(/Hadoop by HTTP/hadoop.resourcemanager.num_unhealthy_nm,15m)>0`|Average|| |NameNode: Service is unavailable||`last(/Hadoop by HTTP/net.tcp.service["tcp","{$HADOOP.NAMENODE.HOST}","{$HADOOP.NAMENODE.PORT}"])=0`|Average|**Manual close**: Yes| |NameNode: Service response time is too high||`min(/Hadoop by HTTP/net.tcp.service.perf["tcp","{$HADOOP.NAMENODE.HOST}","{$HADOOP.NAMENODE.PORT}"],5m)>{$HADOOP.NAMENODE.RESPONSE_TIME.MAX.WARN}`|Warning|**Manual close**: YesUptime is less than 10 minutes.
|`last(/Hadoop by HTTP/hadoop.namenode.uptime)<10m`|Info|**Manual close**: Yes| |NameNode: Failed to fetch NameNode API page|Zabbix has not received any data for items for the last 30 minutes.
|`nodata(/Hadoop by HTTP/hadoop.namenode.uptime,30m)=1`|Warning|**Manual close**: YesA good practice is to ensure that disk use never exceeds 80 percent capacity.
|`max(/Hadoop by HTTP/hadoop.namenode.percent_remaining,15m)<{$HADOOP.CAPACITY_REMAINING.MIN.WARN}`|Warning|| |NameNode: Cluster has missing blocks|A missing block is far worse than a corrupt block, because a missing block cannot be recovered by copying a replica.
|`min(/Hadoop by HTTP/hadoop.namenode.missing_blocks,15m)>0`|Average|| |NameNode: Cluster has volume failures|HDFS now allows for disks to fail in place, without affecting DataNode operations, until a threshold value is reached. This is set on each DataNode via the dfs.datanode.failed.volumes.tolerated property; it defaults to 0, meaning that any volume failure will shut down the DataNode; on a production cluster where DataNodes typically have 6, 8, or 12 disks, setting this parameter to 1 or 2 is typically the best practice.
|`min(/Hadoop by HTTP/hadoop.namenode.volume_failures_total,15m)>0`|Average|| |NameNode: Cluster has DataNodes in Dead state|The death of a DataNode causes a flurry of network activity, as the NameNode initiates replication of blocks lost on the dead nodes.
|`min(/Hadoop by HTTP/hadoop.namenode.num_dead_data_nodes,5m)>0`|Average|| ### LLD rule Node manager discovery |Name|Description|Type|Key and additional info| |----|-----------|----|-----------------------| |Node manager discovery||HTTP agent|hadoop.nodemanager.discovery**Preprocessing**
JavaScript: `The text is too long. Please see the template.`
Average time spent on processing RPC requests.
|Dependent item|hadoop.nodemanager.rpc_processing_time_avg[{#HOSTNAME}]**Preprocessing**
JSON Path: `The text is too long. Please see the template.`
**Preprocessing**
JSON Path: `The text is too long. Please see the template.`
The number of JVM threads.
|Dependent item|hadoop.nodemanager.jvm.threads[{#HOSTNAME}]**Preprocessing**
JSON Path: `The text is too long. Please see the template.`
The JVM garbage collection time in milliseconds.
|Dependent item|hadoop.nodemanager.jvm.gc_time[{#HOSTNAME}]**Preprocessing**
JSON Path: `The text is too long. Please see the template.`
The JVM heap usage in MBytes.
|Dependent item|hadoop.nodemanager.jvm.mem_heap_used[{#HOSTNAME}]**Preprocessing**
JSON Path: `The text is too long. Please see the template.`
**Preprocessing**
JSON Path: `$.beans[?(@.name=='java.lang:type=Runtime')].Uptime.first()`
Custom multiplier: `0.001`
**Preprocessing**
JSON Path: `$.[?(@.HostName=='{#HOSTNAME}')].first()`
⛔️Custom on fail: Discard value
State of the node - valid values are: NEW, RUNNING, UNHEALTHY, DECOMMISSIONING, DECOMMISSIONED, LOST, REBOOTED, SHUTDOWN.
|Dependent item|hadoop.nodemanager.state[{#HOSTNAME}]**Preprocessing**
JSON Path: `$.State`
Discard unchanged with heartbeat: `6h`
**Preprocessing**
JSON Path: `$.NodeManagerVersion`
Discard unchanged with heartbeat: `6h`
**Preprocessing**
JSON Path: `$.NumContainers`
**Preprocessing**
JSON Path: `$.UsedMemoryMB`
**Preprocessing**
JSON Path: `$.AvailableMemoryMB`
Uptime is less than 10 minutes.
|`last(/Hadoop by HTTP/hadoop.nodemanager.uptime[{#HOSTNAME}])<10m`|Info|**Manual close**: Yes| |{#HOSTNAME}: Failed to fetch NodeManager API page|Zabbix has not received any data for items for the last 30 minutes.
|`nodata(/Hadoop by HTTP/hadoop.nodemanager.uptime[{#HOSTNAME}],30m)=1`|Warning|**Manual close**: YesThe state is different from normal.
|`last(/Hadoop by HTTP/hadoop.nodemanager.state[{#HOSTNAME}])<>"RUNNING"`|Average|| ### LLD rule Data node discovery |Name|Description|Type|Key and additional info| |----|-----------|----|-----------------------| |Data node discovery||HTTP agent|hadoop.datanode.discovery**Preprocessing**
JavaScript: `The text is too long. Please see the template.`
Remaining disk space.
|Dependent item|hadoop.datanode.remaining[{#HOSTNAME}]**Preprocessing**
JSON Path: `The text is too long. Please see the template.`
Used disk space.
|Dependent item|hadoop.datanode.dfs_used[{#HOSTNAME}]**Preprocessing**
JSON Path: `The text is too long. Please see the template.`
Number of failed storage volumes.
|Dependent item|hadoop.datanode.numfailedvolumes[{#HOSTNAME}]**Preprocessing**
JSON Path: `The text is too long. Please see the template.`
The number of JVM threads.
|Dependent item|hadoop.datanode.jvm.threads[{#HOSTNAME}]**Preprocessing**
JSON Path: `The text is too long. Please see the template.`
The JVM garbage collection time in milliseconds.
|Dependent item|hadoop.datanode.jvm.gc_time[{#HOSTNAME}]**Preprocessing**
JSON Path: `The text is too long. Please see the template.`
The JVM heap usage in MBytes.
|Dependent item|hadoop.datanode.jvm.mem_heap_used[{#HOSTNAME}]**Preprocessing**
JSON Path: `The text is too long. Please see the template.`
**Preprocessing**
JSON Path: `$.beans[?(@.name=='java.lang:type=Runtime')].Uptime.first()`
Custom multiplier: `0.001`
**Preprocessing**
JSON Path: `$.[?(@.HostName=='{#HOSTNAME}')].first()`
⛔️Custom on fail: Discard value
DataNode software version.
|Dependent item|hadoop.datanode.version[{#HOSTNAME}]**Preprocessing**
JSON Path: `$.version`
Discard unchanged with heartbeat: `6h`
Administrative state.
|Dependent item|hadoop.datanode.admin_state[{#HOSTNAME}]**Preprocessing**
JSON Path: `$.adminState`
Discard unchanged with heartbeat: `6h`
Operational state.
|Dependent item|hadoop.datanode.oper_state[{#HOSTNAME}]**Preprocessing**
JSON Path: `$.operState`
Discard unchanged with heartbeat: `6h`
Uptime is less than 10 minutes.
|`last(/Hadoop by HTTP/hadoop.datanode.uptime[{#HOSTNAME}])<10m`|Info|**Manual close**: Yes| |{#HOSTNAME}: Failed to fetch DataNode API page|Zabbix has not received any data for items for the last 30 minutes.
|`nodata(/Hadoop by HTTP/hadoop.datanode.uptime[{#HOSTNAME}],30m)=1`|Warning|**Manual close**: YesThe state is different from normal.
|`last(/Hadoop by HTTP/hadoop.datanode.oper_state[{#HOSTNAME}])<>"Live"`|Average|| ## Feedback Please report any issues with the template at [`https://support.zabbix.com`](https://support.zabbix.com) You can also provide feedback, discuss the template, or ask for help at [`ZABBIX forums`](https://www.zabbix.com/forum/zabbix-suggestions-and-feedback)