# Hadoop by HTTP ## Overview The template for monitoring Hadoop over HTTP that works without any external scripts. It collects metrics by polling the Hadoop API remotely using an HTTP agent and JSONPath preprocessing. Zabbix server (or proxy) execute direct requests to ResourceManager, NodeManagers, NameNode, DataNodes APIs. All metrics are collected at once, thanks to the Zabbix bulk data collection. ## Requirements Zabbix version: 7.0 and higher. ## Tested versions This template has been tested on: - Hadoop 3.1 and later ## Configuration > Zabbix should be configured according to the instructions in the [Templates out of the box](https://www.zabbix.com/documentation/7.0/manual/config/templates_out_of_the_box) section. ## Setup You should define the IP address (or FQDN) and Web-UI port for the ResourceManager in {$HADOOP.RESOURCEMANAGER.HOST} and {$HADOOP.RESOURCEMANAGER.PORT} macros and for the NameNode in {$HADOOP.NAMENODE.HOST} and {$HADOOP.NAMENODE.PORT} macros respectively. Macros can be set in the template or overridden at the host level. ### Macros used |Name|Description|Default| |----|-----------|-------| |{$HADOOP.RESOURCEMANAGER.HOST}|

The Hadoop ResourceManager host IP address or FQDN.

|`ResourceManager`| |{$HADOOP.RESOURCEMANAGER.PORT}|

The Hadoop ResourceManager Web-UI port.

|`8088`| |{$HADOOP.RESOURCEMANAGER.RESPONSE_TIME.MAX.WARN}|

The Hadoop ResourceManager API page maximum response time in seconds for trigger expression.

|`10s`| |{$HADOOP.NAMENODE.HOST}|

The Hadoop NameNode host IP address or FQDN.

|`NameNode`| |{$HADOOP.NAMENODE.PORT}|

The Hadoop NameNode Web-UI port.

|`9870`| |{$HADOOP.NAMENODE.RESPONSE_TIME.MAX.WARN}|

The Hadoop NameNode API page maximum response time in seconds for trigger expression.

|`10s`| |{$HADOOP.CAPACITY_REMAINING.MIN.WARN}|

The Hadoop cluster capacity remaining percent for trigger expression.

|`20`| ### Items |Name|Description|Type|Key and additional info| |----|-----------|----|-----------------------| |ResourceManager: Service status|

Hadoop ResourceManager API port availability.

|Simple check|net.tcp.service["tcp","{$HADOOP.RESOURCEMANAGER.HOST}","{$HADOOP.RESOURCEMANAGER.PORT}"]

**Preprocessing**

| |ResourceManager: Service response time|

Hadoop ResourceManager API performance.

|Simple check|net.tcp.service.perf["tcp","{$HADOOP.RESOURCEMANAGER.HOST}","{$HADOOP.RESOURCEMANAGER.PORT}"]| |Hadoop: Get ResourceManager stats||HTTP agent|hadoop.resourcemanager.get| |ResourceManager: Uptime||Dependent item|hadoop.resourcemanager.uptime

**Preprocessing**

| |ResourceManager: Get info||Dependent item|hadoop.resourcemanager.info

**Preprocessing**

| |ResourceManager: RPC queue & processing time|

Average time spent on processing RPC requests.

|Dependent item|hadoop.resourcemanager.rpc_processing_time_avg

**Preprocessing**

| |ResourceManager: Active NMs|

Number of Active NodeManagers.

|Dependent item|hadoop.resourcemanager.num_active_nm

**Preprocessing**

| |ResourceManager: Decommissioning NMs|

Number of Decommissioning NodeManagers.

|Dependent item|hadoop.resourcemanager.num_decommissioning_nm

**Preprocessing**

| |ResourceManager: Decommissioned NMs|

Number of Decommissioned NodeManagers.

|Dependent item|hadoop.resourcemanager.num_decommissioned_nm

**Preprocessing**

| |ResourceManager: Lost NMs|

Number of Lost NodeManagers.

|Dependent item|hadoop.resourcemanager.num_lost_nm

**Preprocessing**

| |ResourceManager: Unhealthy NMs|

Number of Unhealthy NodeManagers.

|Dependent item|hadoop.resourcemanager.num_unhealthy_nm

**Preprocessing**

| |ResourceManager: Rebooted NMs|

Number of Rebooted NodeManagers.

|Dependent item|hadoop.resourcemanager.num_rebooted_nm

**Preprocessing**

| |ResourceManager: Shutdown NMs|

Number of Shutdown NodeManagers.

|Dependent item|hadoop.resourcemanager.num_shutdown_nm

**Preprocessing**

| |NameNode: Service status|

Hadoop NameNode API port availability.

|Simple check|net.tcp.service["tcp","{$HADOOP.NAMENODE.HOST}","{$HADOOP.NAMENODE.PORT}"]

**Preprocessing**

| |NameNode: Service response time|

Hadoop NameNode API performance.

|Simple check|net.tcp.service.perf["tcp","{$HADOOP.NAMENODE.HOST}","{$HADOOP.NAMENODE.PORT}"]| |Hadoop: Get NameNode stats||HTTP agent|hadoop.namenode.get| |NameNode: Uptime||Dependent item|hadoop.namenode.uptime

**Preprocessing**

| |NameNode: Get info||Dependent item|hadoop.namenode.info

**Preprocessing**

| |NameNode: RPC queue & processing time|

Average time spent on processing RPC requests.

|Dependent item|hadoop.namenode.rpc_processing_time_avg

**Preprocessing**

| |NameNode: Block Pool Renaming||Dependent item|hadoop.namenode.percent_block_pool_used

**Preprocessing**

| |NameNode: Transactions since last checkpoint|

Total number of transactions since last checkpoint.

|Dependent item|hadoop.namenode.transactions_since_last_checkpoint

**Preprocessing**

| |NameNode: Percent capacity remaining|

Available capacity in percent.

|Dependent item|hadoop.namenode.percent_remaining

**Preprocessing**

| |NameNode: Capacity remaining|

Available capacity.

|Dependent item|hadoop.namenode.capacity_remaining

**Preprocessing**

| |NameNode: Corrupt blocks|

Number of corrupt blocks.

|Dependent item|hadoop.namenode.corrupt_blocks

**Preprocessing**

| |NameNode: Missing blocks|

Number of missing blocks.

|Dependent item|hadoop.namenode.missing_blocks

**Preprocessing**

| |NameNode: Failed volumes|

Number of failed volumes.

|Dependent item|hadoop.namenode.volume_failures_total

**Preprocessing**

| |NameNode: Alive DataNodes|

Count of alive DataNodes.

|Dependent item|hadoop.namenode.num_live_data_nodes

**Preprocessing**

| |NameNode: Dead DataNodes|

Count of dead DataNodes.

|Dependent item|hadoop.namenode.num_dead_data_nodes

**Preprocessing**

| |NameNode: Stale DataNodes|

DataNodes that do not send a heartbeat within 30 seconds are marked as "stale".

|Dependent item|hadoop.namenode.num_stale_data_nodes

**Preprocessing**

| |NameNode: Total files|

Total count of files tracked by the NameNode.

|Dependent item|hadoop.namenode.files_total

**Preprocessing**

| |NameNode: Total load|

The current number of concurrent file accesses (read/write) across all DataNodes.

|Dependent item|hadoop.namenode.total_load

**Preprocessing**

| |NameNode: Blocks allocable|

Maximum number of blocks allocable.

|Dependent item|hadoop.namenode.block_capacity

**Preprocessing**

| |NameNode: Total blocks|

Count of blocks tracked by NameNode.

|Dependent item|hadoop.namenode.blocks_total

**Preprocessing**

| |NameNode: Under-replicated blocks|

The number of blocks with insufficient replication.

|Dependent item|hadoop.namenode.under_replicated_blocks

**Preprocessing**

| |Hadoop: Get NodeManagers states||HTTP agent|hadoop.nodemanagers.get

**Preprocessing**

| |Hadoop: Get DataNodes states||HTTP agent|hadoop.datanodes.get

**Preprocessing**

| ### Triggers |Name|Description|Expression|Severity|Dependencies and additional info| |----|-----------|----------|--------|--------------------------------| |ResourceManager: Service is unavailable||`last(/Hadoop by HTTP/net.tcp.service["tcp","{$HADOOP.RESOURCEMANAGER.HOST}","{$HADOOP.RESOURCEMANAGER.PORT}"])=0`|Average|**Manual close**: Yes| |ResourceManager: Service response time is too high||`min(/Hadoop by HTTP/net.tcp.service.perf["tcp","{$HADOOP.RESOURCEMANAGER.HOST}","{$HADOOP.RESOURCEMANAGER.PORT}"],5m)>{$HADOOP.RESOURCEMANAGER.RESPONSE_TIME.MAX.WARN}`|Warning|**Manual close**: Yes
**Depends on**:
| |ResourceManager: Service has been restarted|

Uptime is less than 10 minutes.

|`last(/Hadoop by HTTP/hadoop.resourcemanager.uptime)<10m`|Info|**Manual close**: Yes| |ResourceManager: Failed to fetch ResourceManager API page|

Zabbix has not received any data for items for the last 30 minutes.

|`nodata(/Hadoop by HTTP/hadoop.resourcemanager.uptime,30m)=1`|Warning|**Manual close**: Yes
**Depends on**:
| |ResourceManager: Cluster has no active NodeManagers|

Cluster is unable to execute any jobs without at least one NodeManager.

|`max(/Hadoop by HTTP/hadoop.resourcemanager.num_active_nm,5m)=0`|High|| |ResourceManager: Cluster has unhealthy NodeManagers|

YARN considers any node with disk utilization exceeding the value specified under the property yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage (in yarn-site.xml) to be unhealthy. Ample disk space is critical to ensure uninterrupted operation of a Hadoop cluster, and large numbers of unhealthyNodes (the number to alert on depends on the size of your cluster) should be quickly investigated and resolved.

|`min(/Hadoop by HTTP/hadoop.resourcemanager.num_unhealthy_nm,15m)>0`|Average|| |NameNode: Service is unavailable||`last(/Hadoop by HTTP/net.tcp.service["tcp","{$HADOOP.NAMENODE.HOST}","{$HADOOP.NAMENODE.PORT}"])=0`|Average|**Manual close**: Yes| |NameNode: Service response time is too high||`min(/Hadoop by HTTP/net.tcp.service.perf["tcp","{$HADOOP.NAMENODE.HOST}","{$HADOOP.NAMENODE.PORT}"],5m)>{$HADOOP.NAMENODE.RESPONSE_TIME.MAX.WARN}`|Warning|**Manual close**: Yes
**Depends on**:
| |NameNode: Service has been restarted|

Uptime is less than 10 minutes.

|`last(/Hadoop by HTTP/hadoop.namenode.uptime)<10m`|Info|**Manual close**: Yes| |NameNode: Failed to fetch NameNode API page|

Zabbix has not received any data for items for the last 30 minutes.

|`nodata(/Hadoop by HTTP/hadoop.namenode.uptime,30m)=1`|Warning|**Manual close**: Yes
**Depends on**:
| |NameNode: Cluster capacity remaining is low|

A good practice is to ensure that disk use never exceeds 80 percent capacity.

|`max(/Hadoop by HTTP/hadoop.namenode.percent_remaining,15m)<{$HADOOP.CAPACITY_REMAINING.MIN.WARN}`|Warning|| |NameNode: Cluster has missing blocks|

A missing block is far worse than a corrupt block, because a missing block cannot be recovered by copying a replica.

|`min(/Hadoop by HTTP/hadoop.namenode.missing_blocks,15m)>0`|Average|| |NameNode: Cluster has volume failures|

HDFS now allows for disks to fail in place, without affecting DataNode operations, until a threshold value is reached. This is set on each DataNode via the dfs.datanode.failed.volumes.tolerated property; it defaults to 0, meaning that any volume failure will shut down the DataNode; on a production cluster where DataNodes typically have 6, 8, or 12 disks, setting this parameter to 1 or 2 is typically the best practice.

|`min(/Hadoop by HTTP/hadoop.namenode.volume_failures_total,15m)>0`|Average|| |NameNode: Cluster has DataNodes in Dead state|

The death of a DataNode causes a flurry of network activity, as the NameNode initiates replication of blocks lost on the dead nodes.

|`min(/Hadoop by HTTP/hadoop.namenode.num_dead_data_nodes,5m)>0`|Average|| ### LLD rule Node manager discovery |Name|Description|Type|Key and additional info| |----|-----------|----|-----------------------| |Node manager discovery||HTTP agent|hadoop.nodemanager.discovery

**Preprocessing**

| ### Item prototypes for Node manager discovery |Name|Description|Type|Key and additional info| |----|-----------|----|-----------------------| |Hadoop NodeManager {#HOSTNAME}: Get stats||HTTP agent|hadoop.nodemanager.get[{#HOSTNAME}]| |{#HOSTNAME}: RPC queue & processing time|

Average time spent on processing RPC requests.

|Dependent item|hadoop.nodemanager.rpc_processing_time_avg[{#HOSTNAME}]

**Preprocessing**

| |{#HOSTNAME}: Container launch avg duration||Dependent item|hadoop.nodemanager.container_launch_duration_avg[{#HOSTNAME}]

**Preprocessing**

| |{#HOSTNAME}: JVM Threads|

The number of JVM threads.

|Dependent item|hadoop.nodemanager.jvm.threads[{#HOSTNAME}]

**Preprocessing**

| |{#HOSTNAME}: JVM Garbage collection time|

The JVM garbage collection time in milliseconds.

|Dependent item|hadoop.nodemanager.jvm.gc_time[{#HOSTNAME}]

**Preprocessing**

| |{#HOSTNAME}: JVM Heap usage|

The JVM heap usage in MBytes.

|Dependent item|hadoop.nodemanager.jvm.mem_heap_used[{#HOSTNAME}]

**Preprocessing**

| |{#HOSTNAME}: Uptime||Dependent item|hadoop.nodemanager.uptime[{#HOSTNAME}]

**Preprocessing**

| |Hadoop NodeManager {#HOSTNAME}: Get raw info||Dependent item|hadoop.nodemanager.raw_info[{#HOSTNAME}]

**Preprocessing**

| |{#HOSTNAME}: State|

State of the node - valid values are: NEW, RUNNING, UNHEALTHY, DECOMMISSIONING, DECOMMISSIONED, LOST, REBOOTED, SHUTDOWN.

|Dependent item|hadoop.nodemanager.state[{#HOSTNAME}]

**Preprocessing**

| |{#HOSTNAME}: Version||Dependent item|hadoop.nodemanager.version[{#HOSTNAME}]

**Preprocessing**

| |{#HOSTNAME}: Number of containers||Dependent item|hadoop.nodemanager.numcontainers[{#HOSTNAME}]

**Preprocessing**

| |{#HOSTNAME}: Used memory||Dependent item|hadoop.nodemanager.usedmemory[{#HOSTNAME}]

**Preprocessing**

| |{#HOSTNAME}: Available memory||Dependent item|hadoop.nodemanager.availablememory[{#HOSTNAME}]

**Preprocessing**

| ### Trigger prototypes for Node manager discovery |Name|Description|Expression|Severity|Dependencies and additional info| |----|-----------|----------|--------|--------------------------------| |{#HOSTNAME}: Service has been restarted|

Uptime is less than 10 minutes.

|`last(/Hadoop by HTTP/hadoop.nodemanager.uptime[{#HOSTNAME}])<10m`|Info|**Manual close**: Yes| |{#HOSTNAME}: Failed to fetch NodeManager API page|

Zabbix has not received any data for items for the last 30 minutes.

|`nodata(/Hadoop by HTTP/hadoop.nodemanager.uptime[{#HOSTNAME}],30m)=1`|Warning|**Manual close**: Yes
**Depends on**:
| |{#HOSTNAME}: NodeManager has state {ITEM.VALUE}.|

The state is different from normal.

|`last(/Hadoop by HTTP/hadoop.nodemanager.state[{#HOSTNAME}])<>"RUNNING"`|Average|| ### LLD rule Data node discovery |Name|Description|Type|Key and additional info| |----|-----------|----|-----------------------| |Data node discovery||HTTP agent|hadoop.datanode.discovery

**Preprocessing**

| ### Item prototypes for Data node discovery |Name|Description|Type|Key and additional info| |----|-----------|----|-----------------------| |Hadoop DataNode {#HOSTNAME}: Get stats||HTTP agent|hadoop.datanode.get[{#HOSTNAME}]| |{#HOSTNAME}: Remaining|

Remaining disk space.

|Dependent item|hadoop.datanode.remaining[{#HOSTNAME}]

**Preprocessing**

| |{#HOSTNAME}: Used|

Used disk space.

|Dependent item|hadoop.datanode.dfs_used[{#HOSTNAME}]

**Preprocessing**

| |{#HOSTNAME}: Number of failed volumes|

Number of failed storage volumes.

|Dependent item|hadoop.datanode.numfailedvolumes[{#HOSTNAME}]

**Preprocessing**

| |{#HOSTNAME}: JVM Threads|

The number of JVM threads.

|Dependent item|hadoop.datanode.jvm.threads[{#HOSTNAME}]

**Preprocessing**

| |{#HOSTNAME}: JVM Garbage collection time|

The JVM garbage collection time in milliseconds.

|Dependent item|hadoop.datanode.jvm.gc_time[{#HOSTNAME}]

**Preprocessing**

| |{#HOSTNAME}: JVM Heap usage|

The JVM heap usage in MBytes.

|Dependent item|hadoop.datanode.jvm.mem_heap_used[{#HOSTNAME}]

**Preprocessing**

| |{#HOSTNAME}: Uptime||Dependent item|hadoop.datanode.uptime[{#HOSTNAME}]

**Preprocessing**

| |Hadoop DataNode {#HOSTNAME}: Get raw info||Dependent item|hadoop.datanode.raw_info[{#HOSTNAME}]

**Preprocessing**

| |{#HOSTNAME}: Version|

DataNode software version.

|Dependent item|hadoop.datanode.version[{#HOSTNAME}]

**Preprocessing**

| |{#HOSTNAME}: Admin state|

Administrative state.

|Dependent item|hadoop.datanode.admin_state[{#HOSTNAME}]

**Preprocessing**

| |{#HOSTNAME}: Oper state|

Operational state.

|Dependent item|hadoop.datanode.oper_state[{#HOSTNAME}]

**Preprocessing**

| ### Trigger prototypes for Data node discovery |Name|Description|Expression|Severity|Dependencies and additional info| |----|-----------|----------|--------|--------------------------------| |{#HOSTNAME}: Service has been restarted|

Uptime is less than 10 minutes.

|`last(/Hadoop by HTTP/hadoop.datanode.uptime[{#HOSTNAME}])<10m`|Info|**Manual close**: Yes| |{#HOSTNAME}: Failed to fetch DataNode API page|

Zabbix has not received any data for items for the last 30 minutes.

|`nodata(/Hadoop by HTTP/hadoop.datanode.uptime[{#HOSTNAME}],30m)=1`|Warning|**Manual close**: Yes
**Depends on**:
| |{#HOSTNAME}: DataNode has state {ITEM.VALUE}.|

The state is different from normal.

|`last(/Hadoop by HTTP/hadoop.datanode.oper_state[{#HOSTNAME}])<>"Live"`|Average|| ## Feedback Please report any issues with the template at [`https://support.zabbix.com`](https://support.zabbix.com) You can also provide feedback, discuss the template, or ask for help at [`ZABBIX forums`](https://www.zabbix.com/forum/zabbix-suggestions-and-feedback)