# Hadoop by HTTP ## Overview The template for monitoring Hadoop over HTTP that works without any external scripts. It collects metrics by polling the Hadoop API remotely using an HTTP agent and JSONPath preprocessing. Zabbix server (or proxy) execute direct requests to ResourceManager, NodeManagers, NameNode, DataNodes APIs. All metrics are collected at once, thanks to the Zabbix bulk data collection. ## Requirements Zabbix version: 7.0 and higher. ## Tested versions This template has been tested on: - Hadoop 3.1 and later ## Configuration > Zabbix should be configured according to the instructions in the [Templates out of the box](https://www.zabbix.com/documentation/7.0/manual/config/templates_out_of_the_box) section. ## Setup You should define the IP address (or FQDN) and Web-UI port for the ResourceManager in {$HADOOP.RESOURCEMANAGER.HOST} and {$HADOOP.RESOURCEMANAGER.PORT} macros and for the NameNode in {$HADOOP.NAMENODE.HOST} and {$HADOOP.NAMENODE.PORT} macros respectively. Macros can be set in the template or overridden at the host level. ### Macros used |Name|Description|Default| |----|-----------|-------| |{$HADOOP.RESOURCEMANAGER.HOST}|

The Hadoop ResourceManager host IP address or FQDN.

|`ResourceManager`| |{$HADOOP.RESOURCEMANAGER.PORT}|

The Hadoop ResourceManager Web-UI port.

|`8088`| |{$HADOOP.RESOURCEMANAGER.RESPONSE_TIME.MAX.WARN}|

The Hadoop ResourceManager API page maximum response time in seconds for trigger expression.

|`10s`| |{$HADOOP.NAMENODE.HOST}|

The Hadoop NameNode host IP address or FQDN.

|`NameNode`| |{$HADOOP.NAMENODE.PORT}|

The Hadoop NameNode Web-UI port.

|`9870`| |{$HADOOP.NAMENODE.RESPONSE_TIME.MAX.WARN}|

The Hadoop NameNode API page maximum response time in seconds for trigger expression.

|`10s`| |{$HADOOP.CAPACITY_REMAINING.MIN.WARN}|

The Hadoop cluster capacity remaining percent for trigger expression.

|`20`| ### Items |Name|Description|Type|Key and additional info| |----|-----------|----|-----------------------| |ResourceManager: Service status|

Hadoop ResourceManager API port availability.

|Simple check|net.tcp.service["tcp","{$HADOOP.RESOURCEMANAGER.HOST}","{$HADOOP.RESOURCEMANAGER.PORT}"]

**Preprocessing**

Discard unchanged with heartbeat: `10m`

| |ResourceManager: Service response time|

Hadoop ResourceManager API performance.

|Simple check|net.tcp.service.perf["tcp","{$HADOOP.RESOURCEMANAGER.HOST}","{$HADOOP.RESOURCEMANAGER.PORT}"]| |Hadoop: Get ResourceManager stats||HTTP agent|hadoop.resourcemanager.get| |ResourceManager: Uptime||Dependent item|hadoop.resourcemanager.uptime

**Preprocessing**

JSON Path: `$.beans[?(@.name=='java.lang:type=Runtime')].Uptime.first()`
Custom multiplier: `0.001`

| |ResourceManager: Get info||Dependent item|hadoop.resourcemanager.info

**Preprocessing**

JSON Path: `$.beans[?(@.name=~'Hadoop:service=ResourceManager,name=*')]`
⛔️Custom on fail: Set value to: `[]`

| |ResourceManager: RPC queue & processing time|

Average time spent on processing RPC requests.

|Dependent item|hadoop.resourcemanager.rpc_processing_time_avg

**Preprocessing**

JSON Path: `The text is too long. Please see the template.`

| |ResourceManager: Active NMs|

Number of Active NodeManagers.

|Dependent item|hadoop.resourcemanager.num_active_nm

**Preprocessing**

JSON Path: `The text is too long. Please see the template.`
Discard unchanged with heartbeat: `6h`

| |ResourceManager: Decommissioning NMs|

Number of Decommissioning NodeManagers.

|Dependent item|hadoop.resourcemanager.num_decommissioning_nm

**Preprocessing**

JSON Path: `The text is too long. Please see the template.`
Discard unchanged with heartbeat: `6h`

| |ResourceManager: Decommissioned NMs|

Number of Decommissioned NodeManagers.

|Dependent item|hadoop.resourcemanager.num_decommissioned_nm

**Preprocessing**

JSON Path: `The text is too long. Please see the template.`

| |ResourceManager: Lost NMs|

Number of Lost NodeManagers.

|Dependent item|hadoop.resourcemanager.num_lost_nm

**Preprocessing**

JSON Path: `The text is too long. Please see the template.`
Discard unchanged with heartbeat: `6h`

| |ResourceManager: Unhealthy NMs|

Number of Unhealthy NodeManagers.

|Dependent item|hadoop.resourcemanager.num_unhealthy_nm

**Preprocessing**

JSON Path: `The text is too long. Please see the template.`

| |ResourceManager: Rebooted NMs|

Number of Rebooted NodeManagers.

|Dependent item|hadoop.resourcemanager.num_rebooted_nm

**Preprocessing**

JSON Path: `The text is too long. Please see the template.`

| |ResourceManager: Shutdown NMs|

Number of Shutdown NodeManagers.

|Dependent item|hadoop.resourcemanager.num_shutdown_nm

**Preprocessing**

JSON Path: `The text is too long. Please see the template.`

| |NameNode: Service status|

Hadoop NameNode API port availability.

|Simple check|net.tcp.service["tcp","{$HADOOP.NAMENODE.HOST}","{$HADOOP.NAMENODE.PORT}"]

**Preprocessing**

Discard unchanged with heartbeat: `10m`

| |NameNode: Service response time|

Hadoop NameNode API performance.

|Simple check|net.tcp.service.perf["tcp","{$HADOOP.NAMENODE.HOST}","{$HADOOP.NAMENODE.PORT}"]| |Hadoop: Get NameNode stats||HTTP agent|hadoop.namenode.get| |NameNode: Uptime||Dependent item|hadoop.namenode.uptime

**Preprocessing**

JSON Path: `$.beans[?(@.name=='java.lang:type=Runtime')].Uptime.first()`
Custom multiplier: `0.001`

| |NameNode: Get info||Dependent item|hadoop.namenode.info

**Preprocessing**

JSON Path: `$.beans[?(@.name=~'Hadoop:service=NameNode,name=*')]`
⛔️Custom on fail: Set value to: `[]`

| |NameNode: RPC queue & processing time|

Average time spent on processing RPC requests.

|Dependent item|hadoop.namenode.rpc_processing_time_avg

**Preprocessing**

JSON Path: `The text is too long. Please see the template.`

| |NameNode: Block Pool Renaming||Dependent item|hadoop.namenode.percent_block_pool_used

**Preprocessing**

JSON Path: `The text is too long. Please see the template.`

| |NameNode: Transactions since last checkpoint|

Total number of transactions since last checkpoint.

|Dependent item|hadoop.namenode.transactions_since_last_checkpoint

**Preprocessing**

JSON Path: `The text is too long. Please see the template.`

| |NameNode: Percent capacity remaining|

Available capacity in percent.

|Dependent item|hadoop.namenode.percent_remaining

**Preprocessing**

JSON Path: `The text is too long. Please see the template.`
Discard unchanged with heartbeat: `6h`

| |NameNode: Capacity remaining|

Available capacity.

|Dependent item|hadoop.namenode.capacity_remaining

**Preprocessing**

JSON Path: `The text is too long. Please see the template.`

| |NameNode: Corrupt blocks|

Number of corrupt blocks.

|Dependent item|hadoop.namenode.corrupt_blocks

**Preprocessing**

JSON Path: `The text is too long. Please see the template.`

| |NameNode: Missing blocks|

Number of missing blocks.

|Dependent item|hadoop.namenode.missing_blocks

**Preprocessing**

JSON Path: `The text is too long. Please see the template.`

| |NameNode: Failed volumes|

Number of failed volumes.

|Dependent item|hadoop.namenode.volume_failures_total

**Preprocessing**

JSON Path: `The text is too long. Please see the template.`

| |NameNode: Alive DataNodes|

Count of alive DataNodes.

|Dependent item|hadoop.namenode.num_live_data_nodes

**Preprocessing**

JSON Path: `The text is too long. Please see the template.`
Discard unchanged with heartbeat: `6h`

| |NameNode: Dead DataNodes|

Count of dead DataNodes.

|Dependent item|hadoop.namenode.num_dead_data_nodes

**Preprocessing**

JSON Path: `The text is too long. Please see the template.`
Discard unchanged with heartbeat: `6h`

| |NameNode: Stale DataNodes|

DataNodes that do not send a heartbeat within 30 seconds are marked as "stale".

|Dependent item|hadoop.namenode.num_stale_data_nodes

**Preprocessing**

JSON Path: `The text is too long. Please see the template.`
Discard unchanged with heartbeat: `6h`

| |NameNode: Total files|

Total count of files tracked by the NameNode.

|Dependent item|hadoop.namenode.files_total

**Preprocessing**

JSON Path: `The text is too long. Please see the template.`

| |NameNode: Total load|

The current number of concurrent file accesses (read/write) across all DataNodes.

|Dependent item|hadoop.namenode.total_load

**Preprocessing**

JSON Path: `The text is too long. Please see the template.`

| |NameNode: Blocks allocable|

Maximum number of blocks allocable.

|Dependent item|hadoop.namenode.block_capacity

**Preprocessing**

JSON Path: `The text is too long. Please see the template.`

| |NameNode: Total blocks|

Count of blocks tracked by NameNode.

|Dependent item|hadoop.namenode.blocks_total

**Preprocessing**

JSON Path: `The text is too long. Please see the template.`

| |NameNode: Under-replicated blocks|

The number of blocks with insufficient replication.

|Dependent item|hadoop.namenode.under_replicated_blocks

**Preprocessing**

JSON Path: `The text is too long. Please see the template.`

| |Hadoop: Get NodeManagers states||HTTP agent|hadoop.nodemanagers.get

**Preprocessing**

JavaScript: `The text is too long. Please see the template.`

| |Hadoop: Get DataNodes states||HTTP agent|hadoop.datanodes.get

**Preprocessing**

JavaScript: `The text is too long. Please see the template.`

| ### Triggers |Name|Description|Expression|Severity|Dependencies and additional info| |----|-----------|----------|--------|--------------------------------| |ResourceManager: Service is unavailable||`last(/Hadoop by HTTP/net.tcp.service["tcp","{$HADOOP.RESOURCEMANAGER.HOST}","{$HADOOP.RESOURCEMANAGER.PORT}"])=0`|Average|**Manual close**: Yes| |ResourceManager: Service response time is too high||`min(/Hadoop by HTTP/net.tcp.service.perf["tcp","{$HADOOP.RESOURCEMANAGER.HOST}","{$HADOOP.RESOURCEMANAGER.PORT}"],5m)>{$HADOOP.RESOURCEMANAGER.RESPONSE_TIME.MAX.WARN}`|Warning|**Manual close**: Yes
**Depends on**:

ResourceManager: Service is unavailable

| |ResourceManager: Service has been restarted|

Uptime is less than 10 minutes.

Zabbix has not received any data for items for the last 30 minutes.

|`nodata(/Hadoop by HTTP/hadoop.resourcemanager.uptime,30m)=1`|Warning|**Manual close**: Yes
**Depends on**:

ResourceManager: Service is unavailable

| |ResourceManager: Cluster has no active NodeManagers|

Cluster is unable to execute any jobs without at least one NodeManager.

|`max(/Hadoop by HTTP/hadoop.resourcemanager.num_active_nm,5m)=0`|High|| |ResourceManager: Cluster has unhealthy NodeManagers|

YARN considers any node with disk utilization exceeding the value specified under the property yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage (in yarn-site.xml) to be unhealthy. Ample disk space is critical to ensure uninterrupted operation of a Hadoop cluster, and large numbers of unhealthyNodes (the number to alert on depends on the size of your cluster) should be quickly investigated and resolved.

|`min(/Hadoop by HTTP/hadoop.resourcemanager.num_unhealthy_nm,15m)>0`|Average|| |NameNode: Service is unavailable||`last(/Hadoop by HTTP/net.tcp.service["tcp","{$HADOOP.NAMENODE.HOST}","{$HADOOP.NAMENODE.PORT}"])=0`|Average|**Manual close**: Yes| |NameNode: Service response time is too high||`min(/Hadoop by HTTP/net.tcp.service.perf["tcp","{$HADOOP.NAMENODE.HOST}","{$HADOOP.NAMENODE.PORT}"],5m)>{$HADOOP.NAMENODE.RESPONSE_TIME.MAX.WARN}`|Warning|**Manual close**: Yes
**Depends on**:

NameNode: Service is unavailable

| |NameNode: Service has been restarted|

Uptime is less than 10 minutes.

Zabbix has not received any data for items for the last 30 minutes.

|`nodata(/Hadoop by HTTP/hadoop.namenode.uptime,30m)=1`|Warning|**Manual close**: Yes
**Depends on**:

NameNode: Service is unavailable

| |NameNode: Cluster capacity remaining is low|

A good practice is to ensure that disk use never exceeds 80 percent capacity.

|`max(/Hadoop by HTTP/hadoop.namenode.percent_remaining,15m)<{$HADOOP.CAPACITY_REMAINING.MIN.WARN}`|Warning|| |NameNode: Cluster has missing blocks|

A missing block is far worse than a corrupt block, because a missing block cannot be recovered by copying a replica.

|`min(/Hadoop by HTTP/hadoop.namenode.missing_blocks,15m)>0`|Average|| |NameNode: Cluster has volume failures|

HDFS now allows for disks to fail in place, without affecting DataNode operations, until a threshold value is reached. This is set on each DataNode via the dfs.datanode.failed.volumes.tolerated property; it defaults to 0, meaning that any volume failure will shut down the DataNode; on a production cluster where DataNodes typically have 6, 8, or 12 disks, setting this parameter to 1 or 2 is typically the best practice.

|`min(/Hadoop by HTTP/hadoop.namenode.volume_failures_total,15m)>0`|Average|| |NameNode: Cluster has DataNodes in Dead state|

The death of a DataNode causes a flurry of network activity, as the NameNode initiates replication of blocks lost on the dead nodes.

|`min(/Hadoop by HTTP/hadoop.namenode.num_dead_data_nodes,5m)>0`|Average|| ### LLD rule Node manager discovery |Name|Description|Type|Key and additional info| |----|-----------|----|-----------------------| |Node manager discovery||HTTP agent|hadoop.nodemanager.discovery

**Preprocessing**

JavaScript: `The text is too long. Please see the template.`

| ### Item prototypes for Node manager discovery |Name|Description|Type|Key and additional info| |----|-----------|----|-----------------------| |Hadoop NodeManager {#HOSTNAME}: Get stats||HTTP agent|hadoop.nodemanager.get[{#HOSTNAME}]| |{#HOSTNAME}: RPC queue & processing time|

Average time spent on processing RPC requests.

|Dependent item|hadoop.nodemanager.rpc_processing_time_avg[{#HOSTNAME}]

**Preprocessing**

JSON Path: `The text is too long. Please see the template.`

| |{#HOSTNAME}: Container launch avg duration||Dependent item|hadoop.nodemanager.container_launch_duration_avg[{#HOSTNAME}]

**Preprocessing**

JSON Path: `The text is too long. Please see the template.`

| |{#HOSTNAME}: JVM Threads|

The number of JVM threads.

|Dependent item|hadoop.nodemanager.jvm.threads[{#HOSTNAME}]

**Preprocessing**

JSON Path: `The text is too long. Please see the template.`

| |{#HOSTNAME}: JVM Garbage collection time|

The JVM garbage collection time in milliseconds.

|Dependent item|hadoop.nodemanager.jvm.gc_time[{#HOSTNAME}]

**Preprocessing**

JSON Path: `The text is too long. Please see the template.`

| |{#HOSTNAME}: JVM Heap usage|

The JVM heap usage in MBytes.

|Dependent item|hadoop.nodemanager.jvm.mem_heap_used[{#HOSTNAME}]

**Preprocessing**

JSON Path: `The text is too long. Please see the template.`

| |{#HOSTNAME}: Uptime||Dependent item|hadoop.nodemanager.uptime[{#HOSTNAME}]

**Preprocessing**

JSON Path: `$.beans[?(@.name=='java.lang:type=Runtime')].Uptime.first()`
Custom multiplier: `0.001`

| |Hadoop NodeManager {#HOSTNAME}: Get raw info||Dependent item|hadoop.nodemanager.raw_info[{#HOSTNAME}]

**Preprocessing**

JSON Path: `$.[?(@.HostName=='{#HOSTNAME}')].first()`
⛔️Custom on fail: Discard value

| |{#HOSTNAME}: State|

State of the node - valid values are: NEW, RUNNING, UNHEALTHY, DECOMMISSIONING, DECOMMISSIONED, LOST, REBOOTED, SHUTDOWN.

|Dependent item|hadoop.nodemanager.state[{#HOSTNAME}]

**Preprocessing**

JSON Path: `$.State`
Discard unchanged with heartbeat: `6h`

| |{#HOSTNAME}: Version||Dependent item|hadoop.nodemanager.version[{#HOSTNAME}]

**Preprocessing**

JSON Path: `$.NodeManagerVersion`
Discard unchanged with heartbeat: `6h`

| |{#HOSTNAME}: Number of containers||Dependent item|hadoop.nodemanager.numcontainers[{#HOSTNAME}]

**Preprocessing**

JSON Path: `$.NumContainers`

| |{#HOSTNAME}: Used memory||Dependent item|hadoop.nodemanager.usedmemory[{#HOSTNAME}]

**Preprocessing**

JSON Path: `$.UsedMemoryMB`

| |{#HOSTNAME}: Available memory||Dependent item|hadoop.nodemanager.availablememory[{#HOSTNAME}]

**Preprocessing**

JSON Path: `$.AvailableMemoryMB`

| ### Trigger prototypes for Node manager discovery |Name|Description|Expression|Severity|Dependencies and additional info| |----|-----------|----------|--------|--------------------------------| |{#HOSTNAME}: Service has been restarted|

Uptime is less than 10 minutes.

|`last(/Hadoop by HTTP/hadoop.nodemanager.uptime[{#HOSTNAME}])<10m`|Info|**Manual close**: Yes| |{#HOSTNAME}: Failed to fetch NodeManager API page|

Zabbix has not received any data for items for the last 30 minutes.

|`nodata(/Hadoop by HTTP/hadoop.nodemanager.uptime[{#HOSTNAME}],30m)=1`|Warning|**Manual close**: Yes
**Depends on**:

{#HOSTNAME}: NodeManager has state {ITEM.VALUE}.

| |{#HOSTNAME}: NodeManager has state {ITEM.VALUE}.|

The state is different from normal.

|`last(/Hadoop by HTTP/hadoop.nodemanager.state[{#HOSTNAME}])<>"RUNNING"`|Average|| ### LLD rule Data node discovery |Name|Description|Type|Key and additional info| |----|-----------|----|-----------------------| |Data node discovery||HTTP agent|hadoop.datanode.discovery

**Preprocessing**

JavaScript: `The text is too long. Please see the template.`

| ### Item prototypes for Data node discovery |Name|Description|Type|Key and additional info| |----|-----------|----|-----------------------| |Hadoop DataNode {#HOSTNAME}: Get stats||HTTP agent|hadoop.datanode.get[{#HOSTNAME}]| |{#HOSTNAME}: Remaining|

Remaining disk space.

|Dependent item|hadoop.datanode.remaining[{#HOSTNAME}]

**Preprocessing**

JSON Path: `The text is too long. Please see the template.`

| |{#HOSTNAME}: Used|

Used disk space.

|Dependent item|hadoop.datanode.dfs_used[{#HOSTNAME}]

**Preprocessing**

JSON Path: `The text is too long. Please see the template.`

| |{#HOSTNAME}: Number of failed volumes|

Number of failed storage volumes.

|Dependent item|hadoop.datanode.numfailedvolumes[{#HOSTNAME}]

**Preprocessing**

JSON Path: `The text is too long. Please see the template.`

| |{#HOSTNAME}: JVM Threads|

The number of JVM threads.

|Dependent item|hadoop.datanode.jvm.threads[{#HOSTNAME}]

**Preprocessing**

JSON Path: `The text is too long. Please see the template.`

| |{#HOSTNAME}: JVM Garbage collection time|

The JVM garbage collection time in milliseconds.

|Dependent item|hadoop.datanode.jvm.gc_time[{#HOSTNAME}]

**Preprocessing**

JSON Path: `The text is too long. Please see the template.`

| |{#HOSTNAME}: JVM Heap usage|

The JVM heap usage in MBytes.

|Dependent item|hadoop.datanode.jvm.mem_heap_used[{#HOSTNAME}]

**Preprocessing**

JSON Path: `The text is too long. Please see the template.`

| |{#HOSTNAME}: Uptime||Dependent item|hadoop.datanode.uptime[{#HOSTNAME}]

**Preprocessing**

JSON Path: `$.beans[?(@.name=='java.lang:type=Runtime')].Uptime.first()`
Custom multiplier: `0.001`

| |Hadoop DataNode {#HOSTNAME}: Get raw info||Dependent item|hadoop.datanode.raw_info[{#HOSTNAME}]

**Preprocessing**

JSON Path: `$.[?(@.HostName=='{#HOSTNAME}')].first()`
⛔️Custom on fail: Discard value

| |{#HOSTNAME}: Version|

DataNode software version.

|Dependent item|hadoop.datanode.version[{#HOSTNAME}]

**Preprocessing**

JSON Path: `$.version`
Discard unchanged with heartbeat: `6h`

| |{#HOSTNAME}: Admin state|

Administrative state.

|Dependent item|hadoop.datanode.admin_state[{#HOSTNAME}]

**Preprocessing**

JSON Path: `$.adminState`
Discard unchanged with heartbeat: `6h`

| |{#HOSTNAME}: Oper state|

Operational state.

|Dependent item|hadoop.datanode.oper_state[{#HOSTNAME}]

**Preprocessing**

JSON Path: `$.operState`
Discard unchanged with heartbeat: `6h`

| ### Trigger prototypes for Data node discovery |Name|Description|Expression|Severity|Dependencies and additional info| |----|-----------|----------|--------|--------------------------------| |{#HOSTNAME}: Service has been restarted|

Uptime is less than 10 minutes.

|`last(/Hadoop by HTTP/hadoop.datanode.uptime[{#HOSTNAME}])<10m`|Info|**Manual close**: Yes| |{#HOSTNAME}: Failed to fetch DataNode API page|

Zabbix has not received any data for items for the last 30 minutes.

|`nodata(/Hadoop by HTTP/hadoop.datanode.uptime[{#HOSTNAME}],30m)=1`|Warning|**Manual close**: Yes
**Depends on**:

{#HOSTNAME}: DataNode has state {ITEM.VALUE}.

| |{#HOSTNAME}: DataNode has state {ITEM.VALUE}.|

The state is different from normal.

|`last(/Hadoop by HTTP/hadoop.datanode.oper_state[{#HOSTNAME}])<>"Live"`|Average|| ## Feedback Please report any issues with the template at [`https://support.zabbix.com`](https://support.zabbix.com) You can also provide feedback, discuss the template, or ask for help at [`ZABBIX forums`](https://www.zabbix.com/forum/zabbix-suggestions-and-feedback)