yzl
93958d0fb0
|
1 year ago | |
---|---|---|
.. | ||
README.md | 1 year ago | |
template_app_hadoop_http.yaml | 1 year ago |
README.md
Hadoop by HTTP
Overview
The template for monitoring Hadoop over HTTP that works without any external scripts. It collects metrics by polling the Hadoop API remotely using an HTTP agent and JSONPath preprocessing. Zabbix server (or proxy) execute direct requests to ResourceManager, NodeManagers, NameNode, DataNodes APIs. All metrics are collected at once, thanks to the Zabbix bulk data collection.
Requirements
Zabbix version: 7.0 and higher.
Tested versions
This template has been tested on:
- Hadoop 3.1 and later
Configuration
Zabbix should be configured according to the instructions in the Templates out of the box section.
Setup
You should define the IP address (or FQDN) and Web-UI port for the ResourceManager in {$HADOOP.RESOURCEMANAGER.HOST} and {$HADOOP.RESOURCEMANAGER.PORT} macros and for the NameNode in {$HADOOP.NAMENODE.HOST} and {$HADOOP.NAMENODE.PORT} macros respectively. Macros can be set in the template or overridden at the host level.
Macros used
Name | Description | Default |
---|---|---|
{$HADOOP.RESOURCEMANAGER.HOST} | The Hadoop ResourceManager host IP address or FQDN. |
ResourceManager |
{$HADOOP.RESOURCEMANAGER.PORT} | The Hadoop ResourceManager Web-UI port. |
8088 |
{$HADOOP.RESOURCEMANAGER.RESPONSE_TIME.MAX.WARN} | The Hadoop ResourceManager API page maximum response time in seconds for trigger expression. |
10s |
{$HADOOP.NAMENODE.HOST} | The Hadoop NameNode host IP address or FQDN. |
NameNode |
{$HADOOP.NAMENODE.PORT} | The Hadoop NameNode Web-UI port. |
9870 |
{$HADOOP.NAMENODE.RESPONSE_TIME.MAX.WARN} | The Hadoop NameNode API page maximum response time in seconds for trigger expression. |
10s |
{$HADOOP.CAPACITY_REMAINING.MIN.WARN} | The Hadoop cluster capacity remaining percent for trigger expression. |
20 |
Items
Name | Description | Type | Key and additional info |
---|---|---|---|
ResourceManager: Service status | Hadoop ResourceManager API port availability. |
Simple check | net.tcp.service["tcp","{$HADOOP.RESOURCEMANAGER.HOST}","{$HADOOP.RESOURCEMANAGER.PORT}"] Preprocessing
|
ResourceManager: Service response time | Hadoop ResourceManager API performance. |
Simple check | net.tcp.service.perf["tcp","{$HADOOP.RESOURCEMANAGER.HOST}","{$HADOOP.RESOURCEMANAGER.PORT}"] |
Hadoop: Get ResourceManager stats | HTTP agent | hadoop.resourcemanager.get | |
ResourceManager: Uptime | Dependent item | hadoop.resourcemanager.uptime Preprocessing
|
|
ResourceManager: Get info | Dependent item | hadoop.resourcemanager.info Preprocessing
|
|
ResourceManager: RPC queue & processing time | Average time spent on processing RPC requests. |
Dependent item | hadoop.resourcemanager.rpc_processing_time_avg Preprocessing
|
ResourceManager: Active NMs | Number of Active NodeManagers. |
Dependent item | hadoop.resourcemanager.num_active_nm Preprocessing
|
ResourceManager: Decommissioning NMs | Number of Decommissioning NodeManagers. |
Dependent item | hadoop.resourcemanager.num_decommissioning_nm Preprocessing
|
ResourceManager: Decommissioned NMs | Number of Decommissioned NodeManagers. |
Dependent item | hadoop.resourcemanager.num_decommissioned_nm Preprocessing
|
ResourceManager: Lost NMs | Number of Lost NodeManagers. |
Dependent item | hadoop.resourcemanager.num_lost_nm Preprocessing
|
ResourceManager: Unhealthy NMs | Number of Unhealthy NodeManagers. |
Dependent item | hadoop.resourcemanager.num_unhealthy_nm Preprocessing
|
ResourceManager: Rebooted NMs | Number of Rebooted NodeManagers. |
Dependent item | hadoop.resourcemanager.num_rebooted_nm Preprocessing
|
ResourceManager: Shutdown NMs | Number of Shutdown NodeManagers. |
Dependent item | hadoop.resourcemanager.num_shutdown_nm Preprocessing
|
NameNode: Service status | Hadoop NameNode API port availability. |
Simple check | net.tcp.service["tcp","{$HADOOP.NAMENODE.HOST}","{$HADOOP.NAMENODE.PORT}"] Preprocessing
|
NameNode: Service response time | Hadoop NameNode API performance. |
Simple check | net.tcp.service.perf["tcp","{$HADOOP.NAMENODE.HOST}","{$HADOOP.NAMENODE.PORT}"] |
Hadoop: Get NameNode stats | HTTP agent | hadoop.namenode.get | |
NameNode: Uptime | Dependent item | hadoop.namenode.uptime Preprocessing
|
|
NameNode: Get info | Dependent item | hadoop.namenode.info Preprocessing
|
|
NameNode: RPC queue & processing time | Average time spent on processing RPC requests. |
Dependent item | hadoop.namenode.rpc_processing_time_avg Preprocessing
|
NameNode: Block Pool Renaming | Dependent item | hadoop.namenode.percent_block_pool_used Preprocessing
|
|
NameNode: Transactions since last checkpoint | Total number of transactions since last checkpoint. |
Dependent item | hadoop.namenode.transactions_since_last_checkpoint Preprocessing
|
NameNode: Percent capacity remaining | Available capacity in percent. |
Dependent item | hadoop.namenode.percent_remaining Preprocessing
|
NameNode: Capacity remaining | Available capacity. |
Dependent item | hadoop.namenode.capacity_remaining Preprocessing
|
NameNode: Corrupt blocks | Number of corrupt blocks. |
Dependent item | hadoop.namenode.corrupt_blocks Preprocessing
|
NameNode: Missing blocks | Number of missing blocks. |
Dependent item | hadoop.namenode.missing_blocks Preprocessing
|
NameNode: Failed volumes | Number of failed volumes. |
Dependent item | hadoop.namenode.volume_failures_total Preprocessing
|
NameNode: Alive DataNodes | Count of alive DataNodes. |
Dependent item | hadoop.namenode.num_live_data_nodes Preprocessing
|
NameNode: Dead DataNodes | Count of dead DataNodes. |
Dependent item | hadoop.namenode.num_dead_data_nodes Preprocessing
|
NameNode: Stale DataNodes | DataNodes that do not send a heartbeat within 30 seconds are marked as "stale". |
Dependent item | hadoop.namenode.num_stale_data_nodes Preprocessing
|
NameNode: Total files | Total count of files tracked by the NameNode. |
Dependent item | hadoop.namenode.files_total Preprocessing
|
NameNode: Total load | The current number of concurrent file accesses (read/write) across all DataNodes. |
Dependent item | hadoop.namenode.total_load Preprocessing
|
NameNode: Blocks allocable | Maximum number of blocks allocable. |
Dependent item | hadoop.namenode.block_capacity Preprocessing
|
NameNode: Total blocks | Count of blocks tracked by NameNode. |
Dependent item | hadoop.namenode.blocks_total Preprocessing
|
NameNode: Under-replicated blocks | The number of blocks with insufficient replication. |
Dependent item | hadoop.namenode.under_replicated_blocks Preprocessing
|
Hadoop: Get NodeManagers states | HTTP agent | hadoop.nodemanagers.get Preprocessing
|
|
Hadoop: Get DataNodes states | HTTP agent | hadoop.datanodes.get Preprocessing
|
Triggers
Name | Description | Expression | Severity | Dependencies and additional info |
---|---|---|---|---|
ResourceManager: Service is unavailable | last(/Hadoop by HTTP/net.tcp.service["tcp","{$HADOOP.RESOURCEMANAGER.HOST}","{$HADOOP.RESOURCEMANAGER.PORT}"])=0 |
Average | Manual close: Yes | |
ResourceManager: Service response time is too high | min(/Hadoop by HTTP/net.tcp.service.perf["tcp","{$HADOOP.RESOURCEMANAGER.HOST}","{$HADOOP.RESOURCEMANAGER.PORT}"],5m)>{$HADOOP.RESOURCEMANAGER.RESPONSE_TIME.MAX.WARN} |
Warning | Manual close: Yes Depends on:
|
|
ResourceManager: Service has been restarted | Uptime is less than 10 minutes. |
last(/Hadoop by HTTP/hadoop.resourcemanager.uptime)<10m |
Info | Manual close: Yes |
ResourceManager: Failed to fetch ResourceManager API page | Zabbix has not received any data for items for the last 30 minutes. |
nodata(/Hadoop by HTTP/hadoop.resourcemanager.uptime,30m)=1 |
Warning | Manual close: Yes Depends on:
|
ResourceManager: Cluster has no active NodeManagers | Cluster is unable to execute any jobs without at least one NodeManager. |
max(/Hadoop by HTTP/hadoop.resourcemanager.num_active_nm,5m)=0 |
High | |
ResourceManager: Cluster has unhealthy NodeManagers | YARN considers any node with disk utilization exceeding the value specified under the property yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage (in yarn-site.xml) to be unhealthy. Ample disk space is critical to ensure uninterrupted operation of a Hadoop cluster, and large numbers of unhealthyNodes (the number to alert on depends on the size of your cluster) should be quickly investigated and resolved. |
min(/Hadoop by HTTP/hadoop.resourcemanager.num_unhealthy_nm,15m)>0 |
Average | |
NameNode: Service is unavailable | last(/Hadoop by HTTP/net.tcp.service["tcp","{$HADOOP.NAMENODE.HOST}","{$HADOOP.NAMENODE.PORT}"])=0 |
Average | Manual close: Yes | |
NameNode: Service response time is too high | min(/Hadoop by HTTP/net.tcp.service.perf["tcp","{$HADOOP.NAMENODE.HOST}","{$HADOOP.NAMENODE.PORT}"],5m)>{$HADOOP.NAMENODE.RESPONSE_TIME.MAX.WARN} |
Warning | Manual close: Yes Depends on:
|
|
NameNode: Service has been restarted | Uptime is less than 10 minutes. |
last(/Hadoop by HTTP/hadoop.namenode.uptime)<10m |
Info | Manual close: Yes |
NameNode: Failed to fetch NameNode API page | Zabbix has not received any data for items for the last 30 minutes. |
nodata(/Hadoop by HTTP/hadoop.namenode.uptime,30m)=1 |
Warning | Manual close: Yes Depends on:
|
NameNode: Cluster capacity remaining is low | A good practice is to ensure that disk use never exceeds 80 percent capacity. |
max(/Hadoop by HTTP/hadoop.namenode.percent_remaining,15m)<{$HADOOP.CAPACITY_REMAINING.MIN.WARN} |
Warning | |
NameNode: Cluster has missing blocks | A missing block is far worse than a corrupt block, because a missing block cannot be recovered by copying a replica. |
min(/Hadoop by HTTP/hadoop.namenode.missing_blocks,15m)>0 |
Average | |
NameNode: Cluster has volume failures | HDFS now allows for disks to fail in place, without affecting DataNode operations, until a threshold value is reached. This is set on each DataNode via the dfs.datanode.failed.volumes.tolerated property; it defaults to 0, meaning that any volume failure will shut down the DataNode; on a production cluster where DataNodes typically have 6, 8, or 12 disks, setting this parameter to 1 or 2 is typically the best practice. |
min(/Hadoop by HTTP/hadoop.namenode.volume_failures_total,15m)>0 |
Average | |
NameNode: Cluster has DataNodes in Dead state | The death of a DataNode causes a flurry of network activity, as the NameNode initiates replication of blocks lost on the dead nodes. |
min(/Hadoop by HTTP/hadoop.namenode.num_dead_data_nodes,5m)>0 |
Average |
LLD rule Node manager discovery
Name | Description | Type | Key and additional info |
---|---|---|---|
Node manager discovery | HTTP agent | hadoop.nodemanager.discovery Preprocessing
|
Item prototypes for Node manager discovery
Name | Description | Type | Key and additional info |
---|---|---|---|
Hadoop NodeManager {#HOSTNAME}: Get stats | HTTP agent | hadoop.nodemanager.get[{#HOSTNAME}] | |
{#HOSTNAME}: RPC queue & processing time | Average time spent on processing RPC requests. |
Dependent item | hadoop.nodemanager.rpc_processing_time_avg[{#HOSTNAME}] Preprocessing
|
{#HOSTNAME}: Container launch avg duration | Dependent item | hadoop.nodemanager.container_launch_duration_avg[{#HOSTNAME}] Preprocessing
|
|
{#HOSTNAME}: JVM Threads | The number of JVM threads. |
Dependent item | hadoop.nodemanager.jvm.threads[{#HOSTNAME}] Preprocessing
|
{#HOSTNAME}: JVM Garbage collection time | The JVM garbage collection time in milliseconds. |
Dependent item | hadoop.nodemanager.jvm.gc_time[{#HOSTNAME}] Preprocessing
|
{#HOSTNAME}: JVM Heap usage | The JVM heap usage in MBytes. |
Dependent item | hadoop.nodemanager.jvm.mem_heap_used[{#HOSTNAME}] Preprocessing
|
{#HOSTNAME}: Uptime | Dependent item | hadoop.nodemanager.uptime[{#HOSTNAME}] Preprocessing
|
|
Hadoop NodeManager {#HOSTNAME}: Get raw info | Dependent item | hadoop.nodemanager.raw_info[{#HOSTNAME}] Preprocessing
|
|
{#HOSTNAME}: State | State of the node - valid values are: NEW, RUNNING, UNHEALTHY, DECOMMISSIONING, DECOMMISSIONED, LOST, REBOOTED, SHUTDOWN. |
Dependent item | hadoop.nodemanager.state[{#HOSTNAME}] Preprocessing
|
{#HOSTNAME}: Version | Dependent item | hadoop.nodemanager.version[{#HOSTNAME}] Preprocessing
|
|
{#HOSTNAME}: Number of containers | Dependent item | hadoop.nodemanager.numcontainers[{#HOSTNAME}] Preprocessing
|
|
{#HOSTNAME}: Used memory | Dependent item | hadoop.nodemanager.usedmemory[{#HOSTNAME}] Preprocessing
|
|
{#HOSTNAME}: Available memory | Dependent item | hadoop.nodemanager.availablememory[{#HOSTNAME}] Preprocessing
|
Trigger prototypes for Node manager discovery
Name | Description | Expression | Severity | Dependencies and additional info |
---|---|---|---|---|
{#HOSTNAME}: Service has been restarted | Uptime is less than 10 minutes. |
last(/Hadoop by HTTP/hadoop.nodemanager.uptime[{#HOSTNAME}])<10m |
Info | Manual close: Yes |
{#HOSTNAME}: Failed to fetch NodeManager API page | Zabbix has not received any data for items for the last 30 minutes. |
nodata(/Hadoop by HTTP/hadoop.nodemanager.uptime[{#HOSTNAME}],30m)=1 |
Warning | Manual close: Yes Depends on:
|
{#HOSTNAME}: NodeManager has state {ITEM.VALUE}. | The state is different from normal. |
last(/Hadoop by HTTP/hadoop.nodemanager.state[{#HOSTNAME}])<>"RUNNING" |
Average |
LLD rule Data node discovery
Name | Description | Type | Key and additional info |
---|---|---|---|
Data node discovery | HTTP agent | hadoop.datanode.discovery Preprocessing
|
Item prototypes for Data node discovery
Name | Description | Type | Key and additional info |
---|---|---|---|
Hadoop DataNode {#HOSTNAME}: Get stats | HTTP agent | hadoop.datanode.get[{#HOSTNAME}] | |
{#HOSTNAME}: Remaining | Remaining disk space. |
Dependent item | hadoop.datanode.remaining[{#HOSTNAME}] Preprocessing
|
{#HOSTNAME}: Used | Used disk space. |
Dependent item | hadoop.datanode.dfs_used[{#HOSTNAME}] Preprocessing
|
{#HOSTNAME}: Number of failed volumes | Number of failed storage volumes. |
Dependent item | hadoop.datanode.numfailedvolumes[{#HOSTNAME}] Preprocessing
|
{#HOSTNAME}: JVM Threads | The number of JVM threads. |
Dependent item | hadoop.datanode.jvm.threads[{#HOSTNAME}] Preprocessing
|
{#HOSTNAME}: JVM Garbage collection time | The JVM garbage collection time in milliseconds. |
Dependent item | hadoop.datanode.jvm.gc_time[{#HOSTNAME}] Preprocessing
|
{#HOSTNAME}: JVM Heap usage | The JVM heap usage in MBytes. |
Dependent item | hadoop.datanode.jvm.mem_heap_used[{#HOSTNAME}] Preprocessing
|
{#HOSTNAME}: Uptime | Dependent item | hadoop.datanode.uptime[{#HOSTNAME}] Preprocessing
|
|
Hadoop DataNode {#HOSTNAME}: Get raw info | Dependent item | hadoop.datanode.raw_info[{#HOSTNAME}] Preprocessing
|
|
{#HOSTNAME}: Version | DataNode software version. |
Dependent item | hadoop.datanode.version[{#HOSTNAME}] Preprocessing
|
{#HOSTNAME}: Admin state | Administrative state. |
Dependent item | hadoop.datanode.admin_state[{#HOSTNAME}] Preprocessing
|
{#HOSTNAME}: Oper state | Operational state. |
Dependent item | hadoop.datanode.oper_state[{#HOSTNAME}] Preprocessing
|
Trigger prototypes for Data node discovery
Name | Description | Expression | Severity | Dependencies and additional info |
---|---|---|---|---|
{#HOSTNAME}: Service has been restarted | Uptime is less than 10 minutes. |
last(/Hadoop by HTTP/hadoop.datanode.uptime[{#HOSTNAME}])<10m |
Info | Manual close: Yes |
{#HOSTNAME}: Failed to fetch DataNode API page | Zabbix has not received any data for items for the last 30 minutes. |
nodata(/Hadoop by HTTP/hadoop.datanode.uptime[{#HOSTNAME}],30m)=1 |
Warning | Manual close: Yes Depends on:
|
{#HOSTNAME}: DataNode has state {ITEM.VALUE}. | The state is different from normal. |
last(/Hadoop by HTTP/hadoop.datanode.oper_state[{#HOSTNAME}])<>"Live" |
Average |
Feedback
Please report any issues with the template at https://support.zabbix.com
You can also provide feedback, discuss the template, or ask for help at ZABBIX forums