# Envoy Proxy by HTTP ## Overview The template to monitor Envoy Proxy by Zabbix that works without any external scripts. Most of the metrics are collected in one go, thanks to Zabbix bulk data collection. Template `Envoy Proxy by HTTP` - collects metrics by HTTP agent from metrics endpoint {$ENVOY.METRICS.PATH} endpoint (default: /stats/prometheus). ## Requirements Zabbix version: 7.0 and higher. ## Tested versions This template has been tested on: - Envoy Proxy 1.20.2 ## Configuration > Zabbix should be configured according to the instructions in the [Templates out of the box](https://www.zabbix.com/documentation/7.0/manual/config/templates_out_of_the_box) section. ## Setup Internal service metrics are collected from {$ENVOY.METRICS.PATH} endpoint (default: /stats/prometheus). https://www.envoyproxy.io/docs/envoy/v1.20.0/operations/stats_overview Don't forget to change macros {$ENVOY.URL}, {$ENVOY.METRICS.PATH}. Also, see the Macros section for a list of macros used to set trigger values. *NOTE.* Some metrics may not be collected depending on your Envoy Proxy instance version and configuration. ### Macros used |Name|Description|Default| |----|-----------|-------| |{$ENVOY.URL}|
Instance URL.
|`http://localhost:9901`| |{$ENVOY.METRICS.PATH}|The path Zabbix will scrape metrics in prometheus format from.
|`/stats/prometheus`| |{$ENVOY.CERT.MIN}|Minimum number of days before certificate expiration used for trigger expression.
|`7`| ### Items |Name|Description|Type|Key and additional info| |----|-----------|----|-----------------------| |Envoy Proxy: Get node metrics|Get server metrics.
|HTTP agent|envoy.get_metrics**Preprocessing**
Check for not supported value
⛔️Custom on fail: Discard value
State of the server.
Live - (default) Server is live and serving traffic.
Draining - Server is draining listeners in response to external health checks failing.
Pre initializing - Server has not yet completed cluster manager initialization.
Initializing - Server is running the cluster manager initialization callbacks (e.g., RDS).
|Dependent item|envoy.server.state**Preprocessing**
Prometheus pattern: `VALUE(envoy_server_state)`
Discard unchanged with heartbeat: `3h`
1 if the server is not currently draining, 0 otherwise.
|Dependent item|envoy.server.live**Preprocessing**
Prometheus pattern: `VALUE(envoy_server_live)`
Discard unchanged with heartbeat: `3h`
Current server uptime in seconds.
|Dependent item|envoy.server.uptime**Preprocessing**
Prometheus pattern: `VALUE(envoy_server_uptime)`
⛔️Custom on fail: Discard value
Number of days until the next certificate being managed will expire.
|Dependent item|envoy.server.days_until_first_cert_expiring**Preprocessing**
Prometheus pattern: `VALUE(envoy_server_days_until_first_cert_expiring)`
Number of worker threads.
|Dependent item|envoy.server.concurrency**Preprocessing**
Prometheus pattern: `VALUE(envoy_server_concurrency)`
Current amount of allocated memory in bytes. Total of both new and old Envoy processes on hot restart.
|Dependent item|envoy.server.memory_allocated**Preprocessing**
Prometheus pattern: `VALUE(envoy_server_memory_allocated)`
Current reserved heap size in bytes. New Envoy process heap size on hot restart.
|Dependent item|envoy.server.memory_heap_size**Preprocessing**
Prometheus pattern: `VALUE(envoy_server_memory_heap_size)`
Current estimate of total bytes of the physical memory. New Envoy process physical memory size on hot restart.
|Dependent item|envoy.server.memory_physical_size**Preprocessing**
Prometheus pattern: `VALUE(envoy_server_memory_physical_size)`
Total number of times internal flush buffers are written to a file due to flush timeout per second.
|Dependent item|envoy.filesystem.flushed_by_timer.rate**Preprocessing**
Prometheus pattern: `VALUE(envoy_filesystem_flushed_by_timer)`
Total number of times a file was written per second.
|Dependent item|envoy.filesystem.write_completed.rate**Preprocessing**
Prometheus pattern: `VALUE(envoy_filesystem_write_completed)`
Total number of times an error occurred during a file write operation per second.
|Dependent item|envoy.filesystem.write_failed.rate**Preprocessing**
Prometheus pattern: `VALUE(envoy_filesystem_write_failed)`
Total number of times a file was failed to be opened per second.
|Dependent item|envoy.filesystem.reopen_failed.rate**Preprocessing**
Prometheus pattern: `VALUE(envoy_filesystem_reopen_failed)`
Total connections of both new and old Envoy processes.
|Dependent item|envoy.server.total_connections**Preprocessing**
Prometheus pattern: `VALUE(envoy_server_total_connections)`
Total connections of the old Envoy process on hot restart.
|Dependent item|envoy.server.parent_connections**Preprocessing**
Prometheus pattern: `VALUE(envoy_server_parent_connections)`
Number of currently warming (not active) clusters.
|Dependent item|envoy.cluster_manager.warming_clusters**Preprocessing**
Prometheus pattern: `VALUE(envoy_cluster_manager_warming_clusters)`
Number of currently active (warmed) clusters.
|Dependent item|envoy.cluster_manager.active_clusters**Preprocessing**
Prometheus pattern: `VALUE(envoy_cluster_manager_active_clusters)`
Total clusters added (either via static config or CDS) per second.
|Dependent item|envoy.cluster_manager.cluster_added.rate**Preprocessing**
Prometheus pattern: `VALUE(envoy_cluster_manager_cluster_added)`
Total clusters modified (via CDS) per second.
|Dependent item|envoy.cluster_manager.cluster_modified.rate**Preprocessing**
Prometheus pattern: `VALUE(envoy_cluster_manager_cluster_modified)`
Total clusters removed (via CDS) per second.
|Dependent item|envoy.cluster_manager.cluster_removed.rate**Preprocessing**
Prometheus pattern: `VALUE(envoy_cluster_manager_cluster_removed)`
Total cluster updates per second.
|Dependent item|envoy.cluster_manager.cluster_updated.rate**Preprocessing**
Prometheus pattern: `VALUE(envoy_cluster_manager_cluster_updated)`
Number of currently active listeners.
|Dependent item|envoy.listener_manager.total_listeners_active**Preprocessing**
Prometheus pattern: `SUM(envoy_listener_manager_total_listeners_active)`
Number of currently draining listeners.
|Dependent item|envoy.listener_manager.total_listeners_draining**Preprocessing**
Prometheus pattern: `SUM(envoy_listener_manager_total_listeners_draining)`
Number of currently warming listeners.
|Dependent item|envoy.listener_manager.total_listeners_warming**Preprocessing**
Prometheus pattern: `SUM(envoy_listener_manager_total_listeners_warming)`
A boolean (1 if started and 0 otherwise) that indicates whether listeners have been initialized on workers.
|Dependent item|envoy.listener_manager.workers_started**Preprocessing**
Prometheus pattern: `VALUE(envoy_listener_manager_workers_started)`
Discard unchanged with heartbeat: `3h`
Total failed listener object additions to workers per second.
|Dependent item|envoy.listener_manager.listener_create_failure.rate**Preprocessing**
Prometheus pattern: `VALUE(envoy_listener_manager_listener_create_failure)`
Total listener objects successfully added to workers per second.
|Dependent item|envoy.listener_manager.listener_create_success.rate**Preprocessing**
Prometheus pattern: `VALUE(envoy_listener_manager_listener_create_success)`
Total listeners added (either via static config or LDS) per second.
|Dependent item|envoy.listener_manager.listener_added.rate**Preprocessing**
Prometheus pattern: `VALUE(envoy_listener_manager_listener_added)`
Total listeners stopped per second.
|Dependent item|envoy.listener_manager.listener_stopped.rate**Preprocessing**
Prometheus pattern: `VALUE(envoy_listener_manager_listener_stopped)`
Uptime is less than 10 minutes.
|`last(/Envoy Proxy by HTTP/envoy.server.uptime)<10m`|Info|**Manual close**: Yes| |Envoy Proxy: Failed to fetch metrics data|Zabbix has not received data for items for the last 10 minutes.
|`nodata(/Envoy Proxy by HTTP/envoy.server.uptime,10m)=1`|Warning|**Manual close**: Yes| |Envoy Proxy: SSL certificate expires soon|Please check certificate. Less than {$ENVOY.CERT.MIN} days left until the next certificate being managed will expire.
|`last(/Envoy Proxy by HTTP/envoy.server.days_until_first_cert_expiring)<{$ENVOY.CERT.MIN}`|Warning|| ### LLD rule Cluster metrics discovery |Name|Description|Type|Key and additional info| |----|-----------|----|-----------------------| |Cluster metrics discovery||Dependent item|envoy.lld.cluster**Preprocessing**
Prometheus to JSON: `envoy_cluster_membership_total`
JavaScript: `The text is too long. Please see the template.`
Discard unchanged with heartbeat: `3h`
Current cluster membership total.
|Dependent item|envoy.cluster.membership_total["{#CLUSTER_NAME}"]**Preprocessing**
Prometheus pattern: `The text is too long. Please see the template.`
Current cluster healthy total (inclusive of both health checking and outlier detection).
|Dependent item|envoy.cluster.membership_healthy["{#CLUSTER_NAME}"]**Preprocessing**
Prometheus pattern: `The text is too long. Please see the template.`
Current cluster unhealthy.
|Calculated|envoy.cluster.membership_unhealthy["{#CLUSTER_NAME}"]| |Envoy Proxy: Cluster ["{#CLUSTER_NAME}"]: Membership, degraded|Current cluster degraded total.
|Dependent item|envoy.cluster.membership_degraded["{#CLUSTER_NAME}"]**Preprocessing**
Prometheus pattern: `The text is too long. Please see the template.`
Current cluster total connections.
|Dependent item|envoy.cluster.upstream_cx_total["{#CLUSTER_NAME}"]**Preprocessing**
Prometheus pattern: `The text is too long. Please see the template.`
Current cluster total active connections.
|Dependent item|envoy.cluster.upstream_cx_active["{#CLUSTER_NAME}"]**Preprocessing**
Prometheus pattern: `The text is too long. Please see the template.`
Current cluster request total per second.
|Dependent item|envoy.cluster.upstream_rq_total.rate["{#CLUSTER_NAME}"]**Preprocessing**
Prometheus pattern: `The text is too long. Please see the template.`
Current cluster requests that timed out waiting for a response per second.
|Dependent item|envoy.cluster.upstream_rq_timeout.rate["{#CLUSTER_NAME}"]**Preprocessing**
Prometheus pattern: `The text is too long. Please see the template.`
Total upstream requests completed per second.
|Dependent item|envoy.cluster.upstream_rq_completed.rate["{#CLUSTER_NAME}"]**Preprocessing**
Prometheus pattern: `The text is too long. Please see the template.`
Aggregate HTTP response codes per second.
|Dependent item|envoy.cluster.upstream_rq_2x.rate["{#CLUSTER_NAME}"]**Preprocessing**
Prometheus pattern: `The text is too long. Please see the template.`
Aggregate HTTP response codes per second.
|Dependent item|envoy.cluster.upstream_rq_3x.rate["{#CLUSTER_NAME}"]**Preprocessing**
Prometheus pattern: `The text is too long. Please see the template.`
Aggregate HTTP response codes per second.
|Dependent item|envoy.cluster.upstream_rq_4x.rate["{#CLUSTER_NAME}"]**Preprocessing**
Prometheus pattern: `The text is too long. Please see the template.`
Aggregate HTTP response codes per second.
|Dependent item|envoy.cluster.upstream_rq_5x.rate["{#CLUSTER_NAME}"]**Preprocessing**
Prometheus pattern: `The text is too long. Please see the template.`
Total active requests pending a connection pool connection.
|Dependent item|envoy.cluster.upstream_rq_pending_active["{#CLUSTER_NAME}"]**Preprocessing**
Prometheus pattern: `The text is too long. Please see the template.`
Total active requests.
|Dependent item|envoy.cluster.upstream_rq_active["{#CLUSTER_NAME}"]**Preprocessing**
Prometheus pattern: `The text is too long. Please see the template.`
Total sent connection bytes per second.
|Dependent item|envoy.cluster.upstream_cx_tx_bytes_total.rate["{#CLUSTER_NAME}"]**Preprocessing**
Prometheus pattern: `The text is too long. Please see the template.`
Total received connection bytes per second.
|Dependent item|envoy.cluster.upstream_cx_rx_bytes_total.rate["{#CLUSTER_NAME}"]**Preprocessing**
Prometheus pattern: `The text is too long. Please see the template.`
**Preprocessing**
Prometheus to JSON: `envoy_listener_downstream_cx_active`
JavaScript: `The text is too long. Please see the template.`
Discard unchanged with heartbeat: `3h`
Total active connections.
|Dependent item|envoy.listener.downstream_cx_active["{#LISTENER_ADDRESS}"]**Preprocessing**
Prometheus pattern: `The text is too long. Please see the template.`
Total connections per second.
|Dependent item|envoy.listener.downstream_cx_total.rate["{#LISTENER_ADDRESS}"]**Preprocessing**
Prometheus pattern: `The text is too long. Please see the template.`
Sockets currently undergoing listener filter processing.
|Dependent item|envoy.listener.downstream_pre_cx_active["{#LISTENER_ADDRESS}"]**Preprocessing**
Prometheus pattern: `The text is too long. Please see the template.`
**Preprocessing**
Prometheus to JSON: `envoy_http_downstream_rq_total`
JavaScript: `The text is too long. Please see the template.`
Discard unchanged with heartbeat: `3h`
Total active connections per second.
|Dependent item|envoy.http.downstream_rq_total.rate["{#CONN_MANAGER}"]**Preprocessing**
Prometheus pattern: `The text is too long. Please see the template.`
Total active requests.
|Dependent item|envoy.http.downstream_rq_active["{#CONN_MANAGER}"]**Preprocessing**
Prometheus pattern: `The text is too long. Please see the template.`
Total requests closed due to a timeout on the request path per second.
|Dependent item|envoy.http.downstream_rq_timeout["{#CONN_MANAGER}"]**Preprocessing**
Prometheus pattern: `The text is too long. Please see the template.`
Total connections per second.
|Dependent item|envoy.http.downstream_cx_total["{#CONN_MANAGER}"]**Preprocessing**
Prometheus pattern: `The text is too long. Please see the template.`
Total active connections.
|Dependent item|envoy.http.downstream_cx_active["{#CONN_MANAGER}"]**Preprocessing**
Prometheus pattern: `The text is too long. Please see the template.`
Total bytes received per second.
|Dependent item|envoy.http.downstream_cx_rx_bytes_total.rate["{#CONN_MANAGER}"]**Preprocessing**
Prometheus pattern: `The text is too long. Please see the template.`
Total bytes sent per second.
|Dependent item|envoy.http.downstream_cx_tx_bytes_tota.rate["{#CONN_MANAGER}"]**Preprocessing**
Prometheus pattern: `The text is too long. Please see the template.`