跳转至

健康检查

本章节我们将学习如何添加一个健康检查,来检查集群中的服务是否可用于接收流量。启用健康检查后,如果服务崩溃了,则 Envoy 将停止发送流量。

1. 代理配置

首先创建一个 Envoy 配置文件 envoy.yaml,配置将任何域名的请求都代理到 172.17.0.3172.17.0.4 这两个上游服务去。完整的配置如下所示:

static_resources:
  listeners:
  - name: listener_0
    address:
      socket_address: { address: 0.0.0.0, port_value: 8080 }
    filter_chains:
    - filters:
      - name: envoy.http_connection_manager
        config:
          codec_type: auto
          stat_prefix: ingress_http
          route_config:
            name: local_route
            virtual_hosts:
            - name: backend
              domains:
                - "*"
              routes:
              - match:
                  prefix: "/"
                route:
                  cluster: targetCluster
          http_filters:
          - name: envoy.router
  clusters:
  - name: targetCluster
    connect_timeout: 0.25s
    type: STRICT_DNS
    dns_lookup_family: V4_ONLY
    lb_policy: ROUND_ROBIN
    hosts: [
      { socket_address: { address: 172.17.0.3, port_value: 80 }},
      { socket_address: { address: 172.17.0.4, port_value: 80 }}
    ]

假设目前 172.17.0.3 这个上游服务出现了故障,现在的 Envoy 代理还是会继续向该服务转发流量过来的,这样当用户访问服务的时候就会遇到不可用的情况。对于这种情况我们更希望的是 Envoy 能够检测到服务不可用的时候自动将其从节点中移除掉,这其实就可以通过向集群中添加健康检查来完成。

2. 添加健康检查

健康检查可以添加到 Envoy 的集群配置中,如下所示的配置将在定义的每个节点内使用 /health 端点来进行健康检查,Envoy 会根据端点返回的 HTTP 状态来确定其是否健康。

health_checks:
- timeout: 1s
  interval: 10s
  interval_jitter: 1s
  unhealthy_threshold: 6
  healthy_threshold: 1
  http_health_check:
    path: "/health"

这里我们简单对上面配置的健康检查的关键字段进行下说明:

  • interval:执行一次健康检查的时间间隔
  • unhealthy_threshold:将主机标记为不健康状态之前需要进行的不健康状态检查数量(相当于就是检测到几次不健康就认为是不健康的)
  • healthy_threshold:将主机标记为健康状态之前需要进行的健康状态检查数量(相当于就是检测到几次健康就认为是健康的)
  • http_health_check.path:用于健康检查请求的路径

关于健康检查的更多字段介绍可以查看官方的文档说明:https://www.envoyproxy.io/docs/envoy/latest/api-v2/api/v2/core/health_check.proto

3. 启动代理

添加了健康检查之后,Envoy 将检查集群中定义的每个节点的运行状况。同样使用如下所示的命令启动 Envoy 代理:

$ docker run -d --name proxy1 -p 80:8080 -v $(pwd)/manifests:/etc/envoy envoyproxy/envoy:latest

然后启动两个节点,都处于正常运行状态:

$ docker run -d cnych/docker-http-server:healthy; docker run -d cnych/docker-http-server:healthy;

启动完成后,我们可以向 Envoy 发送请求,正常都可以从上面的两个上游服务中返回正常的请求:

$ curl localhost -i
HTTP/1.1 200 OK
date: Wed, 15 Apr 2020 04:13:01 GMT
content-length: 63
content-type: text/html; charset=utf-8
x-envoy-upstream-service-time: 0
server: envoy

<h1>A healthy request was processed by host: b6336e79951d</h1>
$ curl localhost -i
HTTP/1.1 200 OK
date: Wed, 15 Apr 2020 04:13:02 GMT
content-length: 63
content-type: text/html; charset=utf-8
x-envoy-upstream-service-time: 0
server: envoy

<h1>A healthy request was processed by host: 9a4c07cc4306</h1>

4. 测试

接下来我们来测试下 Envoy 是如何处理不正常的节点的。在一个独立的命令行终端中,启动一个循环来发送请求,可以让我们来观察状态变化:

$ while true; do curl localhost; sleep .5; done
......

然后使用如下命令,我们可以来确定哪个 Docker 容器的 IP 为 172.17.0.3,然后将这个节点变成不健康的,然后 Envoy 就会自动将其从负载均衡中移除掉。

$ docker ps -q | xargs -n 1 docker inspect --format '{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}} {{ .Config.Hostname }}' | sed 's/ \// /'
172.17.0.4 b6336e79951d
172.17.0.3 9a4c07cc4306
172.17.0.2 6928df882c4f

要让该一个节点变成不健康的状态,我们可以直接请求 unhealthy 的端点:

$ curl 172.18.0.3/unhealthy

这个时候可以看到另外一个终端中循环请求的日志信息中就出现了 unhealthy 的相关信息:

......
<h1>A unhealthy request was processed by host: 9a4c07cc4306</h1>
<h1>A unhealthy request was processed by host: 9a4c07cc4306</h1>
<h1>A healthy request was processed by host: b6336e79951d</h1>
<h1>A unhealthy request was processed by host: 9a4c07cc4306</h1>
......

这个时候访问该容器就会返回 500 状态码了:

$ curl 172.18.0.3 -i
HTTP/1.1 500 Internal Server Error
Date: Wed, 15 Apr 2020 04:23:59 GMT
Content-Length: 65
Content-Type: text/html; charset=utf-8

<h1>A unhealthy request was processed by host: 9a4c07cc4306</h1>

在这段时间内,Envoy会将请求发送到健康检查的端点。如果健康检查的端点发生了故障,它将继续向该服务发送流量,直到达到 unhealthy_threshold 这么多次不健康的请求,此时,Envoy 将从负载均衡器中将其删除。这个时候可以看到另外一个终端中循环请求的日志信息中就只有一个容器的信息了:

......
<h1>A healthy request was processed by host: b6336e79951d</h1>
<h1>A healthy request was processed by host: b6336e79951d</h1>
<h1>A healthy request was processed by host: b6336e79951d</h1>
<h1>A healthy request was processed by host: b6336e79951d</h1>
......

与此同时,Envoy 还会继续检查健康状态的端点,来查看它是否再次变得可用了,一旦可用,它将又会被添加回到 Envoy 的上游服务器集群中去。

我们可以访问下上面不健康容器的 healthy 端点让其变成正常运行状态:

$ curl 172.17.0.3/healthy

我们健康检查的间隔是10s,healthy_threshold 阈值是1,所以检测到成功后 Envoy 就会将该容器再次添加回来。这个时候可以看到另外一个终端中循环请求的日志信息中就又出现了两个容器的信息:

......
<h1>A healthy request was processed by host: 9a4c07cc4306</h1>
<h1>A healthy request was processed by host: 9a4c07cc4306</h1>
<h1>A healthy request was processed by host: b6336e79951d</h1>
<h1>A healthy request was processed by host: b6336e79951d</h1>
......

接下来我们再来测试下所有服务均不可用时发生的情况。目前已经有两个运行正常的上游服务器,Envoy 代理会在它们之间进行负载均衡。

和上面方法一样,对两个上游服务访问 unhealthy 端点,这样就可以将两个服务变成不健康的状态:

$ curl 172.18.0.3/unhealthy
$ curl 172.18.0.4/unhealthy

现在两个上游服务都已经不健康了,所以当我们请求 Envoy 时,将得到如下所示的信息:

$ curl localhost -i
HTTP/1.1 500 Internal Server Error
date: Wed, 15 Apr 2020 06:19:01 GMT
content-length: 65
content-type: text/html; charset=utf-8
x-envoy-upstream-service-time: 0
server: envoy

<h1>A unhealthy request was processed by host: b6336e79951d</h1>

5. 被动健康检查

和前面的主动健康检查不同,被动健康检查从真实的请求响应来确定端点是否健康。一旦端点被删除后,Envoy 将使用基于超时的方法进行重新插入,使用该方法可以通过配置 interval 将不正常的主机重新添加到集群中去,后续的每次删除都会增加一定的时间间隔,这样的话不健康的端点对用户的流量影响就会尽可能小。

和前面的主动健康检查一样,被动健康检查也需要针对每个集群进行配置。如下所示的配置表示返回3个连续的5xx错误时,该配置会将主机删除30s:

outlier_detection:
  consecutive_5xx: "3"
  base_ejection_time: "30s"

  • consecutive_5xx:表示上游主机返回一定数量的连续 5xx 状态,则将其移除。需要注意的是在这种情况下,5xx表示实际的5xx响应码值,或者是一个导致 HTTP 路由器返回一个上游的事件行为(比如重置、连接失败等)
  • base_ejection_time:表示移除主机的基准时间。真实的时间等于基准时间乘以主机移除的次数,默认为 30000ms 或 30s。

当启用被动健康检查过后,Envoy 会根据实际的请求响应来删除主机。同样首先我们先运行两个新的上游节点:

$ docker run -d cnych/docker-http-server:healthy; docker run -d cnych/docker-http-server:healthy;

然后启动一个新的 Envoy 代理,对应的配置文件为 envoy1.yaml,内容如下所示:

static_resources:
  listeners:
  - name: listener_0
    address:
      socket_address: { address: 0.0.0.0, port_value: 8080 }
    filter_chains:
    - filters:
      - name: envoy.http_connection_manager
        config:
          codec_type: auto
          stat_prefix: ingress_http
          route_config:
            name: local_route
            virtual_hosts:
            - name: backend
              domains:
                - "*"
              routes:
              - match:
                  prefix: "/"
                route:
                  cluster: targetCluster
          http_filters:
          - name: envoy.router
  clusters:
  - name: targetCluster
    connect_timeout: 0.25s
    type: STRICT_DNS
    dns_lookup_family: V4_ONLY
    lb_policy: ROUND_ROBIN
    hosts: [
      { socket_address: { address: 172.17.0.5, port_value: 80 }},
      { socket_address: { address: 172.17.0.6, port_value: 80 }}
    ]
    outlier_detection:
        consecutive_5xx: "3"
        base_ejection_time: "30s"

然后执行如下命令启动 Envoy 代理:

$ docker run -d --name proxy2 -p 81:8080 \
    -v $(pwd)/manifests/envoy1.yaml:/etc/envoy/envoy.yaml \
    envoyproxy/envoy

启动完成后,在单独的一个命令行终端中,执行下面的命令来循环发送请求观察状态的变化:

$ while true; do curl localhost:81; sleep .5; done

然后我们将 172.17.0.5 这个端点变成不健康的状态:

$ curl 172.17.0.5/unhealthy

该命令会将该端点的所有请求变成 500 错误:

$ curl 172.17.0.5 -i
HTTP/1.1 500 Internal Server Error
Date: Wed, 15 Apr 2020 06:55:02 GMT
Content-Length: 65
Content-Type: text/html; charset=utf-8

<h1>A unhealthy request was processed by host: 55e0950029b8</h1>

然后我们会在循环的终端中看到会收到3个不健康的请求,然后 Envoy 就会将该上游服务给移除掉:

......
<h1>A healthy request was processed by host: 5749edc61125</h1>
<h1>A healthy request was processed by host: 55e0950029b8</h1>
<h1>A healthy request was processed by host: 5749edc61125</h1>
<h1>A unhealthy request was processed by host: 55e0950029b8</h1>
<h1>A unhealthy request was processed by host: 55e0950029b8</h1>
<h1>A healthy request was processed by host: 5749edc61125</h1>
<h1>A unhealthy request was processed by host: 55e0950029b8</h1>
<h1>A healthy request was processed by host: 5749edc61125</h1>
<h1>A healthy request was processed by host: 5749edc61125</h1>
<h1>A healthy request was processed by host: 5749edc61125</h1>
......

然后我们也可以再次将 172.17.0.5 标记为健康,执行如下命令即可:

$ curl 172.17.0.5/healthy

然后差不多 30s 过后,我们查看 Envoy 又将该端点添加回来参与负载均衡了:

......
<h1>A healthy request was processed by host: 5749edc61125</h1>
<h1>A healthy request was processed by host: 5749edc61125</h1>
<h1>A healthy request was processed by host: 5749edc61125</h1>
<h1>A healthy request was processed by host: 55e0950029b8</h1>
<h1>A healthy request was processed by host: 5749edc61125</h1>
<h1>A healthy request was processed by host: 5749edc61125</h1>
<h1>A healthy request was processed by host: 55e0950029b8</h1>
<h1>A healthy request was processed by host: 5749edc61125</h1>
......

到这里我们就完成了在 Envoy 中的健康检查相关的配置。