"How to realize zero-interrupted rolling update of K8s when updating applications?" 》
Posted Jun 16, 2020 • 9 min read
Author | Zi Bai(Alibaba Cloud Development Engineer), Xi Heng(Alibaba Cloud Technical Expert)
<Follow Alibaba Cloud Native Public Account, reply to investigation and download e-book>
The book "In-depth Kubernetes" brings together 12 technical articles to help you understand the 6 core principles at a time, understand the basic theory, and learn the gorgeous operation of 6 typical problems at a time!
In Kubernetes clusters, services are usually provided in the form of Deployment + LoadBalancer type Service. The typical deployment architecture is shown in Figure 1. This architecture deployment and operation and maintenance are very simple and convenient, but there may be service interruption during application update or upgrade, causing online problems. Today we will analyze in detail why this architecture will cause service interruption when updating the application and how to avoid service interruption.
Figure 1 Business deployment diagram
Why is there a service interruption
When a rolling update is deployed, a new pod will be created first, and then the old pod will be deleted after waiting for a new pod running.
Figure 2 Schematic diagram of service interruption
Cause of interruption:Pod running was added to the Endpoint backend, and after the container service monitoring detected Endpoint, Node was added to the SLB backend. At this time, the request is forwarded from SLB to Pod, but the Pod service code has not been initialized and cannot process the request, resulting in service interruption, as shown in Figure 2.
Solution:Configure readiness detection for pods, and wait for the business code to be initialized before adding nodes to the SLB backend.
In the process of deleting old pods, multiple objects(such as Endpoint, ipvs/iptables, SLB) need to be synchronized, and these synchronization operations are performed asynchronously. The overall synchronization process is shown in Figure 3.
Figure 3 Deployment timing diagram
- Pod status change:Set Pod to Terminating status and delete it from all Endpoints list of Service. At this time, Pod stops getting new traffic, but the containers running in Pod will not be affected;
- Execution of preStop Hook:preStop Hook will be triggered when Pod is deleted. preStop Hook supports bash script, TCP or HTTP requests;
- Send the SIGTERM signal:send the SIGTERM signal to the container in the Pod;
- Wait for the specified time:The terminationGracePeriodSeconds field is used to control the waiting time, and the default value is 30 seconds. This step is executed at the same time as preStop Hook, so the termination GracePeriodSeconds needs to be longer than preStop, otherwise there will be situations where the pod is killed before preStop is completed;
- Send SIGKILL signal:After waiting for a specified time, send a SIGKILL signal to the container in the pod to delete the pod.
Cause of interruption:The above steps 1, 2, 3, and 4 are performed at the same time, so there may be cases where Pod has not been removed from Endpoints after receiving the SIGTERM signal and stopping work. At this time, the request is forwarded from the slb to the pod, and the pod has stopped working, so there will be a service interruption, as shown in Figure 4.
Figure 4 Schematic diagram of service interruption
Solution:Configure the preStop Hook for the pod so that the Pod receives SIGTERM and sleeps for a period of time instead of stopping immediately, so as to ensure that the traffic forwarded from the SLB can continue to be processed by the Pod.
Reason for interruption:When the pod becomes the termintaing state, the pod will be removed from all service endpoints. kube-proxy will clean up the corresponding iptables/ipvs entries. After the container service watch changes to the endpoint, it will call slb openapi to remove the backend, which will take a few seconds. Since these two operations are performed at the same time, there may be cases where the iptables/ipvs entry on the node has been cleaned, but the node has not been removed from the slb. At this time, traffic flows in from slb, and there is no corresponding iptables/ipvs rule on the node, which results in service interruption, as shown in Figure 5.
Figure 5 Schematic diagram of service interruption
- Cluster mode:In Cluster mode, kube-proxy will write all services to the iptables/ipvs of the Node. If the current Node has no business pods, the request will be forwarded to other Nodes, so there will be no service interruption, such as 6. Shown
Figure 6 Schematic diagram of cluster mode request forwarding
- Local mode:In Local mode, kube-proxy will only write pods on Node to iptables/ipvs. When there is only one pod on the Node and the status becomes terminating, iptables/ipvs will remove the pod record. At this time, when the request is forwarded to this node, there is no corresponding iptables/ipvs record, which causes the request to fail. This problem can be avoided by upgrading in place, which means that there is at least one Running Pod on the Node during the update process. In-place upgrade can ensure that there will always be a business record in the iptables/ipvs of the Node, so there will be no service interruption, as shown in Figure 7;
Figure 7 Schematic diagram of request forwarding when upgrading in-place in Local Mode
- ENI mode Service:ENI mode bypasses kube-proxy and mounts Pod directly to the SLB backend, so there is no service interruption caused by iptables/ipvs.
Figure 8 Schematic diagram of ENI mode request forwarding
Figure 9 Schematic diagram of service interruption
Cause of interruption:After the container service monitors that Endpoints have changed, it will remove the Node from the backend of slb. When the node is removed from the back end of SLB, SLB will directly disconnect the long connection that continues to the node, resulting in service interruption.
Solution:Set up long link graceful interruption for SLB(depending on specific cloud vendors).
How to avoid service interruption
To avoid service interruption, you can start with two types of resources:Pod and Service. Next, we will introduce the corresponding configuration methods for the reasons for the above interruption.
apiVersion:v1 kind:Pod metadata: name:nginx namespace:default spec: containers: -name:nginx image:nginx # Survival detection livenessProbe: failureThreshold:3 initialDelaySeconds:30 periodSeconds:30 successThreshold:1 tcpSocket: port:5084 timeoutSeconds:1 # Readiness detection readinessProbe: failureThreshold:3 initialDelaySeconds:30 periodSeconds:30 successThreshold:1 tcpSocket: port:5084 timeoutSeconds:1 # Graceful exit lifecycle: preStop: exec: command: -sleep -30 terminationGracePeriodSeconds:60
Note:The detection frequency, delay time, unhealthy threshold and other data of readiness probe need to be properly set. Some applications have a long start time. If the set time is too short, it will cause the POD to restart repeatedly.
- livenessProbe is a survival test. If the number of failures reaches the threshold(failureThreshold), the pod will restart. For the specific configuration, see [Official Document]( https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness- readiness-startup-probes/);
- readinessProbe is a readiness check. Only after the readiness check is passed, the pod will be added to Endpoint. After the container service monitors the Endpoint change, it will mount the node to the back end of the slb;
- The preStop time is recommended to be set to the time required for the service to process all remaining requests. The terminationGracePeriodSeconds time is recommended to be set to the preStop time plus 30 seconds or more.
apiVersion:v1 kind:Service metadata: name:nginx namespace:default spec: externalTrafficPolicy:Cluster ports: -port:80 protocol:TCP targetPort:80 selector: run:nginx type:LoadBalancer
The container service will mount all nodes in the cluster to the back end of SLB(except the back end configured with the BackendLabel label), so it will quickly consume SLB quota. SLB limits the number of SLBs that can be mounted on each ECS. The default value is 50. When the quota is exhausted, new listeners and SLBs cannot be created.
In cluster mode, if the current node has no service, the pod will forward the request to other nodes. NAT is required when forwarding across nodes, so the source IP will be lost.
apiVersion:v1 kind:Service metadata: name:nginx namespace:default spec: externalTrafficPolicy:Local ports: -port:80 protocol:TCP targetPort:80 selector: run:nginx type:LoadBalancer # Need to make every node have at least one Running Pod during the update process as much as possible # By modifying UpdateStrategy and using nodeAffinity to ensure rolling update in place as much as possible # * UpdateStrategy can set Max Unavailable to 0, to ensure that there is a new pod to start before stopping the previous pod # * First label a fixed number of nodes for scheduling # * Use nodeAffinity+ and the number of replicas exceeding the number of related nodes to ensure that new Pods are built in place as much as possible # E.g: apiVersion:apps/v1 kind:Deployment ... strategy: rollingUpdate: maxSurge:50% maxUnavailable:0% type:RollingUpdate ... affinity: nodeAffinity: preferredDuringSchedulingIgnoredDuringExecution: -weight:1 preference: matchExpressions: -key:deploy operator:In values: -nginx
By default, the container service will add the node where the Pod corresponding to the Service** to the back end of the SLB, so the SLB quota consumption is slow. In Local mode, the request is directly forwarded to the node where the pod is located. There is no cross-node forwarding, so the source IP address can be retained. In Local mode, service interruption can be avoided through in-place upgrade. The yaml file is as above.
ENI mode(Alibaba Cloud specific mode)
apiVersion:v1 kind:Service metadata: annotations: service.beta.kubernetes.io/backend-type:"eni" name:nginx spec: ports: -name:http port:30080 protocol:TCP targetPort:80 selector: app:nginx type:LoadBalancer
In the Terway network mode, by setting service.beta.kubernetes.io/backend-type:
"eni" annotation can create SLB in ENI mode. In ENI mode, pod** will be directly mounted to the SLB backend without going through kube-proxy, so there is no problem of service interruption. The request is forwarded directly to the pod, so the source IP address can be retained.
The comparison of the three svc modes is shown in the table below.
Figure 10 Service comparison
Terway network mode(recommended)
Select svc in ENI mode + set Pod graceful termination + ready detection.
Flannel network mode
- If there are not many slbs in the cluster and there is no need to keep the source IP:select cluster mode + set Pod graceful termination + ready detection;
- If there are a large number of slbs in the cluster or need to keep the source IP:select the local mode + set Pod graceful termination + ready detection + in-place upgrade(ensure that there is at least one running pod on each node during the update process).
- Container Lifecycle Hook
- Configure Liveness, Readiness and Startup Probes
- Access service through load balancing
- Kubernetes best practice:graceful suspension
- Kubernetes community related discussions: Create ability to do zero downtime deployments when using externalTrafficPolicy:Local , Graceful Termination for External Traffic Policy Local
- Container service kubernetes(ACK) application elegant online and offline
In order to allow more developers to enjoy the dividends brought by Serverless, this time, we have assembled 10+ Alibaba Serverless field technical experts to create the most suitable Serverless open class for developers to get started, so you can use it immediately and easily Embrace the new paradigm of cloud computing-Serverless.
Click to watch the course for free: https://developer.aliyun.com/learning/roadmap/serverless
" Alibaba Cloud native focus on micro service, Serverless, Containers, Service Mesh and other technical fields, focusing on cloud-native popular technology trends, cloud-native large-scale landing practices, and being the public account that best understands cloud-native developers."