In a large-scale K8S scenario, how to optimize Service performance?

Posted Jun 29, 20209 min read

Summary: Kubernetes native Service load balancing is based on Iptables, and its rule chain will increase linearly with the number of services, which has a serious impact on Service performance in large-scale scenarios. This article shares Huawei Cloud's exploration and practice of Kubernetes service performance optimization.

In the process of enterprise business promotion, traffic peaks in different business areas usually come at different time periods, and coping with these business flows arriving at different times often requires a large amount of additional network resources to ensure. Today we bring you some optimization practices on Kubernetes Service, which is a network-related topic. First, introduce the Kubernetes Service mechanism. Now the three modes of Service in Kubernetes, including the original Userspace and Iptables, and the IPVS we later contributed; the second part will introduce how the original community used Iptables to achieve Service load balancing; the third part is mainly in the implementation of Iptables Some of the problems; the next is how to use IPVS to do the service load; the last is a comparison.

Service mechanism of Kubernetes

First look at the Service in Kubernetes. Before using Kubernetes, when we have a container network, the most direct way to access an application is for the client to directly access a Backend Container. This approach is the most intuitive and easy, and the problem is obvious. When the application has multiple back-end containers, how to do load balancing, how to maintain the session, what to do when the IP changes after a container is moved, and how to configure the corresponding health check, if you want to use the domain name as the access entrance How to deal with... These are actually the problems to be solved by the introduction of Kubernetes Service.

01 Kubernetes Service and Endpoints

This picture shows the correspondence between Service and several other objects. The first is the Service, which stores the access information(such as IP and port) of the service. It can be simply understood as a LoadBalancer built in Kubernetes. Its role is to provide load balancing for multiple Pods.

The picture shows the Service corresponding to 2 pods deployed by a Replication Controller. We know that the relationship between RC and pod is related by label-selector, and the service is also the same. The Selector is used to match the Pod that it needs to load balance. In fact, there is an object in the middle, called Endpoint, why should there be this object? Because in actual applications, a pod is created, it does not mean that it can provide services to the outside world immediately, and if this pod will be deleted, or in other bad states, we all hope that the client's request will not be distributed to this unable to provide services On the pod. The introduction of Endpoint is used to map those pods that can provide external services. The IP of each Endpoints object corresponds to a Kubernetes internal domain name, and the specific pod can be directly accessed through this domain name.

Look at the definition of Service and Endpoint. Note here that Service has a ClusterIP attribute field, which can be simply understood as a virtual IP. The service domain name resolution usually obtains this ClusterIP. It is also worth noting that Service supports port mapping, that is, the port exposed by Service does not have to be the same as the container port.

02 Service internal logic

Just introduced the relationship between Service, Pods and Endpoint, and then look at the internal logic of Service. Here we mainly look at the Endpoint Controller, which will watch for changes in Service Objects and pods, and maintain the corresponding Endpoint information. Then on each node, KubeProxy maintains local routing rules based on Service and Endpoint.

In fact, whenever an Endpoint changes(that is, the Service and its associated Pod state changes), Kubeproxy will refresh the corresponding rules on each node, so this is actually more like a load balancing near the client-one When a Pod accesses Pods of other services, the request has already selected its destination Pod through local routing rules before it leaves the node.

Iptables to achieve load balancing

Ok, let's see how the Iptables pattern is implemented.

Iptables is mainly divided into two parts, one is its command-line tool, in user mode; then it also has a kernel module, but essentially it is encapsulated and implemented by the Netfilter kernel module. The feature of Iptables is that it supports more operations.

This is a flowchart of IPtables processing network packets. As you can see, each packet will pass through several points in sequence. The first is PREROUTING, it will judge whether the received request packet is to access the local process or other machines. If it is to access other machines, it must go to the FORWARD chain, and then it will do Routing desicion again to determine where it wants FORWARD to go. And finally go out via POSTROUTING. If you are accessing the local, you will enter the INPUT line, find the corresponding local request to be accessed, and then process it locally. After processing, a new data packet will be generated. At this time, it will take OUTPUT and then go out through POSTROUTING.

01 Iptables to achieve traffic forwarding and load balancing

We know that Iptables is a professional firewall, so how does it do traffic forwarding, load balancing, and even session maintenance? As shown below:

02 Iptables application example in Kubernetes

So, how to use Iptables to achieve load balancing in Kubernetes? Let's look at a practical example. In Kubernetes, from VIP to RIP, the intermediate Iptables links include:PREROUTING/OUTPUT(depending on whether the traffic comes from the local machine or an external machine) -> KUBE-SERVICES(the entrance of all Kubernetes custom chains) -> KUBE-SVC-XXX(the latter hash value is generated by the service's virtual IP) ->KUBE-SEP->XXX(the latter hash value is generated by the actual IP of the back-end Pod).

Problems with current Iptables implementation

01 Iptables do load balancing problem

So what are the main drawbacks of Iptables doing load balancing? At first, we only analyzed the principle, and later measured in a large-scale scene, and found that the problem is actually very obvious.

  • The first is delay, matching delay and rule update delay. We can see from the example just now that the virtual IP of each Kubernetes Service will correspond to a chain under kube-services. Iptables' rule matching is linear, and the matching time complexity is O(N). The rule update is non-incremental. Even adding/deleting a rule will modify the Netfilter rule table as a whole.
  • Second is scalability. We know that when the number of Iptables in the system is large, the update will be very slow. At the same time, because of the protection during the full submission process, a kernel lock will appear, and you can only wait.
  • Finally, availability. During service expansion/reduction, the refresh of Iptables rules will cause disconnection and unavailability of services.

02 Iptables rule matching delay

The figure above illustrates that the service access delay increases with the number of rules. But it is actually acceptable, because the maximum delay is 8000us(8ms), which shows that the real performance bottleneck is not here.

03 Iptables rule update delay

So where is the slow update of Iptables rules?

First of all, Iptables' rule update is full update, even --no--flush will not work(--no--flush only guarantees that the old rule chain is not deleted when iptables-restore).

Moreover, kube-proxy will periodically refresh the Iptables state:first copy the system Iptables state by iptables-save, then update some rules, and finally write to the kernel through iptables-restore. When the number of rules reaches a certain level, the process becomes very slow.

There are many reasons for such high latency, and there are certain differences under different kernel versions. In addition, the delay is also closely related to the current memory usage of the system. Because Iptables will update the Netfilter rule table as a whole, and a larger kernel memory(>128MB) will be allocated at a time.

04 Iptables periodic refresh causes TPS jitter

The above figure illustrates that under the highly concurrent loadrunner stress test, kube-proxy periodically refreshes Iptables, resulting in the disconnection of back-end services and the periodic fluctuation of TPS.

K8S Scalability

So this puts a very big limitation on the performance of the data plane of Kubernetes. We know the scale of the community management plane. In fact, it has supported 5000 nodes last year. The data plane lacks an authoritative definition and no specifications are given. .

We have evaluated in multiple scenarios and found that the number of services can easily reach tens of thousands, so optimization is still necessary. At that time, there were two main optimization solutions:

  • Use the tree structure to organize the rules of Iptables, let the matching and rule updating process become the operation of the tree, thereby optimizing the two delays.
  • Using IPVS, the benefits will be discussed later.

An example of using a tree structure to organize Iptables rules is as follows:

In this example, the root of the tree is a 16-bit address, the two child nodes of the root are 24-bit addresses, and the virtual IP is used as a leaf node, which is hung under different tree nodes according to different network segments. In this way, the delay of rule matching is reduced from O(N) to O(Nth power of M), M is the height of the tree. But the cost of doing so is that Iptables rules become more complicated.

IPVS achieves service load balancing

01 What is IPVS

  • Implementation of the load balancer at the transport layer, LVS load balancer;
  • Also based on Netfilter, but using a hash table;
  • Support TCP, UDP, SCTP protocol, IPV4, IPV6;
  • Support multiple load balancing strategies, such as rr, wrr, lc, wlc, sh, dh, lblc
  • Support session retention, persistent connection scheduling algorithm.

02 Three forwarding modes of IPVS

IPVS has three forwarding modes, namely:DR, tunnel and NAT.

  • DR mode works at L2 and uses the fastest MAC address. The request message is forwarded to the back-end server through the IPVS director, and the response message is directly returned to the client. The disadvantage is that port mapping is not supported, so this mode unfortunately PASS is lost.
  • Tunnel mode, using IP packets to encapsulate IP packets. After receiving the tunnel packet, the back-end server will first remove the encapsulated IP address header, and then the response message will be directly returned to the client. IP mode also does not support port mapping, so this mode is also dropped by PASS.
  • NAT mode supports port mapping. Unlike the previous two modes, NAT mode requires the return packet to pass the IPVS director. The native IPVS version of the kernel only does DNAT, not SNAT.

03 Use IPVS to achieve traffic forwarding

Using IPVS for traffic forwarding only needs to go through the following simple steps.

  • Bind VIP

Since the DNAT hook of IPVS hangs on the INPUT chain, it is necessary for the kernel to recognize that the VIP is the local IP. There are at least three ways to bind VIP:

  1. Create a dummy network card and bind it as shown below.
    # ip link add dev dummy0 type dummy # ip addr add dev dummy0

  2. Add the IP address VIP directly to the local routing table.
    # ip route add to local dev eth0proto kernel

  3. Add a network card alias to the local network card.
    # ifconfig eth0:1 up

  • Create an IPVS virtual server for this virtual IP

# ipvsadm -A -t -s rr -p 600
In the above example, the virtual IP of the IPVS virtual server is, and the session hold time is 600s.

  • Create a corresponding real server for this IPVS service

# ipvsadm -a -t -r
# ipvsadm -a -t -r

In the above example, two real servers were created for the virtual server IPVS: and

Iptables vs. IPVS

01 Iptables vs. IPVS rules increase latency

It is easy to find by observing the above picture:

  • Increase the delay of Iptables rules, with the increase of the number of rules showed an "exponential" level rise;
  • When the number of services in the cluster reaches 20,000, the delay of the newly added rule is changed from 50us to 5 hours;
  • The delay of increasing the IPVS rule is always kept within 100us, which is hardly affected by the rule base. This small difference can even be regarded as a systematic error.

02 Iptables vs. IPVS network bandwidth

This is the network bandwidth in two modes that we actually measured with iperf. You can see the difference in bandwidth between the first service and the last service in Iptables mode. The bandwidth of the last service is significantly smaller than that of the first service, and as the base of the service increases, the difference becomes more and more obvious.

In IPVS mode, the overall bandwidth performance is higher than Iptables. When the number of services in the cluster reaches 25,000, the bandwidth in Iptables mode is basically zero, while the service in IPVS mode can still maintain about half of the previous level, providing normal access.

03 Iptables vs. IPVS CPU/Memory consumption

Obviously, the IPVS metrics in both CPU/memory dimensions are much lower than Iptables.

Feature Community Status

This feature was introduced in Alpha from version 1.8, and Beta was released in version 1.9, which fixed most of the problems. It is now relatively stable and highly recommended for everyone. In addition, this feature is currently maintained by our Huawei Cloud K8S open source team. If you find any problems during use, you are welcome to report them to the community or our side.

The cloud-native era has arrived, and Huawei Cloud has taken the first step in building cloud-native infrastructure through Kubernetes. Although there are still many challenges in practice that are waiting for us to deal with, we believe that as we continue to invest in technology, these problems will be solved one by one.

Click to follow and learn about Huawei s fresh cloud technology for the first time~