Service Mesh high availability practice in enterprise production

Posted May 27, 202021 min read


Service Mesh Virtual Meetup is an online series live broadcast jointly hosted by the ServiceMesher community and CNCF. In this issue, Service Mesh Virtual Meetup # 1, four guests from different companies were invited to share the application practice of Service Mesh from different perspectives, sharing the production practice of Service Mesh from Momo and Baidu, and the observable Service Mesh The difference between performance and production practices and observability in traditional microservices, and how to use SkyWalking to observe Service Mesh.

This article is organized according to the theme of "Service Mesh High Availability in Enterprise Production", which is the theme of Bai Guang's senior engineer Luo Guangming. The end of the article contains the review link of the video shared this time and the download address of the PPT.


Service Mesh has many challenges in the landing of enterprises. When co-deploying with traditional microservice applications, the usability challenge is more severe. This sharing will be based on the premise of Service Mesh and Spring Cloud application interconnection and common governance, focusing on the introduction of Consul-based registry high-availability solutions, through various current limiting and fusing strategies to ensure high availability of back-end services, and through intelligent routing strategies(Load balancing, instance fault tolerance, etc.) to achieve high availability of calls between services.

Interworking and interconnection between Service Mesh and Spring Cloud applications

Microservices are a hotspot of technology nowadays, and a large number of Internet companies are doing promotion and landing of microservice architecture. At the same time, there are also many traditional enterprises that are transforming Internet technologies based on microservices and containers. In this technological transformation, there is a phenomenon in China. Microservice development frameworks represented by Spring Cloud and Dubbo are very popular and popular. In recent years, the emerging Service Mesh technology has also become more and more popular, attracting more and more developers' attention, and there is a trend that will later prevail.

When I heard that many people in the community talked about the selection of microservice technology, I noticed that they discussed one or the other issue:adoption
Spring Cloud or Service Mesh technology represented by Istio? However, this answer is not black and white, or you and me. It is entirely possible to use Spring Cloud for some applications and Service Mesh(Istio) for others. Today I will discuss this issue with you.


First, let's take a look at Spring Cloud, the traditional intrusive microservice framework. It contains the following advantages:

  • Integrators, Spring Cloud includes all aspects of the microservices architecture; select the more mature and proven service framework developed by various companies;
  • Lightweight components, most of the components integrated by Spring Cloud are relatively lightweight, and they are all leaders in their respective fields;
  • Easy to develop, Spring Cloud encapsulates various components, which simplifies development;
  • Flexible development, the components of Spring Cloud are all decoupled, and developers can flexibly choose components as needed;

Special thanks to Netflix, a company that has successfully implemented microservices for a long time. A few years ago, it contributed almost the entire microservice framework stack to the community. The early Spring Cloud mainly further encapsulated Netflix's open source components. However, in the past two years, the Spring Cloud community has started to research many new components on its own, and has also access to the excellent practices of some other Internet companies.


Next, let's take a brief look at the Service Mesh framework. It has brought two major changes:the decoupling of microservice governance and business logic, and the unified governance of heterogeneous systems. In addition, the service grid has three major technical advantages over traditional microservice frameworks:observability, flow control, and security. The service grid has brought tremendous changes and has its strong technical advantages. It is called the second generation "microservice architecture".

However, as mentioned earlier, there is no silver bullet for software development. The traditional microservice architecture has many pain points, and the service grid is no exception. It also has its limitations. These limitations include:increased link and O & M complexity, the need for more professional O & M skills, a certain delay, and adaptation to the platform.

For more details about the advantages and disadvantages of Spring Cloud and Service Mesh, please read Istio-Handbook [\ Service Mesh Overview ].


As mentioned earlier, for the traditional microservice framework Spring Cloud and the emerging microservice framework Service Mesh, it is not a black and white, or you and me, extending to microservices and a single architecture, they can also coexist.

It can also be compared to a hybrid cloud, which includes public clouds, private clouds, and possibly other private infrastructure. At present, hybrid cloud is a popular practice; in fact, it may be difficult to find an organization with a completely single cloud model. For most organizations, in the process of completely restructuring a monolithic application into microservices, the mobilization of development resources is a very serious problem; adopting a hybrid microservices strategy is a better way for the development team In this way, the microservice architecture is within reach; otherwise, the development team may be unable to accept the reconstruction of the single application due to lack of time and experience.

Best practices for building a hybrid microservice architecture:

  • Prioritized reconstruction of the part that maximizes revenue;
  • Service Mesh framework is preferred for non-Java applications;

The reason for the emergence of hybrid microservices is to better support smooth migration, maximize the level of service governance, reduce operation and maintenance communication costs, etc., and may exist for a longer period. The premise of achieving this architecture is the "interconnection" of services.


To achieve the above-mentioned "hybrid microservice architecture", runtime support services are essential. It mainly includes three products:service registration center, service gateway, and centralized configuration center.

The combination of traditional microservices and Service Mesh(dual-mode microservices), that is, "traditional SDK-based microservices" and "sidecar-based Service Mesh microservices" can achieve the following goals:

  • Interconnection:Applications in the two systems can access each other;
  • Smooth migration:The application can be migrated between the two systems. For other applications that call the application, it is transparent and unaware;
  • Flexible evolution:After interconnection and smooth migration are realized, we can carry out flexible application transformation and architecture evolution according to the actual situation;

This also includes the requirements for the application running platform, that is, the applications under the two systems can run on virtual machines or containers/K8s. We do not want to bind users to K8s, so Service Mesh does not use K8s Service
Mechanism to do service registration and discovery, here highlights the importance of the registration center.

The Baidu Intelligent Cloud CNAP team implemented the above hybrid microservice architecture, that is, the application interconnection, smooth migration, and flexible evolution of the two microservice systems. The above hybrid microservice architecture diagram includes the following components:

  • API Server:front-end and back-end decoupling, interface permission control, request forwarding, exception localization processing, etc .;
  • Microservice control center:the main logic of microservice governance, including multi-tenant processing of service registration, creation and conversion of governance rules(routing, current limiting, and fuse), and management of microservice configuration;
  • Monitoring data storage, message queue:mainly used by Trace-based monitoring solutions;
  • Configuration center:Microservice configuration center, the most important function is to support configuration management, including the storage and distribution of all microservice configurations such as governance rules and user configuration Update

Next, let's take a look at the service registration and discovery mechanism of the registration center:

  • The Spring Cloud application uses the SDK and Service Mesh application to implement Sidecar registration with the registration center. The registration request is first authenticated and isolated from multi-tenancy through the microservice control center;
  • Mesh control plane directly connects to the registration center to obtain service instances, and Spring Cloud applications obtain service instances through the SDK;
  • Dual-mode heterogeneous, support container and virtual machine models;

Registration Center and High Availability Solution

As mentioned earlier, if you want to implement a hybrid microservices architecture, the registry is the key. When it comes to registries, the current mainstream open source registries include:

  • Zookeeper:A distributed coordination system developed by Yahoo, which can be used in the registration center. Currently, many companies still use it as a registration center;
  • Eureka:Netflix open source component, which can be used for service registration and discovery components, is well known by the majority of Spring Cloud developers. Unfortunately, it is no longer maintained and is no longer recommended for use by the Spring Cloud ecosystem;
  • Consul:HashiCorp s product, which can be used as a registration center, is also the focus of this article;
  • Etcd:Etcd officially defines it as reliable distributed KV storage;

Our registration center chose Consul, Consul contains the following important functions:

  • Service discovery:You can register the service, or you can discover the registered service through Http or DNS;
  • Rich health check mechanism;
  • Service mesh capability, the latest version has supported Envoy as the data plane;
  • KV storage:A distributed configuration center can be implemented based on Consul KV storage;
  • Multiple data centers:With the help of multiple data centers, you can build multi-region scenarios without using additional abstraction layers, supporting multi-DC data synchronization and remote disaster recovery;


The above picture is the architecture diagram provided by Consul's official website. Several core concepts in the Consul architecture are as follows:

  • Agent:Agent is a Daemon process running on each node of Consul cluster. It is started by the Consul Agent command. Agent can run in Client or Server mode;
  • Client:Client is an agent that will redirect all RPC requests to the server. Client is stateless. It mainly participates in the LAN Gossip protocol pool, which takes up very few resources and consumes very little network bandwidth;
  • Server:Server is an agent, which includes a series of responsibilities including:participation in Raft Quorum, maintenance of cluster status, response, RPC response, and other Datacenter exchange information and redirect query requests through WAN gossip Leader or remote Datacenter;
  • Datacenter:Datacenter is a private, low-latency, high-bandwidth network environment that eliminates network interaction on public networks;

As a basic component, the registration center's own availability is particularly important. High-availability design requires distributed deployment. At the same time, due to the complexity in a distributed environment, nodes may fail for various reasons. In a distributed cluster deployment, it is hoped that when some nodes fail, the cluster will still be able to provide normal external services. As a microservices infrastructure, the registration center has certain requirements for its disaster tolerance and robustness, mainly reflected in:

  • The registration center serves as a microservices infrastructure, so it is required that the registration center still be able to operate normally after certain failures(such as node hangs or network partitions);
  • When the registration center fails, it cannot affect the normal call between the services;


Consul uses the Raft protocol as its distributed consistency protocol, which itself has a certain tolerance for failed nodes. In a single DataCenter, the number of nodes in the Consul cluster is controlled at 2 \ * n + 1 node, where n is a tolerable downtime Number of machines. Quorum size:Raft protocol election requires more than half of the nodes to write successfully.

Q1:Can the number of nodes be even?
A2:The answer is yes, but it is not recommended to deploy an even number of nodes. On the one hand, as shown in the above table, the tolerable number of failures of the even node 4 and the odd node 3 are the same. On the other hand, the even node may split the votes when selecting the master node(although Consul resets the election timeout to Re-election), so it is recommended to select an odd number of nodes.

Q2:Is the more server nodes the better?
A2:The answer is no. Although the above table shows that the greater the number of Servers, the greater the tolerable number of failures. Readers familiar with the Raft protocol must be familiar with Log Replication. With the increasing number of servers, the performance will be lower, so it is generally recommended to deploy 3 nodes of the server in combination with the actual scenario.

Three or five nodes are recommended, which is the most effective and fault-tolerant.


An important prerequisite for the design of the registration center is that the registration center cannot affect the mutual transfer of services due to its own reasons or failures. Therefore, in the course of practice, if the registry itself has a downtime failure/unavailability, it must not affect the call between services. This requires the SDK of the docking registry to design client-side disaster recovery for this special situation. "Client cache" is an effective method. When the registration center fails to provide the service, the service itself does not update the local client cache, and uses the service list information that it has cached to normally complete the call between services.


We used 3 Datacenter nodes deployed in the same Datacenter cluster during the design to ensure high availability. When one node in the cluster fails, the cluster can still operate normally. At the same time, these 3 nodes are deployed in different computer rooms to achieve the capacity of the computer room. Disasters.

In the cloud environment, many regions are involved, so when designing and designing the architecture, we first set a data center in Consul
Corresponding to a region on the cloud, this is more in line with Consul's definition of Datecenter(DataCenter is a private, low-latency, high-bandwidth network environment). The intermediate proxy layer implements functions such as service authentication and multi-tenant isolation; it can also connect to multiple registration centers through the intermediate proxy layer.

There is a multi-tenant isolation requirement in the cloud environment, that is, the service of tenant A can only discover instances of tenant service. For this scenario, the implementation of the multi-tenant isolation function needs to be completed in the "middle agent layer". The main practical idea is to use Consul Api
Feature has the function of Filtering:

  • Use the Filtering function to achieve tenant isolation requirements;
  • Reduce the network load when querying the registration center interface;

Ensure high service availability through governance strategies

What is high availability? Wikipedia defines it this way:the ability of a system to perform its functions without interruption represents the degree of availability of the system and is one of the criteria when designing a system. We usually use N 9 to define the availability of the system. If it can reach 4 9s, it means that the system has automatic recovery capability; if it can reach 5 9s, it means that the system is extremely robust and has extremely high availability, and can reach this index It is very difficult.

Common system unavailability factors include:bugs in programs and configurations, machine failures, machine room failures, insufficient capacity, and response timeouts depending on services. Highly available grippers include:R & D quality, test quality, change management, monitoring alarms, failure plans, capacity planning, blind fire testing, duty inspections, etc. Here, we will mainly introduce the use of high-availability design methods to help ensure high availability through governance strategies.


High availability is a more complex proposition, so designing a high availability solution also involves all aspects. The details that will appear in the middle are many and varied, so we need to design a top-level design for such a microservice high availability solution.

For example, service redundancy:

  • Redundancy strategy:Each machine and each service may have problems, so the first consideration is that each service must have more than one copy, but multiple copies. The so-called multiple consistent services are the redundancy of services. The services mentioned here generally refer to the services of machines, services of containers, and services of microservices themselves. At the machine service level, it is necessary to consider whether the redundancy between each machine is isolated and redundant in the physical space.
  • Stateless:We can expand or shrink the service at any time. If we want to expand or shrink the service anytime and anywhere, we require our service to be stateless. The so-called stateless is the content of each service. It is consistent with the data.

For example, flexible/asynchronous:

  • The so-called flexibility means that if our business permits, we can not give users 100%available, and provide users with as many services as possible by means of downgrade, instead of having to hand in 100 points every time. Or a 0-point answer sheet. Flexibility is more a kind of thinking, which requires an in-depth understanding of business scenarios.
  • Asynchronous:In each call, the longer the time, the greater the risk of timeout, and the more complex the logic, the more steps are executed, and the greater the risk of failure. If the business permits, the user calls only give the user the necessary results, and the results that do not need to be synchronized can be placed in another place to operate asynchronously, which reduces the risk of timeout and splits the complex business to reduce complexity .

Most of the methods mentioned above to improve the high availability of services need to be implemented from the perspective of business and deployment and maintenance. In the following, we will focus on the introduction of SDK/Sidecar to provide high-availability service management strategies. These strategies are often non-invasive or weakly intrusive to the business, which can make most services easily achieve high service availability.


Once routing is established between microservices, it means that data will circulate between services. Because different services can provide different resources and carry different data traffic, in order to prevent a single Consumer from occupying too many resources of the Provider, or sudden large traffic shocks that cause the Provider to fail, a service current limit is required to ensure high availability of the service .

In service governance, although we can try to prevent the service from bearing excessive traffic through the current limit rule, service failure is still difficult to avoid completely in actual production. When some services in the entire system fail, if you do not take timely measures, such failures may be spread because of the mutual access between services, which eventually leads to the expansion of the scale of the failure, even causing the entire system to crash. We call this phenomenon "avalanche". Fuse degradation is not only in service governance, but also widely used in the financial industry. For example, when the volatility of the stock index exceeds the specified melting point, the exchange will take trading suspension measures to control risks.

Load balancing is a key component of a high-availability architecture. It is mainly used to improve performance and availability, and distribute traffic to multiple servers through load balancing. At the same time, multiple servers can eliminate this single point of failure.

The above governance rules can be aligned to some extent on the two frameworks of Spring Cloud and Service Mesh, that is, the same set of governance configuration can be distributed to the SDK of the Spring Cloud application and the sidecar of the Service Mesh through conversion. The rules can be delivered by the Config-server or the control plane of the Service Mesh, depending on the specific architectural solution.

Service limit

For an application system, there must be a limit to the number of concurrent/requests, that is, there is always a TPS/QPS threshold. If the threshold is exceeded, the system will not respond to user requests or respond very slowly, so we better overload Protection to prevent a flood of requests from flooding the system. The purpose of flow limiting is to protect the system by limiting the speed of concurrent access/requests or requests within a time window. Once the limit is reached, the service can be denied or traffic shaping can be performed.

Commonly used micro-service current limit architecture includes:

  • Access layer(api-gateway) current limit:

    • Single instance;
    • Multi-instance:distributed current limiting algorithm;
  • Call external current limiting service to limit current:

    • After receiving the request, the microservice queries whether the threshold is exceeded through the RPC interface exposed by the current limiting service;
    • The current limiting service needs to be deployed separately;
  • Cut surface layer current limiting(SDK):

    • The current limit function is integrated in the micro-service system aspect layer and decouples from the business;
    • Can be combined with remote configuration center;

Common current limiting strategies include:

  • Rejection strategy:

    • If the threshold is exceeded, an error will be returned directly;
    • The caller can do fuse downgrade treatment;
  • Delay processing:

    • A traffic buffer pool is set up at the front end to buffer all requests into this pool and will not be processed immediately. Then the real back-end business processing program takes the requests from this pool and processes them in sequence, which can be commonly implemented using the queue mode(MQ:peak-shaving and valley-filling);
    • Asynchronous way to reduce the back-end processing pressure;
  • Privilege processing:

    • This mode needs to classify users. Through the preset classification, the system will give priority to the user groups that need high security. Requests from other user groups will be delayed or not processed directly;

Common current limiting algorithms include:

  • Fixed time window current limit:

    • First, you need to select a starting point of time, and then accumulate the counter every time an interface request comes. If within the current time window, according to the current limit rule(such as the maximum allowable 100 interface requests per second), the cumulative number of accesses exceeds the current limit value , The current-limiting fuse rejects the interface request. When entering the next time window, the counter is cleared and counted again;
    • The disadvantage is that the current-limiting strategy is too rough to deal with the sudden traffic within the critical time of two time windows.
  • Sliding time window algorithm:

    • After the flow is shaped by the sliding time window algorithm, it can ensure that the maximum allowable current limit value will not be exceeded in any time window. From the flow curve, it will be smoother and can partially solve the critical burst flow problem mentioned above. It is an improvement to the fixed time window algorithm;
    • Disadvantage is:need to record the time point of each interface request arrival in the time window, it will occupy more memory.
  • Token bucket algorithm:

    • The interface restricts the maximum number of accesses in t seconds to n, then every t/n seconds will put a token into the bucket;
    • A maximum of b tokens can be stored in the bucket. If the token bucket is full when the token arrives, the token will be discarded;
    • The interface request will take the token from the token bucket first, and get the token to process the interface request. If the token is not obtained, it will block or refuse service.
  • Leaky bucket algorithm:

    • There are also restrictions on the frequency of taking tokens, and tokens should be taken at a fixed speed of t/n;
    • Implementation often depends on the queue. If the request arrives, if the queue is not full, it is directly put into the queue, and then a processor takes the request from the head of the queue for processing at a fixed frequency. If the number of requests is large, it will cause the queue to be full, then the new requests will be abandoned;
    • The algorithm idea of the token bucket and leaky bucket algorithm is roughly similar. The leaky bucket algorithm is an improved version of the token bucket current limit algorithm;

Token bucket algorithm and leaky bucket algorithm, in some scenarios(memory consumption, dealing with burst traffic), these two algorithms will be better than the time window algorithm and become the first choice.


The circuit breaker pattern is one of the widely adopted patterns in the microservices architecture, designed to minimize the impact of failures, prevent cascading failures and avalanches, and ensure end-to-end performance. We will compare the advantages and disadvantages of using two different methods to achieve it:Hystrix and Istio.


In the field of circuits, a circuit breaker is an automatically operated electrical switch designed to protect circuits. Its basic function is to interrupt the current after a fault is detected, and then it can be reset(manually or automatically) to resume normal operation after the fault is resolved. This looks very similar to our problem:To protect the application from excessive requests, it is best to immediately interrupt communication between the front end and the back end when the back end detects repeated errors. Michael Nygard uses this analogy in his "Release It" book, and provides a typical case for the design pattern applied to the above timeout problem, which can be summarized with the above figure.


Istio implements the circuit breaker mode through DestinationRule, or a more specific path TrafficPolicy(original circuit breaker)-> OutlierDetection, according to the model above:

  • consecutiveErrors The number of errors before the circuit breaker opens;
  • interval time interval for circuit breaker inspection and analysis;
  • baseEjectionTime minimum opening time, the circuit will remain for a period of time equal to the product of the minimum ejection duration and the number of times the circuit has been opened;
  • maxEjectionPercent The maximum percentage of hosts in the load balancing pool of the upstream service that can be ejected. If the number of ejected hosts exceeds the threshold, the hosts will not be ejected;

Compared with the above nominal circuit breakers, there are two main deviations:

  • There is no half-open state. However, the time that the circuit breaker continues to open depends on the number of failures before the service is called, and the continuous failure service will cause the circuit breaker to open longer and longer.
  • In basic mode, there is only one called application(backend). In a more practical production environment, multiple instances of the same application may be deployed behind the load balancer. In some cases, some instances may fail, while others may work. Because Istio also has a load balancer function, it can track failed instances and remove them from the load balancing pool. To a certain extent:The role of the maxEjectionPercent attribute is to maintain a small number of instance pools.

Hystrix provides a circuit breaker implementation that allows the fallback mechanism to be executed when the circuit is opened. The most critical place is in the methods of HystrixCommand run() and getFallback():

  • run() is the actual code to be executed e.g. get the price from the quotation service;
  • getFallback() Get the fallback result when the circuit breaker is opened e.g. Return the cached price;

Spring Cloud is a framework built on Spring Boot, which provides good integration with Spring. It allows developers to simply annotate the required fallback method when dealing with the instantiation of the Hystrix command object.


There are two ways to realize the circuit breaker, one is the black box method, and the other is the white box method. As a proxy management tool, Istio uses a black box approach. It is simple to implement, does not depend on the underlying technology stack, and can be configured afterwards. On the other hand, the Hystrix library uses a white box approach, which allows all different types of fallback:

  • Single default value;
  • A cache;
  • Call other services;

It also provides cascading fallbacks. These additional features come at a price:it needs to make fallback decisions during the development phase.

The best match between these two methods may depend on your own context:in some cases, such as referenced services, a white box strategy backup may be a better choice, while in other cases rapid failure may be It is completely acceptable, such as a centralized remote login service.

Commonly used fusing methods include automatic fusing and manual fusing. You can also choose fail-fast or fallback when a fuse occurs. These users can be used flexibly based on demand.

Smart routing


Finally, let's take a look at the high availability brought by intelligent routing. Intelligent routing here includes(client) load balancing and instance fault tolerance strategies. For the Spring Cloud framework, this part of the capability is provided by the Ribbon, and Ribbon supports load balancing algorithms such as randomization, polling, and response time weighting. For the Service Mesh framework, this part of the capability is provided by Envoy. Envoy supports random, polling(weighted), ring hash and other algorithms. In order to achieve uniform alignment of the rules of the two systems, the intersection can be used.

The fault tolerance strategy includes:

  • failover:automatically switch to other servers after failure, support to configure the number of retries;
  • failfast:report an error immediately after failure, no more retry;
  • failresnd:Put the failed request into the cache queue, asynchronous processing, with

failover use

Istio supports retry policy configuration, and fail-fast corresponds to 0 retry times.

to sum up

The high availability of microservices is a complex issue that often needs to be viewed from multiple perspectives, including:

  1. From the means of high availability. The main technical methods used are redundant backup and failover of services and data. A set of services or a set of data can be backed up between multiple nodes. When a machine is down or has problems, you can switch from the current service to other available services without affecting the availability of the system and without causing data loss.
  2. High availability from the perspective of architecture. Keeping a simple architecture, most websites currently use a more classic layered architecture, application layer, service layer, and data layer. The application layer is to handle some business logic, the service layer provides some data and business closely related services, and the data layer is responsible for reading and writing data. A simple architecture enables the application layer and the service layer to maintain stateless state and horizontal expansion, which is highly available for computing. At the same time, when doing architectural design, you should also consider

CAP theory.

  1. High availability from the hardware. First of all, it must be confirmed that the hardware is always bad, and the network is always unstable. The solution to this problem is that if there is not enough server, there will be more than one, if there is not enough cabinet, then there will be more than one, if there is not enough room, then there will be more.
  2. See high availability from the software. The development of software is not rigorous, and the release is not standardized, which also leads to various unavailability. By controlling the quality development of the software development process, through testing, pre-release, gray-scale release and other means are also measures to reduce unavailability.
  3. From the perspective of governance, high availability. Standardize services, perform service segmentation in advance, monitor services, anticipate the emergence of unavailability, find problems before unavailability, and solve problems. For example, after the service is online, according to experience, configure service current limiting rules and automatic fusing rules.

The above is all the content shared in this issue.

Live playback address:
Share PPT download address: